Merge pull request #83 from phac-nml/patch/sistr

Updated table parsing to prevent mangled/mis-aligned values from entering the report.
phac-nml · Jun 4, 2024 · df48c71 · df48c71
2 parents f1efb35 + e5a7570
commit df48c71
Show file tree

Hide file tree

Showing 23 changed files with 528 additions and 33 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,7 +3,14 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v0.2.0 - [2024-05-14]
+## [0.2.1] - 2024-06-03
+
+### `Fixed`
+
+- Parsed table values would not show up properly if values were missing resolving issue See [PR 83](https://github.com/phac-nml/mikrokondo/pull/83)
+- Fixed mismatched description for minimap2 and mash databases. See [PR 83](https://github.com/phac-nml/mikrokondo/pull/83)
+
+## [0.2.0] - 2024-05-14
 
 ### `Added`
 
@@ -37,7 +44,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - Updated StarAMR to version 0.10.0. See [PR 74](https://github.com/phac-nml/mikrokondo/pull/74)
 
-## v0.1.2 - [2024-05-02]
+## [0.1.2] - 2024-05-02
 
 ### Changed
 
@@ -46,13 +53,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Set `--kraken2_db` to be a required parameter for the pipeline. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)
 - Hide bakta parameters from IRIDA Next UI. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)
 
-## v0.1.1 - [2024-04-22]
+## [0.1.1] - 2024-04-22
 
 ### Changed
 
 - Switched the resource labels for **parse_fastp**, **select_pointfinder**, **report**, and **parse_kat** from `process_low` to `process_single` as they are all configured to run on the local Nextflow machine. See [PR 67](https://github.com/phac-nml/mikrokondo/pull/67)
 
-## v0.1.0 - [2024-03-22]
+## [0.1.0] - 2024-03-22
 
 Initial release of phac-nml/mikrokondo. Mikrokondo currently supports: read trimming and quality control, contamination detection, assembly (isolate, metagenomic or hybrid), annotation, AMR detection and subtyping of genomic sequencing data targeting bacterial or metagenomic data.
 
@@ -79,3 +86,9 @@ Initial release of phac-nml/mikrokondo. Mikrokondo currently supports: read trim
 - Changed salmonella default default coverage to 40
 
 - Added integration testing using [nf-test](https://www.nf-test.com/).
+
+[0.2.1]: https://github.com/phac-nml/mikrokondo/releases/tag/0.2.1
+[0.2.0]: https://github.com/phac-nml/mikrokondo/releases/tag/0.2.0
+[0.1.2]: https://github.com/phac-nml/mikrokondo/releases/tag/0.1.2
+[0.1.1]: https://github.com/phac-nml/mikrokondo/releases/tag/0.1.1
+[0.1.0]: https://github.com/phac-nml/mikrokondo/releases/tag/0.1.0
diff --git a/modules/local/report.nf b/modules/local/report.nf
@@ -799,34 +799,73 @@ def table_values(file_path, header_p, seperator, headers=null){
 
         returns a map
     */
-    def split_header = null
-    def split_line = null
-    def converted_data = [:]
-    def idx = 0
-    def lines_read = false
-    file_path.withReader{
-        String line
-        if(header_p){
-            header = it.readLine()
-            split_header = header.tokenize(seperator)
+    def missing_value = 'NoData'
+    def default_index_col = "__default_index__"
+    def rows_list = null
+    def use_modified_headers_from_file = false
+    def is_missing = { it == null || it == '' }
+    def replace_missing = { is_missing(it) ? missing_value : it }
+
+    // Reads two lines (up to one header line + one row) for making decisions on how to parse the file
+    def file_lines = file_path.splitText(limit: 2)
+    if (!header_p && headers == null) {
+        throw new Exception("Header is not provided in file [header_p=${header_p}], but headers passed to function is null")
+    } else if (!header_p) {
+        if (file_lines.size() == 0) {
+            // headers were not in the file, and file size is 0, so return missing data based
+            // on passed headers (i.e., single row of empty values)
+            rows_list = [headers.collectEntries { [(it): null] }]
+        } else {
+            // verify that passed headers and rows have same number
+            def row_line = file_lines[0].replaceAll('(\n|\r\n)$', '')
+            def row_line_columns = row_line.split(seperator, -1)
+            if (headers.size() != row_line_columns.size()) {
+                throw new Exception("Mismatched number of passed headers ${headers} and column values ${row_line_columns} for file ${file_path}")
+            } else {
+                rows_list = file_path.splitCsv(header: headers, sep:seperator)
+            }
         }
-        if(headers){
-            split_header = headers
+    } else {
+        // Headers exist in file
+
+        if (file_lines.size() == 0) {
+            throw new Exception("Attempting to parse empty file [${file_path}] as a table where header_p=${header_p}")
         }
-        while(line = it.readLine()){
-            split_line = line.tokenize(seperator)
-            // Transpose, and collect converts the data to a map
-            converted_data[idx] = [split_header, split_line].transpose().collectEntries()
-            idx++
-            lines_read = true
+
+        def header_line = file_lines[0].replaceAll('(\n|\r\n)$', '')
+        def headers_from_file = header_line.split(seperator, -1)
+        def total_missing_headers = headers_from_file.collect{ is_missing(it) ? 1 : 0 }.sum()
+
+        if (total_missing_headers > 1) {
+            throw new Exception("Attempting to parse tabular file with more than one missing header: [${file_path}]")
+        } else if (is_missing(headers_from_file[0])) {
+            // Case, single missing header as first column
+            headers_from_file[0] = default_index_col
+            use_modified_headers_from_file = true
         }
-        if(!lines_read){
-            converted_data[idx] = [split_header, Collections.nCopies(split_header.size, "NoData")].transpose().collectEntries()
+
+        if (file_lines.size() == 1) {
+            // There is no row lines, only headers, so return missing data
+            // (single row of empty values)
+            rows_list = [headers_from_file.collectEntries { [(it): null] }]
+        } else {
+            // If there exists a row line, then make sure rows + headers match
+
+            def row_line1 = file_lines[1].replaceAll('(\n|\r\n)$', '')
+            def row_line1_columns = row_line1.split(seperator, -1)
+            if (headers_from_file.size() != row_line1_columns.size()) {
+                throw new Exception("Mismatched number of headers ${headers_from_file} and column values ${row_line1_columns} for file ${file_path}")
+            }
+
+            if (use_modified_headers_from_file) {
+                rows_list = file_path.splitCsv(header: headers_from_file as List, sep:seperator, skip: 1)
+            } else {
+                rows_list = file_path.splitCsv(header: true, sep:seperator)
+            }
         }
+    }
 
+    return rows_list.indexed().collectEntries { idx, row -> 
+        [(idx): row.collectEntries { k, v -> [(k): replace_missing(v)] }]
     }
-    return converted_data
 }
-
-
-
diff --git a/nextflow.config b/nextflow.config
@@ -1078,12 +1078,12 @@ dag {
 
 manifest {
     name            = 'phac-nml/mikrokondo'
-    author          = """matthew wells"""
+    author          = """Matthew Wells, James Robertson, Aaron Petkau, Christy-Lynn Peterson, Eric Marinier"""
     homePage        = 'https://github.com/phac-nml/mikrokondo'
-    description     = """Mikrokondo beta"""
+    description     = """Mikrokondo"""
     mainScript      = 'main.nf'
     nextflowVersion = '!>=23.04.0'
-    version         = '0.2.0'
+    version         = '0.2.1'
     defaultBranch   = 'main'
     doi             = ''
 }

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -76,14 +76,14 @@
             "properties": {
                 "dehosting_idx": {
                     "type": "string",
-                    "description": "Mash sketch used for contamination detection and speciation (Sketch comments must be a taxonomic string similar to what Kraken2 outputs)",
+                    "description": "Minimpa2 index for dehosting and kitome removal",
                     "pattern": "^\\S+$",
                     "exists": true,
                     "format": "file-path"
                 },
                 "mash_sketch": {
                     "type": "string",
-                    "description": "Minimpa2 index for dehosting and kitome removal",
+                    "description": "Mash sketch used for contamination detection and speciation (Sketch comments must be a taxonomic string similar to what Kraken2 outputs)",
                     "pattern": "^\\S+$",
                     "exists": true,
                     "format": "file-path"

diff --git a/tests/data/tables/all_values_missing.csv b/tests/data/tables/all_values_missing.csv
@@ -0,0 +1,2 @@
+header1,header2,header3
+,,
diff --git a/tests/data/tables/empty.csv b/tests/data/tables/empty.csv
diff --git a/tests/data/tables/header_missing_val.csv b/tests/data/tables/header_missing_val.csv
@@ -0,0 +1,2 @@
+,header2,header3
+stuff1,stuff2,stuff3
diff --git a/tests/data/tables/missing_all_headers.csv b/tests/data/tables/missing_all_headers.csv
@@ -0,0 +1,2 @@
+
+stuff1,stuff2,stuff3
diff --git a/tests/data/tables/missing_all_headers_single_line.csv b/tests/data/tables/missing_all_headers_single_line.csv
@@ -0,0 +1 @@
+stuff1,stuff2,stuff3
diff --git a/tests/data/tables/missing_last_value.tab b/tests/data/tables/missing_last_value.tab
@@ -0,0 +1,2 @@
+header1	header2	header3
+stuff1	stuff2
diff --git a/tests/data/tables/missing_multiple_value_separators.tab b/tests/data/tables/missing_multiple_value_separators.tab
@@ -0,0 +1,2 @@
+header1	header2	header3	header4
+stuff1	stuff2
diff --git a/tests/data/tables/missing_multiple_value_separators_extra_field.tab b/tests/data/tables/missing_multiple_value_separators_extra_field.tab
@@ -0,0 +1,2 @@
+header1	header2	header3	header4
+stuff1	stuff2		stuff4
diff --git a/tests/data/tables/mistmatch_headers_values.csv b/tests/data/tables/mistmatch_headers_values.csv
@@ -0,0 +1,2 @@
+header1,header2,header3
+stuff1,stuff2,stuff3,stuff4
diff --git a/tests/data/tables/mock_missing_value.csv b/tests/data/tables/mock_missing_value.csv
@@ -0,0 +1,2 @@
+header1,header2,header3
+,stuff2,stuff3
diff --git a/tests/data/tables/mock_missing_value.tab b/tests/data/tables/mock_missing_value.tab
@@ -0,0 +1,2 @@
+header1	header2	header3
+	stuff2	stuff3
diff --git a/tests/data/tables/mock_missing_value_2.tab b/tests/data/tables/mock_missing_value_2.tab
@@ -0,0 +1,2 @@
+header1	header2	header3
+		stuff3
diff --git a/tests/data/tables/no_header.csv b/tests/data/tables/no_header.csv
@@ -0,0 +1 @@
+stuff1,stuff2,stuff3
diff --git a/tests/data/tables/no_missing.csv b/tests/data/tables/no_missing.csv
@@ -0,0 +1,2 @@
+header1,header2,header3
+stuff1,stuff2,stuff3
diff --git a/tests/data/tables/no_missing.tab b/tests/data/tables/no_missing.tab
@@ -0,0 +1,2 @@
+header1	header2	header3
+stuff1	stuff2	stuff3
diff --git a/tests/data/tables/two_missing_headers.csv b/tests/data/tables/two_missing_headers.csv
@@ -0,0 +1,2 @@
+,,header3
+stuff1,stuff2,stuff3
diff --git a/tests/data/tables/vector.csv b/tests/data/tables/vector.csv
@@ -0,0 +1,4 @@
+header1
+stuff1
+stuff2
+stuff3
diff --git a/tests/data/tables/vector_no_hdr.csv b/tests/data/tables/vector_no_hdr.csv
@@ -0,0 +1,4 @@
+
+stuff1
+stuff2
+stuff3
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		header1 header2 header3 header4
		stuff1 stuff2 stuff4
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		header1,header2,header3
		stuff1,stuff2,stuff3,stuff4
-Original file line number
+Diff line change
@@ -0,0 +1,4 @@
+    header1
+    stuff1
+    stuff2
+    stuff3