Skip to content

Commit

Permalink
Merge pull request #83 from phac-nml/patch/sistr
Browse files Browse the repository at this point in the history
Updated table parsing to prevent mangled/mis-aligned values from entering the report.
  • Loading branch information
mattheww95 authored Jun 4, 2024
2 parents f1efb35 + e5a7570 commit df48c71
Show file tree
Hide file tree
Showing 23 changed files with 528 additions and 33 deletions.
21 changes: 17 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,14 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v0.2.0 - [2024-05-14]
## [0.2.1] - 2024-06-03

### `Fixed`

- Parsed table values would not show up properly if values were missing resolving issue See [PR 83](https://github.com/phac-nml/mikrokondo/pull/83)
- Fixed mismatched description for minimap2 and mash databases. See [PR 83](https://github.com/phac-nml/mikrokondo/pull/83)

## [0.2.0] - 2024-05-14

### `Added`

Expand Down Expand Up @@ -37,7 +44,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Updated StarAMR to version 0.10.0. See [PR 74](https://github.com/phac-nml/mikrokondo/pull/74)

## v0.1.2 - [2024-05-02]
## [0.1.2] - 2024-05-02

### Changed

Expand All @@ -46,13 +53,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Set `--kraken2_db` to be a required parameter for the pipeline. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)
- Hide bakta parameters from IRIDA Next UI. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)

## v0.1.1 - [2024-04-22]
## [0.1.1] - 2024-04-22

### Changed

- Switched the resource labels for **parse_fastp**, **select_pointfinder**, **report**, and **parse_kat** from `process_low` to `process_single` as they are all configured to run on the local Nextflow machine. See [PR 67](https://github.com/phac-nml/mikrokondo/pull/67)

## v0.1.0 - [2024-03-22]
## [0.1.0] - 2024-03-22

Initial release of phac-nml/mikrokondo. Mikrokondo currently supports: read trimming and quality control, contamination detection, assembly (isolate, metagenomic or hybrid), annotation, AMR detection and subtyping of genomic sequencing data targeting bacterial or metagenomic data.

Expand All @@ -79,3 +86,9 @@ Initial release of phac-nml/mikrokondo. Mikrokondo currently supports: read trim
- Changed salmonella default default coverage to 40

- Added integration testing using [nf-test](https://www.nf-test.com/).

[0.2.1]: https://github.com/phac-nml/mikrokondo/releases/tag/0.2.1
[0.2.0]: https://github.com/phac-nml/mikrokondo/releases/tag/0.2.0
[0.1.2]: https://github.com/phac-nml/mikrokondo/releases/tag/0.1.2
[0.1.1]: https://github.com/phac-nml/mikrokondo/releases/tag/0.1.1
[0.1.0]: https://github.com/phac-nml/mikrokondo/releases/tag/0.1.0
87 changes: 63 additions & 24 deletions modules/local/report.nf
Original file line number Diff line number Diff line change
Expand Up @@ -799,34 +799,73 @@ def table_values(file_path, header_p, seperator, headers=null){
returns a map
*/
def split_header = null
def split_line = null
def converted_data = [:]
def idx = 0
def lines_read = false
file_path.withReader{
String line
if(header_p){
header = it.readLine()
split_header = header.tokenize(seperator)
def missing_value = 'NoData'
def default_index_col = "__default_index__"
def rows_list = null
def use_modified_headers_from_file = false
def is_missing = { it == null || it == '' }
def replace_missing = { is_missing(it) ? missing_value : it }

// Reads two lines (up to one header line + one row) for making decisions on how to parse the file
def file_lines = file_path.splitText(limit: 2)
if (!header_p && headers == null) {
throw new Exception("Header is not provided in file [header_p=${header_p}], but headers passed to function is null")
} else if (!header_p) {
if (file_lines.size() == 0) {
// headers were not in the file, and file size is 0, so return missing data based
// on passed headers (i.e., single row of empty values)
rows_list = [headers.collectEntries { [(it): null] }]
} else {
// verify that passed headers and rows have same number
def row_line = file_lines[0].replaceAll('(\n|\r\n)$', '')
def row_line_columns = row_line.split(seperator, -1)
if (headers.size() != row_line_columns.size()) {
throw new Exception("Mismatched number of passed headers ${headers} and column values ${row_line_columns} for file ${file_path}")
} else {
rows_list = file_path.splitCsv(header: headers, sep:seperator)
}
}
if(headers){
split_header = headers
} else {
// Headers exist in file

if (file_lines.size() == 0) {
throw new Exception("Attempting to parse empty file [${file_path}] as a table where header_p=${header_p}")
}
while(line = it.readLine()){
split_line = line.tokenize(seperator)
// Transpose, and collect converts the data to a map
converted_data[idx] = [split_header, split_line].transpose().collectEntries()
idx++
lines_read = true

def header_line = file_lines[0].replaceAll('(\n|\r\n)$', '')
def headers_from_file = header_line.split(seperator, -1)
def total_missing_headers = headers_from_file.collect{ is_missing(it) ? 1 : 0 }.sum()

if (total_missing_headers > 1) {
throw new Exception("Attempting to parse tabular file with more than one missing header: [${file_path}]")
} else if (is_missing(headers_from_file[0])) {
// Case, single missing header as first column
headers_from_file[0] = default_index_col
use_modified_headers_from_file = true
}
if(!lines_read){
converted_data[idx] = [split_header, Collections.nCopies(split_header.size, "NoData")].transpose().collectEntries()

if (file_lines.size() == 1) {
// There is no row lines, only headers, so return missing data
// (single row of empty values)
rows_list = [headers_from_file.collectEntries { [(it): null] }]
} else {
// If there exists a row line, then make sure rows + headers match

def row_line1 = file_lines[1].replaceAll('(\n|\r\n)$', '')
def row_line1_columns = row_line1.split(seperator, -1)
if (headers_from_file.size() != row_line1_columns.size()) {
throw new Exception("Mismatched number of headers ${headers_from_file} and column values ${row_line1_columns} for file ${file_path}")
}

if (use_modified_headers_from_file) {
rows_list = file_path.splitCsv(header: headers_from_file as List, sep:seperator, skip: 1)
} else {
rows_list = file_path.splitCsv(header: true, sep:seperator)
}
}
}

return rows_list.indexed().collectEntries { idx, row ->
[(idx): row.collectEntries { k, v -> [(k): replace_missing(v)] }]
}
return converted_data
}



6 changes: 3 additions & 3 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -1078,12 +1078,12 @@ dag {

manifest {
name = 'phac-nml/mikrokondo'
author = """matthew wells"""
author = """Matthew Wells, James Robertson, Aaron Petkau, Christy-Lynn Peterson, Eric Marinier"""
homePage = 'https://github.com/phac-nml/mikrokondo'
description = """Mikrokondo beta"""
description = """Mikrokondo"""
mainScript = 'main.nf'
nextflowVersion = '!>=23.04.0'
version = '0.2.0'
version = '0.2.1'
defaultBranch = 'main'
doi = ''
}
Expand Down
4 changes: 2 additions & 2 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -76,14 +76,14 @@
"properties": {
"dehosting_idx": {
"type": "string",
"description": "Mash sketch used for contamination detection and speciation (Sketch comments must be a taxonomic string similar to what Kraken2 outputs)",
"description": "Minimpa2 index for dehosting and kitome removal",
"pattern": "^\\S+$",
"exists": true,
"format": "file-path"
},
"mash_sketch": {
"type": "string",
"description": "Minimpa2 index for dehosting and kitome removal",
"description": "Mash sketch used for contamination detection and speciation (Sketch comments must be a taxonomic string similar to what Kraken2 outputs)",
"pattern": "^\\S+$",
"exists": true,
"format": "file-path"
Expand Down
2 changes: 2 additions & 0 deletions tests/data/tables/all_values_missing.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1,header2,header3
,,
Empty file added tests/data/tables/empty.csv
Empty file.
2 changes: 2 additions & 0 deletions tests/data/tables/header_missing_val.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
,header2,header3
stuff1,stuff2,stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/missing_all_headers.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@

stuff1,stuff2,stuff3
1 change: 1 addition & 0 deletions tests/data/tables/missing_all_headers_single_line.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
stuff1,stuff2,stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/missing_last_value.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1 header2 header3
stuff1 stuff2
2 changes: 2 additions & 0 deletions tests/data/tables/missing_multiple_value_separators.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1 header2 header3 header4
stuff1 stuff2
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1 header2 header3 header4
stuff1 stuff2 stuff4
2 changes: 2 additions & 0 deletions tests/data/tables/mistmatch_headers_values.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1,header2,header3
stuff1,stuff2,stuff3,stuff4
2 changes: 2 additions & 0 deletions tests/data/tables/mock_missing_value.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1,header2,header3
,stuff2,stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/mock_missing_value.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1 header2 header3
stuff2 stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/mock_missing_value_2.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1 header2 header3
stuff3
1 change: 1 addition & 0 deletions tests/data/tables/no_header.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
stuff1,stuff2,stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/no_missing.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1,header2,header3
stuff1,stuff2,stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/no_missing.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header1 header2 header3
stuff1 stuff2 stuff3
2 changes: 2 additions & 0 deletions tests/data/tables/two_missing_headers.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
,,header3
stuff1,stuff2,stuff3
4 changes: 4 additions & 0 deletions tests/data/tables/vector.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
header1
stuff1
stuff2
stuff3
4 changes: 4 additions & 0 deletions tests/data/tables/vector_no_hdr.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

stuff1
stuff2
stuff3
Loading

0 comments on commit df48c71

Please sign in to comment.