Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bruker data #275

Merged
merged 52 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
c4b17e4
added tdf2mzml
jspaezp Aug 3, 2023
0b55503
added path to conversion
jspaezp Aug 3, 2023
4e9a931
changed comment character
jspaezp Aug 4, 2023
51a6e66
added tuple of meta to tdf2mzml outs
jspaezp Aug 4, 2023
9cc93fa
added debug prints to diann conversion
jspaezp Aug 4, 2023
e8752f6
added renaming of dotd files after extraction
jspaezp Aug 4, 2023
b559b21
yet more debug printing info
jspaezp Aug 4, 2023
0793038
added not to branching
jspaezp Aug 4, 2023
224e340
refactoring of diann convert
jspaezp Aug 6, 2023
9e4c872
fixed bug where mzml AND raw files were passed
jspaezp Aug 6, 2023
1a86393
added speclib to schema
jspaezp Aug 6, 2023
92ee452
returned report in abstracted diannconvert
jspaezp Aug 6, 2023
33b3579
refactor and speedup of diann summary
jspaezp Aug 6, 2023
6ae3565
added debug info to versions
jspaezp Aug 6, 2023
4b58fba
moved tar version in the workflow from tracking to logging
jspaezp Aug 6, 2023
288415a
fixed dumb error
jspaezp Aug 6, 2023
ffff4c9
experimental change of the experimental design to make multiqc pass
jspaezp Aug 6, 2023
d9feb35
changed debug listing of contents in multiqc from tree to ls
jspaezp Aug 6, 2023
5e8fa66
stuff
jspaezp Aug 7, 2023
bf084e2
major speedup
jspaezp Aug 9, 2023
786798f
speed and logging improvement
jspaezp Aug 9, 2023
ec37001
improved error messaging when calculating coverages
jspaezp Aug 9, 2023
882b968
further optimization
jspaezp Aug 10, 2023
69ae560
changed paths to vals
jspaezp Aug 11, 2023
266c121
typo fix
jspaezp Aug 11, 2023
4d94097
even more optimization in diann conversion
jspaezp Aug 12, 2023
41f76c2
added a bit of debug logging
jspaezp Aug 12, 2023
7ce33c4
change to path in the empirical lib step and yet even more optimization
jspaezp Aug 12, 2023
454b911
Further speedup of diann conversion, prevent staging of un-needed fil…
jspaezp Aug 12, 2023
85f3060
Experimental/bruker report (#2)
jspaezp Aug 17, 2023
21ee38b
Merge branch 'dev' into feature/bruker_data
jspaezp Aug 17, 2023
f4d8cbe
incorporated code review notes
jspaezp Aug 17, 2023
062e6ba
minor fix on nf-core linting
jspaezp Aug 17, 2023
6c933e4
whitespace related linting
jspaezp Sep 1, 2023
282b9aa
prettier autofix of quotes
jspaezp Sep 1, 2023
aa32722
Experimental/bruker agg metrics (#3)
jspaezp Sep 3, 2023
0da9965
Updating to upstream dev branch (#4)
jspaezp Sep 3, 2023
f755690
updated example of cli run
jspaezp Sep 3, 2023
83904d9
Merge branch 'dev' into feature/bruker_data
jspaezp Sep 3, 2023
88ef3f3
fixed linting on decompress dotd nf file
jspaezp Sep 10, 2023
484a19c
Merge branch 'feature/bruker_data' of github.com:TalusBio/quantms int…
jspaezp Sep 10, 2023
bbd8f63
fixed inverted rows error (code review)
jspaezp Sep 10, 2023
50b1205
Merge branch 'dev' into feature/bruker_data
jspaezp Sep 10, 2023
0b11023
changed lookup key for the PRH best score
jspaezp Sep 10, 2023
bcc5295
fixed error that arose from fixing merge conflicts
jspaezp Sep 10, 2023
07a6ed2
updated pmultiqc version
jspaezp Sep 15, 2023
b72e000
Maintainance/finish integration with bigbio (#5)
jspaezp Sep 19, 2023
54b685b
Merge branch 'dev' into feature/bruker_data
jspaezp Sep 19, 2023
f1b0bf7
review suggestions
jspaezp Sep 28, 2023
503538d
Update modules/local/dotd_to_mqc/main.nf
jspaezp Sep 28, 2023
ce2670e
Update modules/local/decompress_dotd/main.nf
jspaezp Sep 28, 2023
b20d91c
Merge branch 'dev' into feature/bruker_data
jspaezp Sep 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
574 changes: 415 additions & 159 deletions bin/diann_convert.py

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion modules/local/diann_preliminary_analysis/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ process DIANN_PRELIMINARY_ANALYSIS {
'biocontainers/diann:v1.8.1_cv1' }"

input:
tuple val(meta), file(mzML), file(predict_tsv)
tuple val(meta), path(mzML), path(predict_tsv)

output:
path "*.quant", emit: diann_quant
Expand Down
1 change: 1 addition & 0 deletions modules/local/diannconvert/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ process DIANNCONVERT {
"""
diann_convert.py convert \\
--folder ./ \\
--exp_design ${exp_design} \\
--diann_version ./version/versions.yml \\
--dia_params "${dia_params}" \\
--charge $params.max_precursor_charge \\
Expand Down
10 changes: 8 additions & 2 deletions modules/local/sdrfparsing/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,18 @@ process SDRFPARSING {
"""
## -t2 since the one-table format parser is broken in OpenMS2.5
## -l for legacy behavior to always add sample columns
## TODO Update the sdrf-pipelines to dynamic print versions

parse_sdrf convert-openms -t2 -l -s ${sdrf} 2>&1 | tee ${sdrf.baseName}_parsing.log
## JSPP 2023-Aug -- Adding --raw for now, this will allow the development of the
# bypass diann pipelie but break every other aspect of it. Make sure
# this flag is gone when PRing

parse_sdrf convert-openms --raw -t2 -l -s ${sdrf} 2>&1 | tee ${sdrf.baseName}_parsing.log
mv openms.tsv ${sdrf.baseName}_config.tsv
mv experimental_design.tsv ${sdrf.baseName}_openms_design.tsv

## TODO Update the sdrf-pipelines to dynamic print versions
# Version reporting can now be programmatic, since:
# https://github.com/bigbio/sdrf-pipelines/pull/134
cat <<-END_VERSIONS > versions.yml
"${task.process}":
sdrf-pipelines: \$(echo "0.0.22")
Expand Down
63 changes: 63 additions & 0 deletions modules/local/tdf2mzml/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@

process TDF2MZML {
tag "$meta.mzml_id"
label 'process_low'
label 'process_single'
label 'error_retry'

// for rawfileparser this is conda "conda-forge::mono bioconda::thermorawfileparser=1.3.4"
// conda is not enabled for DIA so ... disabling anyway

// container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
// 'https://depot.galaxyproject.org/singularity/thermorawfileparser:1.3.4--ha8f3691_0' :
// 'quay.io/biocontainers/thermorawfileparser:1.3.4--ha8f3691_0' }"
container 'mfreitas/tdf2mzml:latest' // I don't know which stable tag to use...

stageInMode {
if (task.attempt == 1) {
if (executor == "awsbatch") {
'symlink'
} else {
'link'
}
} else if (task.attempt == 2) {
if (executor == "awsbatch") {
'copy'
} else {
'symlink'
}
} else {
'copy'
}
}

input:
tuple val(meta), path(rawfile)

output:
tuple val(meta), path("*.mzML"), emit: mzmls_converted
tuple val(meta), path("*.d"), emit: dotd_files
path "versions.yml", emit: version
path "*.log", emit: log

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.mzml_id}"

"""
tar --version
echo "Unpacking..." | tee --append ${rawfile.baseName}_conversion.log
tar -xvf ${rawfile} 2>&1 | tee --append ${rawfile.baseName}_conversion.log
echo "Converting..." | tee --append ${rawfile.baseName}_conversion.log
tdf2mzml.py -i *.d 2>&1 | tee --append ${rawfile.baseName}_conversion.log
echo "Compressing..." | tee --append ${rawfile.baseName}_conversion.log
mv *.mzml ${file(rawfile.baseName).baseName}.mzML
mv *.d ${file(rawfile.baseName).baseName}.d
# gzip ${file(rawfile.baseName).baseName}.mzML

cat <<-END_VERSIONS > versions.yml
"${task.process}":
tdf2mzml.py: \$(tdf2mzml.py --version)
END_VERSIONS
"""
}
42 changes: 42 additions & 0 deletions modules/local/tdf2mzml/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: tdf2mzml
description: convert raw bruker files to mzml files
keywords:
- raw
- mzML
- .d
tools:
- tdf2mzml:
description: |
It takes a bruker .d raw file as input and outputs indexed mzML
homepage: https://github.com/mafreitas/tdf2mzml
documentation: https://github.com/mafreitas/tdf2mzml
input:
- meta:
type: map
description: |
Groovy Map containing sample information
- rawfile:
type: file
description: |
Bruker Raw file archived using tar
pattern: "*.d.tar"
output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'MD5', enzyme:trypsin ]
- mzml:
type: file
description: indexed mzML
pattern: "*.mzML"
- log:
type: file
description: log file
pattern: "*.log"
- version:
type: file
description: File containing software version
pattern: "versions.yml"
authors:
- "@jspaezp"

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,8 @@ params {
mass_acc_automatic = true
pg_level = 2
species_genes = false
diann_normalize = true
diann_normalize = true
diann_speclib = ''

// MSstats general options
msstats_remove_one_feat_prot = true
Expand Down
7 changes: 7 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -891,6 +891,13 @@
"fa_icon": "far fa-check-square",
"default": false
},
"diann_speclib": {
"type": "string",
"description": "The spectral library to use for DIA-NN",
"fa_icon": "fas fa-file",
"help_text": "If passed, will use that spectral library to carry out the DIA-NN search, instead of predicting one from the fasta file.",
"hidden": false
},
"diann_debug": {
"type": "integer",
"description": "Debug level",
Expand Down
24 changes: 19 additions & 5 deletions subworkflows/local/file_preparation.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
//

include { THERMORAWFILEPARSER } from '../../modules/local/thermorawfileparser/main'
include { TDF2MZML } from '../../modules/local/tdf2mzml/main'
include { MZMLINDEXING } from '../../modules/local/openms/mzmlindexing/main'
include { MZMLSTATISTICS } from '../../modules/local/mzmlstatistics/main'
include { OPENMSPEAKPICKER } from '../../modules/local/openms/openmspeakpicker/main'

workflow FILE_PREPARATION {
take:
ch_mzmls // channel: [ val(meta), raw/mzml ]
ch_mzmls // channel: [ val(meta), raw/mzml/d.tar ]

main:
ch_versions = Channel.empty()
Expand All @@ -23,6 +24,7 @@ workflow FILE_PREPARATION {
.branch {
raw: WorkflowQuantms.hasExtension(it[1], 'raw')
mzML: WorkflowQuantms.hasExtension(it[1], 'mzML')
dotD: WorkflowQuantms.hasExtension(it[1], '.d.tar')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO add branch here with a plain .d and mix them (and add the exception to TDF2MZML)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jspaezp wouldn't be more interesting to have .d gzip rather than .tar. Im asking because I have seen most of the compressed files using gzip instead of tar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could implement either, We tested locally and since most of the data inside the file is already compressed, gzip offered very little benefit (<10% compression) whilst dramatically increasing the time it took to generate (4x longer if i recall correctly). So in our use case at least it was not worth having the compression.

I could definitely add something like this to the extraction step:
https://gist.github.com/hightemp/5071909#file-bash-aliases-L32-L60

and correspondingly I would just have the branch be something like

        raw: WorkflowQuantms.hasExtension(it[1], 'raw')
        mzML: WorkflowQuantms.hasExtension(it[1], 'mzML')
        dotD: WorkflowQuantms.hasExtension(it[1], '.d.{tar,tar.gz,tar.bz....}')

I will add this to the options (I think we should enforce having the .d though. I would be very hard to track file properties if we attempt to allow "myfile.tar.gz")

Will add this in the next commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jspaezp interesting the information about why do you use .tar. Would be great to support this approach for those ones using other formats.

 raw: WorkflowQuantms.hasExtension(it[1], 'raw')
 mzML: WorkflowQuantms.hasExtension(it[1], 'mzML')
 dotD: WorkflowQuantms.hasExtension(it[1], '.d.{tar,tar.gz,tar.bz....}')

Copy link
Contributor Author

@jspaezp jspaezp Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self:

public static boolean hasExtension(file, extension) {
return file.toString().toLowerCase().endsWith(extension.toLowerCase())
}

since the extension is checked here, a more verbose branching is needed, I think that supporting tar/tar.gz/tar.bz/zip should be enough for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added here: 85f3060

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ypriverol
In a semi-related manner. Do you have any suggestion on public data? If not I think we could upload something.
I was thinking on maybe using https://www.ebi.ac.uk/pride/archive/projects/PXD034128 but its phospho, so it might take A WHILE to search. AND it seems like most people are uploading the data as a single .zip with all the files (and I am not sure if we want to support that or if there is a way to stage the files more efficiently).

LMK what you think

}
.set { ch_branched_input }

Expand All @@ -46,22 +48,34 @@ workflow FILE_PREPARATION {
ch_results = ch_results.mix(ch_branched_input_mzMLs.inputIndexedMzML)

THERMORAWFILEPARSER( ch_branched_input.raw )
// Output is
// {'mzmls_converted': Tuple[val(meta), path(mzml)],
// 'version': Path(versions.yml),
// 'log': Path(*.txt)}

// Where meta is the same as the input meta
ch_versions = ch_versions.mix(THERMORAWFILEPARSER.out.version)
ch_results = ch_results.mix(THERMORAWFILEPARSER.out.mzmls_converted)

MZMLINDEXING( ch_branched_input_mzMLs.nonIndexedMzML )
ch_versions = ch_versions.mix(MZMLINDEXING.out.version)
ch_results = ch_results.mix(MZMLINDEXING.out.mzmls_indexed)

ch_results.map{ it -> [it[0], it[1]] }.set{ ch_mzml }
ch_results.map{ it -> [it[0], it[1]] }.set{ indexed_mzml_bundle }

TDF2MZML( ch_branched_input.dotD )
ch_versions = ch_versions.mix(TDF2MZML.out.version)
ch_results = indexed_mzml_bundle.mix(TDF2MZML.out.dotd_files)
indexed_mzml_bundle = indexed_mzml_bundle.mix(TDF2MZML.out.mzmls_converted)

MZMLSTATISTICS( ch_mzml )
MZMLSTATISTICS( indexed_mzml_bundle )
ch_statistics = ch_statistics.mix(MZMLSTATISTICS.out.mzml_statistics.collect())
ch_versions = ch_versions.mix(MZMLSTATISTICS.out.version)

if (params.openms_peakpicking){
// If the peak picker is enabled, it will over-write not bypass the .d files
OPENMSPEAKPICKER (
ch_results
indexed_mzml_bundle
)

ch_versions = ch_versions.mix(OPENMSPEAKPICKER.out.version)
Expand All @@ -70,7 +84,7 @@ workflow FILE_PREPARATION {


emit:
results = ch_results // channel: [val(mzml_id), indexedmzml]
results = ch_results // channel: [val(mzml_id), indexedmzml|.d.tar]
statistics = ch_statistics // channel: [ *_mzml_info.tsv ]
version = ch_versions // channel: [ *.version.txt ]
}
12 changes: 9 additions & 3 deletions workflows/dia.nf
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,18 @@ workflow DIA {
//
// MODULE: SILICOLIBRARYGENERATION
//
SILICOLIBRARYGENERATION(ch_searchdb, DIANNCFG.out.diann_cfg)
if (!params.diann_speclib) {
SILICOLIBRARYGENERATION(ch_searchdb, DIANNCFG.out.diann_cfg)
speclib = SILICOLIBRARYGENERATION.out.predict_speclib
} else {
speclib = Channel.fromPath(params.diann_speclib)
}


//
// MODULE: DIANN_PRELIMINARY_ANALYSIS
//
DIANN_PRELIMINARY_ANALYSIS(ch_file_preparation_results.combine(SILICOLIBRARYGENERATION.out.predict_speclib))
DIANN_PRELIMINARY_ANALYSIS(ch_file_preparation_results.combine(speclib))
ch_software_versions = ch_software_versions.mix(DIANN_PRELIMINARY_ANALYSIS.out.version.ifEmpty(null))

//
Expand All @@ -69,7 +75,7 @@ workflow DIA {
ASSEMBLE_EMPIRICAL_LIBRARY(ch_result.mzml.collect(),
meta,
DIANN_PRELIMINARY_ANALYSIS.out.diann_quant.collect(),
SILICOLIBRARYGENERATION.out.predict_speclib
speclib
)
ch_software_versions = ch_software_versions.mix(ASSEMBLE_EMPIRICAL_LIBRARY.out.version.ifEmpty(null))

Expand Down