Skip to content

Commit 9e34019

Browse files
committed
feat(assembly): predict megahit memory consumption
1 parent 41dd5d5 commit 9e34019

13 files changed

+408
-77
lines changed

docs/modules/assembly.md

+23-3
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,30 @@
3030

3131
The output is a gzipped fasta file containing contigs.
3232

33-
## Megahit Error Handling
33+
## Megahit
34+
35+
### Error Handling
3436

3537
On error with exit codes ([-9, 137]) (e.g. due to memory restrictions), the tool is executed again with higher cpu and memory values.
36-
The memory and cpu values are computed by the formula 2^(number of attempts) * (cpu/memory value of the assigned flavour).
37-
The highest possible cpu/memory value is restricted by the highest cpu/memory value of all flavours defined in the resource section
38+
The memory and cpu values are computed by the formula 2^(number of attempts) * (cpu/memory value of the assigned or predicted flavor).
39+
The highest possible cpu/memory value is restricted by the highest cpu/memory value of all flavors defined in the resource section
3840
(see global [configuration](../pipeline_configuration.md) section).
3941

42+
### Peak memory usage prediction
43+
44+
Memory cosumption of an assembler varies based on diversity and size of the dataset. We trained a machine learning model on kmer frequencies
45+
and the nonpareil diversity index in order to be able to predict the memory peak consumption of megahit in our full pipeline mode. The required
46+
resources in order to run the assembler are thereby fitted to the resources that are actually needed for a specific dataset. If this
47+
mode is enabled then Nonpareil and jellyfish that are part of the quality control module are automatically executed before the assembler run.
48+
49+
```
50+
resources:
51+
RAM:
52+
mode: MODE
53+
predictMinLabel: LABEL
54+
```
55+
56+
where
57+
* MODE can be either 'PREDICT' for predicting memory usage or 'DEFAULT' for using a default flavor defined in the resources section.
58+
59+
* LABEL is the flavor that will be used if the predicted RAM is below the memory value defined as part of the LABEL flavor. It can also be set to AUTO to always use the predicted flavor.

docs/pipeline_configuration.md

+2
Original file line numberDiff line numberDiff line change
@@ -241,3 +241,5 @@ resources:
241241
cpus: 1
242242
memory: 2
243243
```
244+
245+
The full pipeline mode is able to predict the memory consumption of some assemblers (see assembly module section).

example_params/assembly.yml

+4
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@ steps:
1111
megahit:
1212
additionalParams: " --min-contig-len 200 "
1313
fastg: true
14+
resources:
15+
RAM:
16+
mode: 'DEFAULT'
17+
predictMinLabel: 'AUTO'
1418

1519
resources:
1620
large:

example_params/fullPipeline.yml

+12
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,22 @@ steps:
1515
fastp:
1616
# Example params: " --cut_front --cut_tail --detect_adapter_for_pe "
1717
additionalParams: " "
18+
nonpareil:
19+
additionalParams: " -v 10 -r 1234 "
20+
jellyfish:
21+
additionalParams:
22+
count: " -m 21 -s 100M "
23+
histo: " "
24+
1825
assembly:
1926
megahit:
2027
additionalParams: " --min-contig-len 200 "
2128
fastg: true
29+
resources:
30+
RAM:
31+
mode: 'PREDICT'
32+
predictMinLabel: 'medium'
33+
2234
binning:
2335
bowtie:
2436
additionalParams:

example_params/fullPipeline_fraction/fullPipeline_fraction_binning_metabinner.yml

+4
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@ steps:
1818
assembly:
1919
megahit:
2020
additionalParams: " --min-contig-len 200 "
21+
resources:
22+
RAM:
23+
mode: 'DEFAULT'
24+
predictMinLabel: 'AUTO'
2125
binning:
2226
bowtie:
2327
additionalParams:

example_params/fullPipeline_fraction/fullPipeline_fraction_magAttributes.yml

+4
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@ steps:
1818
assembly:
1919
megahit:
2020
additionalParams: " --min-contig-len 200 "
21+
resources:
22+
RAM:
23+
mode: 'DEFAULT'
24+
predictMinLabel: 'AUTO'
2125
binning:
2226
bowtie:
2327
additionalParams:

example_params/fullPipeline_fraction/fullPipeline_fraction_plasmid.yml

+4
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ steps:
1919
megahit:
2020
fastg: false
2121
additionalParams: " --min-contig-len 200 "
22+
resources:
23+
RAM:
24+
mode: 'DEFAULT'
25+
predictMinLabel: 'AUTO'
2226
binning:
2327
bowtie:
2428
additionalParams:

main.nf

+14-1
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,19 @@ workflow _wConfigurePipeline {
187187
assembler, parameter -> params.steps.assembly.get(assembler).putAll(fastg)
188188
}
189189
}
190+
191+
// If memory resources should be predicted by megahit then nonpareil and jellyfish
192+
// must be enabled
193+
if(params.steps?.assembly?.megahit?.resources?.RAM?.mode == "PREDICT"){
194+
if(!params.steps?.qc.containsKey("nonpareil")){
195+
def nonpareil = [ nonpareil: [additionalParams: " -v 10 -r 1234 "]]
196+
params.steps.qc.putAll(nonpareil)
197+
}
198+
if(!params.steps?.qc.containsKey("jellyfish")){
199+
def jellyfish = [ jellyfish: [additionalParams: [ count: " -m 21 -s 100M ", histo: " "]]]
200+
params.steps.qc.putAll(jellyfish)
201+
}
202+
}
190203
}
191204

192205

@@ -224,7 +237,7 @@ workflow wPipeline {
224237
wQualityControlList.out.readsPair \
225238
| join(wQualityControlList.out.readsSingle) | set { qcReads }
226239

227-
wAssemblyList(qcReads)
240+
wAssemblyList(qcReads, wQualityControlList.out.nonpareil, wQualityControlList.out.kmerFrequencies)
228241

229242
wBinning(wAssemblyList.out.contigs, qcReads)
230243

models/assembler/megahit/default.pkl

326 KB
Binary file not shown.

0 commit comments

Comments
 (0)