-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dorado duplex using older dna model for V10? #1210
Comments
Hi @Patricie34,
You're correct, we do not have duplex models for the v10 condition. My apologies here, I should have spotted that this would have been a problem in your original issue.
Your pipeline sounds reasonable where you have duplex v14 and simplex v10 data, basecalled in duplex and simplex respectively, and then mapped and merged.
Please checkout this documentation on duplex performance. Best regards, |
Hi @HalfPhoton, thank you very much for your response. Over the last few days, I have been working on setting up duplex basecalling on Kubernetes. I followed the documentation, as you advised, and first ran the split Pod5 process by channel, followed by the duplex basecalling. Unfortunately, the basecalling process is failing after three days, probably due to memory limits. I have only 1 GPU available for this. Do you have any suggestions on how to make it work? In the documentation, there is a mention of running one job per, for example, 100 channels. Does this mean that if I have, for instance, 2000 Pod5 files for one sample after splitting by channel, I should run around 20 processes (jobs), each processing 100 files? I am using the Nextflow pipeline. Below is the code for these two processes: process SPLIT_POD5 {
tag "SPLIT_POD5 on $sample.name using $task.cpus CPUs $task.memory"
publishDir "${params.outDir}/${sample.name}/duplex", mode:'copy'
label "l_cpu"
label "xl_mem"
input:
tuple val(name), val(sample), val(pod5_files)
output:
tuple val(name), val(sample), path("split_by_channel"), emit: split_pod5
path "summary.tsv", emit: summary
script:
"""
echo "SPLIT_POD5 on ${name}"
pip install pod5
mkdir split_by_channel
pod5 view ${pod5_files} --include "read_id,channel" --output summary.tsv
pod5 subset ${pod5_files} --summary summary.tsv --columns channel --output split_by_channel
"""
}
process DORADO_DUPLEX {
tag "DORADO_DUPLEX on ${sample.name} using ${task.cpus} CPUs and GPU ${sample.gpu}"
publishDir "${params.outDir}/${sample.name}/mapped/duplex", mode: 'copy'
container 'ontresearch/dorado:latest'
accelerator 1, type: "${sample.gpu}"
label "xl_cpu"
label "xxxl_mem"
maxForks 2
input:
tuple val(name), val(sample), path(split_pod5s)
output:
tuple val(name), val(sample), path("${sample.name}.calls.duplex.bam"), path("${sample.name}.summary.tsv")
script:
def model = get_model(sample.basecall_qual, sample.chemistry)
"""
echo "DORADO_DUPLEX on ${sample.name} "
echo "Model: ${model}"
echo "Downloading model..."
dorado download --model ${model}
echo "Basecalling without methylation..."
dorado duplex ${model} ${split_pod5s} \
--reference ${sample.ref} \
--verbose \
-o ${sample.name}.calls.duplex.bam
dorado summary ${sample.name}.calls.duplex.bam > ${sample.name}.summary.tsv
echo "Finished DORADO_DUPLEX on ${sample.name} "
"""
} Thank you! Best regards, Patricie |
Hi @Patricie34, I believe the issue here is that you're basecalling all of the split pod5 files together in one job. The purpose of splitting the pod5 files into channels is to improve the pair searching efficiency by providing mostly plausible reads instead of all reads. This requires you to be more specific with the dorado call so that only data from one (or a few) channels are loaded in one process. What you need to do is to run The size of the collections of per-channel pod5 files is up to you, mostly depending on how long you want each process to run. There's a diminishing return when calling single per-channel pod5 files as the nextflow process / dorado load time becomes a significant fraction of the overall wall time. I'd recommend trying collections in the range of 20 to 200. Best regards, |
Hi @HalfPhoton, thank you for your advice. [error] Zero positional arguments expected, did you mean --dump_stats_file VAR That’s why I directly used the entire split_by_channel directory; only this way did it actually work. Do you have any idea what could be causing the issue? I’m guessing it might have something to do with how the input split_pod5 files are being streamed from the previous step, but I haven’t been able to figure it out. I’ve tried different approaches and always ended up with the same error. My guess is that it’s more likely a problem with nextflow rather than the dorado command itself. The code I used is following: process SPLIT_POD5 {
tag "SPLIT_POD5 on $sample.name using $task.cpus CPUs $task.memory"
publishDir "${params.outDir}/${sample.name}/nano/duplex", mode:'copy'
label "l_cpu"
label "xl_mem"
input:
tuple val(name), val(sample), val(pod5_files)
output:
tuple val(name), val(sample), path("split_by_channel/*.pod5"), emit: split_pod5
path "summary.tsv", emit: summary
script:
"""
echo "SPLIT_POD5 on ${name}"
pip install pod5
mkdir split_by_channel
pod5 view ${pod5_files} --include "read_id,channel" --output summary.tsv
pod5 subset ${pod5_files} --summary summary.tsv --columns channel --output split_by_channel
"""
}
process DORADO_DUPLEX {
tag "DORADO_DUPLEX on ${sample.name} chunk of ${chunk_size} files using ${task.cpus} CPUs and GPU ${sample.gpu}"
container 'ontresearch/dorado:latest'
publishDir "${params.outDir}/${sample.name}/nano/mapped/duplex/calls", mode: 'copy'
accelerator 1, type: "${sample.gpu}"
label "xl_cpu"
label "xxxl_mem"
input:
tuple val(name), val(sample), val(chunk_size), val(pod5_files)
output:
tuple val(name), val(sample), path("${sample.name}_chunk${task.index}.calls.duplex.bam"), path("${sample.name}.summary.tsv")
//tuple val(name), val(sample), path("${sample.name}_batch${batch_index}.methyl.duplex.bam"), path("${sample.name}_batch${batch_index}.methyl.summary.tsv"), optional: true
script:
def model = get_model(sample.basecall_qual, sample.chemistry)
"""
echo "DORADO_DUPLEX on ${sample.name} chunk ${task.index} (${chunk_size} files)"
echo "Model: ${model}"
echo "Downloading model..."
dorado download --model ${model}
echo "Basecalling without methylation..."
dorado duplex ${model} ${pod5_files} \
--reference ${sample.ref} \
--verbose \
-o ${sample.name}_chunk${task.index}.calls.duplex.bam
dorado summary ${sample.name}.calls.duplex.bam > ${sample.name}.summary.tsv
echo "Finished DORADO_DUPLEX on ${sample.name} chunk ${task.index}"
"""
} input channel looks like this: Split_pod5s.split_pod5
.flatMap { name, meta, pod5s -> pod5s.collate(200).collect { chunk -> tuple(name, meta, chunk.size(), chunk)} }
.set { chunked_pod5s }
Dorado_duplex = DORADO_DUPLEX(chunked_pod5s) Thanks a lot for your help! Best regards, Patricie |
Hi @Patricie34,
I'm guessing that what caused this error is that multiple pod5 files were passed to dorado All you might have to do is move staged pod5 files into a sub directory and call that directory instead of the list of pod5 files which I think is what's currently happening. Alternatively you could batch together collections of pod5 files at the end of output tuple val(name), val(sample), path("batches/batch_*"), emit: split_pod5 You can then run dorado for each of these directories in separate Best regards, |
Hi @HalfPhoton, That's actually what I did the first time – I used the subdirectory where the split POD5 files were located. However, the problem was that it ended up using all >2000 files. So, I thought I needed to separate them into batches of 20-200, as you suggested. Best regards, Patricie |
To elaborate on creating subdirectories of batches; extend the bash script in ... # the pod5 subset by channel stuff
mkdir batches
BATCH_SIZE=20
IDX=1
files=($(find "split_by_channel/" -maxdepth 1 -type f))
# Move BATCH_SIZE files into a new subdirectory batch_${IDX}
while [ ${#files[@]} -gt 0 ]; do
# Create new directory
sub="batches/batch_${IDX}"
mkdir -p "${sub}"
# Move up to BATCH_SIZE files
for ((i = 0; i < BATCH_SIZE && ${#files[@]} > 0; i++)); do
mv "${files[$i]}" "$sub"
done
# Remove moved files from the list
files=("${files[@]:BATCH_SIZE}")
((IDX++))
done |
Hi @HalfPhoton, thank you very much for helping me with setting the dorado duplex. One thing that comes to mind is that since duplex basecalling can be handled in a distributed way on kubernetes, is there a way to do the same with simplex basecalling? Thank you. Best regards, Patricie |
Hi @Patricie34, you're welcome! Yes, you can apply similar steps to distribute simplex basecalling. I use the Best regards, |
Hi everyone,
I previously discussed an issue here regarding merging data from samples sequenced using V10 and V14 chemistries (#1160)).
My goal was to perform dorado duplex basecalling, map the reads to a ref. genome, and then merge BAM files before variant calling. However, I encountered an error when trying to use dorado duplex with the model [email protected].
I assume duplex basecalling is not available for these older models. Is that correct?
If so, could I use duplex basecalling for V14 samples with [email protected], simplex basecalling for V10 samples, and then proceed with mapping and merging?
Additionally, I would like to ask for advice on performing duplex basecalling on Kubernetes.
Is there a way to speed it up?
Thank you very much!
Regards,
Patricie
The text was updated successfully, but these errors were encountered: