-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dorado basecaller hangs at 99% completion #575
Comments
Hi @nglaszik - can you post what command you're running? Are you able to check your GPU utilization (using |
@nglaszik - are there any updates on this issue? |
Hi there, sorry for just now getting back to this. I was able to recreate this error with an updated NVIDIA driver v550.54.14. The command I run is the following:
I did notice that different models will create different errors, e.g. the v4.2 5mCG_5hmCG model hangs after only a few seconds. Maybe it's a model incompatibility I'm overlooking? I also forgot to mention the data is coming from a R10.4 flow cell if that's relevant. I am now getting back to this project, so I can test some other permutations of models / versions of dorado since I think maybe the v4.3 models and dorado 5.3 weren't available when I was trying earlier. I've just continued using the 0.4.1 basecaller for now. Edit: I'll look at the GPU utilization as well. |
Hi there, Confirmed that 0.5.3 basecaller still hangs at 99% completion, producing a .bam file without an EOF, using the v4.3 hac model. During the hang, nvidia-smi shows that the process is still running on the GPU, and taking up the same amount of memory as before the hang. However, volatile GPU-util is at 0% so it doesn't seem to be actually processing anything. |
Another update: dorado basecaller 0.5.3 can run on smaller pod5 datasets... If I split the original 46GB pod5 file into multiple pod5's, basecaller will still hang at 99% completion. However, if I choose a subset (3 pod5's of 4000 reads each) to run basecaller on, it runs to completion. Interestingly, the success of running basecaller for modified bases also seems to be dependent on the input file sizes... 5mCG_5hmCG can successfully run on a single pod5 of 4000 reads, whereas 5mC_5hmC freezes somewhere in the middle of completion. However, 5mC_5hmC runs to completion on a pod5 of 100 reads. Sounds like a memory issue, perhaps related to optimization for other video cards? It's running on an NVIDIA RTX3090 with 24GB of memory, far below the 40 or 80 on an A100. |
@tijyojwad - sorry to bug, but any insight into this? Especially the last post where dorado runs with smaller input pod5's but not with large ones? |
Hi @nglaszik - this is an odd situation and I don't have any obvious solution yet. It feels like it could be related to a specific offending read(s)... One suggestion is to fetch the read ids from BAM of the hung run (remember to collect the read ids in the |
Sounds good, thank you @tijyojwad I'll try that! |
I'm having the same issue with Dorado 0.5.3 while performing RNA basecalling, using the command: dorado basecaller sup,m6A_DRACH ./pod5/ --estimate-poly-a > ./unmapped.bam My basecalling hangs at 100%. Similarly, the command works fine for individual pod5s, and the volatile GPU usage is at 0% during the hang. In my case, killing guppy produces a truncated BAM file, but it seems that all the reads are basecalled. |
thanks for reporting @pre-mRNA - why is the size of your combined dataset? |
2.98M RNA004 reads, across ~700 POD5 files |
Does this issue persist in dorado-0.8.0? Kind regards, |
Hi,
I still get similar errors with default multi-modification usage, but now it's stable once I specify the chunk/batch size, e.g.:
dorado basecaller sup,pseU,inosine_m6A,m5C "$pod5_dir/" --estimate-poly-a -r -b 416 -c 9216 --models-directory ./bin
For some reason, I also need to specifcy the model directory manually with the v0.8 update.
Cheers,
…________________________________
From: Richard Harris ***@***.***>
Sent: Tuesday, 17 September 2024 12:24 PM
To: nanoporetech/dorado ***@***.***>
Cc: Aditya Sethi ***@***.***>; Mention ***@***.***>
Subject: Re: [nanoporetech/dorado] Dorado basecaller hangs at 99% completion (Issue #575)
Does this issue persist in dorado-0.8.0?
Kind regards,
Rich
—
Reply to this email directly, view it on GitHub<#575 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVAYTRNGGHOPTSTU77B27BTZW77M7AVCNFSM6AAAAABBVHJ6AKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVGIZDEMJWHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@pre-mRNA - Ok, I'll keep this issue open as the underlying problem isn't resolved.
Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble? |
My GPU nodes aren't internet connected, so I need to specify the model directory manually, even when the models are already present in cwd. I think Dorado 0.7 automatically searched cwd for models, whereas now the path needs to be set. Not a big deal. |
As the title says, running dorado basecaller (0.5.1) on (multiple or single) pod5 files hangs at 99% completion. The time indicated on the progress bar halts. No errors are output, and operating in verbose mode offers no additional information. The process can be terminated, and examination of the resultant bam file with dorado summary & nanopolish produces normal-looking results. The bam file doesn't have an EOF though, so it's unclear if there is more data that needs to be written. The input pod5's are converted by "pod5 convert fast5" from fast5s created by minknow.
It seems to only happen for particular pod5 files. Some files get to 100% completion. Hanging is perhaps more likely for larger files.
System: Ubuntu 20.04, NVIDIA RTX3090, Driver 470.223.02, CUDA 11.4
Edit: dorado 0.4.1 basecaller doesn't have this issue.
The text was updated successfully, but these errors were encountered: