Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado basecaller hangs at 99% completion #575

Open
nglaszik opened this issue Jan 10, 2024 · 15 comments
Open

Dorado basecaller hangs at 99% completion #575

nglaszik opened this issue Jan 10, 2024 · 15 comments

Comments

@nglaszik
Copy link

nglaszik commented Jan 10, 2024

As the title says, running dorado basecaller (0.5.1) on (multiple or single) pod5 files hangs at 99% completion. The time indicated on the progress bar halts. No errors are output, and operating in verbose mode offers no additional information. The process can be terminated, and examination of the resultant bam file with dorado summary & nanopolish produces normal-looking results. The bam file doesn't have an EOF though, so it's unclear if there is more data that needs to be written. The input pod5's are converted by "pod5 convert fast5" from fast5s created by minknow.

It seems to only happen for particular pod5 files. Some files get to 100% completion. Hanging is perhaps more likely for larger files.

System: Ubuntu 20.04, NVIDIA RTX3090, Driver 470.223.02, CUDA 11.4

Edit: dorado 0.4.1 basecaller doesn't have this issue.

@tijyojwad
Copy link
Collaborator

Hi @nglaszik - can you post what command you're running? Are you able to check your GPU utilization (using nvidia-smi or nvtop when the run is stuck at 99%? Is it showing anything >0?

@HalfPhoton
Copy link
Collaborator

@nglaszik - are there any updates on this issue?

@nglaszik
Copy link
Author

nglaszik commented Feb 28, 2024

Hi there, sorry for just now getting back to this. I was able to recreate this error with an updated NVIDIA driver v550.54.14. The command I run is the following:

/home/dorado-0.5.1-linux-x64/bin/dorado basecaller --device cuda:0 /home/dorado_models/[email protected] /home/nanopore/data/231208_BrdU500_PGvM_2/output.pod5 > /home/nanopore/data/231208_BrdU500_PGvM_2/calls_0.5.1.bam

I did notice that different models will create different errors, e.g. the v4.2 5mCG_5hmCG model hangs after only a few seconds. Maybe it's a model incompatibility I'm overlooking? I also forgot to mention the data is coming from a R10.4 flow cell if that's relevant. I am now getting back to this project, so I can test some other permutations of models / versions of dorado since I think maybe the v4.3 models and dorado 5.3 weren't available when I was trying earlier.

I've just continued using the 0.4.1 basecaller for now.

Edit: I'll look at the GPU utilization as well.

@nglaszik
Copy link
Author

nglaszik commented Feb 29, 2024

Hi there,

Confirmed that 0.5.3 basecaller still hangs at 99% completion, producing a .bam file without an EOF, using the v4.3 hac model.

During the hang, nvidia-smi shows that the process is still running on the GPU, and taking up the same amount of memory as before the hang. However, volatile GPU-util is at 0% so it doesn't seem to be actually processing anything.

@nglaszik
Copy link
Author

nglaszik commented Mar 1, 2024

Another update: dorado basecaller 0.5.3 can run on smaller pod5 datasets...

If I split the original 46GB pod5 file into multiple pod5's, basecaller will still hang at 99% completion.

However, if I choose a subset (3 pod5's of 4000 reads each) to run basecaller on, it runs to completion.

Interestingly, the success of running basecaller for modified bases also seems to be dependent on the input file sizes... 5mCG_5hmCG can successfully run on a single pod5 of 4000 reads, whereas 5mC_5hmC freezes somewhere in the middle of completion. However, 5mC_5hmC runs to completion on a pod5 of 100 reads.

Sounds like a memory issue, perhaps related to optimization for other video cards? It's running on an NVIDIA RTX3090 with 24GB of memory, far below the 40 or 80 on an A100.

@nglaszik
Copy link
Author

nglaszik commented Mar 6, 2024

@tijyojwad - sorry to bug, but any insight into this? Especially the last post where dorado runs with smaller input pod5's but not with large ones?

@tijyojwad
Copy link
Collaborator

Hi @nglaszik - this is an odd situation and I don't have any obvious solution yet. It feels like it could be related to a specific offending read(s)...

One suggestion is to fetch the read ids from BAM of the hung run (remember to collect the read ids in the pi:Z tag as well for split reads) and compare it against the read ids in the pod5. Whichever pod5s read id (or ids) is missing is likely causing the issue.

@nglaszik
Copy link
Author

nglaszik commented Mar 7, 2024

Sounds good, thank you @tijyojwad I'll try that!

@pre-mRNA
Copy link

pre-mRNA commented Apr 2, 2024

I'm having the same issue with Dorado 0.5.3 while performing RNA basecalling, using the command:

dorado basecaller sup,m6A_DRACH ./pod5/ --estimate-poly-a > ./unmapped.bam

My basecalling hangs at 100%. Similarly, the command works fine for individual pod5s, and the volatile GPU usage is at 0% during the hang.

In my case, killing guppy produces a truncated BAM file, but it seems that all the reads are basecalled.

@tijyojwad
Copy link
Collaborator

thanks for reporting @pre-mRNA - why is the size of your combined dataset?

@pre-mRNA
Copy link

pre-mRNA commented Apr 2, 2024

2.98M RNA004 reads, across ~700 POD5 files

@HalfPhoton
Copy link
Collaborator

Does this issue persist in dorado-0.8.0?

Kind regards,
Rich

@pre-mRNA
Copy link

pre-mRNA commented Sep 17, 2024 via email

@HalfPhoton
Copy link
Collaborator

@pre-mRNA - Ok, I'll keep this issue open as the underlying problem isn't resolved.

For some reason, I also need to specifcy the model directory manually with the v0.8 update.

Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble?

@pre-mRNA
Copy link

Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble?

My GPU nodes aren't internet connected, so I need to specify the model directory manually, even when the models are already present in cwd.

I think Dorado 0.7 automatically searched cwd for models, whereas now the path needs to be set.

Not a big deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants