Dorado basecaller hangs at 99% completion #575

nglaszik · 2024-01-10T18:15:07Z

As the title says, running dorado basecaller (0.5.1) on (multiple or single) pod5 files hangs at 99% completion. The time indicated on the progress bar halts. No errors are output, and operating in verbose mode offers no additional information. The process can be terminated, and examination of the resultant bam file with dorado summary & nanopolish produces normal-looking results. The bam file doesn't have an EOF though, so it's unclear if there is more data that needs to be written. The input pod5's are converted by "pod5 convert fast5" from fast5s created by minknow.

It seems to only happen for particular pod5 files. Some files get to 100% completion. Hanging is perhaps more likely for larger files.

System: Ubuntu 20.04, NVIDIA RTX3090, Driver 470.223.02, CUDA 11.4

Edit: dorado 0.4.1 basecaller doesn't have this issue.

tijyojwad · 2024-01-11T13:19:44Z

Hi @nglaszik - can you post what command you're running? Are you able to check your GPU utilization (using nvidia-smi or nvtop when the run is stuck at 99%? Is it showing anything >0?

HalfPhoton · 2024-02-05T17:19:22Z

@nglaszik - are there any updates on this issue?

nglaszik · 2024-02-28T23:00:00Z

Hi there, sorry for just now getting back to this. I was able to recreate this error with an updated NVIDIA driver v550.54.14. The command I run is the following:

/home/dorado-0.5.1-linux-x64/bin/dorado basecaller --device cuda:0 /home/dorado_models/[email protected] /home/nanopore/data/231208_BrdU500_PGvM_2/output.pod5 > /home/nanopore/data/231208_BrdU500_PGvM_2/calls_0.5.1.bam

I did notice that different models will create different errors, e.g. the v4.2 5mCG_5hmCG model hangs after only a few seconds. Maybe it's a model incompatibility I'm overlooking? I also forgot to mention the data is coming from a R10.4 flow cell if that's relevant. I am now getting back to this project, so I can test some other permutations of models / versions of dorado since I think maybe the v4.3 models and dorado 5.3 weren't available when I was trying earlier.

I've just continued using the 0.4.1 basecaller for now.

Edit: I'll look at the GPU utilization as well.

nglaszik · 2024-02-29T21:20:28Z

Hi there,

Confirmed that 0.5.3 basecaller still hangs at 99% completion, producing a .bam file without an EOF, using the v4.3 hac model.

During the hang, nvidia-smi shows that the process is still running on the GPU, and taking up the same amount of memory as before the hang. However, volatile GPU-util is at 0% so it doesn't seem to be actually processing anything.

nglaszik · 2024-03-01T23:35:52Z

Another update: dorado basecaller 0.5.3 can run on smaller pod5 datasets...

If I split the original 46GB pod5 file into multiple pod5's, basecaller will still hang at 99% completion.

However, if I choose a subset (3 pod5's of 4000 reads each) to run basecaller on, it runs to completion.

Interestingly, the success of running basecaller for modified bases also seems to be dependent on the input file sizes... 5mCG_5hmCG can successfully run on a single pod5 of 4000 reads, whereas 5mC_5hmC freezes somewhere in the middle of completion. However, 5mC_5hmC runs to completion on a pod5 of 100 reads.

Sounds like a memory issue, perhaps related to optimization for other video cards? It's running on an NVIDIA RTX3090 with 24GB of memory, far below the 40 or 80 on an A100.

nglaszik · 2024-03-06T23:00:55Z

@tijyojwad - sorry to bug, but any insight into this? Especially the last post where dorado runs with smaller input pod5's but not with large ones?

tijyojwad · 2024-03-07T00:32:36Z

Hi @nglaszik - this is an odd situation and I don't have any obvious solution yet. It feels like it could be related to a specific offending read(s)...

One suggestion is to fetch the read ids from BAM of the hung run (remember to collect the read ids in the pi:Z tag as well for split reads) and compare it against the read ids in the pod5. Whichever pod5s read id (or ids) is missing is likely causing the issue.

nglaszik · 2024-03-07T03:54:20Z

Sounds good, thank you @tijyojwad I'll try that!

pre-mRNA · 2024-04-02T00:14:46Z

I'm having the same issue with Dorado 0.5.3 while performing RNA basecalling, using the command:

dorado basecaller sup,m6A_DRACH ./pod5/ --estimate-poly-a > ./unmapped.bam

My basecalling hangs at 100%. Similarly, the command works fine for individual pod5s, and the volatile GPU usage is at 0% during the hang.

In my case, killing guppy produces a truncated BAM file, but it seems that all the reads are basecalled.

tijyojwad · 2024-04-02T19:06:40Z

thanks for reporting @pre-mRNA - why is the size of your combined dataset?

pre-mRNA · 2024-04-02T22:10:29Z

2.98M RNA004 reads, across ~700 POD5 files

HalfPhoton · 2024-09-17T10:23:52Z

Does this issue persist in dorado-0.8.0?

Kind regards,
Rich

pre-mRNA · 2024-09-17T13:46:57Z

Hi, I still get similar errors with default multi-modification usage, but now it's stable once I specify the chunk/batch size, e.g.: dorado basecaller sup,pseU,inosine_m6A,m5C "$pod5_dir/" --estimate-poly-a -r -b 416 -c 9216 --models-directory ./bin For some reason, I also need to specifcy the model directory manually with the v0.8 update. Cheers,

…

________________________________ From: Richard Harris ***@***.***> Sent: Tuesday, 17 September 2024 12:24 PM To: nanoporetech/dorado ***@***.***> Cc: Aditya Sethi ***@***.***>; Mention ***@***.***> Subject: Re: [nanoporetech/dorado] Dorado basecaller hangs at 99% completion (Issue #575) Does this issue persist in dorado-0.8.0? Kind regards, Rich — Reply to this email directly, view it on GitHub<#575 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVAYTRNGGHOPTSTU77B27BTZW77M7AVCNFSM6AAAAABBVHJ6AKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVGIZDEMJWHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

HalfPhoton · 2024-09-17T15:29:19Z

@pre-mRNA - Ok, I'll keep this issue open as the underlying problem isn't resolved.

For some reason, I also need to specifcy the model directory manually with the v0.8 update.

Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble?

pre-mRNA · 2024-09-17T16:34:08Z

Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble?

My GPU nodes aren't internet connected, so I need to specify the model directory manually, even when the models are already present in cwd.

I think Dorado 0.7 automatically searched cwd for models, whereas now the path needs to be set.

Not a big deal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dorado basecaller hangs at 99% completion #575

Dorado basecaller hangs at 99% completion #575

nglaszik commented Jan 10, 2024 •

edited

Loading

tijyojwad commented Jan 11, 2024

HalfPhoton commented Feb 5, 2024

nglaszik commented Feb 28, 2024 •

edited

Loading

nglaszik commented Feb 29, 2024 •

edited

Loading

nglaszik commented Mar 1, 2024 •

edited

Loading

nglaszik commented Mar 6, 2024

tijyojwad commented Mar 7, 2024

nglaszik commented Mar 7, 2024

pre-mRNA commented Apr 2, 2024 •

edited

Loading

tijyojwad commented Apr 2, 2024

pre-mRNA commented Apr 2, 2024

HalfPhoton commented Sep 17, 2024

pre-mRNA commented Sep 17, 2024 via email

HalfPhoton commented Sep 17, 2024

pre-mRNA commented Sep 17, 2024

Dorado basecaller hangs at 99% completion #575

Dorado basecaller hangs at 99% completion #575

Comments

nglaszik commented Jan 10, 2024 • edited Loading

tijyojwad commented Jan 11, 2024

HalfPhoton commented Feb 5, 2024

nglaszik commented Feb 28, 2024 • edited Loading

nglaszik commented Feb 29, 2024 • edited Loading

nglaszik commented Mar 1, 2024 • edited Loading

nglaszik commented Mar 6, 2024

tijyojwad commented Mar 7, 2024

nglaszik commented Mar 7, 2024

pre-mRNA commented Apr 2, 2024 • edited Loading

tijyojwad commented Apr 2, 2024

pre-mRNA commented Apr 2, 2024

HalfPhoton commented Sep 17, 2024

pre-mRNA commented Sep 17, 2024 via email

HalfPhoton commented Sep 17, 2024

pre-mRNA commented Sep 17, 2024

nglaszik commented Jan 10, 2024 •

edited

Loading

nglaszik commented Feb 28, 2024 •

edited

Loading

nglaszik commented Feb 29, 2024 •

edited

Loading

nglaszik commented Mar 1, 2024 •

edited

Loading

pre-mRNA commented Apr 2, 2024 •

edited

Loading