Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: v0.9.0: 'std::invalid_argument' (core dump) #1186

Open
sklages opened this issue Dec 18, 2024 · 7 comments
Open

Q: v0.9.0: 'std::invalid_argument' (core dump) #1186

sklages opened this issue Dec 18, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@sklages
Copy link

sklages commented Dec 18, 2024

Just wanted to run version v0.9.0 (0.9.0+9dc15a85) and ran into a "invalid argument" which I cannot spot in my command .. :-)

dorado basecaller \
  sup,5mCG_5hmCG \
  /path/to/pod5data \
  --device cuda:all \
  --batchsize 0 
  --trim all 
  --kit-name SQK-NBD114-96 \
  --barcode-both-ends \
  --sample-sheet ../samplesheet.csv 
  > file.bam

error:

[2024-12-18 16:19:02.589] [info] 
Running: "basecaller" "sup,5mCG_5hmCG" "/path/to/pod5data" 
"--device" "cuda:all" "--batchsize" "0" "--trim" "all" 
"--kit-name" "SQK-NBD114-96" "--barcode-both-ends" 
"--sample-sheet" "../samplesheet.csv"

terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi

./call: line 3:  6319 Aborted (core dumped) 
/path/to/bin/dorado basecaller sup,5mCG_5hmCG /path/to/pod5data 
--device cuda:all --batchsize 0 --trim all --kit-name SQK-NBD114-96 
--barcode-both-ends --sample-sheet ../samplesheet.csv > file.bam

The same command works with v0.8.3 .. so something has obviously changed ..
Can you point me to what am I missing here?

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 18, 2024

Hi @sklages,

Do you have CUDA_VISIBLE_DEVICES set? If so, ensure that this is simply a comma-delimited set of integers - e.g. export CUDA_VISIBLE_DEVICES=0,1

@sklages
Copy link
Author

sklages commented Dec 19, 2024

@malton-ont

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-c02c4ecb-2dbd-0567-a809-dcc00241a59e)

CUDA_VISIBLE_DEVICES is set using the UUID of the GPU: export CUDA_VISIBLE_DEVICES=GPU-c02c4ecb-2dbd-0567-a809-dcc00241a59e

@malton-ont
Copy link
Collaborator

@sklages,

This doesn't appear to be a setting we account for - please try setting this to the integer cuda id of the device instead while we investigate fixing this.

@sklages
Copy link
Author

sklages commented Dec 19, 2024

@malton-ont - CUDA_VISIBLE_DEVICES is controlled here by the cluster manager; I can override this on single-GPU/unpartitioned GPU nodes only, knowing that there is only one GPU (partition) available.

But is this somethings that has been changed in v0.9.0? Using UUID is working in all other versions.

From the error message I would expect that I have used a wrong parameter ..

Nvidia supports both, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars in CUDA_VISIBLE_DEVICES.

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 19, 2024

@sklages,

Yes, some of the internal gpu monitoring code changed to use NVML, which doesn't respect CUDA_VISIBLE_DEVICES, so we had to put in some code to deal with it ourselves and it looks like we missed this use case.

std::invalid_argument is an internal C++ exception type - it's occurring because we attempt to parse what we expect to be a numeric ID but we're actually being passed a UUID string, which is an invalid_argument to the std::stoi method. It has nothing to do with the arguments begin passed to dorado.

@sklages
Copy link
Author

sklages commented Dec 19, 2024

@malton-ont - thanks for the explanation!

So maybe I can let this as a feature request / improvement: support for GPU/MIG UUIDs for CUDA_VISIBLE_DEVICES in dorado basecaller :-)

@malton-ont
Copy link
Collaborator

Other developers appear to have had a similar issue, and I found this workaround:
microsoft/DeepSpeed#5278 (comment)

@malton-ont malton-ont added the bug Something isn't working label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants