Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a job for a long time without output (kunpeng920 CPU) #286

Open
zhoujingyu13687306871 opened this issue Jul 6, 2023 · 15 comments
Open
Labels
bug Something isn't working

Comments

@zhoujingyu13687306871
Copy link

zhoujingyu13687306871 commented Jul 6, 2023

dear author:
I submit the job to run on a single node of the cluster, but after a long time, there is no output. The single-node CPU is aarch64 architecture, the cpu model is kunpeng920, the GPU is A100-40 pcie, I would show you cpu information and the script content is as follows:

[scx6299@paraai-n32-h-01-agent-1 dorado-test]$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       HiSilicon
Model:                           0
Model name:                      Kunpeng-920
Stepping:                        0x1
BogoMIPS:                        200.00
L1d cache:                       8 MiB
L1i cache:                       8 MiB
L2 cache:                        64 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
NUMA node2 CPU(s):               64-95
NUMA node3 CPU(s):               96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atom
                                 ics fphp asimdhp cpuid asimdrdm jscvt fcma dcpo
                                 p asimddp asimdfhm ssbs

#!/bin/bash
#SBATCH -J dorado-test
#SBATCH -N 1
#SBATCH --gpus=2
#SBATCH -n 64
module purge
module load compilers/cuda/11.7 compilers/gcc/11.3.0 anaconda/2021.11 cudnn/8.4.0.27_cuda11.x
source activate pytorch-2.0
export LD_LIBRARY_PATH=/home/bingxing2/home/scx6299/software/nccl-2.17.1-1/build/lib:$LD_LIBRARY_PATH
export CPATH=/home/bingxing2/home/scx6299/software/nccl-2.17.1-1/build/include:$CPATH
export LD_LIBRARY_PATH=/home/bingxing2/home/scx6299/software/hdf5-serial/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/home/bingxing2/home/scx6299/software/hdf5-serial/lib:$LIBRARY_PATH
export PATH=/home/bingxing2/home/scx6299/software/hdf5-serial/bin:$PATH
export CPATH=/home/bingxing2/home/scx6299/software/hdf5-serial/include:$CPATH
export PATH=/home/bingxing2/home/scx6299/software/dorado-install/bin:$PATH
export LD_LIBRARY_PATH=/home/bingxing2/home/scx6299/software/dorado-install/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/home/bingxing2/home/scx6299/software/dorado-install/lib:$LIBRARY_PATH

cp -r pod5_pass/PAQ21605_pass__ce971a82_ad9362d2_559.pod5 /dev/shm
ls /dev/shm

dorado basecaller --device cuda:0,1 /home/bingxing2/home/scx6299/dorado-test/model/[email protected] /dev/shm/ --modified-bases 5mCG_5hmCG --verbose > 20230706/pass.bam

After running for 1 hour, there is only debug content, and no real results are output, as shown in the figure below:
the output debug content ion the left, and the GPU utilization information on the right,and the fig below is the CPU utilization, which present S state for a long time. I don't know whether it is caused by the CPU instruction set or the system page size (: Unsupported system page size), I hope to get your reply, thank you!
image
image

@tijyojwad
Copy link
Collaborator

Which version of dorado are you using? How large is your input?

It's possible dorado is collecting some metadata from the pod5s first and that's taking a while. Is your data on an external disk? Can you try running with a smaller dataset for debugging?

@zhoujingyu13687306871
Copy link
Author

zhoujingyu13687306871 commented Jul 7, 2023 via email

@tijyojwad
Copy link
Collaborator

Setup looks good to me.

I did a digging online about the jemalloc: Unsupported page size issue and there are some reports for incompatibility with aarch64 processors. Not sure if that's the same problem you're seeing yet.

Can you also try to run with -x cpu? This will force basecalling on CPU (would be very slow) but we can check if it's making any progress. If it doesn't, then at least it's not a CUDA issue.

@zhoujingyu13687306871
Copy link
Author

Setup looks good to me.

I did a digging online about the jemalloc: Unsupported page size issue and there are some reports for incompatibility with aarch64 processors. Not sure if that's the same problem you're seeing yet.

Can you also try to run with -x cpu? This will force basecalling on CPU (would be very slow) but we can check if it's making any progress. If it doesn't, then at least it's not a CUDA issue.

yes, I found jemalloc: Unsupported page size issue online , so I set export MALLOC_CONF=lg_dirty_mult:-1 to my scritps , but It doesn't work

I will try to run with -x cpu, but the node resource exhausted, so wait a moment please

@zhoujingyu13687306871
Copy link
Author

Dear author
I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue

@zhoujingyu13687306871
Copy link
Author

设置对我来说看起来不错。

我在网上挖掘了这个jemalloc: Unsupported page size问题,有一些与 aarch64 处理器不兼容的报告。不确定这是否是您遇到的相同问题。

你也可以尝试与 一起跑步-x cpu吗?这将强制在 CPU 上进行碱基调用(会非常慢),但我们可以检查它是否取得任何进展。如果没有,那么至少这不是 CUDA 问题。

add '-x cpu' to scritps , After the scritps ran 1 hour later , there is still no effective output as follows:

cat slurm-33744.out
cuda-11.7 loaded successful
gcc-11.3.0 loaded successful
<jemalloc>: Unsupported system page size
[2023-07-08 22:09:09.148] [debug] - matching modification model found: [email protected]_5mCG_5hmCG@v2
[2023-07-08 22:09:09.149] [info] > Creating basecall pipeline
[2023-07-08 22:09:09.164] [debug] - CPU calling: set batch size to 128, num_runners to 128

and no cpu utilization

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2520386 scx6299   20   0  231104  13824   5376 R   0.7   0.0   0:03.00 top
2490194 scx6299   20   0  215040   4416   3072 S   0.0   0.0   0:00.00 slurm_script
2503034 scx6299   20   0  290240  61952   3648 S   0.0   0.0   0:00.01 sshd
2503035 scx6299   20   0  228416  14272   5696 S   0.0   0.0   0:00.02 bash

@tijyojwad
Copy link
Collaborator

Dear author
I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue

Hmm I'm not sure what this would entail tbh. It feels more like something jemalloc would have to support rather than something we can add in dorado.

The fact that it's not making any progress with CPU either makes me think of I/O issues. Have you tried to run dorado (same binary) in any other environment? I can suggest the following -

  1. Run dorado on a local machine instead of the cluster with the data local as well
  2. Copy the data to /tmp in your HPC job first and then run dorado on the copied data

@zhoujingyu13687306871
Copy link
Author

zhoujingyu13687306871 commented Jul 10, 2023 via email

@vellamike
Copy link
Collaborator

Hi @zhoujingyu13687306871 - are you able to compile Dorado yourself on the kunpeng920 machine by any chance? This is not a problem we've encountered before, I suspect that during compilation the page size of your host would be detected and Dorado will be compiled to work with the appropriate (64KB?) page size (Side note is that this may have performance implications, though I think it will be fine)

@zhoujingyu13687306871
Copy link
Author

你好@zhoujingyu13687306871 - 你能在kunpeng920机器上自己编译Dorado吗?这不是我们以前遇到过的问题,我怀疑在编译过程中会检测到主机的页面大小,并且 Dorado 将被编译为使用适当的(64KB?)页面大小(旁注是,这可能会影响性能影响,虽然我认为这会很好)

yes, I compiled dorado on kunpeng920 machine, which system page size is 64K

@vellamike
Copy link
Collaborator

OK - this is probably because the POD5 dependency is not compiled to use 64KB page size. We are investigating a solution

@vellamike vellamike added the bug Something isn't working label Jul 12, 2023
@zhoujingyu13687306871
Copy link
Author

zhoujingyu13687306871 commented Jul 14, 2023 via email

@zhoujingyu13687306871
Copy link
Author

zhoujingyu13687306871 commented Feb 28, 2024

好的 - 这可能是因为 POD5 依赖项未编译为使用 64KB 页面大小。我们正在研究解决方案

@vellamike
Hi, I would like to ask, in the past half year, which version of the newly released Dorado version has fixed this bug?

@tijyojwad
Copy link
Collaborator

Hi @zhoujingyu13687306871 - we haven't looked at fixing this yet

@vellamike vellamike changed the title Running a job for a long time without output Running a job for a long time without output (kunpeng920 CPU) May 3, 2024
@malton-ont
Copy link
Collaborator

See #637 for updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants