Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU utilization lower than expected #1171

Open
karlkashofer opened this issue Dec 10, 2024 · 4 comments
Open

CPU utilization lower than expected #1171

karlkashofer opened this issue Dec 10, 2024 · 4 comments
Labels
performance Issues related to basecalling performance

Comments

@karlkashofer
Copy link

We have several cluster servers with 96 CPUs and we would like to do dorado basecalling on them.

When we put dorado in a debian docker container and call it from within, it seems to wrongly detect the number of CPUs as it is running on a single CPU. nproc inside the container gives the correct number of CPUs so i am at a loss what could be wrong.

Running the same dorado binary outside docker correctly utilizes all CPUs.

Steps to reproduce the issue:

run dorado in a debian docker image.

Our docker command (the basecaller script just runs dorado and then processes the ubams):
docker run -d -v $PWD:/analysis ontsplit bash -c "cd /analysis; /basecaller.sh FBA23844_efef091c_d25addcf_1.pod5 "341, 342, 343, 344, 345, 346, 347, 348, 349, 350""

Run environment:

  • Dorado version: dorado-0.8.2-linux-x64
  • Dorado command: /dorado-0.8.2-linux-x64/bin/dorado basecaller --device cpu hac FBA23844_efef091c_d25addcf_1.pod5
  • Operating system: debian bookworm
  • Hardware (CPUs, Memory, GPUs): 32 cpus, 125gb RAM, no GPUs

Logs

[2024-12-10 22:46:50.091] [info] Running: "basecaller" "-v" "--device" "cpu" "hac" "FBA23844_efef091c_d25addcf_1.pod5"
[2024-12-10 22:46:50.114] [info] - downloading [email protected] with httplib
[2024-12-10 22:46:51.260] [info] Normalised: chunksize 10000 -> 9996
[2024-12-10 22:46:51.260] [info] Normalised: overlap 500 -> 498
[2024-12-10 22:46:51.260] [info] > Creating basecall pipeline
[2024-12-10 22:46:51.260] [debug] CRFModelConfig { qscale:1.050000 qbias:-0.600000 stride:6 bias:0 clamp:1 out_features:-1 state_len:4 outsize:1024 blank_score:2.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:9996 overlap:498 batch_size:128} convs: { 0: ConvParams { insize:1 size:16 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:16 size:16 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:16 size:384 winlen:19 stride:6 activation:tanh}} model_type: lstm { bias:0 outsize:1024 blank_score:2.000000 scale:1.000000}}
[2024-12-10 22:46:51.262] [debug] - CPU calling: set num_cpu_runners to 1
[2024-12-10 22:46:51.454] [debug] BasecallerNode chunk size 9996
[2024-12-10 22:46:51.469] [debug] Load reads from file FBA23844_efef091c_d25addcf_1.pod5

@malton-ont
Copy link
Collaborator

Hi @karlkashofer,

Dorado does not simply create a runner per core - it attempts to determine how much RAM is available and generates a number of runners such that it shouldn't run out of memory. For the HAC model with a 128 batch size, this should be ~1 runner per 4.5GB of available memory.

However, it looks like dorado is only checking for "free" RAM and is not allowing itself access to any buffer/cache memory (at least on linux), even though this could be made accessible. We'll investigate further and look into improving this.

@malton-ont malton-ont added the performance Issues related to basecalling performance label Dec 11, 2024
@karlkashofer
Copy link
Author

karlkashofer commented Dec 11, 2024

Thanks for looking into this !
I dont really understand why "free RAM" would be different inside the container. As i said, the same command runs with many more CPUs on the same machine when i start it outside docker.

I already use the LD_PRELOAD hack to limit cpus used so we can deploy that safely on the gridengine cluster. Do you think there is a similar override for the RAM ? #567 (comment)

I'd be happy to test any suggestions.

@malton-ont
Copy link
Collaborator

Hi @karlkashofer,

If I run free -h in my terminal, I see:

free -h
              total        used        free      shared  buff/cache   available
Mem:          377Gi        20Gi        20Gi        74Mi       336Gi       354Gi
Swap:         8.0Gi       526Mi       7.5Gi

But running dorado basecaller hac -x cpu -v ... produces:

2024-12-11 11:24:51.512] [info] Running: "basecaller" "hac" "-x" "cpu" "-v" "./pod5
...
[2024-12-11 11:24:51.954] [info] Normalised: chunksize 10000 -> 9996
[2024-12-11 11:24:51.954] [info] Normalised: overlap 500 -> 498
[2024-12-11 11:24:51.954] [info] > Creating basecall pipeline
...
[2024-12-11 11:24:51.973] [debug] - CPU calling: set num_cpu_runners to 4

Given the numbers above, this suggests we're seeing memory in the free column but not in the buff/cache column, which could be made available.

It does appear to be possible to hack this in a similar way to n_procs:

cat > ram_override.cpp << "EOF"
#include <sys/sysinfo.h>
extern "C"
int sysinfo(struct sysinfo * __info) {
  __info->freeram = 100 * 1024 * 1024;
  __info->mem_unit = 1024;
  return 0;
}
EOF
gcc -shared ram_override.cpp -o ram_override.so
export LD_PRELOAD=${PWD}/ram_override.so

This generated 22 runners for me (100/4.5). Change 100 to a sensible value for your system. Your mileage may vary - this is at your own risk.

@karlkashofer
Copy link
Author

Hi @malton-ont !

This is very hacky but works like a charm !
I added a "threads" parameter to my script and dynamically recompile both cpu_limiter.so and ram_override.so using that value and now i am able to control resources as fine grained as i need for running on the cluster.
This is an example of running in docker using threads=12:
grafik

For reference, this is the relevant portion of my script:

# build limiters
origpath=$PWD
cd /dorado-0.8.2-linux-x64/lib2
cat > ram_override.cpp << EOF
#include <sys/sysinfo.h>
extern "C"
int sysinfo(struct sysinfo * __info) {
  __info->freeram = $1 * 4.5 * 1024 * 1024;
  __info->mem_unit = 1024;
  return 0;
}
EOF
gcc -shared ram_override.cpp -o ram_override.so

cat > cpu_limiter.cpp << EOF
#include <sys/sysinfo.h>
extern "C" 
int get_nprocs() { return $1; }
EOF
gcc -shared cpu_limiter.cpp -o cpu_limiter.so

export LD_PRELOAD=${PWD}/ram_override.so:${PWD}/cpu_limiter.so
cd $origpath

@malton-ont malton-ont changed the title CPU utilization wrong in docker container CPU utilization lower than expected Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to basecalling performance
Projects
None yet
Development

No branches or pull requests

2 participants