CPU utilization lower than expected #1171

karlkashofer · 2024-12-10T22:51:27Z

We have several cluster servers with 96 CPUs and we would like to do dorado basecalling on them.

When we put dorado in a debian docker container and call it from within, it seems to wrongly detect the number of CPUs as it is running on a single CPU. nproc inside the container gives the correct number of CPUs so i am at a loss what could be wrong.

Running the same dorado binary outside docker correctly utilizes all CPUs.

Steps to reproduce the issue:

run dorado in a debian docker image.

Our docker command (the basecaller script just runs dorado and then processes the ubams):
docker run -d -v $PWD:/analysis ontsplit bash -c "cd /analysis; /basecaller.sh FBA23844_efef091c_d25addcf_1.pod5 "341, 342, 343, 344, 345, 346, 347, 348, 349, 350""

Run environment:

Dorado version: dorado-0.8.2-linux-x64
Dorado command: /dorado-0.8.2-linux-x64/bin/dorado basecaller --device cpu hac FBA23844_efef091c_d25addcf_1.pod5
Operating system: debian bookworm
Hardware (CPUs, Memory, GPUs): 32 cpus, 125gb RAM, no GPUs

Logs

[2024-12-10 22:46:50.091] [info] Running: "basecaller" "-v" "--device" "cpu" "hac" "FBA23844_efef091c_d25addcf_1.pod5"
[2024-12-10 22:46:50.114] [info] - downloading [email protected] with httplib
[2024-12-10 22:46:51.260] [info] Normalised: chunksize 10000 -> 9996
[2024-12-10 22:46:51.260] [info] Normalised: overlap 500 -> 498
[2024-12-10 22:46:51.260] [info] > Creating basecall pipeline
[2024-12-10 22:46:51.260] [debug] CRFModelConfig { qscale:1.050000 qbias:-0.600000 stride:6 bias:0 clamp:1 out_features:-1 state_len:4 outsize:1024 blank_score:2.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:9996 overlap:498 batch_size:128} convs: { 0: ConvParams { insize:1 size:16 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:16 size:16 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:16 size:384 winlen:19 stride:6 activation:tanh}} model_type: lstm { bias:0 outsize:1024 blank_score:2.000000 scale:1.000000}}
[2024-12-10 22:46:51.262] [debug] - CPU calling: set num_cpu_runners to 1
[2024-12-10 22:46:51.454] [debug] BasecallerNode chunk size 9996
[2024-12-10 22:46:51.469] [debug] Load reads from file FBA23844_efef091c_d25addcf_1.pod5

malton-ont · 2024-12-11T11:35:09Z

Hi @karlkashofer,

Dorado does not simply create a runner per core - it attempts to determine how much RAM is available and generates a number of runners such that it shouldn't run out of memory. For the HAC model with a 128 batch size, this should be ~1 runner per 4.5GB of available memory.

However, it looks like dorado is only checking for "free" RAM and is not allowing itself access to any buffer/cache memory (at least on linux), even though this could be made accessible. We'll investigate further and look into improving this.

karlkashofer · 2024-12-11T11:41:45Z

Thanks for looking into this !
I dont really understand why "free RAM" would be different inside the container. As i said, the same command runs with many more CPUs on the same machine when i start it outside docker.

I already use the LD_PRELOAD hack to limit cpus used so we can deploy that safely on the gridengine cluster. Do you think there is a similar override for the RAM ? #567 (comment)

I'd be happy to test any suggestions.

malton-ont · 2024-12-11T12:36:40Z

Hi @karlkashofer,

If I run free -h in my terminal, I see:

free -h
              total        used        free      shared  buff/cache   available
Mem:          377Gi        20Gi        20Gi        74Mi       336Gi       354Gi
Swap:         8.0Gi       526Mi       7.5Gi

But running dorado basecaller hac -x cpu -v ... produces:

2024-12-11 11:24:51.512] [info] Running: "basecaller" "hac" "-x" "cpu" "-v" "./pod5
...
[2024-12-11 11:24:51.954] [info] Normalised: chunksize 10000 -> 9996
[2024-12-11 11:24:51.954] [info] Normalised: overlap 500 -> 498
[2024-12-11 11:24:51.954] [info] > Creating basecall pipeline
...
[2024-12-11 11:24:51.973] [debug] - CPU calling: set num_cpu_runners to 4

Given the numbers above, this suggests we're seeing memory in the free column but not in the buff/cache column, which could be made available.

It does appear to be possible to hack this in a similar way to n_procs:

cat > ram_override.cpp << "EOF"
#include <sys/sysinfo.h>
extern "C"
int sysinfo(struct sysinfo * __info) {
  __info->freeram = 100 * 1024 * 1024;
  __info->mem_unit = 1024;
  return 0;
}
EOF
gcc -shared ram_override.cpp -o ram_override.so
export LD_PRELOAD=${PWD}/ram_override.so

This generated 22 runners for me (100/4.5). Change 100 to a sensible value for your system. Your mileage may vary - this is at your own risk.

karlkashofer · 2024-12-11T20:56:04Z

Hi @malton-ont !

This is very hacky but works like a charm !
I added a "threads" parameter to my script and dynamically recompile both cpu_limiter.so and ram_override.so using that value and now i am able to control resources as fine grained as i need for running on the cluster.
This is an example of running in docker using threads=12:

For reference, this is the relevant portion of my script:

# build limiters
origpath=$PWD
cd /dorado-0.8.2-linux-x64/lib2
cat > ram_override.cpp << EOF
#include <sys/sysinfo.h>
extern "C"
int sysinfo(struct sysinfo * __info) {
  __info->freeram = $1 * 4.5 * 1024 * 1024;
  __info->mem_unit = 1024;
  return 0;
}
EOF
gcc -shared ram_override.cpp -o ram_override.so

cat > cpu_limiter.cpp << EOF
#include <sys/sysinfo.h>
extern "C" 
int get_nprocs() { return $1; }
EOF
gcc -shared cpu_limiter.cpp -o cpu_limiter.so

export LD_PRELOAD=${PWD}/ram_override.so:${PWD}/cpu_limiter.so
cd $origpath

malton-ont added the performance Issues related to basecalling performance label Dec 11, 2024

malton-ont changed the title ~~CPU utilization wrong in docker container~~ CPU utilization lower than expected Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU utilization lower than expected #1171

CPU utilization lower than expected #1171

karlkashofer commented Dec 10, 2024

malton-ont commented Dec 11, 2024

karlkashofer commented Dec 11, 2024 •

edited by malton-ont

Loading

malton-ont commented Dec 11, 2024

karlkashofer commented Dec 11, 2024

CPU utilization lower than expected #1171

CPU utilization lower than expected #1171

Comments

karlkashofer commented Dec 10, 2024

Steps to reproduce the issue:

Run environment:

Logs

malton-ont commented Dec 11, 2024

karlkashofer commented Dec 11, 2024 • edited by malton-ont Loading

malton-ont commented Dec 11, 2024

karlkashofer commented Dec 11, 2024

karlkashofer commented Dec 11, 2024 •

edited by malton-ont

Loading