Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting negative throughput value for large batch sizes #128

Closed
mgiessing opened this issue Feb 19, 2024 · 7 comments
Closed

Getting negative throughput value for large batch sizes #128

mgiessing opened this issue Feb 19, 2024 · 7 comments

Comments

@mgiessing
Copy link

Hi, I'm using the optimum-benchmark for onnxruntime backend (cpu), but for larger batch sizes I get negative values which doesn't seem correct - is this a known bug?

This is the param setup:

[...]
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16,32,64
      model: /data/LLMs/onnx/llama-2-7b_onnx,/data/LLMs/onnx/llama-2-7b_onnx_quant

E.g. for llama-2 ONNX models with onnxruntime (there are always 2 runs for each batch size, one for fp32 the other for int8)

for i in {1..13}; do
    cat "experiments/llama-2-7b_ort/$i/benchmark_report.json" | jq '.batch_size, .decode.throughput."mean[tokens/s]"'
done

1
10.64650340356247
2
12.258327597687689
2
22.51221913908261
4
26.243146016015586
4
46.47543946736317
8
56.99298854476616
8
102.96338019358438
16
156.63038758907555
16
284.71610900884076
32
668.620966745494
32
2320.323920074327
64
-744.356127863481
64
-779.9233473188091

As you can see for bs=64 the numbers are negative - is this something you can confirm/reproduce with onnxruntime?

Thanks!

@IlyasMoutawwakil
Copy link
Member

Hi! I can't reproduce it, can you please share more of the experiment config, benchmark report and cli log ?
is this using main branch ? can you try running with 0.0.1 ? to see if it's caused by one of my new PRs

@mgiessing
Copy link
Author

I was using the main branch, my problem with version 0.0.1 is it doesn't seem to know any keys in the backend section (such as export: false which I use because I'm running "pre-converted" models, since I don't have a lot of space in my root partition.

I'll try to update the benchmark config, so it runs with 0.0.1.

If you want to try to replicate my setup, I did the following: (Let's use Tinyllama as it is much smaller but has the same "negative throughput" issue)

Using huggingface optimum to export to onnx

optimum-cli export onnx --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/LLMs/onnx/tinyllama_onnx/

using this benchmark config (changed bs to only high numbers and model to tinyllama):

defaults:
  - backend: onnxruntime # default backend
  - launcher: process # default launcher
  - benchmark: inference # default benchmark
  - experiment # inheriting experiment schema
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

experiment_name: tinyllama_ort
model: tinyllama-model

backend:
  device: cpu
  export: false
  task: text-generation
  library: transformers

benchmark:
  memory: true
  warmup_runs: 10
  input_shapes:
    batch_size: 1
    sequence_length: 256
  new_tokens: 512

# hydra/cli specific settings
hydra:
  run:
    # where to store run results
    dir: experiments/${experiment_name}
  sweep:
    # where to store sweep results
    dir: experiments/${experiment_name}
  job:
    # change working directory to the run directory
    chdir: true
    env_set:
      # set environment variable OVERRIDE_BENCHMARKS to 1
      # to not skip benchmarks that have been run before
      OVERRIDE_BENCHMARKS: 1
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 64,128
      model: /data/LLMs/onnx/tinyllama_onnx

Run with

optimum-benchmark --config-dir benchmark_configs/ --config-name tinyllama_ort --multirun

@IlyasMoutawwakil
Copy link
Member

@mgiessing thanks I was able to reproduce it with smaller values seq len and new tokens.
it seems that for some reason generate takes less time than forward which doesn't make sense

[PROC-0][2024-02-19 17:25:34,014][latency][INFO] -              + forward latency: 6.503192 ± 2 x 0.068859 (s)
[PROC-0][2024-02-19 17:25:44,944][latency][INFO] -              + generate latency: 1.214331 ± 2 x 0.003735 (s)

the way decoding latency is computed is based on these two measurements (prefill=forward, decode=generate-forward), I'll investigate what's happening here.

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Feb 19, 2024

okay found it, i see why it's only happening with onnxruntime, I was filtering some inputs to avoid passing postion_ids to models that don't require them (causal LMs are a mess), but other inputs got filtered accidently, and since generate can be ran with empty input it didn't raise any errors (it just decodes starting from a bos token).
Thanks a lot for spotting this ! I'll fix it asap and add some sanity checks to avoid this ever occurring.

@mgiessing
Copy link
Author

Ah amazing - thanks for your quick investigation and finding! That is really appreciated :)

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Feb 20, 2024

@mgiessing fixed in in #129 🤗
also added some logging to notify users when an input is generated but removed for a specific backend

[08:35:31][test-cli][INFO] - [PROC-0][2024-02-20 08:35:31,920][onnxruntime][WARNING] - Input position_ids is not in expected inputs names. Removing it.

@mgiessing
Copy link
Author

Thanks a lot @IlyasMoutawwakil - much appreciated & everything works fine now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants