Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

Closed
AmosLewis opened this issue Jan 24, 2025 · 20 comments
Closed

[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

AmosLewis opened this issue Jan 24, 2025 · 20 comments
Assignees
Labels
bug 🐞 Something isn't working

Comments

@AmosLewis
Copy link
Contributor

AmosLewis commented Jan 24, 2025

What happened?

Follow up of #19785

When run tracy iree-run-module for llama 8b float8 model on amd gpu, I got a seg fault.

TRACY_NO_EXIT=1 \
  ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  /home/chi/src/iree-build-trace/tools/iree-run-module \
  --hip_use_streams=true \
  --module=fp8_tracy.vmfb \
  --parameters=model=fp8.irpa \
  --device=hip://4 \
  --function=prefill_bs1 \
  --input=@prefill/bf16_tokens.npy \
  --input=@prefill/bf16_seq_lens.npy \
  --input=@prefill/bf16_seq_block_ids.npy \
  --input=@prefill/bf16_cs_f16.npy
[1]    2696115 segmentation fault (core dumped)  TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  --hip_use_streams=true

gdb bt:

Thread 10 "iree-hal-hip-di" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fbf96600640 (LWP 2749328)]
iree_hal_stream_tracing_zone_begin_external_impl (context=0x7fffe4400010, event_list=0x7fbf68001438, verbosity=verbosity@entry=IREE_HAL_STREAM_TRACING_VERBOSITY_COARSE, file_name=file_name@entry=0x0, file_name_length=file_name_length@entry=0, line=line@entry=0, function_name=0x55555558d55d "iree_hal_hip_stream_command_buffer", function_name_length=34, name=0x0, name_length=0) at /home/chi/src/iree/runtime/src/iree/hal/utils/stream_tracing.c:497
497       if (verbosity > context->verbosity) return;
(gdb) bt
#0  iree_hal_stream_tracing_zone_begin_external_impl (context=0x7fffe4400010, event_list=0x7fbf68001438,
    verbosity=verbosity@entry=IREE_HAL_STREAM_TRACING_VERBOSITY_COARSE, file_name=file_name@entry=0x0,
    file_name_length=file_name_length@entry=0, line=line@entry=0,
    function_name=0x55555558d55d "iree_hal_hip_stream_command_buffer", function_name_length=34, name=0x0,
    name_length=0) at /home/chi/src/iree/runtime/src/iree/hal/utils/stream_tracing.c:497
#1  0x0000555555606fc2 in iree_hal_hip_stream_command_buffer_begin (base_command_buffer=<optimized out>)
    at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c:178
#2  0x00005555555dde15 in iree_hal_command_buffer_begin (command_buffer=0x7fbf680013e0)
    at /home/chi/src/iree/runtime/src/iree/hal/command_buffer.c:273
#3  0x00005555556025a2 in iree_hal_hip_multi_queue_command_buffer_begin (base_command_buffer=0x7fbf680094d0)
    at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/hip_multi_queue_command_buffer.c:158
#4  0x00005555555dde15 in iree_hal_command_buffer_begin (command_buffer=0x7fbf680094d0)
    at /home/chi/src/iree/runtime/src/iree/hal/command_buffer.c:273
#5  0x00005555555f9e7e in iree_hal_hip_device_perform_queue_read_now (user_data=user_data@entry=0x5555579df230,
    status=0x55555558d55d, status@entry=0x0) at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/hip_device.c:1837
#6  0x00005555555fdadf in iree_hal_hip_dispatch_thread_main (param=0x55555786c4c0)
    at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/dispatch_thread.c:66
#7  0x00005555556317ab in iree_thread_start_routine (param=0x55555786c950)
    at /home/chi/src/iree/runtime/src/iree/base/internal/threading_pthreads.c:119
#8  0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007ffff7926850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Steps to reproduce your issue

  1. Checkout iree to this commit
commit 4215100513136f4215862ac2578c20e01597d862 (HEAD -> main, upstream/main)
Author: Zhuoran Yin <[email protected]>
Date:   Fri Jan 24 09:21:21 2025 -0500

    Skipping generic from root op when it computes slice indices (#19767)
  1. Build iree with this command:
cmake -G Ninja -B ../iree-build-trace   -S . -DCMAKE_BUILD_TYPE=RelWithDebInfo   \
-DIREE_ENABLE_ASSERTIONS=ON   -DCMAKE_C_COMPILER=clang   \
-DCMAKE_CXX_COMPILER=clang++   -DIREE_ENABLE_RUNTIME_TRACING=ON   \
-DIREE_BUILD_TRACY=ON   -DIREE_ENABLE_LLD=ON   \
-DIREE_BUILD_PYTHON_BINDINGS=ON   \
-DPython3_EXECUTABLE="$(which python3)"  \
-DIREE_TARGET_BACKEND_CUDA=OFF -DIREE_HAL_DRIVER_HIP=ON \
-DIREE_TARGET_BACKEND_ROCM=ON .

cmake --build ../iree-build-trace
  1. Generate vmfb iree-compile with the following commad
    Here is the input mlir llama_8b_fp8.mlir
    The input mlir is generated with shark-ai https://github.com/nod-ai/shark-ai/commits/users/dan_garvey/fp8_staging
../iree-build-tracy/tools/iree-compile \
  fp8.mlir \
  --iree-hip-target=gfx942 \
  -o=fp8_tracy.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions \
  --iree-hal-executable-debug-level=3 \
  --iree-hal-dump-executable-sources-to=dump
  1. iree-run-module with vmfb/irpa/npy. the irpa is private, reach out to me or @dan-garvey Daniel Garvey (SharkMI300X, /sharedfile/llama3_8b_fp8.irpa ). The input npy are generated by castf16.py , or cp them from folder (SharkMI300X, /sharedfile/prefill/)
TRACY_NO_EXIT=1 \
  ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  /home/chi/src/iree-build-trace/tools/iree-run-module \
  --hip_use_streams=true \
  --module=fp8_tracy.vmfb \
  --parameters=model=fp8.irpa \
  --device=hip://4 \
  --function=prefill_bs1 \
  --input=@prefill/bf16_tokens.npy \
  --input=@prefill/bf16_seq_lens.npy \
  --input=@prefill/bf16_seq_block_ids.npy \
  --input=@prefill/bf16_cs_f16.npy
[1]    2696115 segmentation fault (core dumped)  TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  --hip_use_streams=true

What component(s) does this issue relate to?

Runtime

Version information

4215100

Additional context

SharkMI300X

Tracy related bug solution: #19826

@AmosLewis
Copy link
Contributor Author

AmosLewis commented Jan 27, 2025

I just cmake debug mode without tracy on top of master,

commit 9a34131e16f82dab188f84398e8f4a42f09d3350 (HEAD -> main, upstream/main)
Author: Scott Todd <[email protected]>
Date:   Mon Jan 27 10:31:21 2025 -0800

    Cherry-pick fix for torch-mlir build on MSVC. (#19823)

    See https://github.com/llvm/torch-mlir/pull/3984

iree-complie and iree-runt-module without tracy, I got type issue since numpy does not support bf16 type.

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=@prefill/bf16_tokens.npy \
--input=@prefill/bf16_seq_lens.npy \
--input=@prefill/bf16_seq_block_ids.npy \
--input=@prefill/bf16_cs_f16.npy
iree/runtime/src/iree/tooling/numpy_io.c:232: UNIMPLEMENTED; unsupported data type g; parsing input `@prefill/bf16_tokens.npy`; parsing function inputs

In this way, if I want to input bf16 type, how can I create the input? One of the way is to read the f32.npy then save it to pytorch.pt, since pytorch can support bf16 type. But does iree support the input as .pt format?

@benvanik
Copy link
Collaborator

you can write the data to binary files and pass those in: --input=4x2xbf16=@some_file.bin

numpy does not support bf16 (without a fork), but some implementations are starting to use that - we could make our numpy loader use <V2 ala pytorch: pytorch/pytorch#143042

@AmosLewis
Copy link
Contributor Author

AmosLewis commented Jan 27, 2025

some_file.bin

@benvanik I wrote a script numpy2TorchBf16Bin.py to convert f32.npy to torch_bf16 then write it into .bin. When run with iree-run-module, it said only .npy supported.

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=@prefill/bf16_tokens.bin \
--input=@prefill/bf16_seq_lens.bin \
--input=@prefill/bf16_seq_block_ids.bin \
--input=@prefill/bf16_cs_f16.bin
iree/runtime/src/iree/tooling/function_io.c:607: UNIMPLEMENTED; only numpy (.npy) files are supported for metadata-less variant I/O; parsing input `@prefill/bf16_tokens.bin`; parsing function inputs

@ScottTodd
Copy link
Member

When you pass binary data, you need to tell the runtime how to interpret that data, using for example --input=4x2xbf16=@some_file.bin. Numpy stores enough metadata in .npy files for the runtime to interpret them on their own.

@AmosLewis
Copy link
Contributor Author

When you pass binary data, you need to tell the runtime how to interpret that data, using for example --input=4x2xbf16=@some_file.bin. Numpy stores enough metadata in .npy files for the runtime to interpret them on their own.

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=512xbf16=@prefill/bf16_tokens.bin \
--input=4xbf16=@prefill/bf16_seq_lens.bin \
--input=16xbf16=@prefill/bf16_seq_block_ids.bin \
--input=268435456xbf16=@prefill/bf16_cs_f16.bin
EXEC @prefill_bs1
iree/runtime/src/iree/modules/hal/utils/buffer_diagnostics.c:191: INVALID_ARGUMENT; tensor element type mismatch; expected i64 (10000040) but have bf16 (22000010); while invoking native function hal.buffer_view.assert; while calling import;
[ 1] bytecode module.prefill_bs1$async:4672 fp8.mlir:549:26
[ 0] bytecode module.prefill_bs1:68 fp8.mlir:549:3; invoking function 'prefill_bs1'

@benvanik
Copy link
Collaborator

don't turn your things that should be i64 into bf16? I'm guessing your tokens aren't bf16 values?

@AmosLewis
Copy link
Contributor Author

AmosLewis commented Jan 27, 2025

don't turn your things that should be i64 into bf16? I'm guessing your tokens aren't bf16 values?

Try to only cast the cs_f16 to bf16, everything else I left there same. Because as I print (numpy2TorchBf16Bin.py) from raw npy, everything else is int. @dan-garvey I also doubt if all the input should be bf16, and is the size same?

  # data size:  cs_f16 268435456
  # data type:  cs_f16 float16
  # data size:  seq_block_ids 16
  # data type:  seq_block_ids int64
  # data size:  seq_lens 4
  # data type:  seq_lens int64
  # data size:  tokens 512
  # data type:  tokens int64
/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=@prefill/tokens.npy \
--input=@prefill/seq_lens.npy \
--input=@prefill/seq_block_ids.npy \
--input=268435456xbf16=@prefill/bf16_cs_f16.bin
EXEC @prefill_bs1
iree/runtime/src/iree/modules/hal/utils/buffer_diagnostics.c:225: INVALID_ARGUMENT; tensor shape dimension 0 mismatch; expected 1 but have 4; expected shape `1x128`, actual shape `4x128`; while invoking native function hal.buffer_view.assert; while calling import;
[ 1] bytecode module.prefill_bs1$async:4672 fp8.mlir:549:26
[ 0] bytecode module.prefill_bs1:68 fp8.mlir:549:3; invoking function 'prefill_bs1'

@benvanik
Copy link
Collaborator

these are some basic errors with you passing in the wrong values - this issue has bounced between asserts in tracy, bfloat16 numpy support, and missized inputs - it'd be good to break these down and isolate things so we can actually make some progress. all are issues, but together it's too hard to track.

@drprajap
Copy link

drprajap commented Jan 27, 2025

Originally reported seg fault issue (exposed in Llama3.1_8b_f16_tp8 model) has been resolved by runtime fix 1bf7249, fix has been verified in both the cases.
after that fix, it exposed input specific issues, is it good to file separate ones for them for easier tracking?

@AmosLewis
Copy link
Contributor Author

AmosLewis commented Jan 28, 2025

The tracy issue now fixed. With the corrected SizexDtype input just generated by Dan at (SharkMI300, /sharedfile/prefill/), the INVALID_ARGUMENT issue also fixed.
Now we got a new hip HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION issue at runtime for both with/without Tracy. Same as #19564

commit 4b0ca34a768377d648f1efc7f377a852d7a943a9 (HEAD -> main, upstream/main)
Author: Ian Wood <[email protected]>
Date:   Mon Jan 27 16:08:05 2025 -0800

    Support fusing broadcast transposes with attention (#19828)
/home/chi/src/iree-build/tools/iree-compile fp8.mlir \
  --iree-hip-target=gfx942 \
  -o=fp8.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions
/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=1x32xi64=@/sharedfile/prefill/prefill_token_ids_1_32.bin \
--input=1xi64=@/sharedfile/prefill/prefill_seq_lens_1.bin \
--input=1x1xi64=@/sharedfile/prefill/prefill_seq_block_ids_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/prefill/prefill_cache_state_128_2097152.bin
EXEC @prefill_bs1
:0:rocdevice.cpp            :2984: 268244551229 us: [pid:3664007 tid:0x7e7d1b000640] Callback: Queue 0x7e7d1a500000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
[1]    3664007 IOT instruction (core dumped)  /home/chi/src/iree-build/tools/iree-run-module --hip_use_streams=true

@IanWood1
Copy link
Contributor

@AmosLewis I think the inputs might still be incorrect. Using rocgdb, I found that the failing dispatch is prefill_bs1$async_dispatch_0_elementwise_broadcast_Dx4096_i64xbf16, which is the first dispatch. It uses user provided input to index into the kvcache (torch.embedding). prefill_seq_block_ids_1_1.bin appears to be 1.1kb which doen't seem right since its a binary file with a single i64.

Also, I'm not sure #19564 is related. It was producing a similar error but due to the incorrect input. The mentioned runtime commit fixed a secondary issue #19564 (comment).

@dan-garvey
Copy link
Contributor

I think the bin files have significant size overhead, multiple 1 value files are over 1kb.

@benvanik
Copy link
Collaborator

they are invalid if so - they are supposed to be the literal data - a 4 byte value should be 4 bytes on disk.

@dan-garvey
Copy link
Contributor

@AmosLewis have you successfully used any ".bin" files produced via torch.save?

@benvanik
Copy link
Collaborator

open them in a hex editor and check - here's 24 floats of 1.0, should look like this:

Image

@dan-garvey
Copy link
Contributor

yeah its certainly not one value. But if a bin file is meant to be just a sequence of literal values that should be pretty easy to produce.

In the meantime @AmosLewis you can just load these using the torch api and then pass via the iree python api, none of them are bf16 so the numpy intermediary won't be a problem until output.

@AmosLewis
Copy link
Contributor Author

@AmosLewis have you successfully used any ".bin" files produced via torch.save?

No. I just use them first time for iree-run-module.

@dan-garvey
Copy link
Contributor

dan-garvey commented Jan 29, 2025 via email

@ScottTodd
Copy link
Member

Yeah don't mix file types. .bin is just binary data, no magic, no metadata, nothing framework-specific. At the base level that is all IREE sees at the boundaries anyways - buffers of data.

@AmosLewis
Copy link
Contributor Author

AmosLewis commented Jan 30, 2025

With dan create new .bin input in pr nod-ai/shark-ai#885, the iree-run-module run successfully and tracy file generated. But now we got numeric issue now, I will file a new issue #19859 for numeric separately.

TRACY_NO_EXIT=1 \
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/home/chi/src/iree-build-trace/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8_tracy.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=1x32xi64=@/sharedfile/prefill/prefill_token_ids_1_32.bin \
--input=1xi64=@/sharedfile/prefill/prefill_seq_lens_1.bin \
--input=1x1xi64=@/sharedfile/prefill/prefill_seq_block_ids_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/prefill/prefill_cache_state_128_2097152.bin
EXEC @prefill_bs1
result[0]: hal.buffer_view
1x32x128256xbf16=[[NAN NAN NAAN NAN NAN...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants