[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

AmosLewis · 2025-01-24T19:07:12Z

What happened?

Follow up of #19785

When run tracy iree-run-module for llama 8b float8 model on amd gpu, I got a seg fault.

TRACY_NO_EXIT=1 \
  ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  /home/chi/src/iree-build-trace/tools/iree-run-module \
  --hip_use_streams=true \
  --module=fp8_tracy.vmfb \
  --parameters=model=fp8.irpa \
  --device=hip://4 \
  --function=prefill_bs1 \
  --input=@prefill/bf16_tokens.npy \
  --input=@prefill/bf16_seq_lens.npy \
  --input=@prefill/bf16_seq_block_ids.npy \
  --input=@prefill/bf16_cs_f16.npy
[1]    2696115 segmentation fault (core dumped)  TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  --hip_use_streams=true

gdb bt:

Thread 10 "iree-hal-hip-di" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fbf96600640 (LWP 2749328)]
iree_hal_stream_tracing_zone_begin_external_impl (context=0x7fffe4400010, event_list=0x7fbf68001438, verbosity=verbosity@entry=IREE_HAL_STREAM_TRACING_VERBOSITY_COARSE, file_name=file_name@entry=0x0, file_name_length=file_name_length@entry=0, line=line@entry=0, function_name=0x55555558d55d "iree_hal_hip_stream_command_buffer", function_name_length=34, name=0x0, name_length=0) at /home/chi/src/iree/runtime/src/iree/hal/utils/stream_tracing.c:497
497       if (verbosity > context->verbosity) return;
(gdb) bt
#0  iree_hal_stream_tracing_zone_begin_external_impl (context=0x7fffe4400010, event_list=0x7fbf68001438,
    verbosity=verbosity@entry=IREE_HAL_STREAM_TRACING_VERBOSITY_COARSE, file_name=file_name@entry=0x0,
    file_name_length=file_name_length@entry=0, line=line@entry=0,
    function_name=0x55555558d55d "iree_hal_hip_stream_command_buffer", function_name_length=34, name=0x0,
    name_length=0) at /home/chi/src/iree/runtime/src/iree/hal/utils/stream_tracing.c:497
#1  0x0000555555606fc2 in iree_hal_hip_stream_command_buffer_begin (base_command_buffer=<optimized out>)
    at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c:178
#2  0x00005555555dde15 in iree_hal_command_buffer_begin (command_buffer=0x7fbf680013e0)
    at /home/chi/src/iree/runtime/src/iree/hal/command_buffer.c:273
#3  0x00005555556025a2 in iree_hal_hip_multi_queue_command_buffer_begin (base_command_buffer=0x7fbf680094d0)
    at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/hip_multi_queue_command_buffer.c:158
#4  0x00005555555dde15 in iree_hal_command_buffer_begin (command_buffer=0x7fbf680094d0)
    at /home/chi/src/iree/runtime/src/iree/hal/command_buffer.c:273
#5  0x00005555555f9e7e in iree_hal_hip_device_perform_queue_read_now (user_data=user_data@entry=0x5555579df230,
    status=0x55555558d55d, status@entry=0x0) at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/hip_device.c:1837
#6  0x00005555555fdadf in iree_hal_hip_dispatch_thread_main (param=0x55555786c4c0)
    at /home/chi/src/iree/runtime/src/iree/hal/drivers/hip/dispatch_thread.c:66
#7  0x00005555556317ab in iree_thread_start_routine (param=0x55555786c950)
    at /home/chi/src/iree/runtime/src/iree/base/internal/threading_pthreads.c:119
#8  0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007ffff7926850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Steps to reproduce your issue

Checkout iree to this commit

commit 4215100513136f4215862ac2578c20e01597d862 (HEAD -> main, upstream/main)
Author: Zhuoran Yin <[email protected]>
Date:   Fri Jan 24 09:21:21 2025 -0500

    Skipping generic from root op when it computes slice indices (#19767)

Build iree with this command:

cmake -G Ninja -B ../iree-build-trace   -S . -DCMAKE_BUILD_TYPE=RelWithDebInfo   \
-DIREE_ENABLE_ASSERTIONS=ON   -DCMAKE_C_COMPILER=clang   \
-DCMAKE_CXX_COMPILER=clang++   -DIREE_ENABLE_RUNTIME_TRACING=ON   \
-DIREE_BUILD_TRACY=ON   -DIREE_ENABLE_LLD=ON   \
-DIREE_BUILD_PYTHON_BINDINGS=ON   \
-DPython3_EXECUTABLE="$(which python3)"  \
-DIREE_TARGET_BACKEND_CUDA=OFF -DIREE_HAL_DRIVER_HIP=ON \
-DIREE_TARGET_BACKEND_ROCM=ON .

cmake --build ../iree-build-trace

Generate vmfb iree-compile with the following commad
Here is the input mlir llama_8b_fp8.mlir
The input mlir is generated with shark-ai https://github.com/nod-ai/shark-ai/commits/users/dan_garvey/fp8_staging

../iree-build-tracy/tools/iree-compile \
  fp8.mlir \
  --iree-hip-target=gfx942 \
  -o=fp8_tracy.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions \
  --iree-hal-executable-debug-level=3 \
  --iree-hal-dump-executable-sources-to=dump

iree-run-module with vmfb/irpa/npy. the irpa is private, reach out to me or @dan-garvey Daniel Garvey (SharkMI300X, /sharedfile/llama3_8b_fp8.irpa ). The input npy are generated by castf16.py , or cp them from folder (SharkMI300X, /sharedfile/prefill/)

TRACY_NO_EXIT=1 \
  ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  /home/chi/src/iree-build-trace/tools/iree-run-module \
  --hip_use_streams=true \
  --module=fp8_tracy.vmfb \
  --parameters=model=fp8.irpa \
  --device=hip://4 \
  --function=prefill_bs1 \
  --input=@prefill/bf16_tokens.npy \
  --input=@prefill/bf16_seq_lens.npy \
  --input=@prefill/bf16_seq_block_ids.npy \
  --input=@prefill/bf16_cs_f16.npy
[1]    2696115 segmentation fault (core dumped)  TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  --hip_use_streams=true

What component(s) does this issue relate to?

Runtime

Version information

4215100

Additional context

SharkMI300X

Tracy related bug solution: #19826

The text was updated successfully, but these errors were encountered:

AmosLewis · 2025-01-27T20:33:47Z

I just cmake debug mode without tracy on top of master,

commit 9a34131e16f82dab188f84398e8f4a42f09d3350 (HEAD -> main, upstream/main)
Author: Scott Todd <[email protected]>
Date:   Mon Jan 27 10:31:21 2025 -0800

    Cherry-pick fix for torch-mlir build on MSVC. (#19823)

    See https://github.com/llvm/torch-mlir/pull/3984

iree-complie and iree-runt-module without tracy, I got type issue since numpy does not support bf16 type.

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=@prefill/bf16_tokens.npy \
--input=@prefill/bf16_seq_lens.npy \
--input=@prefill/bf16_seq_block_ids.npy \
--input=@prefill/bf16_cs_f16.npy
iree/runtime/src/iree/tooling/numpy_io.c:232: UNIMPLEMENTED; unsupported data type g; parsing input `@prefill/bf16_tokens.npy`; parsing function inputs

In this way, if I want to input bf16 type, how can I create the input? One of the way is to read the f32.npy then save it to pytorch.pt, since pytorch can support bf16 type. But does iree support the input as .pt format?

benvanik · 2025-01-27T20:39:04Z

you can write the data to binary files and pass those in: --input=4x2xbf16=@some_file.bin

numpy does not support bf16 (without a fork), but some implementations are starting to use that - we could make our numpy loader use <V2 ala pytorch: pytorch/pytorch#143042

AmosLewis · 2025-01-27T21:26:37Z

some_file.bin

@benvanik I wrote a script numpy2TorchBf16Bin.py to convert f32.npy to torch_bf16 then write it into .bin. When run with iree-run-module, it said only .npy supported.

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=@prefill/bf16_tokens.bin \
--input=@prefill/bf16_seq_lens.bin \
--input=@prefill/bf16_seq_block_ids.bin \
--input=@prefill/bf16_cs_f16.bin
iree/runtime/src/iree/tooling/function_io.c:607: UNIMPLEMENTED; only numpy (.npy) files are supported for metadata-less variant I/O; parsing input `@prefill/bf16_tokens.bin`; parsing function inputs

ScottTodd · 2025-01-27T21:39:03Z

When you pass binary data, you need to tell the runtime how to interpret that data, using for example --input=4x2xbf16=@some_file.bin. Numpy stores enough metadata in .npy files for the runtime to interpret them on their own.

AmosLewis · 2025-01-27T21:47:51Z

When you pass binary data, you need to tell the runtime how to interpret that data, using for example --input=4x2xbf16=@some_file.bin. Numpy stores enough metadata in .npy files for the runtime to interpret them on their own.

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=512xbf16=@prefill/bf16_tokens.bin \
--input=4xbf16=@prefill/bf16_seq_lens.bin \
--input=16xbf16=@prefill/bf16_seq_block_ids.bin \
--input=268435456xbf16=@prefill/bf16_cs_f16.bin
EXEC @prefill_bs1
iree/runtime/src/iree/modules/hal/utils/buffer_diagnostics.c:191: INVALID_ARGUMENT; tensor element type mismatch; expected i64 (10000040) but have bf16 (22000010); while invoking native function hal.buffer_view.assert; while calling import;
[ 1] bytecode module.prefill_bs1$async:4672 fp8.mlir:549:26
[ 0] bytecode module.prefill_bs1:68 fp8.mlir:549:3; invoking function 'prefill_bs1'

benvanik · 2025-01-27T21:49:34Z

don't turn your things that should be i64 into bf16? I'm guessing your tokens aren't bf16 values?

AmosLewis · 2025-01-27T22:06:54Z

don't turn your things that should be i64 into bf16? I'm guessing your tokens aren't bf16 values?

Try to only cast the cs_f16 to bf16, everything else I left there same. Because as I print (numpy2TorchBf16Bin.py) from raw npy, everything else is int. @dan-garvey I also doubt if all the input should be bf16, and is the size same?

  # data size:  cs_f16 268435456
  # data type:  cs_f16 float16
  # data size:  seq_block_ids 16
  # data type:  seq_block_ids int64
  # data size:  seq_lens 4
  # data type:  seq_lens int64
  # data size:  tokens 512
  # data type:  tokens int64

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=@prefill/tokens.npy \
--input=@prefill/seq_lens.npy \
--input=@prefill/seq_block_ids.npy \
--input=268435456xbf16=@prefill/bf16_cs_f16.bin
EXEC @prefill_bs1
iree/runtime/src/iree/modules/hal/utils/buffer_diagnostics.c:225: INVALID_ARGUMENT; tensor shape dimension 0 mismatch; expected 1 but have 4; expected shape `1x128`, actual shape `4x128`; while invoking native function hal.buffer_view.assert; while calling import;
[ 1] bytecode module.prefill_bs1$async:4672 fp8.mlir:549:26
[ 0] bytecode module.prefill_bs1:68 fp8.mlir:549:3; invoking function 'prefill_bs1'

benvanik · 2025-01-27T22:32:40Z

these are some basic errors with you passing in the wrong values - this issue has bounced between asserts in tracy, bfloat16 numpy support, and missized inputs - it'd be good to break these down and isolate things so we can actually make some progress. all are issues, but together it's too hard to track.

drprajap · 2025-01-27T23:42:13Z

Originally reported seg fault issue (exposed in Llama3.1_8b_f16_tp8 model) has been resolved by runtime fix 1bf7249, fix has been verified in both the cases.
after that fix, it exposed input specific issues, is it good to file separate ones for them for easier tracking?

AmosLewis · 2025-01-28T02:23:34Z

The tracy issue now fixed. With the corrected SizexDtype input just generated by Dan at (SharkMI300, /sharedfile/prefill/), the INVALID_ARGUMENT issue also fixed.
Now we got a new hip HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION issue at runtime for both with/without Tracy. Same as #19564

commit 4b0ca34a768377d648f1efc7f377a852d7a943a9 (HEAD -> main, upstream/main)
Author: Ian Wood <[email protected]>
Date:   Mon Jan 27 16:08:05 2025 -0800

    Support fusing broadcast transposes with attention (#19828)

/home/chi/src/iree-build/tools/iree-compile fp8.mlir \
  --iree-hip-target=gfx942 \
  -o=fp8.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions

/home/chi/src/iree-build/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=1x32xi64=@/sharedfile/prefill/prefill_token_ids_1_32.bin \
--input=1xi64=@/sharedfile/prefill/prefill_seq_lens_1.bin \
--input=1x1xi64=@/sharedfile/prefill/prefill_seq_block_ids_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/prefill/prefill_cache_state_128_2097152.bin
EXEC @prefill_bs1
:0:rocdevice.cpp            :2984: 268244551229 us: [pid:3664007 tid:0x7e7d1b000640] Callback: Queue 0x7e7d1a500000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
[1]    3664007 IOT instruction (core dumped)  /home/chi/src/iree-build/tools/iree-run-module --hip_use_streams=true

IanWood1 · 2025-01-28T03:55:10Z

@AmosLewis I think the inputs might still be incorrect. Using rocgdb, I found that the failing dispatch is prefill_bs1$async_dispatch_0_elementwise_broadcast_Dx4096_i64xbf16, which is the first dispatch. It uses user provided input to index into the kvcache (torch.embedding). prefill_seq_block_ids_1_1.bin appears to be 1.1kb which doen't seem right since its a binary file with a single i64.

Also, I'm not sure #19564 is related. It was producing a similar error but due to the incorrect input. The mentioned runtime commit fixed a secondary issue #19564 (comment).

dan-garvey · 2025-01-28T15:03:54Z

I think the bin files have significant size overhead, multiple 1 value files are over 1kb.

benvanik · 2025-01-28T15:05:59Z

they are invalid if so - they are supposed to be the literal data - a 4 byte value should be 4 bytes on disk.

dan-garvey · 2025-01-28T15:06:51Z

@AmosLewis have you successfully used any ".bin" files produced via torch.save?

benvanik · 2025-01-28T15:08:31Z

open them in a hex editor and check - here's 24 floats of 1.0, should look like this:

dan-garvey · 2025-01-28T15:16:58Z

yeah its certainly not one value. But if a bin file is meant to be just a sequence of literal values that should be pretty easy to produce.

In the meantime @AmosLewis you can just load these using the torch api and then pass via the iree python api, none of them are bf16 so the numpy intermediary won't be a problem until output.

AmosLewis · 2025-01-29T03:09:09Z

@AmosLewis have you successfully used any ".bin" files produced via torch.save?

No. I just use them first time for iree-run-module.

dan-garvey · 2025-01-29T16:12:27Z

Torch.save adds a bunch of metadata and does not generate bin files as expected for iree. We can use numpy like the generate_data.py in sharktank for everything but the fp8 input, for that we can view as uint8 then cover to numpy and then bin file, I think, I was planning to test this today.

…

On Tue, Jan 28, 2025, 8:09 PM Chi_Liu ***@***.***> wrote: @AmosLewis <https://github.com/AmosLewis> have you successfully used any ".bin" files produced via torch.save? No. I just use them first time for iree-run-module. — Reply to this email directly, view it on GitHub <#19809 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIHDSYHYA67DXRRGHAOXP6T2NBA6ZAVCNFSM6AAAAABV2J32CSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRQGUZDQMBTHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ScottTodd · 2025-01-29T16:41:37Z

Yeah don't mix file types. .bin is just binary data, no magic, no metadata, nothing framework-specific. At the base level that is all IREE sees at the boundaries anyways - buffers of data.

AmosLewis · 2025-01-30T17:25:39Z

With dan create new .bin input in pr nod-ai/shark-ai#885, the iree-run-module run successfully and tracy file generated. But now we got numeric issue now, I will file a new issue #19859 for numeric separately.

TRACY_NO_EXIT=1 \
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/home/chi/src/iree-build-trace/tools/iree-run-module \
--hip_use_streams=true \
--module=fp8_tracy.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=1x32xi64=@/sharedfile/prefill/prefill_token_ids_1_32.bin \
--input=1xi64=@/sharedfile/prefill/prefill_seq_lens_1.bin \
--input=1x1xi64=@/sharedfile/prefill/prefill_seq_block_ids_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/prefill/prefill_cache_state_128_2097152.bin
EXEC @prefill_bs1
result[0]: hal.buffer_view
1x32x128256xbf16=[[NAN NAN NAAN NAN NAN...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...]]

AmosLewis added the bug 🐞 Something isn't working label Jan 24, 2025

AmosLewis mentioned this issue Jan 24, 2025

[Codegen][AMDGPU] iree-compiler vector distribute segfault on AMD gpu #19785

Closed

AmosLewis mentioned this issue Jan 27, 2025

[hip] Cleanup the dispatch thread before the rest of the device. #19826

Merged

AmosLewis mentioned this issue Jan 28, 2025

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29 #19564

Closed

AmosLewis assigned drprajap Jan 28, 2025

AmosLewis closed this as completed Jan 30, 2025

AmosLewis mentioned this issue Jan 30, 2025

Numeric issue for llama_8b_fp8 model on hip #19859

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

AmosLewis commented Jan 24, 2025 •

edited

Loading

AmosLewis commented Jan 27, 2025 •

edited

Loading

benvanik commented Jan 27, 2025

AmosLewis commented Jan 27, 2025 •

edited

Loading

ScottTodd commented Jan 27, 2025

AmosLewis commented Jan 27, 2025

benvanik commented Jan 27, 2025

AmosLewis commented Jan 27, 2025 •

edited

Loading

benvanik commented Jan 27, 2025

drprajap commented Jan 27, 2025 •

edited

Loading

AmosLewis commented Jan 28, 2025 •

edited

Loading

IanWood1 commented Jan 28, 2025

dan-garvey commented Jan 28, 2025

benvanik commented Jan 28, 2025

dan-garvey commented Jan 28, 2025

benvanik commented Jan 28, 2025

dan-garvey commented Jan 28, 2025

AmosLewis commented Jan 29, 2025

dan-garvey commented Jan 29, 2025 via email

ScottTodd commented Jan 29, 2025

AmosLewis commented Jan 30, 2025 •

edited

Loading

[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

[AMDGPU] iree-hal-hip-di SIGSEGV for llama 8b fp8 model #19809

Comments

AmosLewis commented Jan 24, 2025 • edited Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

AmosLewis commented Jan 27, 2025 • edited Loading

benvanik commented Jan 27, 2025

AmosLewis commented Jan 27, 2025 • edited Loading

ScottTodd commented Jan 27, 2025

AmosLewis commented Jan 27, 2025

benvanik commented Jan 27, 2025

AmosLewis commented Jan 27, 2025 • edited Loading

benvanik commented Jan 27, 2025

drprajap commented Jan 27, 2025 • edited Loading

AmosLewis commented Jan 28, 2025 • edited Loading

IanWood1 commented Jan 28, 2025

dan-garvey commented Jan 28, 2025

benvanik commented Jan 28, 2025

dan-garvey commented Jan 28, 2025

benvanik commented Jan 28, 2025

dan-garvey commented Jan 28, 2025

AmosLewis commented Jan 29, 2025

dan-garvey commented Jan 29, 2025 via email

ScottTodd commented Jan 29, 2025

AmosLewis commented Jan 30, 2025 • edited Loading

AmosLewis commented Jan 24, 2025 •

edited

Loading

AmosLewis commented Jan 27, 2025 •

edited

Loading

AmosLewis commented Jan 27, 2025 •

edited

Loading

AmosLewis commented Jan 27, 2025 •

edited

Loading

drprajap commented Jan 27, 2025 •

edited

Loading

AmosLewis commented Jan 28, 2025 •

edited

Loading

AmosLewis commented Jan 30, 2025 •

edited

Loading