Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559

MaheshRavishankar · 2024-12-24T19:43:11Z

I have been hitting unrelated issues with #18907 (which is meant to test LLVM PR llvm/llvm-project#113501 ). Has been tricky to track down the core issue cause the change is almost a NFC, and definitely is not a root cause of the errors being hit. Using this issue to track down the root cause. So far it seems to point to OptimizeIntArithmetic pass and some folder.

Repro instructions

Input IR

toy_llama_tp2.mlir.txt

Compile command

iree-compile  --iree-input-type=auto \
                --iree-vm-bytecode-module-output-format=flatbuffer-binary \
                --mlir-print-debuginfo \
                --mlir-print-op-on-diagnostic=false \
                --iree-hal-target-device=llvm-cpu[0] \
                --iree-hal-target-device=llvm-cpu[1] \
                --iree-llvmcpu-target-cpu=host \
                -o toy_llama_tp2.vmfb \

The IR dump after all is huge, but the difference seems to start from the optimize int arithmetic pass. Here is the IR dump before and after this pass for ToT and for #18907

ToT.optimize_int_arithmetic.mlir.txt
pr18907.optimize_int_arithmetic.mlir.txt

Specifically

WithToT

  %424 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %425 = stream.timepoint.await %65 => %424 : !stream.resource<external>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %426 = stream.async.update on(#hal.device.affinity<@__device_0>) %364, %425[%c0 to %365] : !stream.resource<*>{%365} -> %425 as !stream.resource<external>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %427 = stream.async.transfer %426 : !stream.resource<external>{%365} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %428 = stream.tensor.import on(#hal.device.affinity<@__device_1>) %arg4 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%368} loc("toy_llama_tp2.mlir":2825:12)
  %429 = stream.timepoint.await %85 => %428 : !stream.resource<external>{%368} loc("toy_llama_tp2.mlir":2825:12)
  %430 = stream.async.update on(#hal.device.affinity<@__device_1>) %367, %429[%c0 to %368] : !stream.resource<*>{%368} -> %429 as !stream.resource<external>{%368} loc("toy_llama_tp2.mlir":2825:12)
  %431 = stream.async.transfer %430 : !stream.resource<external>{%368} from(#hal.device.affinity<@__device_1>) -> to(#hal.device.affinity<@__device_1>) !stream.resource<*>{%368} loc("toy_llama_tp2.mlir":2825:12)```

With #18907

  %165 = arith.muli %91, %c6 : index loc("toy_llama_tp2.mlir":681:21)
  %365 = arith.divui %165, %c6 : index loc("toy_llama_tp2.mlir":2322:12)
  ....

 %425 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x12288xf16>{%91} : index loc("toy_llama_tp2.mlir":2822:12)
  %426 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%425} loc("toy_llama_tp2.mlir":2822:12)
  %427 = stream.timepoint.await %65 => %426 : !stream.resource<external>{%425} loc("toy_llama_tp2.mlir":2822:12)
  %428 = stream.async.update on(#hal.device.affinity<@__device_0>) %364, %427[%c0 to %366] : !stream.resource<*>{%366} -> %427 as !stream.resource<external>{%425} loc("toy_llama_tp2.mlir":2822:12)
  %429 = stream.async.slice on(#hal.device.affinity<@__device_0>) %428[%c0 to %366] : !stream.resource<external>{%425} -> !stream.resource<external>{%366} loc("toy_llama_tp2.mlir":2822:12)
  %430 = stream.async.transfer %429 : !stream.resource<external>{%366} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%366} loc("toy_llama_tp2.mlir":2822:12)
  %431 = stream.tensor.sizeof on(#hal.device.affinity<@__device_1>) tensor<?x12288xf16>{%91} : index loc("toy_llama_tp2.mlir":2825:12)
  %432 = stream.tensor.import on(#hal.device.affinity<@__device_1>) %arg4 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%431} loc("toy_llama_tp2.mlir":2825:12)
  %433 = stream.timepoint.await %85 => %432 : !stream.resource<external>{%431} loc("toy_llama_tp2.mlir":2825:12)
  %434 = stream.async.update on(#hal.device.affinity<@__device_1>) %368, %433[%c0 to %369] : !stream.resource<*>{%369} -> %433 as !stream.resource<external>{%431} loc("toy_llama_tp2.mlir":2825:12)
  %435 = stream.async.slice on(#hal.device.affinity<@__device_1>) %434[%c0 to %369] : !stream.resource<external>{%431} -> !stream.resource<external>{%369} loc("toy_llama_tp2.mlir":2825:12)

There are some extra stream.async.slice in PR #18907 . The before IR is essentially equivalent, but might be missing this folder

  %165 = arith.muli %91, %c6 : index loc("toy_llama_tp2.mlir":681:21)
  %365 = arith.divui %165, %c6 : index loc("toy_llama_tp2.mlir":2322:12)

The text was updated successfully, but these errors were encountered:

MaheshRavishankar · 2024-12-24T19:44:43Z

cc @benvanik if you have cycles and can try this example out, would help if you can help give more insight as to whats going on here and how to fix this. (I have the full dump IR after all if you want to just take a look at that, but is too big to upload here)

MaheshRavishankar · 2025-01-03T00:27:33Z

Testing with the folder seems to have fixed the issue

iree-org/llvm-project@2039e93...0b5c7e5#diff-b5a9833b8bcae3234467be62e01f5989da5ebe4d7e32ad268e4c190eca8c892a

benvanik · 2025-01-06T18:12:20Z

Many things rely on SSA value equality and the additional unfolded divui breaks equality and thus hurts any folder/canonicalizer/analysis that is relying on it to elide the extra slices. Your folder fixing it returns the universe back to the happy place where %91=%91.

IanWood1 · 2025-01-13T21:30:33Z

I found that in the case of prefill there are stream.async.update ups that are updating the full size of the KV cache (i.e. the update and target were the same size but couldn't be folded). From the docs, this seems to be legal for this op. However, it get converted to stream.cmd.copy which makes overlapping src/dst illegal:

iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/ScheduleAllocation.cpp

Lines 592 to 606 in 3e34e03

    
           static LogicalResult applyAsyncUpdateOp(IREE::Stream::AsyncUpdateOp asyncOp, 
        
                                                   AllocationScope &scope, 
        
                                                   OpBuilder builder) { 
        
             auto sourceRange = scope.lookupResourceRange(asyncOp.getUpdate()); 
        
             auto sourceOffset = sourceRange.offset; 
        
             auto targetRange = scope.lookupResourceRange(asyncOp.getResult()); 
        
             auto targetOffset = scope.add(asyncOp.getLoc(), targetRange.offset, 
        
                                           asyncOp.getTargetOffset()); 
        
             builder.create<IREE::Stream::CmdCopyOp>( 
        
                 asyncOp.getLoc(), sourceRange.resource, sourceRange.resourceSize, 
        
                 sourceOffset, targetRange.resource, targetRange.resourceSize, 
        
                 targetOffset, asyncOp.getUpdateSize()); 
        
             asyncOp.erase(); 
        
             return success(); 
        
           }

benvanik · 2025-01-13T21:38:11Z

a well formed program should never reach a case with an overlap - if the input is good but we turn it into something that isn't then that's a bug (sounds like something is not folding SSA values it should be?)

IanWood1 · 2025-01-13T22:13:38Z

Yeah, it looks like stream.async.update and stream.async.slice ops both have folders that compare SSA values (as discussed above)

MaheshRavishankar · 2025-01-22T22:47:49Z

Update on the errors I am seeing here.

I am not hitting an error with CPU as well. Here is the input IR : https://drive.google.com/file/d/1jsITqbLfEdQsMl6a6PGdwuh4gYUdj4KQ/view?usp=drive_link

This is the compile command

iree-compile toy_llama.mlir \
              --iree-input-type=auto \
              --iree-vm-bytecode-module-output-format=flatbuffer-binary \
              --iree-hal-target-device=llvm-cpu[0] \
              --iree-llvmcpu-target-cpu=host \
              -o output.vmfb

Here is the IR dump with this change

https://drive.google.com/file/d/1Iihzlto2K5FVIxxpNc4Yz5MGREeWf74n/view?usp=drive_link

And the IR dump without this change

https://drive.google.com/file/d/1sIJhts7mINKsmkj-flAqx9JhelRBxfya/view?usp=drive_link

I still need to find where things go off the rails, but documenting state here.

IanWood1 · 2025-01-22T22:49:29Z

I am not hitting an error with CPU as well. Here is the input IR : drive.google.com/file/d/1jsITqbLfEdQsMl6a6PGdwuh4gYUdj4KQ/view?usp=drive_link

Is this due to data-tiling introducing extract slices?

benvanik · 2025-01-22T23:06:59Z

Definitely looks like the new math not folding being the origin;

same in both:

%5 = hal.buffer_view.dim<%arg3 : !hal.buffer_view>[0] : index
%12 = util.assume.int %5<umin = 1, umax = 9007199254740991> : index
%39 = affine.apply affine_map<()[s0] -> (s0 * 6)>()[%12]
%139 = flow.dispatch @prefill_bs1$async_dispatch_74::@prefill_bs1$async_dispatch_74[%42, %10, %39](%138, %137, %136, %42, %10, %39) : (tensor<?x32x4x32xf16>{%42}, tensor<?x1xi32>{%10}, tensor<?x32x4x32xf16>{%39}, index, index, index) -> %136{%39}

before:

%140 = flow.tensor.reshape %139 : tensor<?x32x4x32xf16>{%39} -> tensor<?x24576xf16>{%12}

after:

%140 = arith.muli %12, %c6 overflow<nsw> : index
%141 = arith.divui %140, %c6 : index
%142 = flow.tensor.reshape %139 : tensor<?x32x4x32xf16>{%39} -> tensor<?x24576xf16>{%141}

I haven't looked at how that eventually produces the error but does show that we are losing the association early and it's leading to weirdness. For example, we know later on (in both cases) that the reshape result is tensor<?x24576xf16>{%12}:

%165 = hal.tensor.alias wait(%arg4) => %142 : tensor<?x24576xf16>{%12} to %arg3 : !hal.buffer_view

so that it changes to the calculated %141 is probably confusing things.

I'll keep looking later in the pipeline - I suspect there's a pattern that is unsafe in this kind of case, but a smaller repro would be easier to spot (we may be able to prune a lot if we can find the one that's causing the actual memmove-semantics).

benvanik · 2025-01-22T23:24:51Z

Interesting observation: at some point we end up with two imports in the same function with different sizes (well, same size but different SSA values). I think this confuses something.

same:

%8 = hal.buffer_view.dim<%arg3 : !hal.buffer_view>[0] : index
%9 = arith.muli %8, %c49152 : index
%10 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%8} in !stream.resource<external>{%9}
%14 = util.assume.int %8<umin = 1, umax = 9007199254740991> : index

new (this is in addition to the above!):

%24 = arith.muli %14, %c49152 : index
%33 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%14} in !stream.resource<external>{%24}

Looking for where that comes from I see in the IR that what's expected here (an update) gets turned into an update+slice now:

before:

  %132 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%18} in !stream.resource<external>{%41}
  %133 = stream.timepoint.await %3 => %132 : !stream.resource<external>{%41}
  %134 = stream.async.update on(#hal.device.affinity<@__device_0>) %114, %133[%c0 to %41] : !stream.resource<external>{%41} -> %133 as !stream.resource<external>{%41}
  %result, %result_timepoint = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %134 : !stream.resource<external>{%41} => !stream.timepoint
  %result_0, %result_timepoint_1 = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %131 : !stream.resource<external>{%20} => !stream.timepoint
  %135 = stream.timepoint.join max(%result_timepoint, %result_timepoint_1) => !stream.timepoint
  stream.timepoint.chain_external on(#hal.device.affinity<@__device_0>) %135 => (%arg5 : !hal.fence)
  %136 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %result_0 : tensor<1x?x256xf16>{%16} in !stream.resource<external>{%20} -> !hal.buffer_view
  util.return %136 : !hal.buffer_view

after:

  %115 = arith.divui %40, %c6 : index
  %116 = arith.muli %115, %c49152 : index
  %134 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%18} in !stream.resource<external>{%41}
  %135 = stream.timepoint.await %3 => %134 : !stream.resource<external>{%41}
  %136 = stream.async.update on(#hal.device.affinity<@__device_0>) %114, %135[%c0 to %116] : !stream.resource<external>{%116} -> %135 as !stream.resource<external>{%41}
  %137 = stream.async.slice on(#hal.device.affinity<@__device_0>) %136[%c0 to %116] : !stream.resource<external>{%41} -> !stream.resource<external>{%116}
  %result, %result_timepoint = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %137 : !stream.resource<external>{%116} => !stream.timepoint
  %result_0, %result_timepoint_1 = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %133 : !stream.resource<external>{%20} => !stream.timepoint
  %138 = stream.timepoint.join max(%result_timepoint, %result_timepoint_1) => !stream.timepoint
  stream.timepoint.chain_external on(#hal.device.affinity<@__device_0>) %138 => (%arg5 : !hal.fence)
  %139 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %result_0 : tensor<1x?x256xf16>{%16} in !stream.resource<external>{%20} -> !hal.buffer_view
  util.return %139 : !hal.buffer_view

The update and slice are happening because there's a mismatch - before it is overwriting range 0-%41 in the target and that's the entire tensor (size %41) - in the after since the size is decoupled we can't tell that: it's updating 0-%116 of a tensor with size %41, and since %41 != %116 it can't tell it's a full update and has to do a slice. Definitely confusion in here. The proper fix is to make sure we're folding things correctly but there may be some things we can do to make it not explode so much and I'll keep looking, but we need to have it work correctly, though, for cases like this (otherwise we risk cloning the entire kv cache or something).

benvanik · 2025-01-22T23:42:15Z

Do you have the error you are hitting at runtime?

MaheshRavishankar · 2025-01-22T23:45:26Z

Do you have the error you are hitting at runtime?

Yup, here : https://github.com/iree-org/iree/actions/runs/12874673677/job/35894913114

benvanik · 2025-01-22T23:46:53Z

Cool, then I'm going to guess you are hitting this:
https://github.com/iree-org/iree/blob/main/runtime/src/iree/hal/buffer.c#L470

(you could add a printf or something to confirm)

This may be ok - but there's no guarantee an implementation won't still issue and perform the copy (it likely will in an indirect command buffer as we don't have the buffer pointers until we go to execute the operation, as opposed to this case where it's happening when we record it in the validation layer).

benvanik · 2025-01-22T23:50:32Z

yeah, the new IR ends up with a stream.cmd.copy on %arg3 (%arg10 = %10, %arg18 = %33, both of which are aliasing %arg3):

%10 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%8} in !stream.resource<external>{%9}
%33 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%14} in !stream.resource<external>{%24}
%89 = stream.cmd.execute on(#hal.device.affinity<@__device_0>) await(%60) => with(%__hoisted_tensor_128xi64 as %arg6: !stream.resource<constant>{%c3267648}, %2 as %arg7: !stream.resource<external>{%1}, %__auto.blk.0.attn_norm.weight as %arg8: !stream.resource<constant>{%c1024}, %7 as %arg9: !stream.resource<external>{%6}, %10 as %arg10: !stream.resource<external>{%9}, %4 as %arg11: !stream.resource<external>{%c8}, %__auto.blk.0.ffn_norm.weight as %arg12: !stream.resource<constant>{%c1024}, %__auto.blk.1.attn_norm.weight as %arg13: !stream.resource<constant>{%c1024}, %__auto.blk.1.ffn_norm.weight as %arg14: !stream.resource<constant>{%c1024}, %__auto.blk.2.attn_norm.weight as %arg15: !stream.resource<constant>{%c1024}, %__auto.blk.2.ffn_norm.weight as %arg16: !stream.resource<constant>{%c1024}, %__auto.output_norm.weight as %arg17: !stream.resource<constant>{%c1024}, %33 as %arg18: !stream.resource<external>{%24}, %result as %arg19: !stream.resource<external>{%35}, %result_0 as %arg20: !stream.resource<transient>{%59}) {
...
stream.cmd.copy %arg10[%c0], %arg18[%c0], %32 : !stream.resource<external>{%9} -> !stream.resource<external>{%24}

benvanik · 2025-01-22T23:53:30Z

I suspect mystery solved with that; we're giving the same tensor two different sizes (SSA values), that happens to be the input aliased by the user, and we end up with two stream.tensor.import ops that are treating the imported tensor as unique values - the rest of the pipeline then treats them as unique values and at some point ends up with a copy between them. Complete overlap is a special case of memmove that could be handled by memcpy or no-oped but we can't really know that in all cases and definitely don't want that copy so we have to prevent the compiler from treating the tensors as different.

This is unlikely to happen in a lot of programs because we don't alias inputs for kvcache stuff - but when we are aliasing inputs intentionally they must line up right.

(I don't really have a solution, but I think that's the problem :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559

Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559

MaheshRavishankar commented Dec 24, 2024

MaheshRavishankar commented Dec 24, 2024

MaheshRavishankar commented Jan 3, 2025

benvanik commented Jan 6, 2025

IanWood1 commented Jan 13, 2025

benvanik commented Jan 13, 2025

IanWood1 commented Jan 13, 2025

MaheshRavishankar commented Jan 22, 2025

IanWood1 commented Jan 22, 2025

benvanik commented Jan 22, 2025 •

edited

Loading

benvanik commented Jan 22, 2025

benvanik commented Jan 22, 2025

MaheshRavishankar commented Jan 22, 2025

benvanik commented Jan 22, 2025

benvanik commented Jan 22, 2025

benvanik commented Jan 22, 2025 •

edited

Loading

Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559

Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559

Comments

MaheshRavishankar commented Dec 24, 2024

MaheshRavishankar commented Dec 24, 2024

MaheshRavishankar commented Jan 3, 2025

benvanik commented Jan 6, 2025

IanWood1 commented Jan 13, 2025

benvanik commented Jan 13, 2025

IanWood1 commented Jan 13, 2025

MaheshRavishankar commented Jan 22, 2025

IanWood1 commented Jan 22, 2025

benvanik commented Jan 22, 2025 • edited Loading

benvanik commented Jan 22, 2025

benvanik commented Jan 22, 2025

MaheshRavishankar commented Jan 22, 2025

benvanik commented Jan 22, 2025

benvanik commented Jan 22, 2025

benvanik commented Jan 22, 2025 • edited Loading

benvanik commented Jan 22, 2025 •

edited

Loading

benvanik commented Jan 22, 2025 •

edited

Loading