Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559

Open
MaheshRavishankar opened this issue Dec 24, 2024 · 15 comments
Open

Comments

@MaheshRavishankar
Copy link
Contributor

I have been hitting unrelated issues with #18907 (which is meant to test LLVM PR llvm/llvm-project#113501 ). Has been tricky to track down the core issue cause the change is almost a NFC, and definitely is not a root cause of the errors being hit. Using this issue to track down the root cause. So far it seems to point to OptimizeIntArithmetic pass and some folder.

Repro instructions

Input IR

toy_llama_tp2.mlir.txt

Compile command

iree-compile  --iree-input-type=auto \
                --iree-vm-bytecode-module-output-format=flatbuffer-binary \
                --mlir-print-debuginfo \
                --mlir-print-op-on-diagnostic=false \
                --iree-hal-target-device=llvm-cpu[0] \
                --iree-hal-target-device=llvm-cpu[1] \
                --iree-llvmcpu-target-cpu=host \
                -o toy_llama_tp2.vmfb \

The IR dump after all is huge, but the difference seems to start from the optimize int arithmetic pass. Here is the IR dump before and after this pass for ToT and for #18907

ToT.optimize_int_arithmetic.mlir.txt
pr18907.optimize_int_arithmetic.mlir.txt

Specifically

WithToT

  %424 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %425 = stream.timepoint.await %65 => %424 : !stream.resource<external>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %426 = stream.async.update on(#hal.device.affinity<@__device_0>) %364, %425[%c0 to %365] : !stream.resource<*>{%365} -> %425 as !stream.resource<external>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %427 = stream.async.transfer %426 : !stream.resource<external>{%365} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%365} loc("toy_llama_tp2.mlir":2822:12)
  %428 = stream.tensor.import on(#hal.device.affinity<@__device_1>) %arg4 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%368} loc("toy_llama_tp2.mlir":2825:12)
  %429 = stream.timepoint.await %85 => %428 : !stream.resource<external>{%368} loc("toy_llama_tp2.mlir":2825:12)
  %430 = stream.async.update on(#hal.device.affinity<@__device_1>) %367, %429[%c0 to %368] : !stream.resource<*>{%368} -> %429 as !stream.resource<external>{%368} loc("toy_llama_tp2.mlir":2825:12)
  %431 = stream.async.transfer %430 : !stream.resource<external>{%368} from(#hal.device.affinity<@__device_1>) -> to(#hal.device.affinity<@__device_1>) !stream.resource<*>{%368} loc("toy_llama_tp2.mlir":2825:12)```

With #18907

  %165 = arith.muli %91, %c6 : index loc("toy_llama_tp2.mlir":681:21)
  %365 = arith.divui %165, %c6 : index loc("toy_llama_tp2.mlir":2322:12)
  ....

 %425 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x12288xf16>{%91} : index loc("toy_llama_tp2.mlir":2822:12)
  %426 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%425} loc("toy_llama_tp2.mlir":2822:12)
  %427 = stream.timepoint.await %65 => %426 : !stream.resource<external>{%425} loc("toy_llama_tp2.mlir":2822:12)
  %428 = stream.async.update on(#hal.device.affinity<@__device_0>) %364, %427[%c0 to %366] : !stream.resource<*>{%366} -> %427 as !stream.resource<external>{%425} loc("toy_llama_tp2.mlir":2822:12)
  %429 = stream.async.slice on(#hal.device.affinity<@__device_0>) %428[%c0 to %366] : !stream.resource<external>{%425} -> !stream.resource<external>{%366} loc("toy_llama_tp2.mlir":2822:12)
  %430 = stream.async.transfer %429 : !stream.resource<external>{%366} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%366} loc("toy_llama_tp2.mlir":2822:12)
  %431 = stream.tensor.sizeof on(#hal.device.affinity<@__device_1>) tensor<?x12288xf16>{%91} : index loc("toy_llama_tp2.mlir":2825:12)
  %432 = stream.tensor.import on(#hal.device.affinity<@__device_1>) %arg4 : !hal.buffer_view -> tensor<?x12288xf16>{%91} in !stream.resource<external>{%431} loc("toy_llama_tp2.mlir":2825:12)
  %433 = stream.timepoint.await %85 => %432 : !stream.resource<external>{%431} loc("toy_llama_tp2.mlir":2825:12)
  %434 = stream.async.update on(#hal.device.affinity<@__device_1>) %368, %433[%c0 to %369] : !stream.resource<*>{%369} -> %433 as !stream.resource<external>{%431} loc("toy_llama_tp2.mlir":2825:12)
  %435 = stream.async.slice on(#hal.device.affinity<@__device_1>) %434[%c0 to %369] : !stream.resource<external>{%431} -> !stream.resource<external>{%369} loc("toy_llama_tp2.mlir":2825:12)

There are some extra stream.async.slice in PR #18907 . The before IR is essentially equivalent, but might be missing this folder

  %165 = arith.muli %91, %c6 : index loc("toy_llama_tp2.mlir":681:21)
  %365 = arith.divui %165, %c6 : index loc("toy_llama_tp2.mlir":2322:12)

@MaheshRavishankar
Copy link
Contributor Author

cc @benvanik if you have cycles and can try this example out, would help if you can help give more insight as to whats going on here and how to fix this. (I have the full dump IR after all if you want to just take a look at that, but is too big to upload here)

@MaheshRavishankar
Copy link
Contributor Author

@benvanik
Copy link
Collaborator

benvanik commented Jan 6, 2025

Many things rely on SSA value equality and the additional unfolded divui breaks equality and thus hurts any folder/canonicalizer/analysis that is relying on it to elide the extra slices. Your folder fixing it returns the universe back to the happy place where %91=%91.

@IanWood1
Copy link
Contributor

I found that in the case of prefill there are stream.async.update ups that are updating the full size of the KV cache (i.e. the update and target were the same size but couldn't be folded). From the docs, this seems to be legal for this op. However, it get converted to stream.cmd.copy which makes overlapping src/dst illegal:

static LogicalResult applyAsyncUpdateOp(IREE::Stream::AsyncUpdateOp asyncOp,
AllocationScope &scope,
OpBuilder builder) {
auto sourceRange = scope.lookupResourceRange(asyncOp.getUpdate());
auto sourceOffset = sourceRange.offset;
auto targetRange = scope.lookupResourceRange(asyncOp.getResult());
auto targetOffset = scope.add(asyncOp.getLoc(), targetRange.offset,
asyncOp.getTargetOffset());
builder.create<IREE::Stream::CmdCopyOp>(
asyncOp.getLoc(), sourceRange.resource, sourceRange.resourceSize,
sourceOffset, targetRange.resource, targetRange.resourceSize,
targetOffset, asyncOp.getUpdateSize());
asyncOp.erase();
return success();
}

@benvanik
Copy link
Collaborator

a well formed program should never reach a case with an overlap - if the input is good but we turn it into something that isn't then that's a bug (sounds like something is not folding SSA values it should be?)

@IanWood1
Copy link
Contributor

Yeah, it looks like stream.async.update and stream.async.slice ops both have folders that compare SSA values (as discussed above)

@MaheshRavishankar
Copy link
Contributor Author

Update on the errors I am seeing here.

I am not hitting an error with CPU as well. Here is the input IR : https://drive.google.com/file/d/1jsITqbLfEdQsMl6a6PGdwuh4gYUdj4KQ/view?usp=drive_link

This is the compile command

iree-compile toy_llama.mlir \
              --iree-input-type=auto \
              --iree-vm-bytecode-module-output-format=flatbuffer-binary \
              --iree-hal-target-device=llvm-cpu[0] \
              --iree-llvmcpu-target-cpu=host \
              -o output.vmfb

Here is the IR dump with this change

https://drive.google.com/file/d/1Iihzlto2K5FVIxxpNc4Yz5MGREeWf74n/view?usp=drive_link

And the IR dump without this change

https://drive.google.com/file/d/1sIJhts7mINKsmkj-flAqx9JhelRBxfya/view?usp=drive_link

I still need to find where things go off the rails, but documenting state here.

@IanWood1
Copy link
Contributor

I am not hitting an error with CPU as well. Here is the input IR : drive.google.com/file/d/1jsITqbLfEdQsMl6a6PGdwuh4gYUdj4KQ/view?usp=drive_link

Is this due to data-tiling introducing extract slices?

@benvanik
Copy link
Collaborator

benvanik commented Jan 22, 2025

Definitely looks like the new math not folding being the origin;

same in both:

%5 = hal.buffer_view.dim<%arg3 : !hal.buffer_view>[0] : index
%12 = util.assume.int %5<umin = 1, umax = 9007199254740991> : index
%39 = affine.apply affine_map<()[s0] -> (s0 * 6)>()[%12]
%139 = flow.dispatch @prefill_bs1$async_dispatch_74::@prefill_bs1$async_dispatch_74[%42, %10, %39](%138, %137, %136, %42, %10, %39) : (tensor<?x32x4x32xf16>{%42}, tensor<?x1xi32>{%10}, tensor<?x32x4x32xf16>{%39}, index, index, index) -> %136{%39}

before:

%140 = flow.tensor.reshape %139 : tensor<?x32x4x32xf16>{%39} -> tensor<?x24576xf16>{%12}

after:

%140 = arith.muli %12, %c6 overflow<nsw> : index
%141 = arith.divui %140, %c6 : index
%142 = flow.tensor.reshape %139 : tensor<?x32x4x32xf16>{%39} -> tensor<?x24576xf16>{%141}

I haven't looked at how that eventually produces the error but does show that we are losing the association early and it's leading to weirdness. For example, we know later on (in both cases) that the reshape result is tensor<?x24576xf16>{%12}:

%165 = hal.tensor.alias wait(%arg4) => %142 : tensor<?x24576xf16>{%12} to %arg3 : !hal.buffer_view

so that it changes to the calculated %141 is probably confusing things.

I'll keep looking later in the pipeline - I suspect there's a pattern that is unsafe in this kind of case, but a smaller repro would be easier to spot (we may be able to prune a lot if we can find the one that's causing the actual memmove-semantics).

@benvanik
Copy link
Collaborator

Interesting observation: at some point we end up with two imports in the same function with different sizes (well, same size but different SSA values). I think this confuses something.

same:

%8 = hal.buffer_view.dim<%arg3 : !hal.buffer_view>[0] : index
%9 = arith.muli %8, %c49152 : index
%10 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%8} in !stream.resource<external>{%9}
%14 = util.assume.int %8<umin = 1, umax = 9007199254740991> : index

new (this is in addition to the above!):

%24 = arith.muli %14, %c49152 : index
%33 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%14} in !stream.resource<external>{%24}

Looking for where that comes from I see in the IR that what's expected here (an update) gets turned into an update+slice now:

before:

  %132 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%18} in !stream.resource<external>{%41}
  %133 = stream.timepoint.await %3 => %132 : !stream.resource<external>{%41}
  %134 = stream.async.update on(#hal.device.affinity<@__device_0>) %114, %133[%c0 to %41] : !stream.resource<external>{%41} -> %133 as !stream.resource<external>{%41}
  %result, %result_timepoint = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %134 : !stream.resource<external>{%41} => !stream.timepoint
  %result_0, %result_timepoint_1 = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %131 : !stream.resource<external>{%20} => !stream.timepoint
  %135 = stream.timepoint.join max(%result_timepoint, %result_timepoint_1) => !stream.timepoint
  stream.timepoint.chain_external on(#hal.device.affinity<@__device_0>) %135 => (%arg5 : !hal.fence)
  %136 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %result_0 : tensor<1x?x256xf16>{%16} in !stream.resource<external>{%20} -> !hal.buffer_view
  util.return %136 : !hal.buffer_view

after:

  %115 = arith.divui %40, %c6 : index
  %116 = arith.muli %115, %c49152 : index
  %134 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%18} in !stream.resource<external>{%41}
  %135 = stream.timepoint.await %3 => %134 : !stream.resource<external>{%41}
  %136 = stream.async.update on(#hal.device.affinity<@__device_0>) %114, %135[%c0 to %116] : !stream.resource<external>{%116} -> %135 as !stream.resource<external>{%41}
  %137 = stream.async.slice on(#hal.device.affinity<@__device_0>) %136[%c0 to %116] : !stream.resource<external>{%41} -> !stream.resource<external>{%116}
  %result, %result_timepoint = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %137 : !stream.resource<external>{%116} => !stream.timepoint
  %result_0, %result_timepoint_1 = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %133 : !stream.resource<external>{%20} => !stream.timepoint
  %138 = stream.timepoint.join max(%result_timepoint, %result_timepoint_1) => !stream.timepoint
  stream.timepoint.chain_external on(#hal.device.affinity<@__device_0>) %138 => (%arg5 : !hal.fence)
  %139 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %result_0 : tensor<1x?x256xf16>{%16} in !stream.resource<external>{%20} -> !hal.buffer_view
  util.return %139 : !hal.buffer_view

The update and slice are happening because there's a mismatch - before it is overwriting range 0-%41 in the target and that's the entire tensor (size %41) - in the after since the size is decoupled we can't tell that: it's updating 0-%116 of a tensor with size %41, and since %41 != %116 it can't tell it's a full update and has to do a slice. Definitely confusion in here. The proper fix is to make sure we're folding things correctly but there may be some things we can do to make it not explode so much and I'll keep looking, but we need to have it work correctly, though, for cases like this (otherwise we risk cloning the entire kv cache or something).

@benvanik
Copy link
Collaborator

Do you have the error you are hitting at runtime?

@MaheshRavishankar
Copy link
Contributor Author

Do you have the error you are hitting at runtime?

Yup, here : https://github.com/iree-org/iree/actions/runs/12874673677/job/35894913114

@benvanik
Copy link
Collaborator

Cool, then I'm going to guess you are hitting this:
https://github.com/iree-org/iree/blob/main/runtime/src/iree/hal/buffer.c#L470

(you could add a printf or something to confirm)

This may be ok - but there's no guarantee an implementation won't still issue and perform the copy (it likely will in an indirect command buffer as we don't have the buffer pointers until we go to execute the operation, as opposed to this case where it's happening when we record it in the validation layer).

@benvanik
Copy link
Collaborator

yeah, the new IR ends up with a stream.cmd.copy on %arg3 (%arg10 = %10, %arg18 = %33, both of which are aliasing %arg3):

%10 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%8} in !stream.resource<external>{%9}
%33 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%14} in !stream.resource<external>{%24}
%89 = stream.cmd.execute on(#hal.device.affinity<@__device_0>) await(%60) => with(%__hoisted_tensor_128xi64 as %arg6: !stream.resource<constant>{%c3267648}, %2 as %arg7: !stream.resource<external>{%1}, %__auto.blk.0.attn_norm.weight as %arg8: !stream.resource<constant>{%c1024}, %7 as %arg9: !stream.resource<external>{%6}, %10 as %arg10: !stream.resource<external>{%9}, %4 as %arg11: !stream.resource<external>{%c8}, %__auto.blk.0.ffn_norm.weight as %arg12: !stream.resource<constant>{%c1024}, %__auto.blk.1.attn_norm.weight as %arg13: !stream.resource<constant>{%c1024}, %__auto.blk.1.ffn_norm.weight as %arg14: !stream.resource<constant>{%c1024}, %__auto.blk.2.attn_norm.weight as %arg15: !stream.resource<constant>{%c1024}, %__auto.blk.2.ffn_norm.weight as %arg16: !stream.resource<constant>{%c1024}, %__auto.output_norm.weight as %arg17: !stream.resource<constant>{%c1024}, %33 as %arg18: !stream.resource<external>{%24}, %result as %arg19: !stream.resource<external>{%35}, %result_0 as %arg20: !stream.resource<transient>{%59}) {
...
stream.cmd.copy %arg10[%c0], %arg18[%c0], %32 : !stream.resource<external>{%9} -> !stream.resource<external>{%24}

@benvanik
Copy link
Collaborator

benvanik commented Jan 22, 2025

I suspect mystery solved with that; we're giving the same tensor two different sizes (SSA values), that happens to be the input aliased by the user, and we end up with two stream.tensor.import ops that are treating the imported tensor as unique values - the rest of the pipeline then treats them as unique values and at some point ends up with a copy between them. Complete overlap is a special case of memmove that could be handled by memcpy or no-oped but we can't really know that in all cases and definitely don't want that copy so we have to prevent the compiler from treating the tensors as different.

This is unlikely to happen in a lot of programs because we don't alias inputs for kvcache stuff - but when we are aliasing inputs intentionally they must line up right.

(I don't really have a solution, but I think that's the problem :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants