-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Stream due to https://github.com/iree-org/iree/pull/18907 #19559
Comments
cc @benvanik if you have cycles and can try this example out, would help if you can help give more insight as to whats going on here and how to fix this. (I have the full dump IR after all if you want to just take a look at that, but is too big to upload here) |
Testing with the folder seems to have fixed the issue |
Many things rely on SSA value equality and the additional unfolded divui breaks equality and thus hurts any folder/canonicalizer/analysis that is relying on it to elide the extra slices. Your folder fixing it returns the universe back to the happy place where %91=%91. |
I found that in the case of prefill there are iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/ScheduleAllocation.cpp Lines 592 to 606 in 3e34e03
|
a well formed program should never reach a case with an overlap - if the input is good but we turn it into something that isn't then that's a bug (sounds like something is not folding SSA values it should be?) |
Yeah, it looks like |
Update on the errors I am seeing here. I am not hitting an error with CPU as well. Here is the input IR : https://drive.google.com/file/d/1jsITqbLfEdQsMl6a6PGdwuh4gYUdj4KQ/view?usp=drive_link This is the compile command
Here is the IR dump with this change https://drive.google.com/file/d/1Iihzlto2K5FVIxxpNc4Yz5MGREeWf74n/view?usp=drive_link And the IR dump without this change https://drive.google.com/file/d/1sIJhts7mINKsmkj-flAqx9JhelRBxfya/view?usp=drive_link I still need to find where things go off the rails, but documenting state here. |
Is this due to data-tiling introducing extract slices? |
Definitely looks like the new math not folding being the origin; same in both: %5 = hal.buffer_view.dim<%arg3 : !hal.buffer_view>[0] : index
%12 = util.assume.int %5<umin = 1, umax = 9007199254740991> : index
%39 = affine.apply affine_map<()[s0] -> (s0 * 6)>()[%12]
%139 = flow.dispatch @prefill_bs1$async_dispatch_74::@prefill_bs1$async_dispatch_74[%42, %10, %39](%138, %137, %136, %42, %10, %39) : (tensor<?x32x4x32xf16>{%42}, tensor<?x1xi32>{%10}, tensor<?x32x4x32xf16>{%39}, index, index, index) -> %136{%39} before: %140 = flow.tensor.reshape %139 : tensor<?x32x4x32xf16>{%39} -> tensor<?x24576xf16>{%12} after: %140 = arith.muli %12, %c6 overflow<nsw> : index
%141 = arith.divui %140, %c6 : index
%142 = flow.tensor.reshape %139 : tensor<?x32x4x32xf16>{%39} -> tensor<?x24576xf16>{%141} I haven't looked at how that eventually produces the error but does show that we are losing the association early and it's leading to weirdness. For example, we know later on (in both cases) that the reshape result is %165 = hal.tensor.alias wait(%arg4) => %142 : tensor<?x24576xf16>{%12} to %arg3 : !hal.buffer_view so that it changes to the calculated I'll keep looking later in the pipeline - I suspect there's a pattern that is unsafe in this kind of case, but a smaller repro would be easier to spot (we may be able to prune a lot if we can find the one that's causing the actual memmove-semantics). |
Interesting observation: at some point we end up with two imports in the same function with different sizes (well, same size but different SSA values). I think this confuses something. same: %8 = hal.buffer_view.dim<%arg3 : !hal.buffer_view>[0] : index
%9 = arith.muli %8, %c49152 : index
%10 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%8} in !stream.resource<external>{%9}
%14 = util.assume.int %8<umin = 1, umax = 9007199254740991> : index new (this is in addition to the above!): %24 = arith.muli %14, %c49152 : index
%33 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%14} in !stream.resource<external>{%24} Looking for where that comes from I see in the IR that what's expected here (an update) gets turned into an update+slice now: before: %132 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%18} in !stream.resource<external>{%41}
%133 = stream.timepoint.await %3 => %132 : !stream.resource<external>{%41}
%134 = stream.async.update on(#hal.device.affinity<@__device_0>) %114, %133[%c0 to %41] : !stream.resource<external>{%41} -> %133 as !stream.resource<external>{%41}
%result, %result_timepoint = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %134 : !stream.resource<external>{%41} => !stream.timepoint
%result_0, %result_timepoint_1 = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %131 : !stream.resource<external>{%20} => !stream.timepoint
%135 = stream.timepoint.join max(%result_timepoint, %result_timepoint_1) => !stream.timepoint
stream.timepoint.chain_external on(#hal.device.affinity<@__device_0>) %135 => (%arg5 : !hal.fence)
%136 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %result_0 : tensor<1x?x256xf16>{%16} in !stream.resource<external>{%20} -> !hal.buffer_view
util.return %136 : !hal.buffer_view after: %115 = arith.divui %40, %c6 : index
%116 = arith.muli %115, %c49152 : index
%134 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg3 : !hal.buffer_view -> tensor<?x24576xf16>{%18} in !stream.resource<external>{%41}
%135 = stream.timepoint.await %3 => %134 : !stream.resource<external>{%41}
%136 = stream.async.update on(#hal.device.affinity<@__device_0>) %114, %135[%c0 to %116] : !stream.resource<external>{%116} -> %135 as !stream.resource<external>{%41}
%137 = stream.async.slice on(#hal.device.affinity<@__device_0>) %136[%c0 to %116] : !stream.resource<external>{%41} -> !stream.resource<external>{%116}
%result, %result_timepoint = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %137 : !stream.resource<external>{%116} => !stream.timepoint
%result_0, %result_timepoint_1 = stream.timepoint.barrier on(#hal.device.affinity<@__device_0>) %133 : !stream.resource<external>{%20} => !stream.timepoint
%138 = stream.timepoint.join max(%result_timepoint, %result_timepoint_1) => !stream.timepoint
stream.timepoint.chain_external on(#hal.device.affinity<@__device_0>) %138 => (%arg5 : !hal.fence)
%139 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %result_0 : tensor<1x?x256xf16>{%16} in !stream.resource<external>{%20} -> !hal.buffer_view
util.return %139 : !hal.buffer_view The update and slice are happening because there's a mismatch - before it is overwriting range 0-%41 in the target and that's the entire tensor (size %41) - in the after since the size is decoupled we can't tell that: it's updating 0-%116 of a tensor with size %41, and since %41 != %116 it can't tell it's a full update and has to do a slice. Definitely confusion in here. The proper fix is to make sure we're folding things correctly but there may be some things we can do to make it not explode so much and I'll keep looking, but we need to have it work correctly, though, for cases like this (otherwise we risk cloning the entire kv cache or something). |
Do you have the error you are hitting at runtime? |
Yup, here : https://github.com/iree-org/iree/actions/runs/12874673677/job/35894913114 |
Cool, then I'm going to guess you are hitting this: (you could add a printf or something to confirm) This may be ok - but there's no guarantee an implementation won't still issue and perform the copy (it likely will in an indirect command buffer as we don't have the buffer pointers until we go to execute the operation, as opposed to this case where it's happening when we record it in the validation layer). |
yeah, the new IR ends up with a stream.cmd.copy on %arg3 (%arg10 = %10, %arg18 = %33, both of which are aliasing %arg3):
|
I suspect mystery solved with that; we're giving the same tensor two different sizes (SSA values), that happens to be the input aliased by the user, and we end up with two This is unlikely to happen in a lot of programs because we don't alias inputs for kvcache stuff - but when we are aliasing inputs intentionally they must line up right. (I don't really have a solution, but I think that's the problem :) |
I have been hitting unrelated issues with #18907 (which is meant to test LLVM PR llvm/llvm-project#113501 ). Has been tricky to track down the core issue cause the change is almost a NFC, and definitely is not a root cause of the errors being hit. Using this issue to track down the root cause. So far it seems to point to
OptimizeIntArithmetic
pass and some folder.Repro instructions
Input IR
toy_llama_tp2.mlir.txt
Compile command
The IR dump after all is huge, but the difference seems to start from the optimize int arithmetic pass. Here is the IR dump before and after this pass for ToT and for #18907
ToT.optimize_int_arithmetic.mlir.txt
pr18907.optimize_int_arithmetic.mlir.txt
Specifically
WithToT
With #18907
There are some extra
stream.async.slice
in PR #18907 . The before IR is essentially equivalent, but might be missing this folderThe text was updated successfully, but these errors were encountered: