Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIE2] Enable up-combining for VLD + UNPACK #241

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

andcarminati
Copy link
Collaborator

If we can't delay VLD, we can try to antecipate UNPACK. This approach can be extended to other selection combiners.

@andcarminati
Copy link
Collaborator Author

QoR:

|--------------------------------------------------------------|------------|----------|--------------|
| Core_Compute_Cycle_Count                                     | aie-public | This PR. | Total diff   |
|--------------------------------------------------------------|------------|----------|--------------|
| Softmax_bf16_0                                               |       6350 |     6351 | SAME(+0.02%) |
|--------------------------------------------------------------|------------|----------|--------------|
| ThresholdedRelu_aie2_int8                                    |        865 |      849 | IMPR(-1.85%) |
|--------------------------------------------------------------|------------|----------|--------------|
| Erf_aie2_int8_0                                              |       2554 |     2490 | IMPR(-2.51%) |
|--------------------------------------------------------------|------------|----------|--------------|
| Erf_aie2_int8_0_ptr_interface                                |       2533 |     2469 | IMPR(-2.53%) |
|--------------------------------------------------------------|------------|----------|--------------|
| HardswishAsHardsigmoid_aie2_0                                |       1368 |     1240 | IMPR(-9.36%) |
|--------------------------------------------------------------|------------|----------|--------------|
| Hardswish_aie2_0                                             |       1368 |     1240 | IMPR(-9.36%) |
|--------------------------------------------------------------|------------|----------|--------------|
| Averege diff                                                 |            | -0.07%   | -0.07%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Diff stdev                                                   |            |     0.75 |         0.75 |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #1                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #2                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #3                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #4                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #5                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #6                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #7                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #8                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|
| Quantile #9                                                  |            | +0.00%   | +0.00%       |
|--------------------------------------------------------------|------------|----------|--------------|

bool isUseOf(const MachineInstr &MI, const MachineInstr &Use) {
for (auto &Defs : Use.defs()) {
bool isUseOf(const MachineInstr &MI, const MachineInstr &Def) {
for (auto &Defs : Def.defs()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that iterate on implicit defs as well? I'm never sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just explicit ones! To cover all definitions, we need all_defs()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should be careful about implicit defs as well. The instruction might write to a control register that is used by a later instruction. I don't think that case can happen in AIE2, but who knows.

However, I'm now thinking about a case like this:

%0:vec256 = VLD
$crSat = ...
%1:vec512 = VUNPACK %0, implicit $crSat

There we cannot advance the VUNPACK, because it would then use a different value for $crSat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be a case if we extend for VUPS case...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best, in this case, is to iterate over all uses and defs, like:

bool isUseOf(const MachineInstr &MI, const MachineInstr &Def) {
  for (auto &Defs : Def.all_defs()) {
    for (auto &MIUse : MI.all_uses()) {
      if (MIUse.isReg() && Defs.getReg() == MIUse.getReg())
        return true;
    }
  }
  return false;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh alright, it's VPACK that uses crSat, not VUNPACK. Still, the latter could use crUnpackSign, so we can end up with the same problem. I think we need to use all_uses and all_defs like you proposed.

@andcarminati andcarminati force-pushed the andreu.upcombining.vld.unpack branch from fcdfcd4 to 20b677f Compare November 20, 2024 14:04
@@ -1628,6 +1633,9 @@ bool AIE2InstructionSelector::selectG_AIE_LOAD_UNPACK(
Register DstReg = UNPACKI.getOperand(0).getReg();
Register SignReg = UNPACKI.getOperand(3).getReg();

if (ShouldAntecipate)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it not be ShouldAnticipate ?
Is there a more descriptive name for it? Like CanJoinLoadAndUse ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name and also included a comment.

@andcarminati andcarminati force-pushed the andreu.upcombining.vld.unpack branch from 20b677f to 1bb9385 Compare November 21, 2024 13:41
Copy link
Collaborator

@gbossu gbossu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but before merging, we need to be a bit more careful about implicit uses/defs I think. Otherwise I think it's nice and concise code :)

auto UnsafeToMoveBefore = [&](const MachineInstr &MI) {
return (isUseOf(Dest, MI));
};
return none_of(InstrRange, UnsafeToMoveBefore);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way of using none_of 🚀

@andcarminati andcarminati force-pushed the andreu.upcombining.vld.unpack branch from 1bb9385 to 11f9a74 Compare November 22, 2024 07:42
@andcarminati
Copy link
Collaborator Author

Hi @gbossu, all comments were addressed. Thank you for the review ;-).

@andcarminati andcarminati force-pushed the andreu.upcombining.vld.unpack branch from 11f9a74 to c00687b Compare November 22, 2024 14:26
for (auto &MIUse : MI.uses()) {
bool isUseOf(const MachineInstr &MI, const MachineInstr &Def) {
for (auto &Defs : Def.all_defs()) {
for (auto &MIUse : MI.all_uses()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to check that the VPACK/VUNPACK isn't advanced when one of its implicit inputs is defined in between the load/store and the op?
E.g.

%0 = VLD ...
VST ... ; prevent delaying VLD
$crSign = ...
%1 = VPACK %0, implicit $crSign

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a test for VUNPACK with a control register write in between. I am also looking explicitly to G_INTRINSIC_W_SIDE_EFFECTS when we try to advance an instruction.

If we can't delay VLD, we can try to antecipate UNPACK. This approach can be
extended to other selection combiners.
@andcarminati andcarminati force-pushed the andreu.upcombining.vld.unpack branch from c00687b to 2a76196 Compare November 26, 2024 15:21
@andcarminati
Copy link
Collaborator Author

After this change:

  for (auto &Defs : Def.all_defs()) {
    for (auto &MIUse : MI.all_uses()) {

No side-effects were observed in QoR.

; CHECK-NEXT: [[COPY1:%[0-9]+]]:er = COPY $r0
; CHECK-NEXT: [[COPY2:%[0-9]+]]:em = COPY [[COPY1]]
; CHECK-NEXT: [[VLD_pstm_pseudo:%[0-9]+]]:vec256, [[VLD_pstm_pseudo1:%[0-9]+]]:ep = VLD_pstm_pseudo [[COPY]], [[COPY2]] :: (load (<32 x s8>))
; CHECK-NEXT: $crunpacksign = COPY [[COPY1]]
; CHECK-NEXT: [[VUNPACK_S16_S8_:%[0-9]+]]:vec512 = VUNPACK_S16_S8 [[VLD_pstm_pseudo]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-nit: we are writing to crSign, but VUNPACK_S16_S8 does not read it. If we get smarter, I guess we will be able to combine. I guess what we should really check is VST + PACK with a set_crsat() in between.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure it could be improved even more. When we are combining/selecting, this csr write in-between is just an intrinsic call with a constant operand representing the register. We can parse this call against the registers that are consumed by the final instruction (we need a mapping for this). About the test, VUNPACK_S16_S8 uses indeed the crunpacksign, but if we change the register, the result will be the same.

As this PR solves the problems of the tracked benchmarks, we can keep it as is and look to set_csr() in the future if we see this as a blocker in some benchmarks.

Copy link
Collaborator

@gbossu gbossu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@andcarminati andcarminati merged commit aa267e3 into aie-public Nov 27, 2024
8 checks passed
@andcarminati andcarminati deleted the andreu.upcombining.vld.unpack branch November 27, 2024 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants