Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with 5ec73b7d (Aug 21) (8) #361

Merged
merged 128 commits into from
Dec 4, 2024

Conversation

mgehre-amd
Copy link
Collaborator

No description provided.

ssahasra and others added 30 commits August 20, 2024 15:23
)

The original implementation provided a simple method to check whether
the forest of nested cycles is well-formed. This is now augmented with
other methods to check well-formedness of all cycles, either
invdividually, or as the entire forest. These will be used by future
transforms that modify CycleInfo.
Use the nuw attribute of GEPs to prove that pointers do not alias, in
cases matching the following:

   +                +                     +
   | BaseOffset     |   +<nuw> Indices    |
   ---------------->|-------------------->|
   |-->V2Size       |                     |-------> V1Size
  LHS                                    RHS

If the difference between pointers is Offset +<nuw> Indices then we know
that the addition does not wrap the pointer index type (add nuw) and the
constant Offset is a lower bound on the distance between the pointers. We
can then prove NoAlias via Offset u>= V2Size.
…102613)

After decomposition of OpenMP compound constructs and assignment of
applicable clauses to each leaf construct, composite constructs are then
combined again into a single element in the construct queue. This helped
later lowering stages easily identify composite constructs.

However, as a result of the re-composition stage, the same list of
clauses is used to produce all MLIR operations corresponding to each
leaf of the original composite construct. This undoes existing logic
introducing implicit clauses and deciding to which leaf construct(s)
each clause applies.

This patch removes construct re-composition logic and updates Flang
lowering to be able to identify composite constructs from a list of leaf
constructs. As a result, the right set of clauses is produced for each
operation representing a leaf of a composite construct.

PR stack:
- llvm#102612
- llvm#102613
…vm#104595)

This new interface is supposed to capture the core functionality of
DLTI: querying for values at keys. As such this new interface unifies
the ability to query DLTI attributes in a single method: query(). All
existing DLTI interfaces exposing their own query methods now 1) now
extend this new interface and 2) provide a default implementation for
`query()`.

As DLTIQueryInterface::query() returns an attribute, it naturally
enables recursive queries on nested DLTI attrs. A utility function,
`dlti::query()`, implements the logic for nested lookups.

A new `#dlti.map` attribute is introduced to capture the most generic
form of a finite DLTI-mapping. One of the benefits is that it allows for
more easily encoding hierachical information that is suitably queryable,
i.e. by means of nested attributes.

In line with the above, `transform.dlti.query` is modified so as to take
an arbitrary number of keys and to perform a nested lookup using the
above utility function.
…atalyst (llvm#104872)

Mac Catalyst is the iOS platform, but it builds against the macOS SDK
and so it needs to be checking the macOS SDK version instead of the iOS
one. Add tests against a greater-than SDK version just to make sure this
works beyond the initially supporting SDKs.
…lvm#102300)"

This reverts commit b432afc.

Reverted due to linker failures in expensive-checks.
…#104805)

This extends SimplifyCFG hoisting to also hoist instructions with
commuted operands, for example a+b on one side and b+a on the other
side.

This should address the issue mentioned in:
llvm#91185 (comment)
Avoids implicit sint_to_fp which wasn't occurring on strict fp codegen

Fixes llvm#104848
…cy zero (llvm#102915)

A long time ago (back in 2009) there was a commit 52d4d82
that changed the scheduler to not dirty height/depth when adding or
removing SUnit predecessors when the latency on the edge was zero. That
commit message is claiming that the depth or height isn't affected when
the latency is zero.

As a matter of fact, the depth/height can change even with a zero
latency on the edge. If for example adding a new SUnit A, with zero
latency, but as a predecessor to a SUnit B, then both height of A and
depth of B should be marked as dirty. If for example B has a greater
height than A, then the height of A needs to be adjusted even if the
latency is zero.

I think this has been wrong for many years. Downstream we have had
commit 52d4d82 reverted since back in 2016. There is no
motivating lit test for 52d4d82 (only an incomplete C level
reproducer in llvm#3613).

After commit 13d04fa there finally appeared an upstream
lit test that shows that we get better code if marking height/depth as
dirty (llvm/test/CodeGen/AArch64/abds.ll).
This change does two kinds of splits:
- Splits each target into a different file. Some targets are left in the
same files, such as riscv32/64 and x86/_64 as these tests and lists are
very similar.
- Splits up the very long 'note:' lines which contain a list of CPUs,
using `CHECK-SAME`. There was a note about this not being possible
before, but with `{{^}}`, this is now possible -- I have
verified that this does the right thing if a single CPU anywhere in the
list is left out.

These tests had become quite annoying to change when adding a CPU, and I
believe this change makes these easier to maintain, and should cut down
on conflicts in these files (or at least makes conflicts easier to
resolve).

I apologise in advance for downstream conflicts, but hopefully that's a
small amount of short term pain, in return for fewer conflicts in
future.
Small PR to add additional getters for LLVMContextRef in the C API.
…lvm#104775)

Another upstreaming of C API extensions we have in Julia/LLVM.jl.
Although [we went](maleadt/LLVM.jl#431) with a
string-based API there, here I'm proposing something that's similar to
existing metadata/attribute APIs:
- explicit functions to map syncscope names to IDs, and back
- `LLVM*SyncScope` versions of builder APIs that already take a
`SingleThread` argument: atomic rmw, atomic xchg, fence
- `LLVMGetAtomicSyncScopeID` and `LLVMSetAtomicSyncScopeID` for other
atomic instructions
- testing through `llvm-c-test`'s `--echo` functionality
Add a hint to use the no-verify-fixpoint option.
These are annoying to update, and are redundant since the tests in
clang/test/Driver/print-enabled-extensions/ were added.
)

Patterns were previously added to allow the following reductions
- fminimum(abs(a), abs(b)) -> famin(a, b)
- fmaximum(abs(a), abs(b)) -> famax(a, b)
- llvm#103027
 
It was suggested by @davemgreen that the following reductions are also
possible
- fminnum[nnan](abs(a), abs(b)) -> famin(a, b)
- fmaxnum[nnan](abs(a), abs(b)) -> famax(a, b)

('nnan' documenatation:
https://llvm.org/docs/LangRef.html#fast-math-flags)

The 'no NaNs' flag allows optimisations to assume that neither argument
is a NaN, and so the differing NaN propagation semantics of
llvm.maxnum/llvm.minnum and FAMAX/FAMIN can be ignored in this
reduction.
(llvm.maxnum/llvm.minnum:
https://llvm.org/docs/LangRef.html#llvm-minnum-intrinsic)

- Changes to LLVM
	- lib/target/AArch64/AArch64InstrInfo.td
- add 'fminnm_nnan' and 'fmaxnm_nnan'; patfrags on fminnm/fmaxnm that
are predicated on the instrinsic call having the 'nnan' flag.
- add AArch64famin and AArch64famax patfrags, containing the new and
existing reductions.
	- test/CodeGen/AArch64/aarch64-neon-faminmax.ll
- add positive and negative tests for the new reduction, based on the
presence of 'nnan' in the IR intrinsic call.
This patch moves utilities from
`offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h` to
`llvm/Frontend/Offloading/Utility.h` to be reused by
other projects.

Concretely the following changes were made:
- Rename `KernelMetaDataTy` to `AMDGPUKernelMetaData`.
- Remove unused fields `KernelObject`, `KernelSegmentSize`,
`ExplicitArgumentCount` and `ImplicitArgumentCount` from
`AMDGPUKernelMetaData`.
- Return the produced error if `ELFObj.sections()` failed instead of
using `cantFail`.
- Added `AGPRCount` field to `AMDGPUKernelMetaData`.
- Added a default invalid value to all the fields in
`AMDGPUKernelMetaData`.
…vm#104692)

Inline asm operands could contain any kind of relocation, so remove the
checks.

Fixes llvm#103493
…e_map (llvm#104918)

This test is already disabled for Windows because of symlinks. Disable
it for cross build on Windows host too.
This change looks for instructions of storing symmetric constants
instruction 32-bit units. usually consisting of several 'MOV' and
one or less 'ORR'.

If found, load only the lower 32-bit constant and change it to copy
and save to the upper 32-bit using the 'STP' instruction.

For example:
  renamable $x8 = MOVZXi 49370, 0
  renamable $x8 = MOVKXi $x8, 320, 16
  renamable $x8 = ORRXrs $x8, $x8, 32
  STRXui killed renamable $x8, killed renamable $x0, 0
becomes
  $w8 = MOVZWi 49370, 0
  $w8 = MOVKWi $w8, 320, 16
STPWi killed renamable $w8, killed renamable $w8, killed renamable $x0,
0


related issue : llvm#51483
…lvm#102300)"

This reverts commit 4aacc60.

The original implementation provided a simple method to check whether the forest
of nested cycles is well-formed. This is now augmented with other methods to
check well-formedness of every cycle, either individually, or as the entire
forest. These will be used by future transforms that modify CycleInfo.
This extends the existing sxtw peephole optimization (llvm#96293) to uxtw,
which in llvm is a ORRWrr which clears the top bits.

Fixes llvm#98481
…odel=aggressive (llvm#100453)

This change modifies -ffp-model=fast to select options that more closely
match -funsafe-math-optimizations, and introduces a new model,
-ffp-model=aggressive which matches the existing behavior (except for a
minor change in the fp-contract behavior).

The primary motivation for this change is to make -ffp-model=fast more
user friendly, particularly in light of LLVM's aggressive optimizations
when -fno-honor-nans and -fno-honor-infinites are used.

This was previously proposed here:

https://discourse.llvm.org/t/making-ffp-model-fast-more-user-friendly/78402
Baed off worst case llvm-mca numbers for CTLZ/CTTZ(+ZERO_UNDEF) codegen

Prep work for llvm#102885
Reland [CGData] llvm-cgdata llvm#89884 using `Opt` instead of `cl`
- Action options are required, `--convert`, `--show`, `--merge`. This
was similar to sub-commands previously implemented, but having a prefix
`--`.
- `--format` option is added, which specifies `text` or `binary`.

---------

Co-authored-by: Kyungwoo Lee <[email protected]>
preames and others added 26 commits August 20, 2024 13:51
…ich can trap (llvm#105214)

This allows the use a single wider operation with a restricted EVL
instead of having to split and cover via decreasing powers-of-two sizes.

On RISCV, this avoids the need for a bunch of vslidedown and vslideup
instructions to extract subvectors, and VL toggles to switch between the
various widths.

Note there is a potential downside of using vp nodes; we loose any
generic DAG combines which might have applied to the split form.
Previously the libc startup code was marked `EXCLUDE_FROM_ALL` due to
build issues. This patch removes that as no longer necessary.
Nowadays, an ASM_MASM tool is required for building the BLAKE3 assembly
in llvm/lib/Support - the llvm-ml tool can do this.
In llvm#100024 we moved from safe_load to load for reading the yaml in
newheadergen due to dependency issues. Those should be resolved by now
so this should be a simple safety improvement.
Rework `IntrinsicEmitter::EmitIntrinsicToBuiltinMap` for improved
    peformance as well as refactor the code.

    Performance:
    - Current generated code does a linear search on the TargetPrefix,
      followed by a binary search on the builtin names for that
      target's builtins.
    - Improve the performance of this code in 2 ways:
      (a) Use binary search on the target prefix to lookup the builtin
          table for the target.
      (b) Improve the (common) case of when all builtins for a target
          share a common prefix.  Check this common prefix first, and
then do the binary search in the builtin table using the builtin
          name with the common prefix removed. This should help
          both data size (by creating a smaller static string table) and
          runtime (by reducing the cost of binary search on smaller
          strings).

    Refactor:
    - Use range based for loops for iterating over maps.
- Use formatv() and C++ raw string literals to simplify the emission
code.
    - Change the generated `getIntrinsicForClangBuiltin` and 
      `getIntrinsicForMSBuiltin`  to take a `StringRef` instead of 
      `const char *` for the prefix.
Don't try to fold x87 extended precision operations in a test unless
it's targeting x86-64.
The mismatch between the comment on this test and the test itself was
pointed out in
llvm#100699 (comment),
but apparently I failed to actually commit the fix.
This recently added test is failing on Windows with:
```
c:\users\tcwg\llvm-worker\lldb-aarch64-windows\build\bin\lldb.exe --no-lldbinit -S C:/Users/tcwg/llvm-worker/lldb-aarch64-windows/build/tools/lldb\test\Shell\lit-lldb-init-quiet C:\Users\tcwg\llvm-worker\lldb-aarch64-windows\build\tools\lldb\test\Shell\Expr\Output\TestAnonNamespaceParamFunc.cpp.tmp -o run -o "expression func(a)" -o exit | c:\users\tcwg\llvm-worker\lldb-aarch64-windows\build\bin\filecheck.exe C:\Users\tcwg\llvm-worker\lldb-aarch64-windows\llvm-project\lldb\test\Shell\Expr\TestAnonNamespaceParamFunc.cpp
executed command: 'c:\users\tcwg\llvm-worker\lldb-aarch64-windows\build\bin\lldb.exe' --no-lldbinit -S 'C:/Users/tcwg/llvm-worker/lldb-aarch64-windows/build/tools/lldb\test\Shell\lit-lldb-init-quiet' 'C:\Users\tcwg\llvm-worker\lldb-aarch64-windows\build\tools\lldb\test\Shell\Expr\Output\TestAnonNamespaceParamFunc.cpp.tmp' -o run -o 'expression func(a)' -o exit
.---command stderr------------
| TestAnonNamespaceParamFunc.cpp.tmp :: Class 'tagARRAYDESC' has a member 'tdescElem' of type 'tagTYPEDESC' which does not have a complete definition.error: TestAnonNamespaceParamFunc.cpp.tmp :: Class 'tagARRAYDESC' has a member 'tdescElem' of type 'tagTYPEDESC' which does not have a complete definition.
| (lldb) TestAnonNamespaceParamFunc.cpp.tmp :: Class 'std::partial_ordering' has a member 'less' of type 'std::partial_ordering' which does not have a complete definition.error: TestAnonNamespaceParamFunc.cpp.tmp :: Class 'std::partial_ordering' has a member 'less' of type 'std::partial_ordering' which does not have a complete definition.
| (lldb) TestAnonNamespaceParamFunc.cpp.tmp :: Class 'std::strong_ordering' has a member 'less' of type 'std::strong_ordering' which does not have a complete definition.error: TestAnonNamespaceParamFunc.cpp.tmp :: Class 'std::strong_ordering' has a member 'less' of type 'std::strong_ordering' which does not have a complete definition.
| (lldb) TestAnonNamespaceParamFunc.cpp.tmp :: Class 'std::weak_ordering' has a member 'less' of type 'std::weak_ordering' which does not have a complete definition.error: TestAnonNamespaceParamFunc.cpp.tmp :: Class 'std::weak_ordering' has a member 'less' of type 'std::weak_ordering' which does not have a complete definition.
| (lldb) error: Couldn't look up symbols:
|   int func(struct `anonymous namespace'::InAnon)
| Hint: The expression tried to call a function that is not present in the target, perhaps because it was optimized out by the compiler.
`-----------------------------
executed command: 'c:\users\tcwg\llvm-worker\lldb-aarch64-windows\build\bin\filecheck.exe' 'C:\Users\tcwg\llvm-worker\lldb-aarch64-windows\llvm-project\lldb\test\Shell\Expr\TestAnonNamespaceParamFunc.cpp'
.---command stderr------------
| C:\Users\tcwg\llvm-worker\lldb-aarch64-windows\llvm-project\lldb\test\Shell\Expr\TestAnonNamespaceParamFunc.cpp:10:11: error: CHECK: expected string not found in input
| // CHECK: (int) $0 = 15
|           ^
| <stdin>:16:26: note: scanning from here
| (lldb) expression func(a)
|                          ^
```

So the function is still not callable. But AFAICT, this is not a
regression, since this function wasn't callable prior to the patch
anyway. I currently do not have a Windows setup to test this on,
so XFAIL for now.
Remove widenToNextPow2 from StoreActions.
Reorder clampScalar and lowerIfMemSizeNotByteSizePow2 for StoreActions.

These match AArch64 and got me further on a test case I was playing with
that contained a i129 store.
…104523)

Compilers and language runtimes often use helper functions that are
fundamentally uninteresting when debugging anything but the
compiler/runtime itself. This patch introduces a user-extensible
mechanism that allows for these frames to be hidden from backtraces and
automatically skipped over when navigating the stack with `up` and
`down`.

This does not affect the numbering of frames, so `f <N>` will still
provide access to the hidden frames. The `bt` output will also print a
hint that frames have been hidden.

My primary motivation for this feature is to hide thunks in the Swift
programming language, but I'm including an example recognizer for
`std::function::operator()` that I wished for myself many times while
debugging LLDB.

rdar://126629381


Example output. (Yes, my proof-of-concept recognizer could hide even
more frames if we had a method that returned the function name without
the return type or I used something that isn't based off regex, but it's
really only meant as an example).

before:
```
(lldb) thread backtrace --filtered=false
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100001f04 a.out`foo(x=1, y=1) at main.cpp:4:10
    frame #1: 0x0000000100003a00 a.out`decltype(std::declval<int (*&)(int, int)>()(std::declval<int>(), std::declval<int>())) std::__1::__invoke[abi:se200000]<int (*&)(int, int), int, int>(__f=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:149:25
    frame #2: 0x000000010000399c a.out`int std::__1::__invoke_void_return_wrapper<int, false>::__call[abi:se200000]<int (*&)(int, int), int, int>(__args=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:216:12
    frame #3: 0x0000000100003968 a.out`std::__1::__function::__alloc_func<int (*)(int, int), std::__1::allocator<int (*)(int, int)>, int (int, int)>::operator()[abi:se200000](this=0x000000016fdff280, __arg=0x000000016fdff224, __arg=0x000000016fdff220) at function.h:171:12
    frame #4: 0x00000001000026bc a.out`std::__1::__function::__func<int (*)(int, int), std::__1::allocator<int (*)(int, int)>, int (int, int)>::operator()(this=0x000000016fdff278, __arg=0x000000016fdff224, __arg=0x000000016fdff220) at function.h:313:10
    frame #5: 0x0000000100003c38 a.out`std::__1::__function::__value_func<int (int, int)>::operator()[abi:se200000](this=0x000000016fdff278, __args=0x000000016fdff224, __args=0x000000016fdff220) const at function.h:430:12
    frame #6: 0x0000000100002038 a.out`std::__1::function<int (int, int)>::operator()(this= Function = foo(int, int) , __arg=1, __arg=1) const at function.h:989:10
    frame #7: 0x0000000100001f64 a.out`main(argc=1, argv=0x000000016fdff4f8) at main.cpp:9:10
    frame #8: 0x0000000183cdf154 dyld`start + 2476
(lldb) 
```

after

```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100001f04 a.out`foo(x=1, y=1) at main.cpp:4:10
    frame #1: 0x0000000100003a00 a.out`decltype(std::declval<int (*&)(int, int)>()(std::declval<int>(), std::declval<int>())) std::__1::__invoke[abi:se200000]<int (*&)(int, int), int, int>(__f=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:149:25
    frame #2: 0x000000010000399c a.out`int std::__1::__invoke_void_return_wrapper<int, false>::__call[abi:se200000]<int (*&)(int, int), int, int>(__args=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:216:12
    frame #6: 0x0000000100002038 a.out`std::__1::function<int (int, int)>::operator()(this= Function = foo(int, int) , __arg=1, __arg=1) const at function.h:989:10
    frame #7: 0x0000000100001f64 a.out`main(argc=1, argv=0x000000016fdff4f8) at main.cpp:9:10
    frame #8: 0x0000000183cdf154 dyld`start + 2476
Note: Some frames were hidden by frame recognizers
```
…entwiseOpFusion (llvm#104409)

This commit changes the getPreservedProducerResults function so that it
takes the consumer into account along with the producer, in order to
predict which of the producer’s outputs can be dropped during the fusion
process. It provides a more accurate prediction, considering that the
fusion process also depends on the consumer.
This wires up dxil-op-lower, dxil-intrinsic-expansion, dxil-translate-metadata,
and dxil-pretty-printer to the new pass manager, both as a matter of future
proofing the backend and so that they can be used more flexibly in tests.

A few arbitrary tests are updated in order to test the new PM path, and we drop
the "print-dxil-resource-md" pass since it's redundant with the pretty printer.

Pull Request: llvm#104250
It didn't crash so I thought this worked now, but upon further review
it miscalculates the stack address for the return.
Specifically, to illustrate our general lowering strategy for
non-power of two vectors.
…vm#104826)

With -fsanitize-cfi-icall-experimental-normalize-integers, Clang
appends ".normalized" to KCFI types in CodeGenModule::CreateKCFITypeId,
which changes type hashes also for functions that don't have integer
types in their signatures. However, llvm::setKCFIType does not take
integer normalization into account, which means LLVM generated
functions with KCFI types, e.g. sanitizer constructors, will fail KCFI
checks when integer normalization is enabled in Clang.

Add a cfi-normalize-integers module flag to indicate integer
normalization is used, and append ".normalized" to KCFI types also in
llvm::setKCFIType to fix the type mismatch.
For the tests I just added +sve instead of what actual hardware has, which is only SME,
since otherwise all the test functions need to be marked as streaming mode.

rdar://121864771
…Lowering::lowerReturn. NFC

This is similar to X86 and AArch64 structure.
/llvm-project/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp:124:7:
error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
      opOperandsToIgnore.pop_back_val();
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Since 2.2, `fmin.s/fmax.s` instructions follow the IEEE754-2019, if F
extension is avaiable; and `fmin.d/fmax.d` also follow the IEEE754-2019
if D extension is avaiable.

So, let's mark them as Legal.
This reverts commit 6476a1d.
d3fb41d relanded in 9bb5556.

...amended to incorporate changes from the reland.
Base automatically changed from bump_to_e47b5075 to feature/fused-ops December 3, 2024 14:42
@mgehre-amd mgehre-amd merged commit c914244 into feature/fused-ops Dec 4, 2024
11 checks passed
@mgehre-amd mgehre-amd deleted the bump_to_5ec73b7d branch December 4, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.