Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move GeneralizeNamedOp pass #72

Closed
wants to merge 41 commits into from

Conversation

PhaneeshB
Copy link
Collaborator

  • moves SPRIVGeneralizeNamedOp Pass to Common/GPU
  • Use it in LLVMGPU Pipeline

ThomasRaoux and others added 30 commits November 22, 2023 19:34
The conversion pass is enabled with `--iree-flow-enable-conv-nchw-to-nhwc-transform`

Includes partial support for propagating and cancelling transposes generated
when converting from nchw to nhwc. The high level strategy for this pass is
as follows:
    1. Do the conversions for all conv_nchw_fchw ops (and pooling ops) and
    wrap the converted convolutions in transposes. Each transpose is tagged
    to indicate which direction the transpose should propagate through the
    graph.
    2. Traverse the ops in the function in reverse to propagate transposes
    marked for upwards propagation to their parents. Ideally just before ops
    such as arith.constant or function arguments.
    3. Propagate the transposes marked for downward propagation to its users,
    ideally to just before return.
    4. Canonicalize out all adjacent cancelling transposes and generalize the
    remaining transposes to allow for fusing them with nearby ops.
- Speedup filter transform folding
- Add points for 4x4, switch to that tile size
- Move winograd after im2col + padding, in im2col do not
  touch conv if it has been marked as winograd
-remove prints/chrono and adjust Attribute rawKernelAttr for windows by
Quinn

Co-authored-by: Quinn Dawkins <[email protected]>
Add pass to insert markers for function bisecting.
Add pass to outline marked operation ranges into separate functions.
Expose a Python binding that has extraction of an
operation list from an MLIR file.
This list is then used to execute with IREE the
entry MLIR while resolving calls to functions in
other MLIR files.
This can be useful when trying to do layout propagation and guaranteeing
specific fusion at time (use with caution).
This pass is the spiritual successor to `convert-conv-nchw-to-nhwc`
focused on generalizing to enable data tiling and more robust layout
propagation, as well as supporting non-named convolutions as well.

Currently this includes some baked in generalization patterns and does
not support padding. Tile size selection currently is pass-wide, but
there is limited attribute control to enable fully transposing. Further
generalizations should aim to write this pass by allowing per-op tile
size control.
<32 bit width types are handled on the SPIR-V side by introducing
bitcasts to and from i32 and bubbling them to the center of the kernel
hoping to cancel. This adds a pattern for a bitcast on the result of an
scf.if, which comes from the way that padding is handled (transfer_read
in the `then` branch, else yield a splat constant).
Build Experimental ROCM builds
Use flags -DIREE_BUILD_EXPERIMENTAL_LEVEL_ZERO=ON -DLEVEL_ZERO_HEADERS_API_ROOT=/home/stanley/nod/level-zero
-LevelZero HAL Driver
-Addi OpenCL HAL Target compiler
-Add SPIRV Codegen for Kernel capability
-fix illegal pointer arithmetic on void* by Boian
-Use events for command buffer execution and synchronization (iree-org#47)
-Add flag for switching between Physical64 and Physical32 addressing in OpenCL
-enable creation of device by UUID + ID(device handle as uintptr_t)
Note that the device ID implemented like that is ephemeral and is valid
only in the current IREE runtime context. If you start a new process
the IDs will be different.
With this change you can do

$ iree-run-module --list_devices
level_zero://00005100-0000-0000-0000-000000000001

$ iree-run-module --device=level_zero://00005100-0000-0000-0000-000000000001 ...

-add query_memory_heaps implementation

Fixes
error: arithmetic on a pointer to void is a GNU extension [-Werror,-Wgnu-pointer-arith]

-Supply structure type when passing such arguments in accordance with the API (iree-org#34)

When calling Level Zero API functions that query information and use a struct to
populate the information, the user must supply the structure type (stype) in
the structure itself. The struct is both in and out argument.

Fix Level Zero build

Remove usage of iree_hal_command_buffer_dyn_cast.
* Add rudimentary non-production distributed Python API

* Distributed execution validation

Add functionality that validates distributed StableHLO
is producing the same results as non-distributed.

* Add execution time measurement

* Distributed Python API: add call_count to run_ranks

* Add setup script for distributed Python API

* Add JAX to install setup

---------

Co-authored-by: Boian Petkantchin <[email protected]>
…f-hosted, clean macos bindist

Drop instrumented builds and Python < 3.11
Add Upstream sync CI

This fixes the problem of potentially dropping commits that have
been submitted while an automatic rebase with upstream IREE is goining
on.

[CI] Fix macos clean up logic

Fixes the macos builder.
Instead of requiring exact NCCL version,
relax constraints to the standard ABI versioning rules, namely
found_version >= major.minor && found_version < major + 1,
where major and minor are from the NCCL headers we use.
Makes the driver compliant with the HAL API change.
vivekkhandelwal1 and others added 10 commits November 22, 2023 19:34
The semantics for specifying different kinds of advice is unclear so I
set it in two stages.
…uped qmm MegaPR

[LLVMCPU] Allow parallel tiling in LLVMCPUSplitReduction, tile reduction by 2
This commit enables tiling of parallel dimensions in LLVMCPUSplitReduction,
as well as changing the tile size of the resulting reduction to 2. The latter
change is an x86 specific optimization that allows targeting specific
instructions through VectorContractCustomKernels.

[LLVMCPU] Add support for vecmat cases in VectorContractCustomKernel
This commit introduces some new functionality to VectorContractCustomKernels:
  1. Matching for vecmat kernels that have 1D vector shapes
  2. Support for `vector.contract` ops with split reduction dimensions
  3. Ability to allow promoting smaller bitwidth inputs with `arith.extui` or
     `arith.extsi` before passing into the `llvm.inline_asm` op
  4. Ability to specify explicit constraint strings per register input in a
     VectorContractCustomKernel
  5. Support for `i4` and `i8` input types
  6. New  x86 AVX512VNNI i16xi16->i32 vecmat kernel with split reduction

This commit also adds `vector.transfer_read` flattening patterns and
VectorContractCustomKernel lowering patterns to LLVMCPUVectorLowering.

[LLVMCPU] Add pass to breakdown subbyte `arith.extui`
This pass breaks down `arith.extui` ops that have `i4` inputs into a
sequence of `vector.shuffle->arith.andi->arith.shrui`. This avoids bad
lowering of subbyte extends in x86 backend. This pass is somewhat
specific to some work on vecmat VectorContractCustomKernels right now,
and has some unique matchings.

The pass also attempts to make use of AVX512 registers, so the vector
size for the resulting IR is hardcoded as 512 bits. This needs to
change before landing. This pass in general needs some refactoring
before landing.

[LLVMCPU] Add pass to fold away unit dimensions on `vector.contract` ops
This pass folds away unit dimensions on `vector.contract` ops to get these
ops into a form that is recognizable by the VectorContractCustomKernels
patterns.

This pass also hoists `vector.shape_cast` ops out of containing
`scf.for` ops if possible when the shape cast operates on the accumulator
of a `vector.contract` op. This pattern may be better off somewhere else,
but for now it is here because the unit dim folding pattern can produce
a hoistable `vector.shape_cast` op in cases with split reduction.

[LLVMCPU] Add flag to restrict reassociated quantized matmul optimizations

[LLVMCPU] Add additional Memref alias foldings

[LLVMCPU] Simplify VectorContractCustomKernels x86 constraint codes, add new AVX512 kernel
1) Level Zero is probably not buildable. Registration CMake for external
drivers changed and I didn't take the time to figure out how that works.

2) Some github action changes; might have messed up the workflows,
   unsure.

3) CPU Performance patches had a number of conflicts, including the
   FuseDequantMatmul pass seemingly being dropped. Will need help from
   Max to make sure they are all in order.
This commit adds a new tiling configuration pass in LLVMCPU. This pass
sets a special tiling configuration for reassociated quantized matmuls,
since the non-root op of these dispatches require specific tiling to
target certain x86 instructions. This pass is a place to set abnormal
tile sizes on non-root ops for specific types of workloads.
@PhaneeshB PhaneeshB requested review from antiagainst and qedawkins and removed request for antiagainst November 25, 2023 01:09
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this cannot be done directly in upstream IREE? Could you send pull request there? Tests also needs to be moved too.

@PhaneeshB
Copy link
Collaborator Author

merged upstream and rebased

@PhaneeshB PhaneeshB closed this Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.