move GeneralizeNamedOp pass #72

PhaneeshB · 2023-11-25T00:10:37Z

moves SPRIVGeneralizeNamedOp Pass to Common/GPU
Use it in LLVMGPU Pipeline

) Co-authored-by: Elias Joseph <[email protected]>

The conversion pass is enabled with `--iree-flow-enable-conv-nchw-to-nhwc-transform` Includes partial support for propagating and cancelling transposes generated when converting from nchw to nhwc. The high level strategy for this pass is as follows: 1. Do the conversions for all conv_nchw_fchw ops (and pooling ops) and wrap the converted convolutions in transposes. Each transpose is tagged to indicate which direction the transpose should propagate through the graph. 2. Traverse the ops in the function in reverse to propagate transposes marked for upwards propagation to their parents. Ideally just before ops such as arith.constant or function arguments. 3. Propagate the transposes marked for downward propagation to its users, ideally to just before return. 4. Canonicalize out all adjacent cancelling transposes and generalize the remaining transposes to allow for fusing them with nearby ops.

…d_conv

- Speedup filter transform folding - Add points for 4x4, switch to that tile size - Move winograd after im2col + padding, in im2col do not touch conv if it has been marked as winograd -remove prints/chrono and adjust Attribute rawKernelAttr for windows by Quinn Co-authored-by: Quinn Dawkins <[email protected]>

Add pass to insert markers for function bisecting. Add pass to outline marked operation ranges into separate functions.

Expose a Python binding that has extraction of an operation list from an MLIR file. This list is then used to execute with IREE the entry MLIR while resolving calls to functions in other MLIR files.

This can be useful when trying to do layout propagation and guaranteeing specific fusion at time (use with caution).

This pass is the spiritual successor to `convert-conv-nchw-to-nhwc` focused on generalizing to enable data tiling and more robust layout propagation, as well as supporting non-named convolutions as well. Currently this includes some baked in generalization patterns and does not support padding. Tile size selection currently is pass-wide, but there is limited attribute control to enable fully transposing. Further generalizations should aim to write this pass by allowing per-op tile size control.

<32 bit width types are handled on the SPIR-V side by introducing bitcasts to and from i32 and bubbling them to the center of the kernel hoping to cancel. This adds a pattern for a bitcast on the result of an scf.if, which comes from the way that padding is handled (transfer_read in the `then` branch, else yield a splat constant).

Build Experimental ROCM builds

Use flags -DIREE_BUILD_EXPERIMENTAL_LEVEL_ZERO=ON -DLEVEL_ZERO_HEADERS_API_ROOT=/home/stanley/nod/level-zero -LevelZero HAL Driver -Addi OpenCL HAL Target compiler -Add SPIRV Codegen for Kernel capability -fix illegal pointer arithmetic on void* by Boian -Use events for command buffer execution and synchronization (iree-org#47) -Add flag for switching between Physical64 and Physical32 addressing in OpenCL -enable creation of device by UUID + ID(device handle as uintptr_t) Note that the device ID implemented like that is ephemeral and is valid only in the current IREE runtime context. If you start a new process the IDs will be different. With this change you can do $ iree-run-module --list_devices level_zero://00005100-0000-0000-0000-000000000001 $ iree-run-module --device=level_zero://00005100-0000-0000-0000-000000000001 ... -add query_memory_heaps implementation Fixes error: arithmetic on a pointer to void is a GNU extension [-Werror,-Wgnu-pointer-arith] -Supply structure type when passing such arguments in accordance with the API (iree-org#34) When calling Level Zero API functions that query information and use a struct to populate the information, the user must supply the structure type (stype) in the structure itself. The struct is both in and out argument. Fix Level Zero build Remove usage of iree_hal_command_buffer_dyn_cast.

* Add rudimentary non-production distributed Python API * Distributed execution validation Add functionality that validates distributed StableHLO is producing the same results as non-distributed. * Add execution time measurement * Distributed Python API: add call_count to run_ranks * Add setup script for distributed Python API * Add JAX to install setup --------- Co-authored-by: Boian Petkantchin <[email protected]>

…f-hosted, clean macos bindist Drop instrumented builds and Python < 3.11 Add Upstream sync CI This fixes the problem of potentially dropping commits that have been submitted while an automatic rebase with upstream IREE is goining on. [CI] Fix macos clean up logic Fixes the macos builder.

Instead of requiring exact NCCL version, relax constraints to the standard ABI versioning rules, namely found_version >= major.minor && found_version < major + 1, where major and minor are from the NCCL headers we use.

Makes the driver compliant with the HAL API change.

This reverts commit a6512dc.

The semantics for specifying different kinds of advice is unclear so I set it in two stages.

…uped qmm MegaPR [LLVMCPU] Allow parallel tiling in LLVMCPUSplitReduction, tile reduction by 2 This commit enables tiling of parallel dimensions in LLVMCPUSplitReduction, as well as changing the tile size of the resulting reduction to 2. The latter change is an x86 specific optimization that allows targeting specific instructions through VectorContractCustomKernels. [LLVMCPU] Add support for vecmat cases in VectorContractCustomKernel This commit introduces some new functionality to VectorContractCustomKernels: 1. Matching for vecmat kernels that have 1D vector shapes 2. Support for `vector.contract` ops with split reduction dimensions 3. Ability to allow promoting smaller bitwidth inputs with `arith.extui` or `arith.extsi` before passing into the `llvm.inline_asm` op 4. Ability to specify explicit constraint strings per register input in a VectorContractCustomKernel 5. Support for `i4` and `i8` input types 6. New x86 AVX512VNNI i16xi16->i32 vecmat kernel with split reduction This commit also adds `vector.transfer_read` flattening patterns and VectorContractCustomKernel lowering patterns to LLVMCPUVectorLowering. [LLVMCPU] Add pass to breakdown subbyte `arith.extui` This pass breaks down `arith.extui` ops that have `i4` inputs into a sequence of `vector.shuffle->arith.andi->arith.shrui`. This avoids bad lowering of subbyte extends in x86 backend. This pass is somewhat specific to some work on vecmat VectorContractCustomKernels right now, and has some unique matchings. The pass also attempts to make use of AVX512 registers, so the vector size for the resulting IR is hardcoded as 512 bits. This needs to change before landing. This pass in general needs some refactoring before landing. [LLVMCPU] Add pass to fold away unit dimensions on `vector.contract` ops This pass folds away unit dimensions on `vector.contract` ops to get these ops into a form that is recognizable by the VectorContractCustomKernels patterns. This pass also hoists `vector.shape_cast` ops out of containing `scf.for` ops if possible when the shape cast operates on the accumulator of a `vector.contract` op. This pattern may be better off somewhere else, but for now it is here because the unit dim folding pattern can produce a hoistable `vector.shape_cast` op in cases with split reduction. [LLVMCPU] Add flag to restrict reassociated quantized matmul optimizations [LLVMCPU] Add additional Memref alias foldings [LLVMCPU] Simplify VectorContractCustomKernels x86 constraint codes, add new AVX512 kernel

Co-authored-by: Max Dawkins <[email protected]>

1) Level Zero is probably not buildable. Registration CMake for external drivers changed and I didn't take the time to figure out how that works. 2) Some github action changes; might have messed up the workflows, unsure. 3) CPU Performance patches had a number of conflicts, including the FuseDequantMatmul pass seemingly being dropped. Will need help from Max to make sure they are all in order.

This commit adds a new tiling configuration pass in LLVMCPU. This pass sets a special tiling configuration for reassociated quantized matmuls, since the non-root op of these dispatches require specific tiling to target certain x86 instructions. This pass is a place to set abnormal tile sizes on non-root ops for specific types of workloads.

antiagainst

Any reason this cannot be done directly in upstream IREE? Could you send pull request there? Tests also needs to be moved too.

PhaneeshB · 2023-11-28T18:00:07Z

merged upstream and rebased

ThomasRaoux and others added 30 commits November 22, 2023 19:34

[TUNER] Add attribute control for splitK

a82e605

[TUNER] Add attribute control for swizzle

ef721a1

[TUNER] Allow split-k working on generic ops with reduction

a85bd42

[CUDA] added fucntion to sync cuda context to current thread (iree-org#8

a1607b7

) Co-authored-by: Elias Joseph <[email protected]>

[vulkan] Modify subspan to handle static cast.

09ed63d

[codegen][spirv] Pack/transpose matrix B for better coop mmma

6d0f50e

[winograd] Add winograd convolution attribute control as iree_winogra…

85613f2

…d_conv

[API] Expose iree-opt in python for applying flow preprocessing passes

b78915e

[COMPILER] Add a plugin to split MLIR functions

d0071dd

Add pass to insert markers for function bisecting. Add pass to outline marked operation ranges into separate functions.

[split_mlir] Add operation list extraction and execution for IREE

46bebd2

Expose a Python binding that has extraction of an operation list from an MLIR file. This list is then used to execute with IREE the entry MLIR while resolving calls to functions in other MLIR files.

[Preprocessing] Add pass to generalize 2d convolutions

f8dabd9

This can be useful when trying to do layout propagation and guaranteeing specific fusion at time (use with caution).

Fix embedded linker flag in python exposed iree-opt

3891bd6

[CI] Add ROCM builds to the the nightly

bb88c8e

Build Experimental ROCM builds

[BUILD] - Remove documentation build before publishing website

b968c57

Drop CODEOWNERS to prevent sending review requests for SHARK-Runtime

8ed83e0

[Distributed] Add example to run a simple model across 2 GPUs

e1737f4

[CI] Add AArch64 builder, disable tests

3b362d7

Relax NCCL version constraints

2fd038e

Instead of requiring exact NCCL version, relax constraints to the standard ABI versioning rules, namely found_version >= major.minor && found_version < major + 1, where major and minor are from the NCCL headers we use.

Remove dependency of iree.runtime to iree.runtime.distributed

776d797

Switch windows to self-hosted

a7457c4

clean up bindist before building

f93d1a0

[LevelZero] remove intial data argument form buffer allocation

3cf9259

Makes the driver compliant with the HAL API change.

Revert "[codegen][spirv] Pack/transpose matrix B for better coop mmma"

d89e4ca

This reverts commit a6512dc.

vivekkhandelwal1 and others added 10 commits November 22, 2023 19:34

Remove ConvertLinalgMatmulToMmt completely

70c5a05

Add hip headers to build ROCm backend without the SDK.

5a297ca

[experimental][ROCM] Stream Command Buffer

dbfd1ab

[ROCM] Add supports_concurrent_managed_access

295292e

Set preferred location to the device for HIP Managed Memory

8011db3

The semantics for specifying different kinds of advice is unclear so I set it in two stages.

LLVMCPU reduction tiling rebase conflict fix

e0848b0

Co-authored-by: Max Dawkins <[email protected]>

update deprecated AffineExpr casts

36bcb7e

PhaneeshB requested review from antiagainst and qedawkins and removed request for antiagainst November 25, 2023 01:09

antiagainst requested changes Nov 25, 2023

View reviewed changes

move GeneralizeNamedOp pass

ee88b49

PhaneeshB force-pushed the move_generalizeoppass branch from 881690d to ee88b49 Compare November 27, 2023 14:24

sogartar force-pushed the shark branch from 36bcb7e to 7f38944 Compare November 28, 2023 16:22

powderluv force-pushed the shark branch from 7f38944 to fb8e3f0 Compare November 28, 2023 17:05

PhaneeshB closed this Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move GeneralizeNamedOp pass #72

move GeneralizeNamedOp pass #72

PhaneeshB commented Nov 25, 2023

antiagainst left a comment

PhaneeshB commented Nov 28, 2023

move GeneralizeNamedOp pass #72

move GeneralizeNamedOp pass #72

Conversation

PhaneeshB commented Nov 25, 2023

antiagainst left a comment

Choose a reason for hiding this comment

PhaneeshB commented Nov 28, 2023