Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compiler support for offloading mmt4d ops to custom dispatch plugin. #70

Closed
wants to merge 44 commits into from

Conversation

monorimet
Copy link
Collaborator

No description provided.

@powderluv powderluv force-pushed the shark branch 2 times, most recently from d94fb2e to 14b5d20 Compare September 26, 2023 18:08
@monorimet monorimet force-pushed the ean-accel branch 3 times, most recently from 5f31690 to 42d13a3 Compare September 26, 2023 21:30
@powderluv powderluv force-pushed the shark branch 2 times, most recently from cb85819 to 8359b2b Compare September 27, 2023 01:41
@powderluv powderluv force-pushed the shark branch 4 times, most recently from 152623a to 7b615fe Compare September 28, 2023 04:07
@monorimet monorimet force-pushed the ean-accel branch 4 times, most recently from b7de8ea to 421cc19 Compare September 28, 2023 05:26
@powderluv powderluv force-pushed the shark branch 2 times, most recently from 079253d to de90ea2 Compare September 28, 2023 16:07
ThomasRaoux and others added 10 commits September 28, 2023 17:04
The conversion pass is enabled with `--iree-flow-enable-conv-nchw-to-nhwc-transform`

Includes partial support for propagating and cancelling transposes generated
when converting from nchw to nhwc. The high level strategy for this pass is
as follows:
    1. Do the conversions for all conv_nchw_fchw ops (and pooling ops) and
    wrap the converted convolutions in transposes. Each transpose is tagged
    to indicate which direction the transpose should propagate through the
    graph.
    2. Traverse the ops in the function in reverse to propagate transposes
    marked for upwards propagation to their parents. Ideally just before ops
    such as arith.constant or function arguments.
    3. Propagate the transposes marked for downward propagation to its users,
    ideally to just before return.
    4. Canonicalize out all adjacent cancelling transposes and generalize the
    remaining transposes to allow for fusing them with nearby ops.
- Speedup filter transform folding
- Add points for 4x4, switch to that tile size
- Move winograd after im2col + padding, in im2col do not
  touch conv if it has been marked as winograd
-remove prints/chrono and adjust Attribute rawKernelAttr for windows by
Quinn

Co-authored-by: Quinn Dawkins <[email protected]>
sogartar and others added 22 commits September 28, 2023 17:04
* Add rudimentary non-production distributed Python API

* Distributed execution validation

Add functionality that validates distributed StableHLO
is producing the same results as non-distributed.

* Add execution time measurement

* Distributed Python API: add call_count to run_ranks

* Add setup script for distributed Python API

* Add JAX to install setup

---------

Co-authored-by: Boian Petkantchin <[email protected]>
…f-hosted, clean macos bindist

Drop instrumented builds and Python < 3.11
Add Upstream sync CI

This fixes the problem of potentially dropping commits that have
been submitted while an automatic rebase with upstream IREE is goining
on.

[CI] Fix macos clean up logic

Fixes the macos builder.
Instead of requiring exact NCCL version,
relax constraints to the standard ABI versioning rules, namely
found_version >= major.minor && found_version < major + 1,
where major and minor are from the NCCL headers we use.
Makes the driver compliant with the HAL API change.
Currently the number of subgroups to use is driven by a target vector
size, which for large reductions can end up translating to a large
number of subgroups. This adds a preferred unrolling factor on the
vector size to reduce the default number of subgroups.
The semantics for specifying different kinds of advice is unclear so I
set it in two stages.
We don't currently insert deallocas and don't track live ranges but that
can come in the future as we support more control flow. For now this
at least gets all of the common allocations within an invocation into
the queue-ordered bucket so that we can do proper async execution and
use native queue-ordered (e.g. stream-ordered allocations in CUDA)
functionality.

With this change the caching allocator is no longer needed for CUDA
in almost all cases.
Previously all external resources (results returned by an invocation)
were made host-visible and mappable and this prevented the use of
queue-ordered allocations in CUDA as memory pools cannot service memory
with associated host pointers. Depending on device the host-visible
memory could also be much slower to access (or have more potential
pitfalls with page management) vs pinned device-local memory and this
got worse once we started doing more dispatches in-place on the results.

Now all external buffers are by default allocated as device-local. Users
will need to manually stage the buffers and otherwise they'll remain
on-device. For externalized state this is a good thing as it means we'll
keep state on device automatically. A temporary flag has been added to
revert to the old mappable behavior with
`--iree-stream-external-resources-mappable=true`. Note that some devices
(like CPU) will always allow mapping even if not requested and users can
avoid the copies by checking before performing the transfers.
monorimet and others added 2 commits September 28, 2023 10:08
Add AccelMatmulExpert pass pipeline

Apply clang-format to new C++ files.

Remove the currently unused skip-intermediate-roundings option.

Reorder entries in BUILD.bazel and CMakeLists.txt alphabetically. (#5)

Wrap everything but the factory in the unnamed namespace. (#8)

use 'accel' identifier

Use parameter struct calling convention

Use mmt4d path instead of defining linalg.matmul path.

Fix lit test and pass to use mmt4d op.

(WIP) Use rank-reduced slices of operands in ukernel call.

Fix rank reduction and lit test.

Remove isInitializedToZero from accel codegen

Fix dims

Fix post-accel lowering passes to handle linalg.fill

Fix lit test expected output.

Correct dims for plugin interface

Co-authored-by: Sungsoon Cho <[email protected]>
@powderluv powderluv force-pushed the shark branch 2 times, most recently from 147cf90 to 54a5fbb Compare September 28, 2023 22:04
@monorimet monorimet closed this Sep 28, 2023
godot73 pushed a commit to godot73/SRT that referenced this pull request Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.