forked from iree-org/iree
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compiler support for offloading mmt4d ops to custom dispatch plugin. #70
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
powderluv
force-pushed
the
shark
branch
2 times, most recently
from
September 26, 2023 18:08
d94fb2e
to
14b5d20
Compare
monorimet
force-pushed
the
ean-accel
branch
from
September 26, 2023 18:31
4b646e2
to
4174ffe
Compare
monorimet
force-pushed
the
ean-accel
branch
3 times, most recently
from
September 26, 2023 21:30
5f31690
to
42d13a3
Compare
powderluv
force-pushed
the
shark
branch
2 times, most recently
from
September 27, 2023 01:41
cb85819
to
8359b2b
Compare
monorimet
force-pushed
the
ean-accel
branch
from
September 27, 2023 19:19
3fd3cfa
to
c9de6de
Compare
powderluv
force-pushed
the
shark
branch
4 times, most recently
from
September 28, 2023 04:07
152623a
to
7b615fe
Compare
monorimet
force-pushed
the
ean-accel
branch
4 times, most recently
from
September 28, 2023 05:26
b7de8ea
to
421cc19
Compare
powderluv
force-pushed
the
shark
branch
2 times, most recently
from
September 28, 2023 16:07
079253d
to
de90ea2
Compare
Co-authored-by: Elias Joseph <[email protected]>
The conversion pass is enabled with `--iree-flow-enable-conv-nchw-to-nhwc-transform` Includes partial support for propagating and cancelling transposes generated when converting from nchw to nhwc. The high level strategy for this pass is as follows: 1. Do the conversions for all conv_nchw_fchw ops (and pooling ops) and wrap the converted convolutions in transposes. Each transpose is tagged to indicate which direction the transpose should propagate through the graph. 2. Traverse the ops in the function in reverse to propagate transposes marked for upwards propagation to their parents. Ideally just before ops such as arith.constant or function arguments. 3. Propagate the transposes marked for downward propagation to its users, ideally to just before return. 4. Canonicalize out all adjacent cancelling transposes and generalize the remaining transposes to allow for fusing them with nearby ops.
- Speedup filter transform folding - Add points for 4x4, switch to that tile size - Move winograd after im2col + padding, in im2col do not touch conv if it has been marked as winograd -remove prints/chrono and adjust Attribute rawKernelAttr for windows by Quinn Co-authored-by: Quinn Dawkins <[email protected]>
* Add rudimentary non-production distributed Python API * Distributed execution validation Add functionality that validates distributed StableHLO is producing the same results as non-distributed. * Add execution time measurement * Distributed Python API: add call_count to run_ranks * Add setup script for distributed Python API * Add JAX to install setup --------- Co-authored-by: Boian Petkantchin <[email protected]>
…f-hosted, clean macos bindist Drop instrumented builds and Python < 3.11 Add Upstream sync CI This fixes the problem of potentially dropping commits that have been submitted while an automatic rebase with upstream IREE is goining on. [CI] Fix macos clean up logic Fixes the macos builder.
Instead of requiring exact NCCL version, relax constraints to the standard ABI versioning rules, namely found_version >= major.minor && found_version < major + 1, where major and minor are from the NCCL headers we use.
Makes the driver compliant with the HAL API change.
Currently the number of subgroups to use is driven by a target vector size, which for large reductions can end up translating to a large number of subgroups. This adds a preferred unrolling factor on the vector size to reduce the default number of subgroups.
This reverts commit a6512dc.
This reverts commit 31e7635.
The semantics for specifying different kinds of advice is unclear so I set it in two stages.
We don't currently insert deallocas and don't track live ranges but that can come in the future as we support more control flow. For now this at least gets all of the common allocations within an invocation into the queue-ordered bucket so that we can do proper async execution and use native queue-ordered (e.g. stream-ordered allocations in CUDA) functionality. With this change the caching allocator is no longer needed for CUDA in almost all cases.
Previously all external resources (results returned by an invocation) were made host-visible and mappable and this prevented the use of queue-ordered allocations in CUDA as memory pools cannot service memory with associated host pointers. Depending on device the host-visible memory could also be much slower to access (or have more potential pitfalls with page management) vs pinned device-local memory and this got worse once we started doing more dispatches in-place on the results. Now all external buffers are by default allocated as device-local. Users will need to manually stage the buffers and otherwise they'll remain on-device. For externalized state this is a good thing as it means we'll keep state on device automatically. A temporary flag has been added to revert to the old mappable behavior with `--iree-stream-external-resources-mappable=true`. Note that some devices (like CPU) will always allow mapping even if not requested and users can avoid the copies by checking before performing the transfers.
Add AccelMatmulExpert pass pipeline Apply clang-format to new C++ files. Remove the currently unused skip-intermediate-roundings option. Reorder entries in BUILD.bazel and CMakeLists.txt alphabetically. (#5) Wrap everything but the factory in the unnamed namespace. (#8) use 'accel' identifier Use parameter struct calling convention Use mmt4d path instead of defining linalg.matmul path. Fix lit test and pass to use mmt4d op. (WIP) Use rank-reduced slices of operands in ukernel call. Fix rank reduction and lit test. Remove isInitializedToZero from accel codegen Fix dims Fix post-accel lowering passes to handle linalg.fill Fix lit test expected output. Correct dims for plugin interface Co-authored-by: Sungsoon Cho <[email protected]>
monorimet
force-pushed
the
ean-accel
branch
from
September 28, 2023 17:09
81079a1
to
e08fba7
Compare
powderluv
force-pushed
the
shark
branch
2 times, most recently
from
September 28, 2023 22:04
147cf90
to
54a5fbb
Compare
godot73
pushed a commit
to godot73/SRT
that referenced
this pull request
Sep 29, 2023
Automatically created
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.