Use kernels from the kernel hub #2988

danieldk · 2025-02-03T12:00:48Z

What does this PR do?

Use hub kernels for paged attention, MoE, and quantization (Marlin, cutlass, etc.).

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

server/text_generation_server/layers/compressed_tensors/w8a8_int.py

danieldk · 2025-02-05T15:44:19Z

nix/impure-shell.nix

@@ -90,7 +90,7 @@ mkShell {

  postVenvCreation = ''
    unset SOURCE_DATE_EPOCH
-    ( cd server ; python -m pip install --no-dependencies -e . )
+    ( cd server ; python -m pip install --no-build-isolation --no-dependencies -e . )


This is to avoid downloading torch as a build dependency (since without build isolation it is used from the environment).

danieldk · 2025-02-05T15:45:36Z

server/text_generation_server/layers/moe/gptq_marlin.py

@@ -230,3 +232,111 @@ def _pack_weight(
    moe_weight.perm[expert] = weight.perm

    return moe_weight
+
+
+def fused_marlin_moe(


I'd like to keep moe on the kernel hub as close to vLLM as possible, so moved this with our own extensions here.

danieldk · 2025-02-05T15:45:55Z

server/text_generation_server/layers/moe/unquantized.py

@@ -146,3 +159,110 @@ def _load_expert_weights_row(
    assert all_weight is not None

    return all_weight
+
+
+def fused_moe(


I'd like to keep moe on the kernel hub as close to vLLM as possible, so moved this with our own extensions here.

Narsil reviewed Feb 3, 2025

View reviewed changes

server/text_generation_server/layers/compressed_tensors/w8a8_int.py Outdated Show resolved Hide resolved

danieldk force-pushed the kernel-hub branch 3 times, most recently from e01578d to 27decc5 Compare February 4, 2025 13:22

danieldk added 22 commits February 5, 2025 15:41

Use Hub kernels for Marlin and cutlass quantization kernels

5360e7e

Use hub kernels for MoE/GPTQ-Marlin MoE

aec8707

Use attention kernels from the Hub

b1a3c45

Cache the kernels in the Docker image

e3f303f

Docker: enable uv build isolation again

530f49f

Update moe kernels

9d89cba

Support loading local kernels for development

c082395

Support latest moe kernels

612b0a9

Update to moe 0.1.1

3781435

CI: download locked kernels for server tests

5b8f6b0

Fixup some imports

41004dd

CI: activate venv

859ebd5

Fix unused imports

9f1854c

Nix: add attention/moe/quantization kernels

990503d

Update hf-kernels to 0.1.5

871f9f4

Update kernels

a7171f4

Update tgi-nix flake for hf-kernels

27e6f6d

Fix EOF

1def55d

Take load_kernel out of a frequently-called function

d6ac7fd

Hoist another case of kernel loading out of a somewhat hot function

8bca785

marlin-kernels -> quantization

8677989

attention -> paged-attention

3726ab7

danieldk force-pushed the kernel-hub branch from fac14af to 3726ab7 Compare February 5, 2025 15:41

danieldk commented Feb 5, 2025

View reviewed changes

EOF fix

44eb825

danieldk marked this pull request as ready for review February 5, 2025 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use kernels from the kernel hub #2988

Use kernels from the kernel hub #2988

danieldk commented Feb 3, 2025 •

edited

Loading

danieldk Feb 5, 2025

danieldk Feb 5, 2025

danieldk Feb 5, 2025

Use kernels from the kernel hub #2988

Are you sure you want to change the base?

Use kernels from the kernel hub #2988

Conversation

danieldk commented Feb 3, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

danieldk Feb 5, 2025

Choose a reason for hiding this comment

danieldk Feb 5, 2025

Choose a reason for hiding this comment

danieldk Feb 5, 2025

Choose a reason for hiding this comment

danieldk commented Feb 3, 2025 •

edited

Loading