More Jetstream Pytorch fixes, prepare for release #116

tengomucho · 2024-11-20T08:26:14Z

What does this PR do?

This PR includes several fixes for Jetstream Pytorch TGI implementation, including corrected support for batches and sampling strategy switch support.
This finally prepares for the upcoming 0.2.0 release, the first that will officially support Jetstream Pytorch TGI.

Before submitting

Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-11-20T08:29:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This is not used anyway.

The cached batch returned was wrong, because the generator expects only one cache batch returned per each prefill/decode call. Also, the slot size is now fixed: this will prevent creating and destroying elements in the slot list, so to better allow further optimizations and avoid JIT compilation.

The randomness when sampling has been improved by splitting the key as suggested by the documentation of the JAX random submodule.

A GPT2 test file exists to verify the generator behaviour when using the legacy Pytorch/XLA code, so now this test has been added to verify the same behaviour on the Jetstream/Pytorch counterpart.

The Jetstream/Pt engine allows to pass a callback when using the prefill and generate methods. This callback is used to sample the generated token with custom function, but the caller function is JIT'ed, making a strong constraint on the callback signature. So far the callback was compiled on the first call, making it impossible to change the sampling algorithm on different requests. This commit fixes this issue by subclassing the PytorchEngine class and defining a new `prefill_ex` method that is not JIT'ed. The model calls are still compiled, so the performance should not be noticeably affected.

Minor version is increased mainly because of Jetstream Pytorch support on TGI.

dacorvo

A few nits but otherwise LGTM. Thanks !

dacorvo · 2024-11-20T09:11:50Z

optimum/tpu/version.py

@@ -15,5 +15,5 @@
 from packaging.version import parse


-__version__ = "0.1.5"
+__version__ = "0.1.6"


dacorvo · 2024-11-20T09:12:14Z

text-generation-inference/server/text_generation_server/jetstream_pt_support/generator.py

@@ -524,8 +524,8 @@ def decode(self, batches: List[CachedBatch]) -> Tuple[List[Generation], CachedBa
            raise ValueError("Unable to decode tokens for non-prefilled batches (probably due to a previous failure)")

        # Use a custom function to select the next token for each slot
-        select_fn = jax.tree_util.Partial(self._select_from_slots)
-        self.decode_state, result_tokens = self.engine.generate(self.params, self.decode_state, select_fn)
+        # select_fn = jax.tree_util.Partial(self._select_from_slots)


Debug leftover

text-generation-inference/server/text_generation_server/jetstream_pt_support/engine_loader.py

text-generation-inference/server/text_generation_server/jetstream_pt_support/generator.py

baptistecolle · 2024-11-20T09:31:19Z

Except for that, it looks good to me.

Maybe for the testing part, I would have broken down the big tests into multiple smaller ones. This would make them easier to maintain and help pinpoint issues more easily, as the current tests cover a lot of different things at once.

dacorvo · 2024-11-20T09:41:42Z

Maybe for the testing part, I would have broken down the big tests into multiple smaller ones. This would make them easier to maintain and help pinpoint issues more easily, as the current tests cover a lot of different things at once.

Can you elaborate ? The tests are inspired by the tests I wrote for neuron TGI, and are all testing a single feature, except for the multiple decode.
This longer test verifies that the decoding of multiple sequences that are not ending at the same time works, so it cannot be broken down into smaller pieces.

baptistecolle · 2024-11-20T09:50:09Z

Maybe for the testing part, I would have broken down the big tests into multiple smaller ones. This would make them easier to maintain and help pinpoint issues more easily, as the current tests cover a lot of different things at once.

Can you elaborate ? The tests are inspired by the tests I wrote for neuron TGI, and are all testing a single feature, except for the multiple decode. This longer test verifies that the decoding of multiple sequences that are not ending at the same time works, so it cannot be broken down into smaller pieces.

Thanks for the clarification. I think I misunderstood the goal of the test and the specific scenario it was testing. Maybe it would be nice to add some doc on the test to explain it in more detail, like https://github.com/huggingface/optimum-neuron/blob/dd60749502cd05385d6f4fe3dd884dc221e22926/text-generation-inference/tests/server/helpers.py#L83C1-L85C8

tengomucho · 2024-11-20T09:58:38Z

@baptistecolle I just added the docstring to test, hoping to make it simpler to understand.

tengomucho added 9 commits November 20, 2024 08:48

fix(tgi): correct truncation in Jetstream Pytorch generator

107016b

chore(ci): jetstream TGI tests also run on main on push

3a4ab50

refactor(generator): inputs removed from slot

d7fd56c

This is not used anyway.

feat(rng): improve randomness in sampling on Jetstream/Pt

20ff0db

The randomness when sampling has been improved by splitting the key as suggested by the documentation of the JAX random submodule.

test(jetstream): added prefill and decode multiple tests

071a7f9

A GPT2 test file exists to verify the generator behaviour when using the legacy Pytorch/XLA code, so now this test has been added to verify the same behaviour on the Jetstream/Pytorch counterpart.

test(jetstream): added failing test to check sampling can be changed

f1dbc13

chore: bump version to 0.2.0

d9e89be

Minor version is increased mainly because of Jetstream Pytorch support on TGI.

tengomucho force-pushed the jetstream-switch branch from 2e3208e to d9e89be Compare November 20, 2024 08:57

tengomucho marked this pull request as ready for review November 20, 2024 09:02

tengomucho requested review from dacorvo and baptistecolle November 20, 2024 09:02

dacorvo reviewed Nov 20, 2024

View reviewed changes

baptistecolle reviewed Nov 20, 2024

View reviewed changes

text-generation-inference/server/text_generation_server/jetstream_pt_support/engine_loader.py Show resolved Hide resolved

baptistecolle reviewed Nov 20, 2024

View reviewed changes

text-generation-inference/server/text_generation_server/jetstream_pt_support/generator.py Show resolved Hide resolved

tengomucho added 2 commits November 20, 2024 09:24

fix(version): version number was not correctly updated, fix it

2308722

review: remove commented code leftover

270435c

baptistecolle approved these changes Nov 20, 2024

View reviewed changes

review: add docstring to explain tests goals

ad9e3e5

dacorvo approved these changes Nov 20, 2024

View reviewed changes

tengomucho merged commit 1fc59ce into main Nov 20, 2024
5 checks passed

tengomucho deleted the jetstream-switch branch November 20, 2024 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Jetstream Pytorch fixes, prepare for release #116

More Jetstream Pytorch fixes, prepare for release #116

tengomucho commented Nov 20, 2024

HuggingFaceDocBuilderDev commented Nov 20, 2024

dacorvo left a comment

dacorvo Nov 20, 2024

dacorvo Nov 20, 2024

baptistecolle commented Nov 20, 2024

dacorvo commented Nov 20, 2024

baptistecolle commented Nov 20, 2024 •

edited

Loading

tengomucho commented Nov 20, 2024

More Jetstream Pytorch fixes, prepare for release #116

More Jetstream Pytorch fixes, prepare for release #116

Conversation

tengomucho commented Nov 20, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Nov 20, 2024

dacorvo left a comment

Choose a reason for hiding this comment

dacorvo Nov 20, 2024

Choose a reason for hiding this comment

dacorvo Nov 20, 2024

Choose a reason for hiding this comment

baptistecolle commented Nov 20, 2024

dacorvo commented Nov 20, 2024

baptistecolle commented Nov 20, 2024 • edited Loading

tengomucho commented Nov 20, 2024

baptistecolle commented Nov 20, 2024 •

edited

Loading