✈️ Introduce Jetstream/Pytorch in TGI #88

tengomucho · 2024-08-29T15:04:05Z

What does this PR do?

This allows to use TGI with the meta-llama/Llama-2-7b-hf model using the Jetstream/Pytorch engine.
This should be the starting point for a more complete integration in the future. It is not ready yet to replace the legacy implementation, in particular because:

no other models have been tested, and some work is required for weights conversion;
when I tried other Llama2 models it did not work, I still need to investigate the reason, this should follow next;
further work should be done to simplify the coexistence of torch_xla and Jetstream/Pytorch. For now this feature is optional, in particular because the dependency installation is not that straightforward and execution of these two engines (Jetstream Pytorch and Pytorch/XLA) is exclusive.

Before submitting

Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-08-29T15:09:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

use packaging.version's parse instead of pkg_resources' parse_version.

The custom HfEngine contains functions that will allow for prefill and generate functions to use custom sampling functions.

This implementation is equivalent to the torch_xla one, but uses the Jetstream/Pytorch engine instead.

This way we can aboid trying to import torch xla.

This is just a way to provide a factory class method to create Jetstream/Pytorch or Pytorch XLA generator.

There are still some issues related to some fine-tuned models, so for now just enable only when JETSTREAM_PT is set.

For now it is possible to install dependency after optimum-tpu has been instelled, issuing this command: pip install "optimum-tpu[jetstream-pt]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

Also adapted other tests to avoid torch-xla generator implementaion, to avoid conflict. I also added the Jetstream/pytorch test to workflow in CI.

FanhaiLu1 · 2024-09-04T19:12:06Z

text-generation-inference/server/text_generation_server/jetstream_pt_support/generator.py

+                                         )
+        return tokens, true_length
+
+    def prefill(self, batch: Batch) -> Tuple[List[Generation], CachedBatch]:


Can you share more insights on where the server take the request and call prefill?

Sure. prefill receives the request from the model server interface, you can see the code here.
The model server is called by the TGI router. You can see more information about TGI architecture here.
Let me know if you need any more specific detail!

allenwang28

Thanks Alvaro, great work! I took a first pass and left some fairly minor comments.

allenwang28 · 2024-09-05T15:43:16Z

optimum/tpu/jetstream_pt_support.py

+            return False
+        # Torch XLA should not be imported before torch_xla2 to avoid conflicts.
+        if 'torch_xla2' not in sys.modules and 'torch_xla.core' in sys.modules:
+            return False


nit: would it make sense to emit a warning here? Like "JETSTREAM_PT is enabled, but torch_xla2 is not installed. Falling back to torch_xla".

it's actually a little trickier than that: torch_xla cannot be imported after torch_xla has been imported. I will add a warning.

allenwang28 · 2024-09-05T15:57:17Z

text-generation-inference/server/text_generation_server/auto_generator.py

@@ -0,0 +1,35 @@
+from .generator_base import Generator
+from .jetstream_pt_support import check


from .jetstream_pt_support import check as should_use_jetstream

or something along this lines could be more descriptive. Possibly could just change the check def within jetstream_pt_support

I renamed it model_can_use_jetstream_pt.

allenwang28 · 2024-09-05T15:58:55Z

text-generation-inference/server/text_generation_server/auto_generator.py

+                model_path, revision=revision, max_batch_size=max_batch_size, max_sequence_length=max_sequence_length
+            )
+        else:
+            from .generator import TpuGenerator


It would useful to a user to log 1) when we have successfully loaded jetstream and 2) when we're falling back to the base generator

allenwang28 · 2024-09-05T16:44:39Z

...on-inference/server/text_generation_server/jetstream_pt_support/llama_model_exportable_hf.py

Rather than re-implement llama/model_exportable.py, could we implement some type of parameter transformation logic instead? That would allow us to directly use jetstream_pt's code

That's what I tried to do at first, but if we want to support models as they are defined in transformers, the simplest way is to extract the model parameters from the config file. In the model definition in transformers, for Llama some of the original parameters (hidden_dim, multiple_of and ffn_dim_multiplier) were combined in the intermediate_size variable. I could not see a trivial way to go back to the original values, That is why I ended up re-implementing FeedForward, and as a consequence I ended up modifying the other classes that use that. If you think about a a way to get the original parameters back in a reliable way, then I can drop most of this and just use jetstream_pt's code.

allenwang28 · 2024-09-05T16:51:46Z

text-generation-inference/server/text_generation_server/jetstream_pt_support/generator.py

+        return len(self._tokens) == 0
+
+
+class TpuGeneratorJetStream(Generator):


One general comment (no need for change at this point), since this is essentially re-implementing the responsibility of JetStream's orchestrator as designed, this will lose out on features like disaggregated serving and will likely result in different performance

Yes, I thought about it, and I agree with you that using only the engine and not the orchestrator means we will end up with different performance results. The reason why I did this was the API: the engine API is similar to TGI's model_server, while the orchestrator is not meant to interact via a Python API, but rather through gRPC, and its interface is more similar to the one in the TGI router. So interfacing the orchestrator with TGI would mean taking the TGI requests, re-encode them as requests for the jetstream orchestrator and forward them, then re-transcode the responses. So yes, at some point we might need to look at a way to integrate those, but it seems more complicated and I think we can do that later.

allenwang28 · 2024-09-05T16:55:57Z

text-generation-inference/server/text_generation_server/jetstream_pt_support/engine.py

+from jetstream_pt import engine
+
+
+class HfEngine(engine.PyTorchEngine):


General note (no need to respond to this within this PR), Ray support for multi-node currently lives within PyTorchRayEngine. So as is, this won't be able to take advantage of Ray multi-host. A few options:

[within JetStream] Consolidate PyTorchRayEngine with PyTorchEngine - probably preferred since we saw issues rise because of the decoupled design (cc @FanhaiLu1)

[within TGI] Create a RayHfEngine or use some type of mixin

- Added warning when trying to load torch_xla2 adter torch_xla - renamed jetstream_pt_support.check to model_can_use_jetstream_pt

miladm · 2024-09-08T03:06:23Z

text-generation-inference/server/text_generation_server/jetstream_pt_support/logits_process.py

+
+    def __call__(self, logits: jnp.ndarray) -> Tuple[jnp.ndarray, jnp.ndarray]:
+        if self.temperature != 1.0:
+            logits = logits / self.temperature


qq: what happens if temp = 0?

Good question @miladm. In that the operation will give an array with [inf, -inf] values. The generation will still give some result, though probably not the one you would expect (in my case it was as if it was using greedy search).
BTW, you will have the same division in the Jetstream sampling code.

tengomucho added 12 commits August 30, 2024 08:12

feat(tgi): add functions to load Jetstream Pytorch engine for Llama2

68c77df

chore(TokenSelector): remove XLA xm rng seed set

b28ef47

fix(version): remove warning on deprecated API

c74900e

use packaging.version's parse instead of pkg_resources' parse_version.

fix(generator): use pad_token_id for padding

be56089

fix(decode): clear unrequested slots

6db3c2c

feat(imports): add function to check if Jetstream Pytorch can be used

02ffeea

feat(Jetstream): improved support for engine load

8e98023

The custom HfEngine contains functions that will allow for prefill and generate functions to use custom sampling functions.

feat(TGI): Added Jetstream/Pytorch generator

e3840e1

This implementation is equivalent to the torch_xla one, but uses the Jetstream/Pytorch engine instead.

chore(fsdp v2): avoid importing PretrainedModel

3ff7197

This way we can aboid trying to import torch xla.

feat(tgi): introduce AutoGenerator

6c9348c

This is just a way to provide a factory class method to create Jetstream/Pytorch or Pytorch XLA generator.

feat(Jetstream PT): Enable support only if env var is set

42ebaef

There are still some issues related to some fine-tuned models, so for now just enable only when JETSTREAM_PT is set.

feat(TGI): use AutoGenerator in model server

0af77a4

tengomucho force-pushed the introduce-jetstream-pytorch branch from 5a73926 to 82849fa Compare August 30, 2024 08:14

feat(package): add optional dependency on Jetstream/Pytorch

3d782ab

For now it is possible to install dependency after optimum-tpu has been instelled, issuing this command: pip install "optimum-tpu[jetstream-pt]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

tengomucho force-pushed the introduce-jetstream-pytorch branch 5 times, most recently from aab4506 to c11d2bb Compare September 2, 2024 08:47

tengomucho added 5 commits September 2, 2024 08:53

test(Jetstream Pytorch): added a simple decode test

33bb7d4

Also adapted other tests to avoid torch-xla generator implementaion, to avoid conflict. I also added the Jetstream/pytorch test to workflow in CI.

test(decode): added a variant with do_sample=True with Jetstream PT

3cc2ff8

fix(README): correct link

e5e2fd4

doc(README): add mention on how to install and enable Pytorch/Jetstream

aac4237

feat(build): make clean removes old TGI builds too

07a71db

tengomucho force-pushed the introduce-jetstream-pytorch branch from c11d2bb to 07a71db Compare September 2, 2024 08:54

tengomucho marked this pull request as ready for review September 2, 2024 08:54

tengomucho requested a review from mfuntowicz September 2, 2024 08:54

tengomucho mentioned this pull request Sep 2, 2024

Draft an integration of TGI with Jetstream/pytorch #59

Closed

tengomucho requested a review from allenwang28 September 4, 2024 16:12

FanhaiLu1 reviewed Sep 4, 2024

View reviewed changes

allenwang28 reviewed Sep 5, 2024

View reviewed changes

tengomucho added 2 commits September 6, 2024 12:01

review: comply to comments requests

b77a352

- Added warning when trying to load torch_xla2 adter torch_xla - renamed jetstream_pt_support.check to model_can_use_jetstream_pt

review(AutoGenerator): log if using Jetstream/PT or torch xla

76fbf94

tengomucho force-pushed the introduce-jetstream-pytorch branch from 1f5e9c4 to 76fbf94 Compare September 6, 2024 12:01

mfuntowicz approved these changes Sep 6, 2024

View reviewed changes

miladm reviewed Sep 8, 2024

View reviewed changes

tengomucho merged commit fa24cc4 into main Sep 9, 2024
4 checks passed

tengomucho deleted the introduce-jetstream-pytorch branch September 9, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✈️ Introduce Jetstream/Pytorch in TGI #88

✈️ Introduce Jetstream/Pytorch in TGI #88

tengomucho commented Aug 29, 2024

HuggingFaceDocBuilderDev commented Aug 29, 2024

FanhaiLu1 Sep 4, 2024

tengomucho Sep 5, 2024

allenwang28 left a comment

allenwang28 Sep 5, 2024

tengomucho Sep 6, 2024

allenwang28 Sep 5, 2024

tengomucho Sep 6, 2024 •

edited

Loading

allenwang28 Sep 5, 2024

allenwang28 Sep 5, 2024

tengomucho Sep 6, 2024

allenwang28 Sep 5, 2024

tengomucho Sep 6, 2024

allenwang28 Sep 5, 2024

miladm Sep 8, 2024

tengomucho Sep 9, 2024

		@@ -0,0 +1,35 @@
		from .generator_base import Generator
		from .jetstream_pt_support import check

		return len(self._tokens) == 0


		class TpuGeneratorJetStream(Generator):

		from jetstream_pt import engine


		class HfEngine(engine.PyTorchEngine):

✈️ Introduce Jetstream/Pytorch in TGI #88

✈️ Introduce Jetstream/Pytorch in TGI #88

Conversation

tengomucho commented Aug 29, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Aug 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenwang28 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengomucho Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengomucho Sep 6, 2024 •

edited

Loading