More Inference Endpoints features and fixes #68

tengomucho · 2024-07-03T15:07:27Z

What does this PR do?

Several changes to improve Inference Endpoints and TGI overall.

Better handle exceptions in threads. While they should not happen, it is even worse when they happen and no error appears.
Entrypoint sets log to json by default.
Add prefill bucketing, so that XLA compilation happens less often. The price of that is that we waste some tokens when not using all buckets' space.
Fix input truncation bug.
Enable logs on children processes
Warmup tries all possible combinations of prefill sequence length and batch length, to speed up inference and avoid surprises later.

This will raise an error, signaling there was a problem. Before the root thread was getting stuck waiting for the agent that was dead. This way it should exit.

It is possible to disable it by setting JSON_OUTPUT_DISABLE. It is now possible also to play with more batch sizes.

This will further simplify the implementation of prefill bucketing.

HuggingFaceDocBuilderDev · 2024-07-04T15:54:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Truncation was sub-optimal, and it was done on the wrong side.

This will prevent XLA compilation at inference time. Note that I had to disable dynamo compilation though, otherwise the model was not generating correct results. This leads to slower generation, but at least generation seems stable now.

This will allow for testing IE before release.

mfuntowicz · 2024-07-05T14:54:32Z

.github/workflows/tpu-tgi-release.yml

@@ -3,6 +3,9 @@ name: Release
 on:
  release:
    types: [published]
+  push:
+    branches:
+      - debug-tgi-ie-pt2


Is it still required?

Nope, I can remove it

mfuntowicz · 2024-07-05T14:57:17Z

text-generation-inference/server/text_generation_server/generator.py

+        for _ in reversed(range(batch_size)):
+            # Prefill with different truncate sizes to test all prefill lengths. List is reversed so first longest
+            # sequences are tested and, if there is a memory failure, that will appear sooner.
+            for l in reversed(PREFILL_LENGTHS):


We will certainly need to cap the PREFILL_LENGTHS w.r.t the model's config max_position_embedding otherwise it might generate too much and also unsupported lengths

good point. For now it will stop at the sequence length passed as parameter, but we can also make this check. I will do it on a future PR.

tengomucho added 6 commits July 3, 2024 14:28

feat(generator): better handle exceptions on multiprocessing

1f414bb

This will raise an error, signaling there was a problem. Before the root thread was getting stuck waiting for the agent that was dead. This way it should exit.

feat(tgi): add more debug on server

0bf300e

chore(docker): entrypoint json output is set by default

5d61783

It is possible to disable it by setting JSON_OUTPUT_DISABLE. It is now possible also to play with more batch sizes.

feat(generator): add bucketing functions to use in prefill

d8e6ef6

feat(generator): store position_id in current slot

333378a

This will further simplify the implementation of prefill bucketing.

fix(generator): correct input_ids and attention_mask padding

355f976

tengomucho force-pushed the debug-tgi-ie-pt2 branch from 8738cd2 to 890d402 Compare July 4, 2024 15:51

tengomucho added 4 commits July 4, 2024 19:42

fix(TGI): fix input truncation

bfd9b51

Truncation was sub-optimal, and it was done on the wrong side.

feat(generator): enable logs on children processes

e73e2fd

ci(tgi): create images when pushing on current branch

3145343

This will allow for testing IE before release.

tengomucho force-pushed the debug-tgi-ie-pt2 branch from 890d402 to 3145343 Compare July 4, 2024 19:43

feat(tgi): reversed loop order in warmup to test memory limits earlier

0204586

tengomucho marked this pull request as ready for review July 4, 2024 21:31

mfuntowicz approved these changes Jul 5, 2024

View reviewed changes

chore(ci): remove image generation for this branch

5d852b1

tengomucho merged commit fd29591 into main Jul 6, 2024
4 checks passed

tengomucho deleted the debug-tgi-ie-pt2 branch July 6, 2024 11:08

tengomucho mentioned this pull request Jul 8, 2024

Text-generation-inference (TPU) container fixes #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Inference Endpoints features and fixes #68

More Inference Endpoints features and fixes #68

tengomucho commented Jul 3, 2024

HuggingFaceDocBuilderDev commented Jul 4, 2024

mfuntowicz Jul 5, 2024

tengomucho Jul 5, 2024

mfuntowicz Jul 5, 2024

tengomucho Jul 5, 2024

More Inference Endpoints features and fixes #68

More Inference Endpoints features and fixes #68

Conversation

tengomucho commented Jul 3, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 4, 2024

mfuntowicz Jul 5, 2024

Choose a reason for hiding this comment

tengomucho Jul 5, 2024

Choose a reason for hiding this comment

mfuntowicz Jul 5, 2024

Choose a reason for hiding this comment

tengomucho Jul 5, 2024

Choose a reason for hiding this comment