Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Inference Endpoints features and fixes #68

Merged
merged 12 commits into from
Jul 6, 2024
Merged

Conversation

tengomucho
Copy link
Collaborator

What does this PR do?

Several changes to improve Inference Endpoints and TGI overall.

  • Better handle exceptions in threads. While they should not happen, it is even worse when they happen and no error appears.
  • Entrypoint sets log to json by default.
  • Add prefill bucketing, so that XLA compilation happens less often. The price of that is that we waste some tokens when not using all buckets' space.
  • Fix input truncation bug.
  • Enable logs on children processes
  • Warmup tries all possible combinations of prefill sequence length and batch length, to speed up inference and avoid surprises later.

This will raise an error, signaling there was a problem. Before the
root thread was getting stuck waiting for the agent that was dead. This
way it should exit.
It is possible to disable it by setting JSON_OUTPUT_DISABLE.
It is now possible also to play with more batch sizes.
This will further simplify the implementation of prefill bucketing.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Truncation was sub-optimal, and it was done on the wrong side.
This will prevent XLA compilation at inference time. Note that I had to
disable dynamo compilation though, otherwise the model was not
generating correct results. This leads to slower generation, but at
least generation seems stable now.
This will allow for testing IE before release.
@tengomucho tengomucho marked this pull request as ready for review July 4, 2024 21:31
@@ -3,6 +3,9 @@ name: Release
on:
release:
types: [published]
push:
branches:
- debug-tgi-ie-pt2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I can remove it

for _ in reversed(range(batch_size)):
# Prefill with different truncate sizes to test all prefill lengths. List is reversed so first longest
# sequences are tested and, if there is a memory failure, that will appear sooner.
for l in reversed(PREFILL_LENGTHS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will certainly need to cap the PREFILL_LENGTHS w.r.t the model's config max_position_embedding otherwise it might generate too much and also unsupported lengths

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. For now it will stop at the sequence length passed as parameter, but we can also make this check. I will do it on a future PR.

@tengomucho tengomucho merged commit fd29591 into main Jul 6, 2024
4 checks passed
@tengomucho tengomucho deleted the debug-tgi-ie-pt2 branch July 6, 2024 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants