-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Inference Endpoints features and fixes #68
Conversation
This will raise an error, signaling there was a problem. Before the root thread was getting stuck waiting for the agent that was dead. This way it should exit.
It is possible to disable it by setting JSON_OUTPUT_DISABLE. It is now possible also to play with more batch sizes.
This will further simplify the implementation of prefill bucketing.
8738cd2
to
890d402
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Truncation was sub-optimal, and it was done on the wrong side.
This will prevent XLA compilation at inference time. Note that I had to disable dynamo compilation though, otherwise the model was not generating correct results. This leads to slower generation, but at least generation seems stable now.
This will allow for testing IE before release.
890d402
to
3145343
Compare
@@ -3,6 +3,9 @@ name: Release | |||
on: | |||
release: | |||
types: [published] | |||
push: | |||
branches: | |||
- debug-tgi-ie-pt2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I can remove it
for _ in reversed(range(batch_size)): | ||
# Prefill with different truncate sizes to test all prefill lengths. List is reversed so first longest | ||
# sequences are tested and, if there is a memory failure, that will appear sooner. | ||
for l in reversed(PREFILL_LENGTHS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will certainly need to cap the PREFILL_LENGTHS
w.r.t the model's config max_position_embedding
otherwise it might generate too much and also unsupported lengths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. For now it will stop at the sequence length passed as parameter, but we can also make this check. I will do it on a future PR.
What does this PR do?
Several changes to improve Inference Endpoints and TGI overall.