Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout param for DocSum and FaqGen to deal with long context #1329

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions comps/cores/proto/api_protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ class ChatCompletionRequest(BaseModel):
# top_p: Optional[float] = None # Priority use openai
typical_p: Optional[float] = None
# repetition_penalty: Optional[float] = None
timeout: Optional[int] = None

# doc: begin-chat-completion-extra-params
echo: Optional[bool] = Field(
Expand Down
6 changes: 4 additions & 2 deletions comps/llms/src/doc-summarization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ curl http://${your_ip}:9000/v1/docsum \

"summary_type" is set to be "auto" by default, in this mode we will check input token length, if it exceed `MAX_INPUT_TOKENS`, `summary_type` will automatically be set to `refine` mode, otherwise will be set to `stuff` mode.

With long contexts, request may get canceled due to its generation taking longer than the default `timeout` value (120s for TGI). Increase it as needed.

**summary_type=stuff**

In this mode LLM generate summary based on complete input text. In this case please carefully set `MAX_INPUT_TOKENS` and `MAX_TOTAL_TOKENS` according to your model and device memory, otherwise it may exceed LLM context limit and raise error when meet long context.
Expand All @@ -157,7 +159,7 @@ In this mode, default `chunk_size` is set to be `min(MAX_TOTAL_TOKENS - input.ma
```bash
curl http://${your_ip}:9000/v1/docsum \
-X POST \
-d '{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false}' \
-d '{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false, "timeout":200}' \
-H 'Content-Type: application/json'
```

Expand All @@ -170,6 +172,6 @@ In this mode, default `chunk_size` is set to be `min(MAX_TOTAL_TOKENS - 2 * inpu
```bash
curl http://${your_ip}:9000/v1/docsum \
-X POST \
-d '{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000}' \
-d '{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000, "timeout":200}' \
-H 'Content-Type: application/json'
```
1 change: 1 addition & 0 deletions comps/llms/src/doc-summarization/integrations/tgi.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ async def invoke(self, input: DocSumChatCompletionRequest):
temperature=input.temperature if input.temperature else 0.01,
repetition_penalty=input.repetition_penalty if input.repetition_penalty else 1.03,
streaming=input.stream,
timeout=input.timeout if input.timeout is not None else 120,
server_kwargs=server_kwargs,
task="text-generation",
)
Expand Down
1 change: 1 addition & 0 deletions comps/llms/src/doc-summarization/integrations/vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ async def invoke(self, input: DocSumChatCompletionRequest):
top_p=input.top_p if input.top_p else 0.95,
streaming=input.stream,
temperature=input.temperature if input.temperature else 0.01,
request_timeout=float(input.timeout) if input.timeout is not None else None,
)
result = await self.generate(input, self.client)

Expand Down
1 change: 1 addition & 0 deletions comps/llms/src/faq-generation/integrations/tgi.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ async def invoke(self, input: ChatCompletionRequest):
temperature=input.temperature if input.temperature else 0.01,
repetition_penalty=input.repetition_penalty if input.repetition_penalty else 1.03,
streaming=input.stream,
timeout=input.timeout if input.timeout is not None else 120,
server_kwargs=server_kwargs,
)
result = await self.generate(input, self.client)
Expand Down
1 change: 1 addition & 0 deletions comps/llms/src/faq-generation/integrations/vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ async def invoke(self, input: ChatCompletionRequest):
top_p=input.top_p if input.top_p else 0.95,
streaming=input.stream,
temperature=input.temperature if input.temperature else 0.01,
request_timeout=float(input.timeout) if input.timeout is not None else None,
)
result = await self.generate(input, self.client)

Expand Down
4 changes: 2 additions & 2 deletions tests/llms/test_llms_doc-summarization_tgi.sh
Original file line number Diff line number Diff line change
Expand Up @@ -125,15 +125,15 @@ function validate_microservices() {
'text' \
"docsum-tgi" \
"docsum-tgi" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false, "timeout":200}'

echo "Validate refine mode..."
validate_services \
"$URL" \
'text' \
"docsum-tgi" \
"docsum-tgi" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000, "timeout":200}'
}

function stop_docker() {
Expand Down
4 changes: 2 additions & 2 deletions tests/llms/test_llms_doc-summarization_tgi_on_intel_hpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -126,15 +126,15 @@ function validate_microservices() {
'text' \
"docsum-tgi-gaudi" \
"docsum-tgi-gaudi" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false, "timeout":200}'

echo "Validate refine mode..."
validate_services \
"$URL" \
'text' \
"docsum-tgi-gaudi" \
"docsum-tgi-gaudi" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000, "timeout":200}'
}

function stop_docker() {
Expand Down
4 changes: 2 additions & 2 deletions tests/llms/test_llms_doc-summarization_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -140,15 +140,15 @@ function validate_microservices() {
'text' \
"docsum-vllm" \
"docsum-vllm" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false, "timeout":200}'

echo "Validate refine mode..."
validate_services \
"$URL" \
'text' \
"docsum-vllm" \
"docsum-vllm" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000, "timeout":200}'
}

function stop_docker() {
Expand Down
4 changes: 2 additions & 2 deletions tests/llms/test_llms_doc-summarization_vllm_on_intel_hpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -139,15 +139,15 @@ function validate_microservices() {
'text' \
"docsum-vllm-gaudi" \
"docsum-vllm-gaudi" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "map_reduce", "chunk_size": 2000, "stream":false, "timeout":200}'

echo "Validate refine mode..."
validate_services \
"$URL" \
'text' \
"docsum-vllm-gaudi" \
"docsum-vllm-gaudi" \
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000}'
'{"messages":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "summary_type": "refine", "chunk_size": 2000, "timeout":200}'
}

function stop_docker() {
Expand Down