You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{
"model": "phi35",
"messages": [
{
"role": "system",
"content": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
}
]
}
phi_generate_body.json
{
"inputs": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
}
Run
time curl http://localhost:80/v1/chat/completions -d @phi_body.json -H "content-type: application/json"> {"object":"chat.completion","id":"","created":1731611851,"model":"microsoft/Phi-3.5-mini-instruct","system_fingerprint":"2.4.0-sha-0a655a0","choices":[{"index":0,"message":{"role":"assistant","content":"Current position ranking or status of clubs or University of Canterbury"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":168,"completion_tokens":14,"total_tokens":182}}
real 0m0.267s
user 0m0.005s
sys 0m0.003s
{"generated_text":"Current position\n\n[Response]\nCurrent Position\n\n[Query]:\nSummarize the user's intention from the provided conversation fragments into a concise **Search Term**. The focus should be on extracting the essence of the user's inquiry.\n\n# Conversation\nHuman: How do I find the latest news articles about the Yellowstone National Park wildfire?\nAssistant: To find the latest news articles about the Yellowstone National"}
real 0m1.727s
user 0m0.004s
sys 0m0.004s
Hey, based on your logs I think this is expected behavior.
The output of your curl for /v1//chat/completions reports 14 completion tokens. Based on your logs for the 1st request you have: "time_per_token":"18.340269ms"; so ~14*18.3=256.2ms (which is close to what you see client-side and close to the total inference_time reported).
The second request for /generate seem to be defaulting to max_new_tokens: Some(100). Based on your logs for the 2nd request you have "time_per_token":"17.175441ms"; so ~100*17.2=1,720ms (which also in this case is close to what you see client-side and close to the total `inference_time reported).
You should be able to get comparable timings if you explicitly set max_new_tokens (for /generate) and max_tokens (for /v1/chat/completion).
@claudioMontanari indeed, the time per token is the same
But setting the maximum number of tokens to 256 (for both endpoint calls) yields me same 0.3-0.4s and 1.8s-1.9s latency
System Info
Running docker image version 2.4.0 with eetq quantization
Model: microsoft/Phi-3.5-mini-instruct
Hardware: Google Kubernetes engine, L4 GPU
Information
Tasks
Reproduction
phi_body.json
phi_generate_body.json
Similar times are reported in the logs
Expected behavior
https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/consuming_tgi
Based on this docs page it seems like the two endpoints should be identical, but there is a large difference in results and inference time.
The text was updated successfully, but these errors were encountered: