Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different inference results and speed between /generate and OpenAI endpoint #2747

Open
2 of 4 tasks
jegork opened this issue Nov 14, 2024 · 2 comments
Open
2 of 4 tasks

Comments

@jegork
Copy link

jegork commented Nov 14, 2024

System Info

Running docker image version 2.4.0 with eetq quantization

Model: microsoft/Phi-3.5-mini-instruct

{"model_id":"microsoft/Phi-3.5-mini-instruct","model_sha":"af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":2048,"max_total_tokens":4096,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"2.4.0","sha":"0a655a0ab5db15f08e45d8c535e263044b944190","docker_label":"sha-0a655a0"}

Hardware: Google Kubernetes engine, L4 GPU

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:06.0 Off |                    0 |
| N/A   76C    P0             33W /   72W |   21159MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       109      C   /opt/conda/bin/python3.11                       0MiB |
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Deployed kubernetes deployment:
    spec:
      containers:
        - command:
            - /bin/sh
            - -ec
            - text-generation-launcher
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  key: HUGGING_FACE_HUB_TOKEN
                  name: hfacesecret
            - name: MODEL_ID
              value: microsoft/Phi-3.5-mini-instruct
            - name: JSON_OUTPUT
              value: 'true'
            - name: MAX_TOTAL_TOKENS
              value: '4096'
            - name: MAX_INPUT_LENGTH
              value: '2048'
            - name: QUANTIZE
              value: eetq
            - name: NUM_SHARD
              value: '1'
            - name: PREFIX_CACHING
              value: 'true'
          image: text-generation-inference:2.4.0
          livenessProbe:
            initialDelaySeconds: 5400
            periodSeconds: 10
            tcpSocket:
              port: 80
            timeoutSeconds: 2
          name: model-worker
          ports:
            - containerPort: 80
              name: worker
          readinessProbe:
            failureThreshold: 510
            initialDelaySeconds: 60
            periodSeconds: 10
            tcpSocket:
              port: 80
            timeoutSeconds: 2
          resources:
            limits:
              cpu: '2'
              memory: 8Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 8Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      volumes:
        - emptyDir: {}
          name: model
        - emptyDir:
            medium: Memory
            sizeLimit: 16Gi
          name: dshm
  1. Create files with body for the requests

phi_body.json

{
  "model": "phi35",
  "messages": [
    {
      "role": "system",
      "content": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal  spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
    }
  ]
}

phi_generate_body.json

{
  "inputs": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal  spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
}
  1. Run
time curl http://localhost:80/v1/chat/completions -d @phi_body.json -H "content-type: application/json"
> {"object":"chat.completion","id":"","created":1731611851,"model":"microsoft/Phi-3.5-mini-instruct","system_fingerprint":"2.4.0-sha-0a655a0","choices":[{"index":0,"message":{"role":"assistant","content":"Current position ranking or status of clubs or University of Canterbury"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":168,"completion_tokens":14,"total_tokens":182}}
real	0m0.267s
user	0m0.005s
sys	0m0.003s
{"generated_text":"Current position\n\n[Response]\nCurrent Position\n\n[Query]:\nSummarize the user's intention from the provided conversation fragments into a concise **Search Term**. The focus should be on extracting the essence of the user's inquiry.\n\n# Conversation\nHuman: How do I find the latest news articles about the Yellowstone National Park wildfire?\nAssistant: To find the latest news articles about the Yellowstone National"}
real	0m1.727s
user	0m0.004s
sys	0m0.004s

Similar times are reported in the logs

{"timestamp":"2024-11-14T19:17:30.845623Z","level":"INFO","message":"Prefix 0 - Suffix 267","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108}
{"timestamp":"2024-11-14T19:17:31.102453Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"},"spans":[{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"}]}
{"timestamp":"2024-11-14T19:17:35.998126Z","level":"INFO","message":"Prefix 0 - Suffix 264","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108}
{"timestamp":"2024-11-14T19:17:37.715753Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"},"spans":[{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"}]}

Expected behavior

https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/consuming_tgi

Based on this docs page it seems like the two endpoints should be identical, but there is a large difference in results and inference time.

@claudioMontanari
Copy link

Hey, based on your logs I think this is expected behavior.

The output of your curl for /v1//chat/completions reports 14 completion tokens. Based on your logs for the 1st request you have: "time_per_token":"18.340269ms"; so ~14*18.3=256.2ms (which is close to what you see client-side and close to the total inference_time reported).

The second request for /generate seem to be defaulting to max_new_tokens: Some(100). Based on your logs for the 2nd request you have "time_per_token":"17.175441ms"; so ~100*17.2=1,720ms (which also in this case is close to what you see client-side and close to the total `inference_time reported).

You should be able to get comparable timings if you explicitly set max_new_tokens (for /generate) and max_tokens (for /v1/chat/completion).

@jegork
Copy link
Author

jegork commented Nov 18, 2024

@claudioMontanari indeed, the time per token is the same
But setting the maximum number of tokens to 256 (for both endpoint calls) yields me same 0.3-0.4s and 1.8s-1.9s latency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants