Randomly distribute traffic across multiple workers of the same model #2857

LinJianping · 2025-02-13T12:03:36Z

Feature request / 功能建议

I have deployed one supervisor and two qwen2-vl-7b-instruct workers. However, I've noticed that currently clients can only query by model_id. I would like to query models by name, such as qwen2-vl-7b-instruct, and randomly distribute traffic across multiple workers of the same model. Is this currently supported?

Motivation / 动机

Additional workers should be deployed for models that require more resources.

Your contribution / 您的贡献

None

qinxuye · 2025-02-13T12:04:58Z

Launch the same model with replica=2, the model will have 2 replicas on 2 workers.

LinJianping · 2025-02-13T12:09:49Z

Launch the same model with replica=2, the model will have 2 replicas on 2 workers.

In my situation, I intend to launch multiple GPU Docker instances, each automatically initiating one xinference worker. Is this scenario suitable for utilizing the replicas configuration?

qinxuye · 2025-02-13T12:17:44Z

That should work well.

LinJianping · 2025-02-13T12:25:13Z

That should work well.

start supervisor:
xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417
start worker 1:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418
launch model 1 on worker 1:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

start worker 2:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418
launch model 2 on worker 2:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

How can I modify my deployment method?

LinJianping · 2025-02-13T12:44:43Z

That should work well.

start supervisor: xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417 start worker 1: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418 launch model 1 on worker 1:
from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)
start worker 2: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418 launch model 2 on worker 2:
from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)
How can I modify my deployment method?

I have tested that if all GPU Docker instances are ready and all workers have started, then launching the model once by setting replica to 2 works. However, in my scenario, I want to dynamically add workers to an existing model. Is there any method to achieve this?

qinxuye · 2025-02-13T14:30:29Z

Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?

LinJianping · 2025-02-14T00:17:18Z

Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?

Yes, the replica may need to be adjusted dynamically after the initial model launch due to traffic. I am hoping for support to add/delete workers and increase/decrease model replicas dynamically after the first model launch.

qinxuye · 2025-02-14T01:12:28Z

Sorry this is the functionality of enterprise version.

LinJianping · 2025-02-14T01:36:19Z

Sorry this is the functionality of enterprise version.

Got it, thank you for your kind reply.

LinJianping added the feature label Feb 13, 2025

XprobeBot added this to the v1.x milestone Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomly distribute traffic across multiple workers of the same model #2857

Randomly distribute traffic across multiple workers of the same model #2857

LinJianping commented Feb 13, 2025

qinxuye commented Feb 13, 2025

LinJianping commented Feb 13, 2025

qinxuye commented Feb 13, 2025

LinJianping commented Feb 13, 2025 •

edited

Loading

LinJianping commented Feb 13, 2025

qinxuye commented Feb 13, 2025 •

edited

Loading

LinJianping commented Feb 14, 2025 •

edited

Loading

qinxuye commented Feb 14, 2025

LinJianping commented Feb 14, 2025

Randomly distribute traffic across multiple workers of the same model #2857

Randomly distribute traffic across multiple workers of the same model #2857

Comments

LinJianping commented Feb 13, 2025

Feature request / 功能建议

Motivation / 动机

Your contribution / 您的贡献

qinxuye commented Feb 13, 2025

LinJianping commented Feb 13, 2025

qinxuye commented Feb 13, 2025

LinJianping commented Feb 13, 2025 • edited Loading

LinJianping commented Feb 13, 2025

qinxuye commented Feb 13, 2025 • edited Loading

LinJianping commented Feb 14, 2025 • edited Loading

qinxuye commented Feb 14, 2025

LinJianping commented Feb 14, 2025

LinJianping commented Feb 13, 2025 •

edited

Loading

qinxuye commented Feb 13, 2025 •

edited

Loading

LinJianping commented Feb 14, 2025 •

edited

Loading