Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly distribute traffic across multiple workers of the same model #2857

Open
LinJianping opened this issue Feb 13, 2025 · 9 comments
Open
Labels
Milestone

Comments

@LinJianping
Copy link

Feature request / 功能建议

I have deployed one supervisor and two qwen2-vl-7b-instruct workers. However, I've noticed that currently clients can only query by model_id. I would like to query models by name, such as qwen2-vl-7b-instruct, and randomly distribute traffic across multiple workers of the same model. Is this currently supported?
Image

Motivation / 动机

Additional workers should be deployed for models that require more resources.

Your contribution / 您的贡献

None

@XprobeBot XprobeBot added this to the v1.x milestone Feb 13, 2025
@qinxuye
Copy link
Contributor

qinxuye commented Feb 13, 2025

Launch the same model with replica=2, the model will have 2 replicas on 2 workers.

@LinJianping
Copy link
Author

Launch the same model with replica=2, the model will have 2 replicas on 2 workers.

In my situation, I intend to launch multiple GPU Docker instances, each automatically initiating one xinference worker. Is this scenario suitable for utilizing the replicas configuration?

@qinxuye
Copy link
Contributor

qinxuye commented Feb 13, 2025

That should work well.

@LinJianping
Copy link
Author

LinJianping commented Feb 13, 2025

That should work well.

start supervisor:
xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417
start worker 1:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418
launch model 1 on worker 1:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

start worker 2:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418
launch model 2 on worker 2:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

How can I modify my deployment method?

@LinJianping
Copy link
Author

That should work well.

start supervisor: xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417 start worker 1: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418 launch model 1 on worker 1:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

start worker 2: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418 launch model 2 on worker 2:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

How can I modify my deployment method?

I have tested that if all GPU Docker instances are ready and all workers have started, then launching the model once by setting replica to 2 works. However, in my scenario, I want to dynamically add workers to an existing model. Is there any method to achieve this?

@qinxuye
Copy link
Contributor

qinxuye commented Feb 13, 2025

Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?

@LinJianping
Copy link
Author

LinJianping commented Feb 14, 2025

Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?

Yes, the replica may need to be adjusted dynamically after the initial model launch due to traffic. I am hoping for support to add/delete workers and increase/decrease model replicas dynamically after the first model launch.

@qinxuye
Copy link
Contributor

qinxuye commented Feb 14, 2025

Sorry this is the functionality of enterprise version.

@LinJianping
Copy link
Author

Sorry this is the functionality of enterprise version.

Got it, thank you for your kind reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants