Supporting Multi-LoRA inferencing via JetStream server #221

aman2930 · 2025-03-06T22:24:15Z

Supporting Multi-LoRA inferencing via JetStream server following LLM Inference gateway API protocols.

Implemented an adapter_tensorstore to load, store, manage and unload the adapter weights
Added and exposed required metrics at prometheus endpoint
Added multi_lora_decoding service with corresponding APIs as per the requirement.
Implemented single LoRA functionality support.

…ion to Orbax format.

…e2e with LoRA paths.

…tAdapters, LoadAdapter and UnloadAdapter. 2) Driver which is holding list of all loaded base-parameters is now storing the list of lora updated paramters for loaded lora. Implemented methods for loading, unloading and listing LoRA adapters into the Driver object. Original base model params are intact and saved into the params dictionary with key . 3) Created a proxy-client to make MultiAdapterManager service requests to JetStream server.

…pters. Its functionality includes loading, unloading of adapters between CPU RAM and HBM. It also follows LRU policy to evict the adapter if a new load_adapter request comes up. Currently it is only storing the adapter as separate tensors (lora_a and lora_b). Calculation of lora_b x lora_a is being done in prefill() and generate() during decode request. Adapter_tensorstore can be configured with a max_limit on HBM and RAM. 2) Functionality to load from a catalog file at the start of the server is added. If no file is given, it will just load the base params. Loading from the catalog file is done on CPU RAM. After that based on incoming requests, those params are moved/evicted to/from HBM. 3) Some proto updates to get only single path for each adapter, and that path is expected to have an adapter_config.json and Orbax format weights in 0/items folder.

…n API (https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/proposals/003-model-server-protocol/README.md#inference-api-protocol), & . 2) Added a flag to explicitly run the JetStream server with these APIs when . Else only expose older Decode() & HealthCheck() APIs of the JetStream Server. 3) Fixed a bug in the adapter_tensorstore while converting jnp_array and np_array. 4) Added a which made requests to the new APIs (v1/load_lora_adapter, v1/unload_lora_adapter, v1/models, v1/completions)

1) kv_cache_utilization: This refers to percentage of memory in the allocated kv-cache on TPU HBM, that is actually used during decode. It is based on the percentage of slots used. 2) num_requests_waiting: Total number of requests which are waiting to be decoded. 3) lora_requests_info: List of LoRA adapters that are loaded into the TPU HBM for serving the requests.

2) Fixing model_ckpt_conversion.sh after refactoring and merging from main.

jetstream/core/lora/adapter_tensorstore.py

jetstream/core/lora/multi_lora_inference_api.py

jetstream/core/orchestrator.py

jetstream/tools/multi_adapter_service_client.py

mailvijayasingh

Looked at it at high level, left some comments. Will take a deeper look again.

vipannalla

Thanks for the PR, its a bit longish and I'd have preferred you to send the adapter_tensorstore.py and related code as a separate PR since its isolated enough along with the unittests before sending the the PR to integrate it into orchestrator.

I've some initial comments.

jetstream/core/proto/jetstream.proto

jetstream/core/proto/multi_lora_decoding.proto

jetstream/core/orchestrator.py

jetstream/tools/multi_lora_decode_requester.py

jetstream/core/lora/adapter_tensorstore.py

- Implemented unapply lora from base_params - Fixed some comments from the PR

jetstream/core/metrics/prometheus.py

vipannalla

Looks good for initial version

jetstream/core/server_lib.py

… for updating the AdapterTensorstore metadata. It is fixed by design pattern with Unsafe methods and delegating the responsibility of acquiring the locks to the caller. These unsafe methods are for internal only usage. All public methods are safe and aquiring locks for themselves. - Added unit tests for LoRA implementation workflow and adapter_tensorstore.

… 80 --verbose .

…ewly added tests for coverage.

aman2930 added 12 commits January 6, 2025 16:33

Extra logging for understanding the workflow

f0f295a

Updating checkpoint conversion script to support LoRA weights convers…

610fcea

…ion to Orbax format.

Cleaning up of loggings and some refactoring to make the script work …

50deb3e

…e2e with LoRA paths.

Refactoring and cleaning of the JetStream server code.

e4d875a

Refactoring part-2.

eb74d86

Refactor part-3.

a41e4cd

Merging main to amangu-lora.

26b1f37

1) Adding more comments at applying LoRA on Prefill params path.

febaed1

2) Fixing model_ckpt_conversion.sh after refactoring and merging from main.

aman2930 requested a review from vipannalla as a code owner March 6, 2025 22:24

aman2930 requested review from yixinshi, vipannalla and gangji and removed request for vipannalla March 6, 2025 22:28

aman2930 added 2 commits March 6, 2025 22:34

Fixing TypeCheck errors.

ed66fdf

Fixing linting error.

a6a5cd1

mailvijayasingh reviewed Mar 7, 2025

View reviewed changes

Adding documentations.

bd67171

vipannalla reviewed Mar 10, 2025

View reviewed changes

aman2930 added 5 commits March 13, 2025 18:53

- Created separate adapter_tensorstore for each engine.

5f679a9

- Implemented unapply lora from base_params - Fixed some comments from the PR

Refactoring and fixing lint errors.

5bda29b

Merging 'main' to 'amangu-lora'

6ce324c

Changes to resolve comments on the PR.

fe1511c

Fixing Unit test errors.

f0da2b9

Bslabe123 reviewed Mar 17, 2025

View reviewed changes

jetstream/core/metrics/prometheus.py Outdated Show resolved Hide resolved

Fixed changes missed in the merge conflicts.

0f7a5f9

aman2930 added 2 commits March 17, 2025 21:26

Fix failures due to code merge conflicts missing.

006919e

Linter fixes

b936469

vipannalla approved these changes Mar 18, 2025

View reviewed changes

jetstream/core/server_lib.py Outdated Show resolved Hide resolved

aman2930 added 10 commits April 1, 2025 12:57

Merge branch 'main' into amangu-lora

8682503

Fixing linting and type-check errors.

988266d

Merge main to amangu-lora

0e386e8

Fixing linter errors

3075621

Fix linter error by running pyink --pyink-indentation 2 --line-length…

57828ac

… 80 --verbose .

Adding test for MultiLoraManager. Also updated Makefile to consider n…

6fcb5c2

…ewly added tests for coverage.

Merge branch 'main' into amangu-lora

31b1407

Merge branch 'main' into amangu-lora

0e30e29

Update the README with adapter tensorstore test.

fae388e

vipannalla approved these changes Apr 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting Multi-LoRA inferencing via JetStream server #221

Supporting Multi-LoRA inferencing via JetStream server #221

aman2930 commented Mar 6, 2025

mailvijayasingh left a comment

vipannalla left a comment

vipannalla left a comment

Supporting Multi-LoRA inferencing via JetStream server #221

Are you sure you want to change the base?

Supporting Multi-LoRA inferencing via JetStream server #221

Conversation

aman2930 commented Mar 6, 2025

mailvijayasingh left a comment

Choose a reason for hiding this comment

vipannalla left a comment

Choose a reason for hiding this comment

vipannalla left a comment

Choose a reason for hiding this comment