-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update model_loader deps and qqq quantization deps #2220
base: main
Are you sure you want to change the base?
Conversation
There are some failures due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM left some comments
Except for rope, vllm.distributed and quant, everything else related to vllm needs to be removed, such as some utils
BTW python/sglang/srt/models/phi3_small.py should also be handled
|
||
import torch | ||
from torch.nn.parameter import Parameter | ||
from torchao.ops import marlin_qqq_gemm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImportError: cannot import name 'marlin_qqq_gemm' from 'torchao.ops'
It should be due to the version, the current release of torchao (v0.6.1) does not include qqq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can introduce qqq in the next PR after torchao releases a new version, how about that @HandH1998
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
from torchao.ops import marlin_qqq_gemm | ||
from torchao.quantization.utils import dynamically_quantize_per_channel | ||
from vllm.model_executor.layers.linear import LinearBase, LinearMethodBase | ||
from vllm.model_executor.parameter import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part also needs to be migrated.
from typing import Optional | ||
|
||
from torch import nn | ||
from vllm.config import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part also needs to be migrated.
from torch import nn | ||
from transformers import AutoModelForCausalLM, PretrainedConfig | ||
from transformers.utils import SAFE_WEIGHTS_INDEX_NAME | ||
from vllm.config import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part also needs to be migrated.
get_tensor_model_parallel_rank, | ||
get_tensor_model_parallel_world_size, | ||
) | ||
from vllm.envs import VLLM_USE_MODELSCOPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L39:46 This part also needs to be migrated.
|
||
import torch | ||
from torch import nn | ||
from vllm.config import ModelConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part also needs to be migrated.
from huggingface_hub import HfFileSystem, hf_hub_download, snapshot_download | ||
from safetensors.torch import load_file, safe_open, save_file | ||
from tqdm.auto import tqdm | ||
from vllm.config import LoadConfig, ModelConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L22:L25 This part also needs to be migrated. (except vllm.distributed)
Motivation
Update the model_loader deps and qqq quantization deps for SGLang.
Modifications
We modified the relevant code primarily according to vLLM. Thanks the vLLM team for their significant contributions. Here we list the main modifications.
model_loader
code from https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/model_loader and modified it adaptively for SGLang. The updatedmodel_loader
code is located atpython/sglang/srt/model_loader
.registry.py
atpython/sglang/srt/models/registry.py
and registered all the models into classModelRegistry
. Consequently, we removed all monkey patches inpython/sglang/srt/model_executor/model_runner.py
.marlin_qqq_gemm
fromtorchao
. For more details on qqq, please refer to our paper and our code repo.