Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INTEGRATION] Expose stable kernel/packing/repacking apis #726

Open
wenhuach21 opened this issue Dec 3, 2024 · 11 comments
Open

[INTEGRATION] Expose stable kernel/packing/repacking apis #726

wenhuach21 opened this issue Dec 3, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@wenhuach21
Copy link

when pip install, marlin kernel could not find ValueError: Trying to use the marlin backend, but could not import the C++/CUDA dependencies with the following error: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/wenhuach/anaconda3/envs/autoround/lib/python3.10/site-packages/gptqmodel_marlin_cuda_inference.cpython-310-x86_64-linux-gnu.so)

when install from source
image

@wenhuach21 wenhuach21 added the bug Something isn't working label Dec 3, 2024
@Qubitium
Copy link
Collaborator

Qubitium commented Dec 3, 2024

@wenhuach21 It appears there are two issues.

  1. Pip install failed. Can you show the stacktrace for pip installed marlin error? It maybe caused by our cached whl prebuilt.

Need linux os version, kernel, libc/glibc version

  1. Source build error. Can you confirm which commit or release tag you are using for source install?

Thanks. @CSY-ModelCloud

@CSY-ModelCloud
Copy link
Member

We have renamed gptqmodel_marlin_cuda_inference. Can you try to pull latest and delete build dir? Then pip install it.

@wenhuach21
Copy link
Author

wenhuach21 commented Dec 3, 2024

Got it. It would be beneficial for GPTQModel to provide a backward-compatible API for layer packing and repacking, accommodating both the original AutoGPTQ linear layer and your/AutoRound fixed zero-point layer in future implementations. This would allow seamless reliance on your CUDA kernels for Marlin, asymmetric quantization, and other operations in AutoRound side.

@Qubitium
Copy link
Collaborator

Qubitium commented Dec 3, 2024

Got it. It would be beneficial for GPTQModel to provide a backward-compatible API for layer packing and repacking, accommodating both the original AutoGPTQ linear layer and your/AutoRound fixed zero-point layer in future implementations. This would allow seamless reliance on your CUDA kernels for Marlin, asymmetric quantization, and other operations in AutoRound side.

We are adding hf_select_quant_linear as external api for HF/optimum repo. Can autoeound use this? Api is going stable later today/tonight.

Tracking PR: #713

Code is not ready. We are finalizing it still. The above pr holds links to hf/optimum pr that will be submitted upstream.

@Qubitium
Copy link
Collaborator

Qubitium commented Dec 3, 2024

[1-3] https://github.com/ModelCloud/GPTQModel/pull/727/files

we will expose the 3 hf_ prefixed as stable api to hf/optimum. May still be changes. wip.

Correction: 4 hf_methods

[4] https://github.com/ModelCloud/GPTQModel/pull/728/files

@wenhuach21
Copy link
Author

Thanks for the info. However, this may not help in our side, we need layer-wise packing and repacking as autoround could support mixed bits or mixed group size .

@Qubitium
Copy link
Collaborator

Qubitium commented Dec 4, 2024

@wenhuach21 We are currently refactoring and make sure gptqmodel is correctly integrated into transformers/optimum/peft.

Can you list the exact api you want? Feel free-form and imagine any/all api you want/desire to have so that autoround can work with our kernels.

Api stability can be enforced by locking pkg depends to specific release as we cant promise internal apis to be always stable.

Let me know a detailed, preferably with pseudo code to illustrate the usage so I can visualize actual usage scenarios. Be as detailed as possible.

@Qubitium Qubitium changed the title [Question] install issue [Feature] Expose stable kernel/packing/repacking apis Dec 4, 2024
@Qubitium Qubitium changed the title [Feature] Expose stable kernel/packing/repacking apis [INTEGRATION] Expose stable kernel/packing/repacking apis Dec 5, 2024
@Qubitium
Copy link
Collaborator

@wenhuach21 Our refractor is complete and preparing for transformers/optimum/peft upstream prs to be merged and integrated.

Now is a good time to review exactly what you and the intel/auto-round team needs from us explicatively at code-level. Please provide us with detailed (pseudo code is okay) examples show what apis we need to expose.

@wenhuach21
Copy link
Author

wenhuach21 commented Dec 10, 2024

@wenhuach21 Our refractor is complete and preparing for transformers/optimum/peft upstream prs to be merged and integrated.

Now is a good time to review exactly what you and the intel/auto-round team needs from us explicatively at code-level. Please provide us with detailed (pseudo code is okay) examples show what apis we need to expose.

Sorry for the delayed response. At the moment, the following come to mind as we want to support mixed bits quantization later

Symmetric Quantization

layer.pack(xxx, backend="marlin") ##Packs the layer using the specified format. Actually WrapperLinear is ok if there is no big change int the future.
check_packing_feasibility(xxx, backend) ## check whether the layer and its quantization config could pack with the specified backend
check_best_packing_format(xxx, target_device="cuda") ##return the best performance format in your repository based on the specified bit-width and group size.

Asymmetric Quantization
Since there are differences in the zero-point (zp) settings, the API should include additional arguments to reflect these variations appropriately.

@wenhuach21
Copy link
Author

sorry, I forgot the repacking API and pos_init API.

@Qubitium
Copy link
Collaborator

@wenhuach21 Feel free to open a wip PR and make core changes as you see fit. I can monitor and we can also connect on teams to smooth out ideas. The only things I would require is below:

  1. If the api is exposed externally, add hf_ prefix to the api name. We are doing this for stable apis for transformers/optimum/peft and would follow the same principle here for the layer/packing/repacking changes. So internally it can be def pack but if auto_round wants to call this, we would change this to def hf_pack which may well be just a wrapper for pack but I would still require the wrapper so we can be backward compatible and have stable api.

  2. Add unit tests to the new hf_ external apis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants