CPU accelerate solution #166

Aisuko · 2024-06-26T05:09:33Z

Aisuko
Jun 26, 2024
Maintainer

Hugging Face Transformers load gguf into fp32

After I check the code. The Hugging Face Transformers support gguf isn't what we think. According to the PR. This PR offers the ability to load .gguf files within transformers, dequantizing them to float32. This isn't we want to. So, we need to move on to the other solutions.

Based on the currently researching. Here are two paths.

There are also have some other bindings. However, I use Python as backend. So, we want to use Python binding. Furthermore, here is my thinking.

We can import Python binding through kimchima. And support CPU accelerate. And we also what already done. This is breaking change, because we will use Llama.cpp rather than transformers. And currently, we need to do investigation on how to support our Conversation history mechanism.

We can import Llama.cpp server as a seperate stateless micro-service. Our backend can be an orchestration system. It means we can also let kimchima become a micro-service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU accelerate solution #166

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CPU accelerate solution #166

Aisuko Jun 26, 2024 Maintainer

Hugging Face Transformers load gguf into fp32

Reference

Replies: 0 comments

Aisuko
Jun 26, 2024
Maintainer