Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Android app issue #3010

Open
j0h0k0i0m opened this issue Nov 4, 2024 · 2 comments
Open

[Question] Android app issue #3010

j0h0k0i0m opened this issue Nov 4, 2024 · 2 comments
Labels
question Question about the usage

Comments

@j0h0k0i0m
Copy link

❓ General Questions

Hello, I have some questions regarding the Android app.

  1. Currently, I am using q4f16_0 quantization, but there's a significant difference in prefill tokens per second compared to q4f16_1. I’m using the phi-3.5-mini model as a basis, but when testing q4f16_1, the device (Galaxy S24 Ultra) even shuts down entirely. I understand that q4f16_1 generally offers better performance, so I’d like to ask if there are any ways to improve this.

  2. Is the repetition penalty working correctly? I couldn't find a parameter for it in the ChatCompletionRequest within the app, so I'm unsure if it functions as expected. When reviewing the generated sentences, it produces a continuous sequence in a similar style, which suggests it may not be applied properly.

Thanks.

@j0h0k0i0m j0h0k0i0m added the question Question about the usage label Nov 4, 2024
@Hzfengsy
Copy link
Member

Hzfengsy commented Nov 4, 2024

On mobile phone, I don't think q416_1 offers better performance. For prefill stage, q4f16_0 provides much better performance than q4f16_1

@j0h0k0i0m
Copy link
Author

@Hzfengsy

Thank you for replying to the issue.

It's understandable to use q4f16_0 due to the prefill stage, but I recall seeing an issue raised earlier stating that the decoding performance is lower. Currently, the prefill tokens per second for the phi-3.5-mini model (q4f16_1) are below 1, and I would like to achieve a better quality response than q4f16_0 with a suitable prefill. Is there any way to do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants