-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix OVModelForCausalLM for auto device #433
Conversation
30a2685
to
4985d1c
Compare
@AlexKoff88 @helena-intel @echarlaix could you please take a look? |
The documentation is not available anymore as the PR was closed or merged. |
Thanks @eaidova ! It is a shame it is not possible to query this for AUTO device. It is possible to do that once the model has been loaded on the "final" device:
but we can't predict when the model will have been loaded to that device. I really wish we could find a way to get the optimal PKV precision with AUTO too though, even if it's a bit hacky for now. Should we for now have a warning about potentially slower performance on some devices when using AUTO with LLMs? And should we have a separate method to allow users to set this PKV precision manually without having to set ov_config's INFERENCE_PRECISION_HINT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix @eaidova !
@helena-intel let's discuss it internally what we can suggest for users who are interested in using auto (probably we also need some advice from @peterchen-intel here). My opinion, we can postpone applying of PKV transformation until target device is unknown and move it on compilation step (only if AUTO selected, because reloading model several times maybe time consuming for some devices). Also I need to note that default device for model class now is CPU, AUTO used only if user explicitly specified it, so impact on users probably not so big. Current solution is also nonuniversal, because it does not take into account that model precision and device can be changed in runtime, now it applicable only at the moment when model initialized, so probably move this logic inside compile will be better place (as recompilation will be triggered also by half() and to() methods) |
What does this PR do?
Fix loading OVModelForCausalLM if AUTO specified as device (AUTO device does not have property INFERENCE_PRECISION_HINT among supported)