You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm exploring inferencing on SBCs (specifically RK3588s) and want to benchmark mistral.rs performance with detailed settings through mistralrs-server. Are these knobs available? I'm fine accessing them via python if that's the best option:
Can I limit the thread count (to handle big/little core CPUs)? (inferencing is much faster with exactly 4 threads on the 4big/ 4 little CPU)
Is it possible to set a smaller context window (for long-context models) to boost inferencing performance by 10-20% on these small devices?
Can I set temperature to 0 for repeatability (and slightly faster inferencing?)
Are ARM-specialized quantizations supported (similar to llama.cpp's Q4_0_4_4 support)?
Can I provide a prompt for one-shot inference (for benchmarking)?
Is there a way to get more detailed inferencing performance metrics, like in llama.cpp? (e.g., prompt vs. new token generation times—especially helpful when working with vision LLMs, which mistral.rs supports well)?
Is Vulkan likely to be supported (long shot - and so far GPU inferencing seems slower on these RK3588s)?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm exploring inferencing on SBCs (specifically RK3588s) and want to benchmark mistral.rs performance with detailed settings through mistralrs-server. Are these knobs available? I'm fine accessing them via python if that's the best option:
Beta Was this translation helpful? Give feedback.
All reactions