Benchmarking / optimizing inferencing on single board computers #898

kinchahoy · 2024-11-05T07:57:31Z

kinchahoy
Nov 5, 2024

I'm exploring inferencing on SBCs (specifically RK3588s) and want to benchmark mistral.rs performance with detailed settings through mistralrs-server. Are these knobs available? I'm fine accessing them via python if that's the best option:

Can I limit the thread count (to handle big/little core CPUs)? (inferencing is much faster with exactly 4 threads on the 4big/ 4 little CPU)
Is it possible to set a smaller context window (for long-context models) to boost inferencing performance by 10-20% on these small devices?
Can I set temperature to 0 for repeatability (and slightly faster inferencing?)
Are ARM-specialized quantizations supported (similar to llama.cpp's Q4_0_4_4 support)?
Can I provide a prompt for one-shot inference (for benchmarking)?
Is there a way to get more detailed inferencing performance metrics, like in llama.cpp? (e.g., prompt vs. new token generation times—especially helpful when working with vision LLMs, which mistral.rs supports well)?
Is Vulkan likely to be supported (long shot - and so far GPU inferencing seems slower on these RK3588s)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking / optimizing inferencing on single board computers #898

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Benchmarking / optimizing inferencing on single board computers #898

kinchahoy Nov 5, 2024

Replies: 0 comments

kinchahoy
Nov 5, 2024