[Request] Bigger picture of the system #32

taehyunzzz · 2024-12-18T06:37:49Z

Hi, thank you for sharing the work.
I was wondering if there is a better demonstration of your implementation.
Specifically,

How do I check the cached experts? Are they observable from the python API?
How would I measure the cache hit rate and prefetch accuracy for this implementation?
I saw a note saying that "device_memory_ratio" is the ratio of the KV caches to be kept on GPU. How do I control the number of experts to keep in GPU (i.e., how do I control the expert cache)?
Are the non-expert parameters all kept in GPU as default? I am running the code and the non-moe parameters seem to be residing on CPU? How exactly are the parameters positioned and how do I confirm this?
Are the inference operations lossy? If the prefetched experts is actually not the correct ones, does a fallback on-the-fly expert fetching occur, or does the inference use the mis-fetched experts? Sometimes the LLM outputs incomprehensible sequences, which may be due to using wrong experts.

I am really interested in your work. Would be a lot of help if you could clarify the above.

Provide feedback