You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for sharing the work.
I was wondering if there is a better demonstration of your implementation.
Specifically,
How do I check the cached experts? Are they observable from the python API?
How would I measure the cache hit rate and prefetch accuracy for this implementation?
I saw a note saying that "device_memory_ratio" is the ratio of the KV caches to be kept on GPU. How do I control the number of experts to keep in GPU (i.e., how do I control the expert cache)?
Are the non-expert parameters all kept in GPU as default? I am running the code and the non-moe parameters seem to be residing on CPU? How exactly are the parameters positioned and how do I confirm this?
Are the inference operations lossy? If the prefetched experts is actually not the correct ones, does a fallback on-the-fly expert fetching occur, or does the inference use the mis-fetched experts? Sometimes the LLM outputs incomprehensible sequences, which may be due to using wrong experts.
I am really interested in your work. Would be a lot of help if you could clarify the above.
The text was updated successfully, but these errors were encountered:
Hi, thank you for sharing the work.
I was wondering if there is a better demonstration of your implementation.
Specifically,
I am really interested in your work. Would be a lot of help if you could clarify the above.
The text was updated successfully, but these errors were encountered: