Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Bigger picture of the system #32

Open
taehyunzzz opened this issue Dec 18, 2024 · 0 comments
Open

[Request] Bigger picture of the system #32

taehyunzzz opened this issue Dec 18, 2024 · 0 comments

Comments

@taehyunzzz
Copy link

taehyunzzz commented Dec 18, 2024

Hi, thank you for sharing the work.
I was wondering if there is a better demonstration of your implementation.
Specifically,

  • How do I check the cached experts? Are they observable from the python API?
  • How would I measure the cache hit rate and prefetch accuracy for this implementation?
  • I saw a note saying that "device_memory_ratio" is the ratio of the KV caches to be kept on GPU. How do I control the number of experts to keep in GPU (i.e., how do I control the expert cache)?
  • Are the non-expert parameters all kept in GPU as default? I am running the code and the non-moe parameters seem to be residing on CPU? How exactly are the parameters positioned and how do I confirm this?
  • Are the inference operations lossy? If the prefetched experts is actually not the correct ones, does a fallback on-the-fly expert fetching occur, or does the inference use the mis-fetched experts? Sometimes the LLM outputs incomprehensible sequences, which may be due to using wrong experts.

I am really interested in your work. Would be a lot of help if you could clarify the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant