-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do you feed long texts to a model? #2
Comments
As illustrated in our run_streaming_llama.py, the KV cache eviction occurs only before the prompt input and generation. This means the demo code isn't designed for single, long input samples. However, for long text inputs with LLMs, you can reference our perplexity evaluation script. Here, we input text and evict the KV cache token-by-token. As highlighted in our README's FAQ section, StreamingLLM doesn't enlarge the LLM context window. If you want to expand the context window, consider using a model like Llama-2-7B-32K-Instruct for your experiments. |
Ok! Thank you for your answer, I knew it was too good to be true, still a great achievement! |
Hijacking from this, (tell me if I need a separate issue for this) but would adding more sink tokens similarly act as "state" once the sliding mechanism starts evicting tokens? I bet the model would have to learn to use the register cache during training. EDIT: In case this wasn't clear, sink cache + sliding tokens for computation in an autoregressive manner is similar to RNNs, because of "state". We've somehow backtracked to having our RNN "hidden state" alongside the attention mechanism. |
I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I receive:
Did I misunderstand "infinite-length inputs without sacrificing efficiency and performance. "?
The text was updated successfully, but these errors were encountered: