Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you feed long texts to a model? #2

Closed
CorentinvdBdO opened this issue Oct 2, 2023 · 3 comments
Closed

How do you feed long texts to a model? #2

CorentinvdBdO opened this issue Oct 2, 2023 · 3 comments

Comments

@CorentinvdBdO
Copy link

I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I receive:

ASSISTANT: Token indices sequence length is longer than the specified maximum sequence length for this model (3905 > 2048). Running this sequence through the model will result in indexing errors
- - - - - - - - - d - d - d - d - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d - d d d d d - d - d - - - - - - - - - - - - - - - - - - - - - - d - d - d - d - - - - - - - - - - - - - d - d d - d - d d d d - d - d d d d d d d d - d - d - d - d - d - d - d - - - d - d - d - d - d - d d d d d d d d d d d d - d - d - d - d - d - d - d - d - d - d - d d d d d - d - d - d - d - d - d d d d d d d d d d d d d d d d - d d d d - d0 - d - d - d - d - d - d - d - d - - - - - - - d - d - d - d - - - - - - d - d - - - d - d - d - d - d - d - d - d - - - - - - d - d - - - - - - - - d - d - d d d d d - d - d - d - d - d - d - d - d0 d0 d0 d d d d d d d d d d d d d d d d - d - d - d - - d - d - d - d - d - d d d d d d - d - - - - - - - - - - - - - - - - - - - - d - d - d - - - - - - - d - d - d - d - d - d d d d - d - d - d d d d d - d - d - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d - d d d d d d d d d - d - d - d - d - d d d d d - d d d d d d - d - d - d - d - d - d - d - d - - - - - d - d - d - d - d - - - - - - - - - - d - d d d d d d d d d s d d d d0 n0 d0 - d - d - d - d - d - d - d - d - - - - - - - - d0 d d d d d d d d d d d d d d d d d d d d d d d d - d - d - d - d - - - - - - - - - d - d - d - d - - - - - - - - - - - - - - - - - - d - d - d - - - - - - - - - - - d - d - d - d - d d d, d, s s s s s s s s s s d s s s s s d d d. d, d d d n n n n d0 d00 d0 d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d, d, d, d, d, d, d, d, d, d, d d d d d0 d00, d, d, d, d000,0,0000000 d0 d. d. d. d, d, d d. d. et d. d d d d d et d et d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d et d et d et d et et et et et d et d d d d d d d d d d d d, d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d. d. d. d d d d d d d d d d d d d d d d d d d d d d d d d d

Did I misunderstand "infinite-length inputs without sacrificing efficiency and performance. "?

@Guangxuan-Xiao
Copy link
Collaborator

Guangxuan-Xiao commented Oct 3, 2023

As illustrated in our run_streaming_llama.py, the KV cache eviction occurs only before the prompt input and generation. This means the demo code isn't designed for single, long input samples.

However, for long text inputs with LLMs, you can reference our perplexity evaluation script. Here, we input text and evict the KV cache token-by-token.

As highlighted in our README's FAQ section, StreamingLLM doesn't enlarge the LLM context window. If you want to expand the context window, consider using a model like Llama-2-7B-32K-Instruct for your experiments.

@CorentinvdBdO
Copy link
Author

Ok! Thank you for your answer, I knew it was too good to be true, still a great achievement!

@gembancud
Copy link

gembancud commented Oct 3, 2023

Hijacking from this, (tell me if I need a separate issue for this) but would adding more sink tokens similarly act as "state" once the sliding mechanism starts evicting tokens? I bet the model would have to learn to use the register cache during training.
Similar to the VIT Paper, research direction can be directed towards looking for outliers, if they are similarly removed due to the availability of registers/sinks. That can hopefully perhaps make quantization a bit more easier! What exciting ideas! Amazing work!

EDIT: In case this wasn't clear, sink cache + sliding tokens for computation in an autoregressive manner is similar to RNNs, because of "state". We've somehow backtracked to having our RNN "hidden state" alongside the attention mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants