feat: added k, v cache for inference speed up #7

immortal3 · 2023-02-12T05:19:12Z

Hi, @jaymody, Awesome blog post. I was interested in learning kvcache during inference and searched for it but existing articles on kvcache don't focus on the implementation part of it. So, I decided to implement it in picoGPT.

Are you interested in writing a post for optimization inference time? I would love to collaborate on it.

jaymody · 2023-02-12T05:45:36Z

Thanks for the implementation!

What kind of speedups did you get with this and did you get an identical output to the non-kv cache version?

Just FYI, I'm going to leave this unmerged to keep the implementation as simple as possible. However, will keep this PR open if people want to reference it in the future.

There's also an inference optimization section in my blog post with some further resources to read up on.

immortal3 · 2023-02-12T06:30:07Z

Yes, the Output is identical. I am seeing a 25% speedup of CPU.

 the most powerful machines on the planet.

The computer is a machine that can perform complex calculations, and it can perform these calculations in a way that is very similar to the human brain.

Yeah, it makes sense to not merge it. Probably, we can create another file gpt2_inference_speed.py which can have these sorts of optimizations.

clam004 · 2024-01-14T08:09:49Z

Hi @immortal3 I love the minimal implementation I'm having trouble reproducing the 25% speedup though. I've been using time to compare the two implementations and the 125M model for generating the above output text. If you are up for it, a before and after comparison in your own repo would be so cool and very compelling.

immortal3 · 2024-01-14T17:21:19Z

@clam004 i don't remember exactly how I ended as 25% speedup but it was definitely not a scientific one. 😄

The speedup number will heavily rely on the combination of CPU/Memory and the length of the input tokens. So, I think you might not be getting the exact number 25%, but try feeding a sufficiently longer sequence that should definitely indicate some performance improvement compared to a normal forward pass with KV cache.

On the proper comparison side, I am not sure if it would be worth it (time-wise) at this point to do it thoroughly.

panaali · 2024-07-04T22:08:59Z

gpt2.py

    for _ in tqdm(range(n_tokens_to_generate), "generating"):  # auto-regressive decode loop
-        logits = gpt2(inputs, **params, n_head=n_head)  # model forward pass
+        logits, kvcache = gpt2(inputs, **params, n_head=n_head, kvcache=kvcache)  # model forward pass


The main benefit of KV caching is that you don't need to recalculate the MLPs again for the tokens you already calculated the forward for, and so in the decoding phase you only pass the new token as input to the network.

You should only pass the next_id as input in the decoding phase. In prefill phase, the initial inputs should be passed. checkout https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L72 or https://github.com/meta-llama/llama/blob/main/llama/generation.py#L187C51-L187C59 for an example.

more: https://www.perplexity.ai/search/what-should-be-the-input-to-th-bsYpXZiuRFinjT11Ck33EA#0

immortal3 · 2024-07-09T04:41:52Z

gpt2.py

+        wpe_out = wpe[range(len(inputs))]
+    else:
+        wpe_out = wpe[[len(inputs)-1]]
+        inputs = [inputs[-1]]


@panaali You're correct, if kvcache is there then only the last token should be passed. But this is I being lazy and don't want to change function signatures. So, I am doing it inside function. I just use the last token as input if kvcache is there.

feat: added k, v cache for inference speed up

0c1dd6c

feat: added few comments and renamed symbol for more clearility

d663909

certik mentioned this pull request Mar 6, 2023

Implement kv cache certik/fastGPT#3

Closed

panaali reviewed Jul 4, 2024

View reviewed changes

immortal3 commented Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added k, v cache for inference speed up #7

feat: added k, v cache for inference speed up #7

immortal3 commented Feb 12, 2023

jaymody commented Feb 12, 2023 •

edited

Loading

immortal3 commented Feb 12, 2023

clam004 commented Jan 14, 2024 •

edited

Loading

immortal3 commented Jan 14, 2024 •

edited

Loading

panaali Jul 4, 2024 •

edited

Loading

immortal3 Jul 9, 2024

feat: added k, v cache for inference speed up #7

Are you sure you want to change the base?

feat: added k, v cache for inference speed up #7

Conversation

immortal3 commented Feb 12, 2023

jaymody commented Feb 12, 2023 • edited Loading

immortal3 commented Feb 12, 2023

clam004 commented Jan 14, 2024 • edited Loading

immortal3 commented Jan 14, 2024 • edited Loading

panaali Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

immortal3 Jul 9, 2024

Choose a reason for hiding this comment

jaymody commented Feb 12, 2023 •

edited

Loading

clam004 commented Jan 14, 2024 •

edited

Loading

immortal3 commented Jan 14, 2024 •

edited

Loading

panaali Jul 4, 2024 •

edited

Loading