https://huggingface.co/RWKV Project Archived
- Fixed issue with crashing on fp16 on cpu.
- Fixed issues with floating point half and cuda half modes. 7B runs on 14 GB
- added option to not include original context in generation
- Fixed Examples.
- Bug Fixes can now properly resume the state generation, there was an issue with the array not being copied.
- Bug Fixes can now use float 16 on GPU
- CPU remains at float 32
- Reduce the logging noise in the library.
- Provides Stability and State Management Documentation to be provided at a later date.
RWKV - Receptance Weighted Key Value
-
RWKV is a Sequence to Sequence Model that takes the best features of Generative PreTraining (GPT) and Recurrent Neural Networks (RNN) that performs Language Modelling (LM). This is used to generate text Auto Regressively (AR).
-
It has Transformer Level Performance without the quadratic attention mechanism. It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.
More from the Research and Development Repository: https://github.com/BlinkDL/RWKV-LM
This project aims to make RWKV Accessible to everyone using a familiar interface similar to Hugging Face.
-
Q: So why not port it to Hugging Face?
-
A: Well as of right now RWKV goes through many changes and is involved in Research, the Hugging Face framework is large and vast and requires a lengthy PR process that maybe ignored for long periods of time. This project aims to port the latest developments in the RWKV make them super accessible with few lines of code. While keeping it close to the R and D RWKV branch of code. This is a very thin layer over the core features of RWKV.
Update: I am working on the HF version.
from prwkv.rwkvtokenizer import RWKVTokenizer
from prwkv.rwkvrnnmodel import RWKVRNN4NeoForCausalLM
tokenizer = RWKVTokenizer.default()
model = RWKVRNN4NeoForCausalLM.from_pretrained("/Users/michaelchung/Code/Production-RWKV/RWKV-4-Pile-430M-20220808-8066",number_of_layers=24,embedding_dimension=1024,context_length=1024)
context = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
context_input_ids = tokenizer.encode(context).ids
model.warmup_with_context(context=context_input_ids)
import time
def callback(ind):
token = tokenizer.decode([ind],skip_special_tokens=False)
print(token,end="")
# repetition penalty and temperature are senstive and can cause generations to offshoot.
start = time.time()
ctx = model.generate(input_ids=[],streaming_callback=callback,max_length=32,repetition_penalty=0.0,temperature=0.8,stop_on_eos=True)
end = time.time()
print(f"{end-start} seconds \n")
result = tokenizer.decode(ctx,skip_special_tokens=False) # cpu 3 tokens a second
print(f"\n---Result---:\n{result}")
# using latest pretrained from hugging face
tokenizer = RWKVTokenizer.default()
model = RWKVRNN4NeoForCausalLM.from_pretrained("RWKV-4-430M") # options RWKV-4-1B5 RWKV-4-3B RWKV-4-7B RWKV-4-14B
pip3 install prwkv
So when I first read about this model on less wrong[https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon] Then spent quite a bit of time digging. The important information seemed to be scattered around the discord and locked up behind the mind of a genius, so this is an attempt to simplify and clarify and surface the ideas and model and how it works.
There are two models for RWKV, they are refered to as modes, specifically in RWKVv4[https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo] Folder
This was due discovery in an algebraic formulation for the RWKV_RNN model that allows it to be reformulated as a GPT model (RWKV GPT) with a Self Attention.
-
What makes this
very special
is that weights can be shared and loaded between the two models, allowing an interop between both GPT mode and RNN Mode. This implies you can use both models at the same time because you can share the weights in memory. More on this idea later, as we need to get into the specifics properties of each model mode. -
RWKV_RNN:
- This model is designed for running inference quickly. ( Available for Inference in this Package )
- It has a hidden state that stays a constant size. This hidden state encodes and compresses
the prompt context
and subsequent additions to the prompt context. This means that we really don't need to keep theprompt context and the history of that in memory like that of a vanilla transformer
because it is encoded in the hidden state. This feature has limitations however... and entirely depends on the context length of the training samples when training using RWKV GPT. Also depends on the floating point accuracy. - Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the
the prompt context
a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for thethe prompt context
to be infinite, if we can fine tune it properly. ( Unclear right how )
BlinkDL Mentioned
If you train RWKV using the correct method (GPT mode with ctxlen 1024 but apply smart "continuation" of hidden state) to let it learn longer ctxlen, the RNN mode can easily support ctxlen of at least 100k.
- RWKV (RWKV GPT): This mode is for training or fine tuning your model quickly. ( Not Available For Training Yet In this Repo )
- This mode is designed for training and generating the initial hidden state quickly when in memory weight sharing.
- The limitation of this mode is that it doesn't contain a hidden state that can't hold an infinite context length.
- The pros of this mode is that it can utilize parallelism to quickly train because it is in it's GPT configuration.
- Another pro of this mode is that it contains a linear self attention mechanism allowing for large context lengths.
- Wierdly RWKV can be trained as an RNN as well ( mentioned in a discord discussion but not implemented )
The checkpoints for the models can be used for both models. This allows you to transition between both a GPT like model and a RNN like model. Almost like a shape shifting model.
- Another special note about RWKV-LM is that you can use RWKV GPT as an context encoder to generate the context for the decoder very similar to the cross attention mechanism with Encoder Decoder Architectures. This will be implemented at a future date. As it requires in memory weight sharing.
Performance:
CPU M1 Pro | RWKV-430m fp32 | RWKV-1B5 fp32 | RWKV-3B | RWKV-7B | RWKV-14B |
---|---|---|---|---|---|
Tokens/Second | 17-18 Tokens | 4-5 Tokens | NA | NA | NA |
Memory RAM | ~1.3-2GB | ~5.6-5.8 GB | NA | NA | NA |
Performance 3090 (Non Cuda Might need to revisit these metrics):
GPU 3090 24GB | RWKV-170m (RWKV-4a-Pile-170M-20221209-7955) fp16 | RWKV-430m (RWKV-4-Pile-430M-20220808-8066) fp16 | RWKV-1B5 (RWKV-4-Pile-1B5-20220929-ctx4096) fp16 | RWKV-3B (RWKV-4-Pile-3B-20221110-ctx4096) fp16 | RWKV-7B (RWKV-4-Pile-7B-20221123-ctx2048) fp16 | RWKV-14B fp16 |
---|---|---|---|---|---|---|
25 Tokens | 0.6221s | 0.9178s | 0.8562s | 1.0058s | 1.0309s | x |
Memory VRAM | 900MB | 1.5GB | 3.5GB | 6GB | 14GB | x |
Warm Up Time | 0.7686s | 0.9178s | 0.8562s | 1.0058s | 1.0309s | x |
Load Time | 1.9397s | 3.0567s | 6.3156s | 14.0923s | 26.1861s | x |
Model | Threads | Load Time | Initialize Time | Run Time (25 tokens) | VRAM/RAM |
---|---|---|---|---|---|
RWKV-4-Pile-14B-20221128-3061 | 3090 & 3080Ti | 232.76s | 3.66s | 3.66s | 29199MB |
RWKV-4-Pile-14B-20221128-3061 | 38 | 218.46s | 40.33s | 60.73s | 54341MB |
RWKV-4-Pile-1B5-20220929-ctx4096 | 3090 & 3080Ti | 32.42s | 1.36s | 2.08s | 2936MB |
RWKV-4-Pile-1B5-20220929-ctx4096 | 38 | 3.8s | 3.55s | 5.2s | 5302MB |
RWKV-4a-Pile-170M-20221209-7955 | 3090 & 3080Ti | 3.92s | 0.58s | 0.92s | 286MB |
RWKV-4a-Pile-170M-20221209-7955 | 38 | 0.81s | 0.59s | 1.12s | 604MB |
RWKV-4-Pile-430M-20220808-8066 | 3090 & 3080Ti | 9.02s | 1.22s | 1.86s | 748MB |
RWKV-4-Pile-430M-20220808-8066 | 38 | 1.42s | 1.53s | 2.56s | 1507MB |
RWKV-4-Pile-7B-20221123-ctx2048 | 3090 & 3080Ti | 154.28s | 1.79s | 2.97s | 13620MB |
RWKV-4-Pile-7B-20221123-ctx2048 | 38 | 37.79s | 30.65s | 37.27s | 29667MB |
RWKV-4-Pile-3B-20221110-ctx4096 | 3090 & 3080Ti | 61.79s | 1.83s | 2.85s | 5700MB |
RWKV-4-Pile-3B-20221110-ctx4096 | 38 | 7.39s | 11.55s | 15.1s | 11797MB |
-
Provide Pip Package
-
Zero Shot Benchmark Test Provide the performance numbers for the models in various deployment strategies focusing on deployment to Edge Devices iPhone and Android.
-
Onnx CUDA / CPU Performance Charts and Export Scripts
-
Torch CUDA / CPU and Streaming Large Models | Performance Charts | Export Scripts
-
Torch.jit CUDA / CPU and Streaming Large Models | Performance Charts | Export Scripts
-
Beam Search
-
Typical Sampling
-
Tail Free Sampling
-
Top-p-x sampling method
-
Provide Better Documentation WIP
Train:
-
Seeker Dialog Model
-
Palm Instruction Tuned Model
-
GPT Instruction Tuned Model
-
AdamW takes 8 bytes per parameter ~~
-
each model is roughly 14G VRAM for a 7B model at fp16 28G VRAM for a 7B model at fp32