TorchSharp NanoGpt #1379

GeorgeS2019 · 2024-10-01T22:08:43Z

GeorgeS2019
Oct 1, 2024

https://github.com/biegehydra/NanoGptDotnet/blob/master/src/NanoGpt/NanoGpt.cs

ejhg · 2024-10-01T22:28:23Z

ejhg
Oct 1, 2024

Thank you @GeorgeS2019 !

I can almost feel the AGI now!

0 replies

travisjj · 2024-10-05T17:12:41Z

travisjj
Oct 5, 2024

This is a nicely implemented version of Karpathy's first setup of reproducing GPT, but does not include the updates from GPT-2 such as some of the CUDA performance improvements, the checkmarks, and some minor tweaks to preactivation.

4 replies

GeorgeS2019 Oct 6, 2024
Author

@travisjj check these:
https://github.com/biegehydra/NanoGptDotnet
https://github.com/PentoreXannaci/NanoGPTNet

ejhg Nov 7, 2024

Along these lines, I have been playing a lot with o1-preview. I copy and paste large chunks of transformer related python code and ask it to translate it to C#. It is mind-blowingly effective. The tasks I give it are particularly challenging because I'm asking it to rewrite llama from scratch without dependencies, or I ask it to unflatten tensor operations a certain way, etc. This approach could be effective in bringing over any of Karpathy's implementations to C#/TorchSharp.

ejhg Nov 7, 2024

Example: I couldn't find any C# code to train a Llama model from a quick google search, so I asked it to write it for me from one of Karparthy's repos. (I still have to test it...)

https://chatgpt.com/share/672d06a0-e6fc-8009-b29a-408e0575a4bd

travisjj Nov 11, 2024

Hi @GeorgeS2019 , one of those (NanoGPTNet) is a simple nano-gpt.com wrapper, which doesn't really include any torchsharp code unfortunately. Pentore is also from the older version released by Karpathy. A great indicator is the lack of GELU, but there are more subtleties as well such as all of the performance increases he gains in the subsequent version; for example with F.scaled_dot_product_attention inside of the CausalSelfAttention's forward pass.

I do have a working version I could share, perhaps with some polish... I am used to publishing c# enterprise code and this code isn't commented and the call to start the training itself is rather chicken scratch. I haven't played around too much with it, but on an Azure A100 it hit 600k tokens per second, and 200k tokens processed per second on gradient accumulation.

travisjj · 2024-11-21T19:11:44Z

travisjj
Nov 21, 2024

Part of the intent with Karpathy making NanoGPT was to be able to mimic GPT-2 in such a way as to be able to load it's model weights. As I have gotten further into developing NanoGPT in TorchSharp I have noticed that the parameter count is slightly off from what it is in PyTorch.

I am using the hyper parameters that should be equivalent to GPT-2 small:

var gptConfig = new HyperParams()
{
    BlockSize = 1024,
    VocabSize = 50257,

    BlockLayers = 12,
    NumberOfHeads = 12,

    EmbedDimensions = 768
};

(huggingface config for GPT-2: https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config)

Unfortunately this is yielding 163M (163037184), which is off from GPT-2 small's 124M. As a result, and this is the important part, I cannot load a pretrained model from huggingface. I tried using the same train.bin that Karpathy loads but it is a mismatch. If anyone was curious about training their own models, it looks like the 163M parameter model would take 3.7 days per epoch on an Azure NVidia A100.

Of the other implementations of NanoGPT in TorchSharp, has anyone seen any which will load a huggingface model using Model.Load that "just works"?

0 replies

ejhg · 2024-11-21T20:24:42Z

ejhg
Nov 21, 2024

Can you share your branch to repro?

1 reply

travisjj Nov 22, 2024

I definitely will, however, it is currently prepped to run with cuda enabled on a linux Azure instance and lacks comments; so... maybe I should repackage (to cpu? what is everyone training these on? hobby? home gpu? Bueller!?) prior to sharing this exact version. I was mostly looking to reproduce the functionality prior to making this anything larger. To be honest I fully agree with your sentiment about llama2 and that direction.

ejhg · 2024-11-21T21:24:09Z

ejhg
Nov 21, 2024

Personally, I've shifted my attention away from nanogpt towards llama2.c instead: https://github.com/karpathy/llama2.c

IMO, the complexity of llama is comparable to nanogpt, yet llama uses a more modern architecture, and you have much better open source pre-trained models available to plug and play.

The neat thing about llama2.c is that you can run inference on the cpu, and the inference file is written in a single C file from scratch.

There's already a C# port of llama2.c, and I have my own port as well... If you compile with optimizations and AOT, it runs impressively fast, comparable to the original C version, though I don't have a benchmark.

3 replies

travisjj Nov 22, 2024

Do the llama2 pre-trained models load in the c# ported versions you have seen or created?

ejhg Nov 22, 2024

@travisjj I haven't tested this, but you would need to run Karpathy's script to convert the huggingface model into a .bin file, which you can then run in any of the llama2.c versions. This is the script: https://github.com/karpathy/llama2.c/blob/master/export.py

Here's a C# version of the CPU inference file: https://github.com/trrahul/llama2.cs

Here's my version before I broke it up into smaller files: https://github.com/ejhg/transformers/blob/21b0234f7de6d1f4bd135adc41baaed93128169f/llama.cs/model.cs

It's worth noting that I used o1-preview heavily to craft my version. Not only did it perform the initial translation, but I guided it to refactor it in certain ways. For instance, below is what my weights look like. Notice that I use multi-dimensional arrays instead of flat arrays like Karpathy and the other versions do. Since we're not running on GPUs, there's no point in flattening arrays and losing structure. It was very tricky to get this right, and some of the arrays can still be broken down more. I verified that this did not impact performance. I resorted to unsafe code in one of the matrix multiplies. The thread parallelization was ChatGPT's idea...

    public class LayerWeights
    {
        // Weights for RMSNorms
        public float[] rms_att_weight; // [dim]
        public float[] rms_ffn_weight; // [dim]

        // Weights for matmuls. note dim == n_heads * head_size
        public float[,] wq; // [n_heads * head_size][dim]
        public float[,] wk; // [n_kv_heads * head_size][dim]
        public float[,] wv; // [n_kv_heads * head_size][dim]
        public float[,] wo; // [dim][n_heads * head_size]

        // Weights for FFN
        public float[,] w1; // [hidden_dim][dim]
        public float[,] w2; // [dim][hidden_dim]
        public float[,] w3; // [hidden_dim][dim]
    }

ejhg Nov 22, 2024

This is llama inference running on TorchSharp, and it does load hugging face files successfully: https://github.com/ejhg/llama-torchsharp

I tested it with both Llama 2 and 3. Llama 3 does work, but there is one scaling layer that isn't implemented, so it works, but it could be better.

I plan to reuse the loader from this repo and combine it with the CPU implementation mentioned above so Karpathy's python script can be bypassed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchSharp NanoGpt #1379

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

TorchSharp NanoGpt #1379

Replies: 5 comments · 8 replies

GeorgeS2019 Oct 6, 2024 Author

Replies: 5 comments 8 replies

GeorgeS2019 Oct 6, 2024
Author