Skip to content

Commit

Permalink
add section for genai
Browse files Browse the repository at this point in the history
  • Loading branch information
javierluraschi committed May 28, 2024
1 parent 51071f1 commit 0b36dc7
Show file tree
Hide file tree
Showing 15 changed files with 89 additions and 19 deletions.
2 changes: 1 addition & 1 deletion python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "hal9"
version = "2.1.4"
version = "2.1.5"
description = ""
authors = ["Javier Luraschi <[email protected]>"]
readme = "README.md"
Expand Down
Binary file added stories15M.bin
Binary file not shown.
Binary file added tokenizer.bin
Binary file not shown.
File renamed without changes
File renamed without changes
File renamed without changes
24 changes: 12 additions & 12 deletions website/docs/genai/intro-ai.md → website/docs/genai/dnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
sidebar_position: 1
---

import RosenblattNYT from './intro-ai-nyt-learning-device-1958.jpeg';
import PerceptronLayered from './intro-ai-perceptron-multilayered.png';
import NewtonMethod from './intro-ai-newton-method.png';
import Imagenet from './intro-ai-imagenet.jpg';
import RosenblattNYT from './dnn-nyt-perceptron.jpeg';
import PerceptronLayered from './dnn-perceptron-multilayered.png';
import NewtonMethod from './dnn-newton-method.png';
import Imagenet from './dnn-imagenet.jpg';

# Intro to AI
# Deep Neural Networks

## The Perceptron

Expand All @@ -23,7 +23,7 @@ The perceptron maps stimuli to numeric inputs that are weighted into a threshold

For example, if you wanted to classify the number one in an image of 2x2 pixels, you could encode these pixels into an array of values `I = [[0,1], [0,1]]` and use the perceptron equation with coefficients `W = [[-1, 1], [-1, 1]]` such that multiplying the pixels times the array of coefficients `W*I` classifies the image as the number one when the value is larger than zero.

Minsky and Papert found out that a single perceptron can classify only datasets that are linearly separable; however, they also revealed in their book _Perceptrons_ that layering perceptrons would bring additional classification capabilities. The next Figure presents the original diagram showcasing a multilayered perceptron.
Minsky and Papert found out that a single perceptron can classify only datasets that are linearly separable; however, they also revealed in their book _Perceptrons_ that layering perceptrons would bring additional classification capabilities. Since then, we've refered to layered perceptrons as **Neural Networks** (**NN**). The next Figure presents the original diagram showcasing the multilayered perceptron neural network.

<center><img src={PerceptronLayered} alt="Multi-layered Perceptron" style={{width: 500}} /></center>

Expand All @@ -35,22 +35,22 @@ Let's go back to our 2x2 pixels image classification example. We can redefine th

You can think of *f(W)-1* as the error measurement that we are trying to minimize, we don't want any random *W* values, we want weights that help us classify the image correctly.

**Calculus to the rescue!** We can solve equations like *f(W)-1* using the [Newton–Raphson method](https://en.wikipedia.org/wiki/Newton%27s_method), which differentiates the equation, picks random numbers of W, and then moves towards the direction that gets closer to the solution (the local minima). Check [Khan Academy's Newton's method](https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-1-new/ab-2-1/v/newton-leibniz-and-usain-bolt).
Calculus to the rescue! We can solve equations like *f(W)-1* using the [Newton–Raphson method](https://en.wikipedia.org/wiki/Newton%27s_method), which differentiates the equation, picks random numbers of W, and then moves towards the direction that gets closer to the solution (the local minima) using the derivative of the euation. Check [Khan Academy's Newton's method](https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-1-new/ab-2-1/v/newton-leibniz-and-usain-bolt).

<center><a href="https://towardsdatascience.com/newton-raphson-explained-and-visualised-23f63da21bd5"><img src={NewtonMethod} alt="Multi-layered Perceptron" style={{width: 500}} /></a></center>

This approach, among many other improvements beyond the scope of this introduction, is what Geoff Hinton used to solve this. The strategy used in DNN is called *Backpropagation* and uses **gradient descent** as the method used to find local minima in n-dimensional spaces. To keep this simple, you can think of gradient descent as Newton-Raphson method for more complex equations.
This approach, among many other improvements beyond the scope of this introduction, is what Geoff Hinton used to find weights for networks with many layers, which we refer to as **Deep Neural Networks** (**DNN**). The strategy used in DNN is called *Backpropagation* and uses **gradient descent** as the method used to find local minima in n-dimensional spaces. The term backpropagation comes from the process of finding the local minima by propagating the weights back (from output to input) to minimize the error backwards using the derivative function to guide the direction. You can think of backpropagation as Newton-Raphson method for more complex equations. The process of finding the right weigths is refered to as **training** the model.

Is forth mentioning that the original perceptron function is not differentiable, to solve this problem, we replace that function that is similar enough, like a sigmoid function or a ReLU function, that are close enough and we call it a day. We also use technices like dropout to escape local minima in our search for weights.

## Autodiff

You might be thinking, in order to classify images using DNN, do I need to differentiate complex equations by hand? That would be ceritanly a lot of fun, but certainly not. We have developed computer libraries that perform automatic differentiation (**autodiff**) to avoid having to compute them by hand. So these days we can simply define our DNN network by code, and a computer library (usually written in Python) will differentiate, apply gradient descent and gives us the final weights. With one important caveat!
You might be thinking, in order to classify images using DNN, do I need to differentiate complex equations by hand? That would be ceritanly a lot of fun, but certainly not. We have developed computer libraries that perform **automatic differentiation** (**autodiff**) to avoid having to compute them by hand. So these days we can simply define our DNN network by code, and a computer library (usually written in Python) will differentiate, apply gradient descent and gives us the final weights. With one important caveat!

Gradient descent is an incremental algorithm taht takes computing resources, and even worse, DNNs have hundreds of layers and sometimes millions of perceptrongs or more. Not only that but we usually need also millions of iamges to classify is something is a cat or not, so there is a lot of computing going on to put all this together.
For image classificaiton, DNNs can have hundreds of layers and millions of perceptrons making the process not that trivial since gradient descent is an incremental algorithm that requires significant computing resources. To train this, we usually need also millions of images to classify them, so there is a lot of computing going on to put all this together. The following figure shows a famous image classification dataset called [ImageNet](https://www.image-net.org/) which was used to train image classification DNN and contains hundreds and thousands of images.

<center><a href="https://paperswithcode.com/dataset/imagenet"><img src={Imagenet} style={{width: 400}} /></a></center>

To crunch all the numbers, is very comon not to use CPUs (Central Processing Unit) but rather GPUs (Graphic Processing Unit) which where originally designed to play videogames and are really good at rendering 3D in parallel, which happens to be very similar to iterate in parallel for DNN.
To crunch all the numbers, is very comon not to use CPUs (Central Processing Unit) but rather GPUs (Graphic Processing Unit) which where originally designed for videogames and are really good at rendering 3D in parallel, which happens to be very similar to the algorithms (matrix multiplication) required to train DNNs.

That's about it, you are somewhat caught up with AI just before the advent of [Generative AI](intro-genai).
That's about it, you are somewhat caught up with AI just before the advent of Generative AI with [LLMs](llm.md).
5 changes: 0 additions & 5 deletions website/docs/genai/intro-genai.md

This file was deleted.

Binary file added website/docs/genai/llm-autoencoder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added website/docs/genai/llm-gpt-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added website/docs/genai/llm-transformer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
42 changes: 42 additions & 0 deletions website/docs/genai/llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
sidebar_position: 2
---

import Autoencoder from './llm-autoencoder.png';
import Transformer from './llm-transformer.png';
import GPT1 from './llm-gpt-1.png';

# Large Language Models

## Embeddings

An Autoencoder is a type of [DNN](intro-ai#) that does not require classification labels but rather, performs unsupervised learning by asking the DNN to classify the the inputs of the network as the outputs. For example, when classifying the image of a cat, the pixels of that cat would be the input and the classificatio label would also be all the pixels of the cat.

<center><a href="https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798"><img src={Autoencoder} style={{width: 500}} /></a></center>

This can seem pretty pointless, why would we spend so many compute resources training neural networks that produce the same output for the given input? Interestingly, it was discovered that the middle layer that contains a array (vector) of only a few numbers has very interesting properties, we will refer to this middle layer as the **embedding**.

It was found that such embeddings generalize and build intuitive understanding of the underlining data. For example, when using embeddings with text as input (as opposed to images), one can use them to as a question like "What is the term for a king that is not a man?", such question can be answered by simply adding ans substracting [King – Man + Woman](https://www.technologyreview.com/2015/09/17/166211/king-man-woman-queen-the-marvelous-mathematics-of-computational-linguistics/) and find out the resulting embedding is actually the vector of Queen, which is surprising given that this was learned by the autoencoder itself. This is arguably an early example of emergent abilities, as in, an unexpected behavior the model was not designed to accomplish.

## Transformers

We can use a DNN to predict the next word from a given text; for example, we can train a DNN that given 3 embeddings tells us the next token, so we could ask the DNN to find the next work after `['King', 'wife', 'is']`. The initial text used is referred to as the text **prompt**. Using plain DNN, we would get descent completions from training over several books and we would be able to get reasonable guesses like 'Queen', for that example.

However, using standard (feed=forward) DNN turns out to really show signs of intelligence. If we were to use the prompt "Rose is the Queen. Who is the King's wife?" we would like get a reply like "The Queen" or even worse, the name of a Queen seen in the training text as "Queen Elizabeth" and the like.

To solve that problem, variations to DNNs where explored like Recursive Neural Networks (RNNs), Long-Short Term Memory (LSTM) DNNs, and the like. Those showed improvements but it was not until the **transformer** was presented in the [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.

<center><a href="https://arxiv.org/abd/1706.03762"><img src={Transformer} style={{width: 380}} /></a></center>

You can think of attention as using a DNN to figure out where in the text to put attention to, even if the reference to "Rose is the Queen" is way early in the prompt, the DNN will tell the DNN, to answer this question also look for references in these other parts of the text.

## Generative Pretrained Transformers

Transformers also surprised us with another level of emergent abilities, related to answering very basic questions in early 2018 with [GPT-1 by OpenAI](https://openai.com/index/language-unsupervised/). Over time, we found out that thre are further [emergent abilities in larger transformer models](https://arxiv.org/abs/2206.07682) and started referring to pre-trained large transformer models as **Generative Pretrained Transformers** (**GPT**), models that use more data, more compute (GPUs), and more parameters to train complex DNN networks with backpropagation. To leave room for other kinds of models that go beyond transformers, we refer to large GPT models as **Large Language Models** (**LLM**).

<center><a href="https://openai.com/index/language-unsupervised/"><img src={GPT1} style={{width: 500}} /></a></center>

The *Generative* term in GPT comes from the ability to generate text (embeddings) and the focus on applications that generating content for question answering, summarization, and many of the emergent abilities a GPT shows.

Refer to [Advancements in Generative AI](https://arxiv.org/abs/2311.10242) for additional details or hop into the [Prompting](prompts.md) section to learn techniques to maximize the practical use of LLMs.

4 changes: 4 additions & 0 deletions website/docs/genai/prompts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

# Prompt Engineering

Under construction, in the meantime, check [A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications](https://arxiv.org/pdf/2402.07927)
31 changes: 30 additions & 1 deletion website/docs/genapps/hello-world.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,36 @@ sidebar_position: 1

# Hello World

## Chatbot

ChatGPT populatized the chat interface as the application interface to interoperate with LLMs, tools like MidJourney have also popularized through their use of Disscord.

From an applicaiton development perspective, the simplest chat interface we can build relies on input / output functions provided by the language itself.

For example, the following Python code generates a chatbot that replies "Echo" to whatever the input is:

```python
echo = input()
print(f"Echo: {echo}")
```
```

Arguably that's the simplest chatbot we can create, but is lacking to use any generative AI technology.

## Tiny Llama

To fix that, we can change the chatbot to use a small **Open Source Software** (**OSS**) LLM called Tiny Llama created by Andrej Karpathy with a nice [Python wrapper](https://github.com/karpathy/llama2.c) that we can easily run.

```bash
pip install llama2-py==0.0.6
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://github.com/tairov/llama2.py/raw/master/tokenizer.bin
```

```python
import llama2_py

prompt = input()
llama2_py.run({ "checkpoint": "stories15M.bin", "temperature": 0.0, "steps": 256, "prompt": prompt })
```

Notice that `llama2_py.run` will print the output so no need to call `print()` explicitly.

0 comments on commit 0b36dc7

Please sign in to comment.