-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinstructions.typ
131 lines (96 loc) · 5.35 KB
/
instructions.typ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
#import "@local/anu-typst-template:0.1.0": *
#show: anu-template.with(
title: [_My First LM_ Instructions],
subtitle: "A Cybernetic Studio Project",
socy_logo: true
)
Ever wanted to train your own Language Model by hand? Now you can.
== Training Phase
<training>
You'll need:
- a nice fine-tipped pen
- a grid printout
- a couple of paragraphs of text (a paragraph from the newspaper, a page from a
kids book, etc.)
The training procedure is:
+ write the _first_ word of the text in the *first blank row* and *first blank
column* on the grid (in both cases it'll be right up near the logo in the
top-left-hand corner)
+ look at the _next_ word in your text: write it in theh *next blank row* and
*next blank column* on the grid (which will be the second row and second column)
+ make a tally score mark in the *first row, second column* grid cell, so that
the grid cell contains a count of the number of times you've seen the second
word following the first one
+ return to step 2 and continue through your text, and each time check if the next
word is one you've seen already in the text (and if not, write it in the *next
blank row* and *next blank column* as before) and add a new tally mark in the
corresponding grid cell
Once you're done, your grid (and the total tally scores in each grid cell) is
what's called a _co-occurance table_, which keeps track of how often each word in
your text follows each other word. *This #emph[is] your language model.*
#pagebreak()
== Inference Phase
<inference>
Now comes the fun part: you can use your model to generate new text (this is
called _inference_ in Machine Learning jargon).
+ choose one of the words in your model as your starting word (this is your
"prompt") and write it down---this is the start of your "generated text"
+ look along the row corresponding to that word:
- if there's only one grid cell in the row that has any tally marks in it,
that's your next word
- if there are multiple grid cells with tally marks, choose one at random
(weighted by the number of tally scores, so that e.g.~if one grid cell has 2
marks and one has 3, then choose the first grid cell 40% of the time and the
second grid cell 60% of the time)
+ write down the chosen word in your generated text, and repeat step 2 until
you've generated as much text as you like (feel free to add punctuation as
necessary to make the generated text make sense)
If you want to write a new sentence, go back to step 1 and choose a new starting
word (prompt).
== Extension activities
+ train a new model with a different training text, and compare the output
generated text from the two models
+ swap your model with another group's model, and continue the training procedure
with your text but with their model as a starting point
+ re-train your model, but with include special "tokens" for punctuation (e.g.~a
"full stop" token, a "comma" token, etc.) and include these in your model
+ try training a model with a different "n-gram" size (e.g.~a bigram model where
you look at two successive words to determine the next word), although note that
the grid gets pretty big pretty fast---if there are $n$ distinct words in your
text then you'll need $n^2$ rows in your grid
== Further reading
The specific type of model you've built is called a unigram model (a type of
#link("https://en.wikipedia.org/wiki/N-gram")["n-gram" model];, where "n" is 1).
Each "next word" choice is determined by the "current word" only. These models are
different (and less sophisticated) than the Transformer-based models like GPT
which now dominate the LLM landscape, but there's evidence that a sufficiently
large n-gram model can get you #link("https://arxiv.org/abs/2407.12034")[almost
80% of the way there to a fancier GPT-based model].
Having said that, GPT-4 has around 1.75T parameters, which when printed out
(assuming one 1cm#super[2] grid cell per parameter) would cover 175km#super[2]---enough
to cover #link("https://en.wikipedia.org/wiki/Melbourne_Cricket_Ground")[the MCG]
_ten thousand times_.
== Appendix: weighted randomness with dice
<weighted-randomness>
You'll need a source of "weighted" random numbers for the "select the next word
based on the tally scores" part of the Inference procedure described above.
If you've got a d6 (a normal 6-sided die) it's not too tricky. Here's an example:
- if there's one column (i.e. one potential next word) with one tally mark and
another with two tally marks, then roll the dice---if it's 1 or 2 choose the
first column, otherwise choose the second column
Remember: it doesn't actually matter the _absoulte_ number of tallies in each
column, just the _relative_ number of tallies compared to all the other columns.
So if the two potential next words have e.g. 1 tally each or 4 tallies each it
doesn't matter, you just need to choose between them such that there's an equal
chance of each one (so dice 1-3 means the first one, 4-6 means the second one).
If you end up with a larger language model and need more than a D6, you could use
a d20 (a 20-sided die) and use a similar approach.
// TODO make a nice diagram of the above "partition the D20 range" approach
// #import "@preview/cetz:0.3.2"
// #cetz.canvas({
// import cetz.draw: *
// let anu-gold = rgb("#be830e")
// let anu-copper = rgb("#be4e0e")
// let anu-teal = rgb("#0085ad")
// rect((0, 0), (rel: (1, 1)), radius: .1, stroke: anu-copper)
// })