From 6e8b7a8548c8b118e90d3bef15787396f6a44f14 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 3 Dec 2024 01:23:21 -0500
Subject: [PATCH] re-write of basics page

---
 docs/src/guide/models/basics.md | 416 +++++++++++++++++++++-----------
 1 file changed, 275 insertions(+), 141 deletions(-)

diff --git a/docs/src/guide/models/basics.md b/docs/src/guide/models/basics.md
index 3bb4358afe..2e3c2b0e9a 100644
--- a/docs/src/guide/models/basics.md
+++ b/docs/src/guide/models/basics.md
@@ -1,246 +1,380 @@
-# [How Flux Works: Gradients and Layers](@id man-basics)
+# [How Flux Works: Parameters, Gradients, and Layers](@id man-basics)
 
-## [Taking Gradients](@id man-taking-gradients)
+A neural network is a function with *parameters*.
+That is, it takes some input `x` and gives you some output `y`,
+whose value also depends on some other numbers `θ`.
 
-Flux's core feature is taking gradients of Julia code. The `gradient` function takes another Julia function `f` and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)
+A sufficiently flexible function can, by adjusting the parameters just right,
+be made to do many things. And the one magic trick for adjusting parameters
+is to follow a *gradient*.
 
-```jldoctest basics
-julia> using Flux
+This page describes Flux's take on how to construct such flexible functions
+containing many parameters, and how to handle their gradients.
 
-julia> f(x) = 3x^2 + 2x + 1;
+## Parameterised Functions
 
-julia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2
+Let's start with very simple functions. This is a polynomial in `x::Real`,
+returning another real number `y` which depends on some coefficients stored in a vector:
 
-julia> df(2)
-14.0
+```jldoctest poly
+θ = [10, 1, 0.1]
 
-julia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6
+poly1(x::Real) = θ[1] + θ[2]*x + θ[3]*x^2
 
-julia> d2f(2)
-6.0
+poly1(5) == 17.5  # true
 ```
 
-When a function has many parameters, we can get gradients of each one at the same time:
+Here the parameters are a global variable `θ`. They could be handled in other ways,
+for instance by explicitly passing them as an additional argument to the function:
 
-```jldoctest basics
-julia> f(x, y) = sum((x .- y).^2);
+```jldoctest poly
+poly2(x::Real, θ2) = evalpoly(x, θ2)
 
-julia> gradient(f, [2, 1], [2, 0])
-([0.0, 2.0], [-0.0, -2.0])
+poly2(5, θ) == 17.5  # true
 ```
 
-These gradients are based on `x` and `y`. Flux works by instead taking gradients based on the weights and biases that make up the parameters of a model.
+Flux chooses a third path, by *encapsulating* the parameters within the function.
+The simplest way to do this is a *closure*, an anonymous function which Julia knows
+to depend on some local variable `θ3`:
+
+```jldoctest poly
+poly3 = let θ3 = [10, 1, 0.1]
+    x -> evalpoly(x, θ3)
+end
 
-Machine learning often can have *hundreds* of parameter arrays.
-Instead of passing them to `gradient` individually, we can store them together in a structure.
-The simplest example is a named tuple, created by the following syntax:
+poly3(5) == 17.5  # true
+```
 
-```jldoctest basics
-julia> nt = (a = [2, 1], b = [2, 0], c = tanh);
+An equivalent, but tidier, way is to construct a `struct` in which to store the parameters.
+Any struct can be made callable, allowing its instances to act just like function:
 
-julia> g(x::NamedTuple) = sum(abs2, x.a .- x.b);
+```jldoctest poly
+struct Poly3{T}  # container struct
+    θ3::T
+end
+(p::Poly3)(x::Real) = evalpoly(x, p.θ3)  # make this callable
 
-julia> g(nt)
-1
+poly3s = Poly3([10, 1, 0.1])  # construct an instance
 
-julia> dg_nt = gradient(g, nt)[1]
-(a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing)
+poly3s(5) == 17.5  # true
 ```
 
-Notice that `gradient` has returned a matching structure. The field `dg_nt.a` is the gradient
-for `nt.a`, and so on. Some fields have no gradient, indicated by `nothing`. 
+Internally, there is little difference between a closure and a struct.
+They have the same fields, and equivalent methods:
 
-Rather than define a function like `g` every time (and think up a name for it),
-it is often useful to use anonymous functions: this one is `x -> sum(abs2, x.a .- x.b)`.
-Anonymous functions can be defined either with `->` or with `do`,
-and such `do` blocks are often useful if you have a few steps to perform:
+```jldoctest poly
+poly3s.θ3 == poly3.θ3 == θ  # both have a field called :θ3
+dump(poly3)  # contains θ3: Array
+dump(poly3s)
+methods(poly3)
+methods(poly3s)  # each has 1 method, taking x::Real
+```
 
-```jldoctest basics
-julia> gradient((x, y) -> sum(abs2, x.a ./ y .- x.b), nt, [1, 2])
-((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])
+The virtue of encapsulation is that it makes composition very easy.
+We can make more complicated functions by combining simple ones,
+and each will keep track of its own parameters.
+Juia writes function composition as `∘`, for instance `(inv ∘ sin)(pi/6) ≈ 2`,
+and we can use exactly this for our parameterised polynomials:
 
-julia> gradient(nt, [1, 2]) do x, y
-         z = x.a ./ y
-         sum(abs2, z .- x.b)
-       end
-((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])
-```
+```jldoctest poly
+poly4 = Poly3([1, 0.5, 0]) ∘ Poly3([10, 1, 0.1])
 
-Sometimes you may want to know the value of the function, as well as its gradient.
-Rather than calling the function a second time, you can call [`withgradient`](@ref Zygote.withgradient) instead:
+poly4 isa ComposedFunction  # ∘ creates another struct...
+poly4.outer.θ3 == θ         # which has fields :inner & :outer
 
-```
-julia> Flux.withgradient(g, nt)
-(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))
+poly4(5) == 9.75  # true
 ```
 
-## Building Simple Models
+Flux models are precisely made by such function composition.
+In fact, `poly3`  and `poly4` are already valid Flux models.
 
-Consider a simple linear regression, which tries to predict an output array `y` from an input `x`.
 
-```julia
-predict(W, b, x) = W*x .+ b
+## [Taking Gradients](@id man-taking-gradients)
 
-function loss(W, b, x, y)
-  ŷ = predict(W, b, x)
-  sum((y .- ŷ).^2)
-end
+The derivative of a scalar function is its slope: how fast the output changes as the input is changed slightly.
+This may be found approximately by evaluating at two nearby points, and exactly by taking the limit in
+which the distance between them approaches zero:
 
-x, y = rand(5), rand(2) # Dummy data
-W = rand(2, 5)
-b = rand(2)
+```jldoctest poly
+julia> (poly1(5 + 0.1) - poly1(5)) / 0.1
+2.010000000000005
 
-loss(W, b, x, y) # ~ 3
+julia> (poly1(5 + 0.001) - poly1(5)) / 0.001  # answer is getting close to 2
+2.000100000003613
 ```
 
-To improve the prediction we can take the gradients of the loss with respect to `W` and `b` and perform gradient descent.
+Flux's `gradient(f, x)` works this out for `f(x)`, and gives exactly `∂f/∂x = 2.0` here:
 
-```julia
-using Flux
+```jldoctest poly
+julia> gradient(poly1, 5)
+(2.0,)
+```
+
+The reason `gradient` returns a tuple, not just the number `2.0`, is to allow for
+functions taking several arguments. (That's also why it's not called "derivative".)
+For instance, this returns `∂f/∂x, ∂f/∂y, ∂f/∂z`:
 
-dW, db = gradient((W, b) -> loss(W, b, x, y), W, b)
+```jldoctest poly
+julia> gradient((x,y,z) -> (x*y)+z, 30, 40, 50)
+(40.0, 30.0, 1.0)
 ```
 
-Now that we have gradients, we can pull them out and update `W` to train the model.
+For our parameterised polynomial, we have `∂f/∂x` but we are really more interested
+in `∂f/∂θ`, as this will tell us about how the parameters are affecting the answer.
+It is possible to track gradients with respect to global `θ` (see below),
+but clearer to track explicit arguments. Here's how this works for `poly2` (which takes `θ` as a 2nd argument)
+and `poly3` (which encapsulates `θ`):
 
-```julia
-W .-= 0.1 .* dW
+```jldoctest poly
+julia> grad2 = gradient(poly2, 5, θ)
+(2.0, [1.0, 5.0, 25.0])
 
-loss(W, b, x, y) # ~ 2.5
+julia> grad3 = gradient((x,p) -> p(x), 5, poly3s)
+(2.0, (θ3 = [1.0, 5.0, 25.0],))
 ```
 
-The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md).
+The first entry is `∂f/∂x` as before, but the second entry is more interesting.
+For `poly2`, we get `∂f/∂θ` as `grad2[2]` directly.
+It is a vector, because `θ` is a vector, and has elements `[∂f/∂θ[1], ∂f/∂θ[2], ∂f/∂θ[3]]`.
 
-All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can *look* very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.
+For `poly3`, however, we get a `NamedTuple` whose fields correspond to those of the struct `Poly3`.
+This is called a *structural gradient*. And the nice thing about them is that they work for
+arbitrarily complicated structures, for instance:
 
-## Building Layers
+```jldoctest poly
+julia> grad4 = gradient(|>, 5, poly4)
+(1.0, (outer = (θ3 = [1.0, 17.5, 306.25],), inner = (θ3 = [0.5, 2.5, 12.5],)))
+```
+
+Here `grad4.inner.θ3` corresponds to `poly4.inner.θ3`.
+These matching nested structures are at the core of how Flux works.
+
+(Aside, why don't we just write `gradient(poly4, 5)`? It's just a convention that `gradient(f,x,y)`
+returns only `∂f(x,y)/∂x, ∂f(x,y)/∂y`, and not a derivative of the output with respect to the function itself.
+For ordinary pure functions like `(x,y) -> (x*y)`, this `∂f(x,y)/∂f` would always be zero.)
+
+!!! note "Implicit gradients"
+    Earlier versions of Flux used a different way to relate parameters and gradients,
+    which looks like this:
+    ```jldoctest poly
+    g1 = gradient(() -> poly1(5), Params([θ]))
+    g1[θ] == [1.0, 5.0, 25.0]
+    ```
+    Here `Params` is a set of references to global variables using `objectid`,
+    and `g1 isa Grads` is a dictionary from these to their gradients.
+    This method of `gradient` takes a zero-argument function, which only *implicitly*
+    depends on `θ`.
+
+```@raw html
+<h3><img src="https://github.com/FluxML/Optimisers.jl/blob/master/docs/src/assets/logo.png?raw=true" width="40px"/><a href="https://github.com/FluxML/Zygote.jl">Zygote.jl</a></h3>
+```
 
-It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) in between them. We could write this as:
+Flux's [`gradient`](@ref) function by default calls a companion packages called [Zygote](https://github.com/FluxML/Zygote.jl).
+Zygote performs source-to-source automatic differentiation, meaning that `gradient(f, x)`
+hooks into Julia's compiler to find out what operations `f` contains, and transforms this
+to produce code for computing `∂f/∂x`.
+
+Zygote can in principle differentiate almost any Julia code. However, it's not perfect,
+and you may eventually want to read its [page about limitations](https://fluxml.ai/Zygote.jl/dev/limitations/).
+In particular, a major limitation is that mutating an array is not allowed.
+
+Flux can also be used with other automatic differentiation (AD) packages.
+It was originally written using [Tracker](https://github.com/FluxML/Tracker.jl), a more traditional operator-overloading approach.
+The future might be [Enzyme](https://github.com/EnzymeAD/Enzyme.jl), and Flux now builds in an easy way to use this instead:
 
 ```julia
-using Flux
+julia> using Enzyme
 
-W1 = rand(3, 5)
-b1 = rand(3)
-layer1(x) = W1 * x .+ b1
+julia> grad3e = Flux.gradient((x,p) -> p(x), Const(5.0), Duplicated(poly3s))
+(nothing, (θ3 = [1.0, 5.0, 25.0],))
+```
 
-W2 = rand(2, 3)
-b2 = rand(2)
-layer2(x) = W2 * x .+ b2
+`Flux.gradient` follows Zygote's convention that arguments with no derivative are marked `nothing`.
+Here, this is because `Const(5.0)` is explicitly constant.
+Below, we will see an example where `nothing` shows up because the model struct has fields containing things other than parameters, such as an activation function.
 
-model(x) = layer2(sigmoid.(layer1(x)))
+## Neural Networks
 
-model(rand(5)) # => 2-element vector
+The polynomial functions above send a number `x` to another a number `y`.
+Neural networks typically take a vector of numbers, mix them all up, and return another vector.
+Here's a very simple one, which will take a vector like `x = [1.0, 2.0, 3.0]`
+and return another vector `y = layer1(x)` with `length(y) == 2`:
+
+```julia
+W = randn(2, 3)
+b = zeros(2)
+layer1(x) = sigmoid.(W*x .+ b)
 ```
 
-This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.
+Here `sigmoid` is a nonlinear function, applied element-wise
+because it is called with `.()`, called broadcasting.
 
 ```julia
-function linear(in, out)
-  W = randn(out, in)
-  b = randn(out)
-  x -> W * x .+ b
-end
+sigmoid(x::Real) = 1 / (1 + exp(-x))
+```
 
-linear1 = linear(5, 3) # we can access linear1.W etc
-linear2 = linear(3, 2)
+Like `poly1` above, this `layer1` has as its parameters the global variables `W, b`.
+We can similarly define a version which takes these as arguments (like `poly2`),
+and a version which encapsulates them (like `poly3` above):
 
-model(x) = linear2(sigmoid.(linear1(x)))
+```julia
+layer2(x, W2, b2) = sigmoid.(W2*x .+ b2)  # explicit parameter arguments
 
-model(rand(5)) # => 2-element vector
+layer3 = let
+    W3 = randn(2,3)
+    b3 = zeros(2)
+    x -> sigmoid.(W3*x .+ b3)  # closure over local variables
+end
 ```
 
-Another (equivalent) way is to create a struct that explicitly represents the affine layer.
+This third way is precisely a Flux model. And we can again make a tidier version
+using a `struct` to hold the parameters:
 
 ```julia
-struct Affine
-  W
-  b
+struct Layer  # container struct
+    W::Matrix
+    b::Vector
+    act::Function
 end
 
-Affine(in::Integer, out::Integer) =
-  Affine(randn(out, in), zeros(out))
-
-# Overload call, so the object can be used as a function
-(m::Affine)(x) = m.W * x .+ m.b
+(d::Layer)(x) = d.act.(d.W*x .+ d.b)  # make it callabale
 
-a = Affine(10, 5)
+Layer(in::Int, out::Int, act::Function=sigmoid) =
+  Layer(randn(Float32, out, in), zeros(Float32, out), act)
 
-a(rand(10)) # => 5-element vector
+layer3s = Layer(3, 2)  # instance with its own parameters
 ```
 
-Congratulations! You just built the [`Dense`](@ref) layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
-
-(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, sigmoid)`.)
+The one new thing here is a friendly constructor `Layer(in, out, act)`. This is because we anticipate
+composing several instances of this thing, with different sizes and different random initial parameters.
 
-## Stacking It Up
+```julia
+x = Float32[0.1, 0.2, 0.3]  # input
 
-It's pretty common to write models that look something like:
+layer3s(x)  # output, 2-element Vector{Float32}
 
-```julia
-layer1 = Dense(10 => 5, relu)
-# ...
-model(x) = layer3(layer2(layer1(x)))
+gradient((x,d) -> d(x)[1], x, layer3s)[2]  # NamedTuple{(:W, :b, :act)}
 ```
 
-For long chains, it might be a bit more intuitive to have a list of layers, like this:
+This `∂f/∂layer3s` is a named tuple with the same fields as `Layer`.
+Within it, the gradient with respect to `W` is a matrix of seemingly random numbers.
+Notice that there is also an entry for `act`, which is `nothing`,
+as this field of the struct is not a smoothly adjustible parameter.
 
-```julia
-using Flux
+We can compose these layers just as we did the polynomials above.
+Here's a composition of 3, in which the last step is the function `only`
+which takes a 2-element vector and gives us the number inside:
 
-layers = [Dense(10 => 5, relu), Dense(5 => 2), softmax]
+```julia
+model1 = only ∘ Layer(20, 1) ∘ Layer(1, 20)
 
-model(x) = foldl((x, m) -> m(x), layers, init = x)
+model1(Float32[0.1])  # output is a Float32 number
 
-model(rand(10)) # => 2-element vector
+grad = gradient(|>, [1f0], model1)[2]
 ```
 
-Handily, this is also provided for in Flux:
+This gradient is starting to be a complicated nested structure.
+But it works just like before: `grad.inner.W` corresponds to `model1.inner.W`.
+
+### ⋆ [Flux's layers](@ref man-layers)
+
+Rather than define everything from scratch every time, Flux provides a library of
+commonly used layers. The same model could be defined:
 
 ```julia
-model2 = Chain(
-  Dense(10 => 5, relu),
-  Dense(5 => 2),
-  softmax)
+model2 = Chain(Dense(1 => 20, σ), Dense(20 => 1), only)
+```
 
-model2(rand(10)) # => 2-element vector
+How does this `model2` differ from the `model1` we had before?
+
+* Flux's [`Chain`](@ref Flux.Chain) works left-to-right, the reverse of Base's `∘`.
+  Its contents is stored in a tuple, thus `model2.layers[1].weight` is an array.
+* Flux's layer [`Dense`](@ref Flux.Dense) has only minor differences, mainly that
+  like `struct Poly3{T}` above it has type parameters for its fields -- the compiler does not
+  know exactly what type `layer3s.W` will be, which costs speed.
+  Its initialisation uses not `randn` but ??
+  It also produces more friendly errors on wrong-size input, and has some performance tricks.
+* The function [`σ`](@ref NNlib.sigmoid) is calculated in a slightly better way,
+  and has a rule telling Zygote how to differentiate it efficiently.
+* Flux overloads `Base.show` so to give pretty printing at the REPL prompt.
+
+If what you need isn't covered by Flux's built-in layers, it's easy to write your own.
+There are more details later, but the steps are invariably those shown for `struct Layer` above:
+1. Define a `struct` which will hold the parameters.
+2. Make it callable, to define how it uses them to transform the input `x`
+3. Define a constructor which initialises the parameters.
+4. Annotate with `@layer` to opt-in to pretty printing, and other enhacements.
+
+```@raw html
+<h3><img src="https://github.com/FluxML/Optimisers.jl/blob/master/docs/src/assets/logo.png?raw=true" width="40px"/><a href="https://github.com/FluxML/Functors.jl">Functors.jl</a></h3>
 ```
 
-This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.
+To deal with such nested structures, Flux relies heavily on an associated package
+called Functors. Its basic function is [`fmap`](@ref Functors.fmap),
+which generalises `map(f, x)` to work on almost anything.
 
-A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition.
+For example, this is how [gpu](@ref Flux.gpu) moves all arrays within a model to the GPU,
+reconstructing another `only ∘ Layer(...) ∘ Layer(...)` around the new `CuArray`s:
 
 ```julia
-m = Dense(5 => 2) ∘ Dense(10 => 5, σ)
-
-m(rand(10))
+using CUDA
+fmap(cu, model)
 ```
 
-Likewise, `Chain` will happily work with any Julia function.
+And this is a very simple gradient update of the parameters:
 
 ```julia
-m = Chain(x -> x^2, x -> x+1)
-
-m(5) # => 26
+fmap((x, dx) -> x isa Array ? (x - dx/100) : x, model, grad)
 ```
 
-## Layer Helpers
+Before Flux v0.15 (and Functors v0.5), this exploration of structs was opt-in.
+After defining `struct Layer` it was necessary to call `@functor Layer` (or `@layer Layer`) before Flux would look inside.
+This has now changed to be opt-out: Functors (and hence Flux) will explore arbitrary structs, unless told not to (using `Functors.@leaf`).
+This is why even "anonymous structs" created by closures like `poly3` and `layer3` above are now valid Flux models, although the use of named structs is still recommended practice.
 
-We can give our layer some additional functionality, like nice printing, using the [`@layer`](@ref Flux.@layer) macro:
+## Curve Fitting
+
+Above we took gradients of the output, or sometimes to the first element
+of the output -- it must be a number, not a vector. Adjusting the parameters
+to make this smaller won't lead us anywhere interesting. Instead, we should minimise
+some *loss function* which compares the actual output to our desired output.
+
+Perhaps the simplest example is curve fitting. Given a function like `f(x) = 2x - x^3`
+evaluated at some points `x in -2:0.1:2`, we can aim to adjust the parameters
+of the two-layer `model` from above so that its output is similar to the truth.
+Here's how this might look:
 
 ```julia
-Flux.@layer Affine
+data = [([x], 2x-x^3) for x in -2:0.1f0:2]  # training points (x, y)
+
+for _ in 1:1000  # adjust parameters to minimise the error:
+  Flux.train!((m,x,y) -> (m(x) - y)^2, model, data, Descent(0.01))
+end
 ```
 
-Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the `Affine` layer as follows, using the helper function [`create_bias`](@ref Flux.create_bias):
+Here, `Flux.train!` is a loop which iterates over data, like this:
 
 ```julia
-function Affine((in, out)::Pair; bias=true, init=glorot_uniform)
-  W = init(out, in)
-  b = Flux.create_bias(W, bias, out)
-  return Affine(W, b)
+for xy in data
+  grads = gradient(m -> m(xy...), model)  # gradient at this datapoint
+  fmap(model, grads[1]) do x, dx
+    if x isa Array
+      x .= x .- 0.01 * dx  # mutates the parameters
+    else
+      x
+    end
+  end
 end
+```
+
+And here's how to plot the desired and actual outputs:
 
-Affine(3 => 1, bias=false) |> gpu
+```julia
+using Plots
+plot(x -> 2x-x^3, -2, 2, legend=false)
+scatter!(-2:0.1:2, [model([x]) for x in -2:0.1:2])
 ```
 
+If this general idea is unfamiliar, you may want the [tutorial on linear regression](#ref ???).
+
+More detail about what exactly the function `train!` is doing, and how to use rules other than simple [`Descent`](@ref Optimisers.Descent), is what the next page in this guide is about.