From 6e8b7a8548c8b118e90d3bef15787396f6a44f14 Mon Sep 17 00:00:00 2001 From: Michael Abbott <32575566+mcabbott@users.noreply.github.com> Date: Tue, 3 Dec 2024 01:23:21 -0500 Subject: [PATCH] re-write of basics page --- docs/src/guide/models/basics.md | 416 +++++++++++++++++++++----------- 1 file changed, 275 insertions(+), 141 deletions(-) diff --git a/docs/src/guide/models/basics.md b/docs/src/guide/models/basics.md index 3bb4358afe..2e3c2b0e9a 100644 --- a/docs/src/guide/models/basics.md +++ b/docs/src/guide/models/basics.md @@ -1,246 +1,380 @@ -# [How Flux Works: Gradients and Layers](@id man-basics) +# [How Flux Works: Parameters, Gradients, and Layers](@id man-basics) -## [Taking Gradients](@id man-taking-gradients) +A neural network is a function with *parameters*. +That is, it takes some input `x` and gives you some output `y`, +whose value also depends on some other numbers `θ`. -Flux's core feature is taking gradients of Julia code. The `gradient` function takes another Julia function `f` and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.) +A sufficiently flexible function can, by adjusting the parameters just right, +be made to do many things. And the one magic trick for adjusting parameters +is to follow a *gradient*. -```jldoctest basics -julia> using Flux +This page describes Flux's take on how to construct such flexible functions +containing many parameters, and how to handle their gradients. -julia> f(x) = 3x^2 + 2x + 1; +## Parameterised Functions -julia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2 +Let's start with very simple functions. This is a polynomial in `x::Real`, +returning another real number `y` which depends on some coefficients stored in a vector: -julia> df(2) -14.0 +```jldoctest poly +θ = [10, 1, 0.1] -julia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6 +poly1(x::Real) = θ[1] + θ[2]*x + θ[3]*x^2 -julia> d2f(2) -6.0 +poly1(5) == 17.5 # true ``` -When a function has many parameters, we can get gradients of each one at the same time: +Here the parameters are a global variable `θ`. They could be handled in other ways, +for instance by explicitly passing them as an additional argument to the function: -```jldoctest basics -julia> f(x, y) = sum((x .- y).^2); +```jldoctest poly +poly2(x::Real, θ2) = evalpoly(x, θ2) -julia> gradient(f, [2, 1], [2, 0]) -([0.0, 2.0], [-0.0, -2.0]) +poly2(5, θ) == 17.5 # true ``` -These gradients are based on `x` and `y`. Flux works by instead taking gradients based on the weights and biases that make up the parameters of a model. +Flux chooses a third path, by *encapsulating* the parameters within the function. +The simplest way to do this is a *closure*, an anonymous function which Julia knows +to depend on some local variable `θ3`: + +```jldoctest poly +poly3 = let θ3 = [10, 1, 0.1] + x -> evalpoly(x, θ3) +end -Machine learning often can have *hundreds* of parameter arrays. -Instead of passing them to `gradient` individually, we can store them together in a structure. -The simplest example is a named tuple, created by the following syntax: +poly3(5) == 17.5 # true +``` -```jldoctest basics -julia> nt = (a = [2, 1], b = [2, 0], c = tanh); +An equivalent, but tidier, way is to construct a `struct` in which to store the parameters. +Any struct can be made callable, allowing its instances to act just like function: -julia> g(x::NamedTuple) = sum(abs2, x.a .- x.b); +```jldoctest poly +struct Poly3{T} # container struct + θ3::T +end +(p::Poly3)(x::Real) = evalpoly(x, p.θ3) # make this callable -julia> g(nt) -1 +poly3s = Poly3([10, 1, 0.1]) # construct an instance -julia> dg_nt = gradient(g, nt)[1] -(a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing) +poly3s(5) == 17.5 # true ``` -Notice that `gradient` has returned a matching structure. The field `dg_nt.a` is the gradient -for `nt.a`, and so on. Some fields have no gradient, indicated by `nothing`. +Internally, there is little difference between a closure and a struct. +They have the same fields, and equivalent methods: -Rather than define a function like `g` every time (and think up a name for it), -it is often useful to use anonymous functions: this one is `x -> sum(abs2, x.a .- x.b)`. -Anonymous functions can be defined either with `->` or with `do`, -and such `do` blocks are often useful if you have a few steps to perform: +```jldoctest poly +poly3s.θ3 == poly3.θ3 == θ # both have a field called :θ3 +dump(poly3) # contains θ3: Array +dump(poly3s) +methods(poly3) +methods(poly3s) # each has 1 method, taking x::Real +``` -```jldoctest basics -julia> gradient((x, y) -> sum(abs2, x.a ./ y .- x.b), nt, [1, 2]) -((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25]) +The virtue of encapsulation is that it makes composition very easy. +We can make more complicated functions by combining simple ones, +and each will keep track of its own parameters. +Juia writes function composition as `∘`, for instance `(inv ∘ sin)(pi/6) ≈ 2`, +and we can use exactly this for our parameterised polynomials: -julia> gradient(nt, [1, 2]) do x, y - z = x.a ./ y - sum(abs2, z .- x.b) - end -((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25]) -``` +```jldoctest poly +poly4 = Poly3([1, 0.5, 0]) ∘ Poly3([10, 1, 0.1]) -Sometimes you may want to know the value of the function, as well as its gradient. -Rather than calling the function a second time, you can call [`withgradient`](@ref Zygote.withgradient) instead: +poly4 isa ComposedFunction # ∘ creates another struct... +poly4.outer.θ3 == θ # which has fields :inner & :outer -``` -julia> Flux.withgradient(g, nt) -(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),)) +poly4(5) == 9.75 # true ``` -## Building Simple Models +Flux models are precisely made by such function composition. +In fact, `poly3` and `poly4` are already valid Flux models. -Consider a simple linear regression, which tries to predict an output array `y` from an input `x`. -```julia -predict(W, b, x) = W*x .+ b +## [Taking Gradients](@id man-taking-gradients) -function loss(W, b, x, y) - ŷ = predict(W, b, x) - sum((y .- ŷ).^2) -end +The derivative of a scalar function is its slope: how fast the output changes as the input is changed slightly. +This may be found approximately by evaluating at two nearby points, and exactly by taking the limit in +which the distance between them approaches zero: -x, y = rand(5), rand(2) # Dummy data -W = rand(2, 5) -b = rand(2) +```jldoctest poly +julia> (poly1(5 + 0.1) - poly1(5)) / 0.1 +2.010000000000005 -loss(W, b, x, y) # ~ 3 +julia> (poly1(5 + 0.001) - poly1(5)) / 0.001 # answer is getting close to 2 +2.000100000003613 ``` -To improve the prediction we can take the gradients of the loss with respect to `W` and `b` and perform gradient descent. +Flux's `gradient(f, x)` works this out for `f(x)`, and gives exactly `∂f/∂x = 2.0` here: -```julia -using Flux +```jldoctest poly +julia> gradient(poly1, 5) +(2.0,) +``` + +The reason `gradient` returns a tuple, not just the number `2.0`, is to allow for +functions taking several arguments. (That's also why it's not called "derivative".) +For instance, this returns `∂f/∂x, ∂f/∂y, ∂f/∂z`: -dW, db = gradient((W, b) -> loss(W, b, x, y), W, b) +```jldoctest poly +julia> gradient((x,y,z) -> (x*y)+z, 30, 40, 50) +(40.0, 30.0, 1.0) ``` -Now that we have gradients, we can pull them out and update `W` to train the model. +For our parameterised polynomial, we have `∂f/∂x` but we are really more interested +in `∂f/∂θ`, as this will tell us about how the parameters are affecting the answer. +It is possible to track gradients with respect to global `θ` (see below), +but clearer to track explicit arguments. Here's how this works for `poly2` (which takes `θ` as a 2nd argument) +and `poly3` (which encapsulates `θ`): -```julia -W .-= 0.1 .* dW +```jldoctest poly +julia> grad2 = gradient(poly2, 5, θ) +(2.0, [1.0, 5.0, 25.0]) -loss(W, b, x, y) # ~ 2.5 +julia> grad3 = gradient((x,p) -> p(x), 5, poly3s) +(2.0, (θ3 = [1.0, 5.0, 25.0],)) ``` -The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md). +The first entry is `∂f/∂x` as before, but the second entry is more interesting. +For `poly2`, we get `∂f/∂θ` as `grad2[2]` directly. +It is a vector, because `θ` is a vector, and has elements `[∂f/∂θ[1], ∂f/∂θ[2], ∂f/∂θ[3]]`. -All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can *look* very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models. +For `poly3`, however, we get a `NamedTuple` whose fields correspond to those of the struct `Poly3`. +This is called a *structural gradient*. And the nice thing about them is that they work for +arbitrarily complicated structures, for instance: -## Building Layers +```jldoctest poly +julia> grad4 = gradient(|>, 5, poly4) +(1.0, (outer = (θ3 = [1.0, 17.5, 306.25],), inner = (θ3 = [0.5, 2.5, 12.5],))) +``` + +Here `grad4.inner.θ3` corresponds to `poly4.inner.θ3`. +These matching nested structures are at the core of how Flux works. + +(Aside, why don't we just write `gradient(poly4, 5)`? It's just a convention that `gradient(f,x,y)` +returns only `∂f(x,y)/∂x, ∂f(x,y)/∂y`, and not a derivative of the output with respect to the function itself. +For ordinary pure functions like `(x,y) -> (x*y)`, this `∂f(x,y)/∂f` would always be zero.) + +!!! note "Implicit gradients" + Earlier versions of Flux used a different way to relate parameters and gradients, + which looks like this: + ```jldoctest poly + g1 = gradient(() -> poly1(5), Params([θ])) + g1[θ] == [1.0, 5.0, 25.0] + ``` + Here `Params` is a set of references to global variables using `objectid`, + and `g1 isa Grads` is a dictionary from these to their gradients. + This method of `gradient` takes a zero-argument function, which only *implicitly* + depends on `θ`. + +```@raw html +

Zygote.jl

+``` -It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) in between them. We could write this as: +Flux's [`gradient`](@ref) function by default calls a companion packages called [Zygote](https://github.com/FluxML/Zygote.jl). +Zygote performs source-to-source automatic differentiation, meaning that `gradient(f, x)` +hooks into Julia's compiler to find out what operations `f` contains, and transforms this +to produce code for computing `∂f/∂x`. + +Zygote can in principle differentiate almost any Julia code. However, it's not perfect, +and you may eventually want to read its [page about limitations](https://fluxml.ai/Zygote.jl/dev/limitations/). +In particular, a major limitation is that mutating an array is not allowed. + +Flux can also be used with other automatic differentiation (AD) packages. +It was originally written using [Tracker](https://github.com/FluxML/Tracker.jl), a more traditional operator-overloading approach. +The future might be [Enzyme](https://github.com/EnzymeAD/Enzyme.jl), and Flux now builds in an easy way to use this instead: ```julia -using Flux +julia> using Enzyme -W1 = rand(3, 5) -b1 = rand(3) -layer1(x) = W1 * x .+ b1 +julia> grad3e = Flux.gradient((x,p) -> p(x), Const(5.0), Duplicated(poly3s)) +(nothing, (θ3 = [1.0, 5.0, 25.0],)) +``` -W2 = rand(2, 3) -b2 = rand(2) -layer2(x) = W2 * x .+ b2 +`Flux.gradient` follows Zygote's convention that arguments with no derivative are marked `nothing`. +Here, this is because `Const(5.0)` is explicitly constant. +Below, we will see an example where `nothing` shows up because the model struct has fields containing things other than parameters, such as an activation function. -model(x) = layer2(sigmoid.(layer1(x))) +## Neural Networks -model(rand(5)) # => 2-element vector +The polynomial functions above send a number `x` to another a number `y`. +Neural networks typically take a vector of numbers, mix them all up, and return another vector. +Here's a very simple one, which will take a vector like `x = [1.0, 2.0, 3.0]` +and return another vector `y = layer1(x)` with `length(y) == 2`: + +```julia +W = randn(2, 3) +b = zeros(2) +layer1(x) = sigmoid.(W*x .+ b) ``` -This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers. +Here `sigmoid` is a nonlinear function, applied element-wise +because it is called with `.()`, called broadcasting. ```julia -function linear(in, out) - W = randn(out, in) - b = randn(out) - x -> W * x .+ b -end +sigmoid(x::Real) = 1 / (1 + exp(-x)) +``` -linear1 = linear(5, 3) # we can access linear1.W etc -linear2 = linear(3, 2) +Like `poly1` above, this `layer1` has as its parameters the global variables `W, b`. +We can similarly define a version which takes these as arguments (like `poly2`), +and a version which encapsulates them (like `poly3` above): -model(x) = linear2(sigmoid.(linear1(x))) +```julia +layer2(x, W2, b2) = sigmoid.(W2*x .+ b2) # explicit parameter arguments -model(rand(5)) # => 2-element vector +layer3 = let + W3 = randn(2,3) + b3 = zeros(2) + x -> sigmoid.(W3*x .+ b3) # closure over local variables +end ``` -Another (equivalent) way is to create a struct that explicitly represents the affine layer. +This third way is precisely a Flux model. And we can again make a tidier version +using a `struct` to hold the parameters: ```julia -struct Affine - W - b +struct Layer # container struct + W::Matrix + b::Vector + act::Function end -Affine(in::Integer, out::Integer) = - Affine(randn(out, in), zeros(out)) - -# Overload call, so the object can be used as a function -(m::Affine)(x) = m.W * x .+ m.b +(d::Layer)(x) = d.act.(d.W*x .+ d.b) # make it callabale -a = Affine(10, 5) +Layer(in::Int, out::Int, act::Function=sigmoid) = + Layer(randn(Float32, out, in), zeros(Float32, out), act) -a(rand(10)) # => 5-element vector +layer3s = Layer(3, 2) # instance with its own parameters ``` -Congratulations! You just built the [`Dense`](@ref) layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily. - -(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, sigmoid)`.) +The one new thing here is a friendly constructor `Layer(in, out, act)`. This is because we anticipate +composing several instances of this thing, with different sizes and different random initial parameters. -## Stacking It Up +```julia +x = Float32[0.1, 0.2, 0.3] # input -It's pretty common to write models that look something like: +layer3s(x) # output, 2-element Vector{Float32} -```julia -layer1 = Dense(10 => 5, relu) -# ... -model(x) = layer3(layer2(layer1(x))) +gradient((x,d) -> d(x)[1], x, layer3s)[2] # NamedTuple{(:W, :b, :act)} ``` -For long chains, it might be a bit more intuitive to have a list of layers, like this: +This `∂f/∂layer3s` is a named tuple with the same fields as `Layer`. +Within it, the gradient with respect to `W` is a matrix of seemingly random numbers. +Notice that there is also an entry for `act`, which is `nothing`, +as this field of the struct is not a smoothly adjustible parameter. -```julia -using Flux +We can compose these layers just as we did the polynomials above. +Here's a composition of 3, in which the last step is the function `only` +which takes a 2-element vector and gives us the number inside: -layers = [Dense(10 => 5, relu), Dense(5 => 2), softmax] +```julia +model1 = only ∘ Layer(20, 1) ∘ Layer(1, 20) -model(x) = foldl((x, m) -> m(x), layers, init = x) +model1(Float32[0.1]) # output is a Float32 number -model(rand(10)) # => 2-element vector +grad = gradient(|>, [1f0], model1)[2] ``` -Handily, this is also provided for in Flux: +This gradient is starting to be a complicated nested structure. +But it works just like before: `grad.inner.W` corresponds to `model1.inner.W`. + +### ⋆ [Flux's layers](@ref man-layers) + +Rather than define everything from scratch every time, Flux provides a library of +commonly used layers. The same model could be defined: ```julia -model2 = Chain( - Dense(10 => 5, relu), - Dense(5 => 2), - softmax) +model2 = Chain(Dense(1 => 20, σ), Dense(20 => 1), only) +``` -model2(rand(10)) # => 2-element vector +How does this `model2` differ from the `model1` we had before? + +* Flux's [`Chain`](@ref Flux.Chain) works left-to-right, the reverse of Base's `∘`. + Its contents is stored in a tuple, thus `model2.layers[1].weight` is an array. +* Flux's layer [`Dense`](@ref Flux.Dense) has only minor differences, mainly that + like `struct Poly3{T}` above it has type parameters for its fields -- the compiler does not + know exactly what type `layer3s.W` will be, which costs speed. + Its initialisation uses not `randn` but ?? + It also produces more friendly errors on wrong-size input, and has some performance tricks. +* The function [`σ`](@ref NNlib.sigmoid) is calculated in a slightly better way, + and has a rule telling Zygote how to differentiate it efficiently. +* Flux overloads `Base.show` so to give pretty printing at the REPL prompt. + +If what you need isn't covered by Flux's built-in layers, it's easy to write your own. +There are more details later, but the steps are invariably those shown for `struct Layer` above: +1. Define a `struct` which will hold the parameters. +2. Make it callable, to define how it uses them to transform the input `x` +3. Define a constructor which initialises the parameters. +4. Annotate with `@layer` to opt-in to pretty printing, and other enhacements. + +```@raw html +

Functors.jl

``` -This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code. +To deal with such nested structures, Flux relies heavily on an associated package +called Functors. Its basic function is [`fmap`](@ref Functors.fmap), +which generalises `map(f, x)` to work on almost anything. -A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition. +For example, this is how [gpu](@ref Flux.gpu) moves all arrays within a model to the GPU, +reconstructing another `only ∘ Layer(...) ∘ Layer(...)` around the new `CuArray`s: ```julia -m = Dense(5 => 2) ∘ Dense(10 => 5, σ) - -m(rand(10)) +using CUDA +fmap(cu, model) ``` -Likewise, `Chain` will happily work with any Julia function. +And this is a very simple gradient update of the parameters: ```julia -m = Chain(x -> x^2, x -> x+1) - -m(5) # => 26 +fmap((x, dx) -> x isa Array ? (x - dx/100) : x, model, grad) ``` -## Layer Helpers +Before Flux v0.15 (and Functors v0.5), this exploration of structs was opt-in. +After defining `struct Layer` it was necessary to call `@functor Layer` (or `@layer Layer`) before Flux would look inside. +This has now changed to be opt-out: Functors (and hence Flux) will explore arbitrary structs, unless told not to (using `Functors.@leaf`). +This is why even "anonymous structs" created by closures like `poly3` and `layer3` above are now valid Flux models, although the use of named structs is still recommended practice. -We can give our layer some additional functionality, like nice printing, using the [`@layer`](@ref Flux.@layer) macro: +## Curve Fitting + +Above we took gradients of the output, or sometimes to the first element +of the output -- it must be a number, not a vector. Adjusting the parameters +to make this smaller won't lead us anywhere interesting. Instead, we should minimise +some *loss function* which compares the actual output to our desired output. + +Perhaps the simplest example is curve fitting. Given a function like `f(x) = 2x - x^3` +evaluated at some points `x in -2:0.1:2`, we can aim to adjust the parameters +of the two-layer `model` from above so that its output is similar to the truth. +Here's how this might look: ```julia -Flux.@layer Affine +data = [([x], 2x-x^3) for x in -2:0.1f0:2] # training points (x, y) + +for _ in 1:1000 # adjust parameters to minimise the error: + Flux.train!((m,x,y) -> (m(x) - y)^2, model, data, Descent(0.01)) +end ``` -Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the `Affine` layer as follows, using the helper function [`create_bias`](@ref Flux.create_bias): +Here, `Flux.train!` is a loop which iterates over data, like this: ```julia -function Affine((in, out)::Pair; bias=true, init=glorot_uniform) - W = init(out, in) - b = Flux.create_bias(W, bias, out) - return Affine(W, b) +for xy in data + grads = gradient(m -> m(xy...), model) # gradient at this datapoint + fmap(model, grads[1]) do x, dx + if x isa Array + x .= x .- 0.01 * dx # mutates the parameters + else + x + end + end end +``` + +And here's how to plot the desired and actual outputs: -Affine(3 => 1, bias=false) |> gpu +```julia +using Plots +plot(x -> 2x-x^3, -2, 2, legend=false) +scatter!(-2:0.1:2, [model([x]) for x in -2:0.1:2]) ``` +If this general idea is unfamiliar, you may want the [tutorial on linear regression](#ref ???). + +More detail about what exactly the function `train!` is doing, and how to use rules other than simple [`Descent`](@ref Optimisers.Descent), is what the next page in this guide is about.