Skip to content

Commit

Permalink
more work
Browse files Browse the repository at this point in the history
  • Loading branch information
CarloLucibello committed Apr 7, 2024
1 parent 0a34a8a commit 2fb1716
Show file tree
Hide file tree
Showing 20 changed files with 135 additions and 282 deletions.
14 changes: 0 additions & 14 deletions docs/src/models/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,17 +101,3 @@ for epoch in 1:1_000
end
end
```

!!! compat "Implicit-style training, Flux ≤ 0.14"
Until recently Flux's training worked a bit differently.
Any code which looks like
```
gradient(() -> loss(model, x, y), Flux.params(model))
```
(gradient of a zero-argument function) or
```
train!((x,y) -> loss(model, x, y), Flux.params(model), loader, opt)
```
(with `Flux.params`) is in the old "implicit" style.
This still works on Flux 0.14, but will be removed from Flux 0.15.
See the [training section](@ref man-training) for more details.
47 changes: 5 additions & 42 deletions docs/src/training/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,16 +64,6 @@ in order for the influence of the model's parameters to be observed by Zygote.
It is also important that every `update!` step receives a newly computed gradient,
as it will change whenever the model's parameters are changed, and for each new data point.

!!! compat "Implicit gradients"
Flux ≤ 0.14 used Zygote's "implicit" mode, in which `gradient` takes a zero-argument function.
It looks like this:
```
pars = Flux.params(model)
grad = gradient(() -> loss(model(input), label), pars)
```
Here `pars::Params` and `grad::Grads` are two dictionary-like structures.
Support for this will be removed from Flux 0.15, and these blue (teal?) boxes
explain what needs to change.

## Loss Functions

Expand Down Expand Up @@ -208,15 +198,6 @@ end
Or explicitly writing the anonymous function which this `do` block creates,
`train!((m,x,y) -> loss(m(x),y), model, train_set, opt_state)` is exactly equivalent.

!!! compat "Implicit-style `train!`"
This is a new method of `train!`, which takes the result of `setup` as its 4th argument.
The 1st argument is a function which accepts the model itself.
Flux versions ≤ 0.14 provided a method of `train!` for "implicit" parameters,
which works like this:
```
train!((x,y) -> loss(model(x), y), Flux.params(model), train_set, Adam())
```

Real training loops often need more flexibility, and the best way to do this is just
to write the loop. This is ordinary Julia code, without any need to work through some
callback API. Here is an example, in which it may be helpful to note:
Expand Down Expand Up @@ -284,21 +265,21 @@ A very simple model could be implemented as follows:
grads = Flux.gradient(densemodel) do m
result = m(input)
penalty = sum(abs2, m.weight)/2 + sum(abs2, m.bias)/2
my_loss(result, label) + 0.42 * penalty
my_loss(result, label) + 0.42f0 * penalty
end
```

Accessing each individual parameter array by hand won't work well for large models.
Instead, we can use [`Flux.params`](@ref) to collect all of them,
Instead, we can use [`Flux.trainables`](@ref Optimisers.trainables) to collect all of them,
and then apply a function to each one, and sum the result:

```julia
pen_l2(x::AbstractArray) = sum(abs2, x)/2

grads = Flux.gradient(model) do m
result = m(input)
penalty = sum(pen_l2, Flux.params(m))
my_loss(result, label) + 0.42 * penalty
penalty = sum(pen_l2, Flux.trainables(m))
my_loss(result, label) + 0.42f0 * penalty
end
```

Expand All @@ -317,7 +298,7 @@ decay_opt_state = Flux.setup(OptimiserChain(WeightDecay(0.42), Adam(0.1)), model
```

Flux's optimisers are really modifications applied to the gradient before using it to update
the parameters, and `OptimiserChain` applies two such modifications.
the parameters, and [`OptimiserChain`](@ref Optimisers.OptimiserChain) applies two such modifications.
The first, [`WeightDecay`](@ref Flux.WeightDecay) adds `0.42` times the original parameter to the gradient,
matching the gradient of the penalty above (with the same, unrealistically large, constant).
After that, in either case, [`Adam`](@ref Flux.Adam) computes the final update.
Expand Down Expand Up @@ -348,10 +329,6 @@ for epoch in 1:1000
end
```

!!! compat "Flux ≤ 0.14"
With the old "implicit" optimiser, `opt = Adam(0.1)`, the equivalent was to
directly mutate the `Adam` struct, `opt.eta = 0.001`.

Other hyper-parameters can also be adjusted, such as `Flux.adjust!(opt_state, beta = (0.8, 0.99))`.
And such modifications can be applied to just one part of the model.
For instance, this sets a different learning rate for the encoder and the decoder:
Expand Down Expand Up @@ -382,21 +359,7 @@ train!(loss, bimodel, data, opt_state)
Flux.thaw!(opt_state)
```

!!! compat "Flux ≤ 0.14"
The earlier "implicit" equivalent was to pass to `gradient` an object referencing only
part of the model, such as `Flux.params(bimodel.layers.enc)`.

While `adjust!` and `freeze!`/`thaw!` make temporary modifications to the optimiser state,
permanently removing some fields of a new layer type from training is usually done
when defining the layer, by calling for example [`@layer`](@ref Flux.@layer)` NewLayer trainable=(weight,)`.

## Implicit or Explicit?

Flux used to handle gradients, training, and optimisation rules quite differently.
The new style described above is called "explicit" by Zygote, and the old style "implicit".
Flux 0.13 and 0.14 are the transitional versions which support both.

The blue-green boxes above describe the changes.
For more details on training in the implicit style, see [Flux 0.13.6 documentation](https://fluxml.ai/Flux.jl/v0.13.6/training/training/).

For details about the two gradient modes, see [Zygote's documentation](https://fluxml.ai/Zygote.jl/dev/#Explicit-and-Implicit-Parameters-1).
16 changes: 0 additions & 16 deletions docs/src/training/zygote.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,6 @@ Zygote.hessian_reverse
Zygote.diaghessian
```

## Implicit style (Flux ≤ 0.14)

Flux used to use what Zygote calls "implicit" gradients, [described here](https://fluxml.ai/Zygote.jl/dev/#Explicit-and-Implicit-Parameters-1) in its documentation.
However, support for this will be removed from Flux 0.15.

!!! compat "Training"
The blue-green boxes in the [training section](@ref man-training) describe
the changes needed to upgrade old code from implicit to explicit style.

```@docs
Zygote.gradient(loss, ::Params)
Zygote.Params
Zygote.Grads
Zygote.jacobian(loss, ::Params)
```

## ChainRules

Sometimes it is necessary to exclude some code, or a whole function, from automatic differentiation. This can be done using [ChainRules](https://github.com/JuliaDiff/ChainRules.jl):
Expand Down
71 changes: 25 additions & 46 deletions docs/src/tutorials/2020-09-15-deep-learning-flux.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,26 +167,8 @@ gradient(myloss, W, b, x)

Now we get gradients for each of the inputs `W`, `b` and `x`, which will come in handy when we want to train models.

Because ML models can contain hundreds of parameters, Flux provides a slightly different way of writing `gradient`. We instead mark arrays with `param` to indicate that we want their derivatives. `W` and `b` represent the weight and bias respectively.

```julia
using Flux: params

W = randn(3, 5)
b = zeros(3)
x = rand(5)

y(x) = sum(W * x .+ b)

grads = gradient(()->y(x), params([W, b]))

grads[W], grads[b]
```


We can now grab the gradients of `W` and `b` directly from those parameters.

This comes in handy when working with *layers*. A layer is just a handy container for some parameters. For example, `Dense` does a linear transform for you.
ML models can contain hundreds of parameter arrays, therefore it is handy to group them into **layers**.
A layer is just a handy container for some parameters. For example, `Dense` does a linear transform for you.

```julia
using Flux
Expand All @@ -196,23 +178,22 @@ m = Dense(10 => 5)
x = rand(Float32, 10)
```

We can easily get the parameters of any layer or model with params with `params`.
We can easily get the parameters of any layer or model with `trainables`.

```julia
params(m)
trainables(m)
```

This makes it very easy to calculate the gradient for all parameters in a network, even if it has many parameters.
It very easy to calculate the gradient for all parameters in a network, even if it has many parameters.
The function `gradient` is not limited to array but can compute the gradient with respect to generic composite types.

```julia
x = rand(Float32, 10)
m = Chain(Dense(10 => 5, relu), Dense(5 => 2), softmax)
l(x) = sum(Flux.crossentropy(m(x), [0.5, 0.5]))
grads = gradient(params(m)) do
l(x)
end
for p in params(m)
println(grads[p])
model = Chain(Dense(10 => 5, relu), Dense(5 => 2))
loss(model, x) = Flux.logitcrossentropy(model(x), [0.5, 0.5])
grad = gradient(m -> loss(m, x), model)[1]
for (k, p) in trainables(model, path=true)
println("$k => $(getkeypath(grad, k))")
end
```

Expand All @@ -221,38 +202,37 @@ You don't have to use layers, but they can be convient for many simple kinds of
The next step is to update our weights and perform optimisation. As you might be familiar, *Gradient Descent* is a simple algorithm that takes the weights and steps using a learning rate and the gradients. `weights = weights - learning_rate * gradient`.

```julia
using Flux.Optimise: update!, Descent
η = 0.1
for p in params(m)
update!(p, -η * grads[p])
for (k, p) in trainables(m)
p .+= -η * getkeypath(grads, p)
end
```

While this is a valid way of updating our weights, it can get more complicated as the algorithms we use get more involved.

Flux comes with a bunch of pre-defined optimisers and makes writing our own really simple. We just give it the learning rate η:
Flux comes with a bunch of pre-defined optimisers and makes writing our own really simple. We just give it the learning rate `η`:

```julia
opt = Descent(0.01)
opt_state = Flux.setup(Descent(η), model)
```

`Training` a network reduces down to iterating on a dataset mulitple times, performing these steps in order. Just for a quick implementation, let’s train a network that learns to predict `0.5` for every input of 10 floats. `Flux` defines the `train!` function to do it for us.
Training a network reduces down to iterating on a dataset mulitple times, performing these steps in order. Just for a quick implementation, let’s train a network that learns to predict `0.5` for every input of 10 floats. `Flux` defines the `train!` function to do it for us.

```julia
data, labels = rand(10, 100), fill(0.5, 2, 100)
loss(x, y) = sum(Flux.crossentropy(m(x), y))
Flux.train!(loss, params(m), [(data,labels)], opt)
loss(m, x, y) = Flux.logitcrossentropy(m(x), y)
Flux.train!(loss, model, [(data, labels)], opt)
```

You don't have to use `train!`. In cases where arbitrary logic might be better suited, you could open up this training loop like so:

```julia
for d in training_set # assuming d looks like (data, labels)
# our super logic
gs = gradient(params(m)) do #m is our model
l = loss(d...)
g = gradient(model) do model
l = loss(model, d...)
end
update!(opt, params(m), gs)
Flux.update!(opt_state, model, g)
end
```

Expand All @@ -272,7 +252,7 @@ We will do the following steps in order:

```julia
using Statistics
using Flux, Flux.Optimise
using Flux
using MLDatasets: CIFAR10
using Images.ImageCore
using Flux: onehotbatch, onecold
Expand Down Expand Up @@ -323,16 +303,15 @@ m = Chain(
x -> reshape(x, :, size(x, 4)),
Dense(200 => 120),
Dense(120 => 84),
Dense(84 => 10),
softmax) |> gpu
Dense(84 => 10)) |> gpu
```

We will use a crossentropy loss and an Momentum optimiser here. Crossentropy will be a good option when it comes to working with mulitple independent classes. Momentum gradually lowers the learning rate as we proceed with the training. It helps maintain a bit of adaptivity in our optimisation, preventing us from over shooting from our desired destination.

```julia
using Flux: crossentropy, Momentum
using Flux: logitcrossentropy, Momentum

loss(x, y) = sum(crossentropy(m(x), y))
loss(m, x, y) = logitcrossentropy(m(x), y)
opt = Momentum(0.01)
```

Expand Down
5 changes: 1 addition & 4 deletions src/Flux.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,10 @@ import Optimisers: trainable
using Optimisers: update!, trainables
using Random: default_rng
using Zygote, ChainRulesCore
using Zygote: Params, @adjoint, gradient, pullback
using Zygote: @adjoint, gradient, pullback
using Zygote.ForwardDiff: value
export gradient

# Pirate error to catch a common mistake. (Internal function `base` because overloading `update!` is more likely to give ambiguities.)
Optimisers.base(dx::Zygote.Grads) = error("Optimisers.jl cannot be used with Zygote.jl's implicit gradients, `Params` & `Grads`")

export Chain, Dense, Embedding, Maxout, SkipConnection, Parallel, PairwiseFusion,
RNN, LSTM, GRU, GRUv3,
SamePad, Conv, CrossCor, ConvTranspose, DepthwiseConv,
Expand Down
31 changes: 31 additions & 0 deletions src/deprecations.jl
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,34 @@ Train.train!(loss::Function, ps::Zygote.Params, data, opt) = throw(ArgumentError
where `loss_mxy` accepts the model as its first argument.
"""
))


function params!(p::Params, x, seen = IdSet())
# @depwarn "Implicit use of `params` is deprecated. TODO."

if x isa AbstractArray{<:Number} && Functors.isleaf(x)
return push!(p, x)
elseif x in seen
nothing
else
_check_new_macro(x) # complains if you used @functor not @layer
push!(seen, x)
for child in trainable(x)
params!(p, child, seen)
end
end
end

function params(m...)
# @depwarn "Implicit use of `params` is deprecated. TODO."
ps = Params()
params!(ps, m)
return ps
end

# Allows caching of the parameters when params is called within gradient() to fix #2040.
# @non_differentiable params(m...) # https://github.com/FluxML/Flux.jl/pull/2054
# That speeds up implicit use, and silently breaks explicit use.
# From @macroexpand Zygote.@non_differentiable params(m...) and https://github.com/FluxML/Zygote.jl/pull/1248
Zygote._pullback(::Zygote.Context{true}, ::typeof(params), m...) = params(m), _ -> nothing

59 changes: 0 additions & 59 deletions src/functor.jl
Original file line number Diff line number Diff line change
Expand Up @@ -75,65 +75,6 @@ function testmode!(m, mode)
m
end

function params!(p::Params, x, seen = IdSet())
if x isa AbstractArray{<:Number} && Functors.isleaf(x)
return push!(p, x)
elseif x in seen
nothing
else
_check_new_macro(x) # complains if you used @functor not @layer
push!(seen, x)
for child in trainable(x)
params!(p, child, seen)
end
end
end

"""
params(model)
params(layers...)
Given a model or specific layers from a model, create a `Params` object pointing to its trainable parameters.
This can be used with the `gradient` function, see the [training section of the manual](@ref man-training), or as input to the [`Flux.train!`](@ref Flux.train!) function.
The behaviour of `params` on custom types can be customized using [`Functors.@functor`](@ref) or [`Flux.trainable`](@ref).
# Examples
```jldoctest
julia> using Flux: params
julia> params(Chain(Dense(ones(2,3)), softmax)) # unpacks Flux models
Params([[1.0 1.0 1.0; 1.0 1.0 1.0], [0.0, 0.0]])
julia> bn = BatchNorm(2, relu)
BatchNorm(2, relu) # 4 parameters, plus 4 non-trainable
julia> params(bn) # only the trainable parameters
Params([Float32[0.0, 0.0], Float32[1.0, 1.0]])
julia> params([1, 2, 3], [4]) # one or more arrays of numbers
Params([[1, 2, 3], [4]])
julia> params([[1, 2, 3], [4]]) # unpacks array of arrays
Params([[1, 2, 3], [4]])
julia> params(1, [2 2], (alpha=[3,3,3], beta=Ref(4), gamma=sin)) # ignores scalars, unpacks NamedTuples
Params([[2 2], [3, 3, 3]])
```
"""
function params(m...)
ps = Params()
params!(ps, m)
return ps
end

# Allows caching of the parameters when params is called within gradient() to fix #2040.
# @non_differentiable params(m...) # https://github.com/FluxML/Flux.jl/pull/2054
# That speeds up implicit use, and silently breaks explicit use.
# From @macroexpand Zygote.@non_differentiable params(m...) and https://github.com/FluxML/Zygote.jl/pull/1248
Zygote._pullback(::Zygote.Context{true}, ::typeof(params), m...) = params(m), _ -> nothing

struct FluxCPUAdaptor end

# define rules for handling structured arrays
Expand Down
Loading

0 comments on commit 2fb1716

Please sign in to comment.