more work

FluxML · Apr 7, 2024 · 2fb1716 · 2fb1716
1 parent 0a34a8a
commit 2fb1716
Show file tree

Hide file tree

Showing 20 changed files with 135 additions and 282 deletions.
diff --git a/docs/src/models/quickstart.md b/docs/src/models/quickstart.md
@@ -101,17 +101,3 @@ for epoch in 1:1_000
     end
 end
 ```
-
-!!! compat "Implicit-style training, Flux ≤ 0.14"
-    Until recently Flux's training worked a bit differently. 
-    Any code which looks like 
-    ```
-    gradient(() -> loss(model, x, y), Flux.params(model))
-    ```
-    (gradient of a zero-argument function) or
-    ```
-    train!((x,y) -> loss(model, x, y), Flux.params(model), loader, opt)
-    ```
-    (with `Flux.params`) is in the old "implicit" style.
-    This still works on Flux 0.14, but will be removed from Flux 0.15.
-    See the [training section](@ref man-training) for more details.
diff --git a/docs/src/training/training.md b/docs/src/training/training.md
@@ -64,16 +64,6 @@ in order for the influence of the model's parameters to be observed by Zygote.
 It is also important that every `update!` step receives a newly computed gradient,
 as it will change whenever the model's parameters are changed, and for each new data point.
 
-!!! compat "Implicit gradients"
-    Flux ≤ 0.14 used Zygote's "implicit" mode, in which `gradient` takes a zero-argument function.
-    It looks like this:
-    ```
-    pars = Flux.params(model)
-    grad = gradient(() -> loss(model(input), label), pars)
-    ```
-    Here `pars::Params` and `grad::Grads` are two dictionary-like structures.
-    Support for this will be removed from Flux 0.15, and these blue (teal?) boxes
-    explain what needs to change.
 
 ## Loss Functions
 
@@ -208,15 +198,6 @@ end
 Or explicitly writing the anonymous function which this `do` block creates,
 `train!((m,x,y) -> loss(m(x),y), model, train_set, opt_state)` is exactly equivalent.
 
-!!! compat "Implicit-style `train!`"
-    This is a new method of `train!`, which takes the result of `setup` as its 4th argument.
-    The 1st argument is a function which accepts the model itself.
-    Flux versions ≤ 0.14 provided a method of `train!` for "implicit" parameters,
-    which works like this:
-    ```
-    train!((x,y) -> loss(model(x), y), Flux.params(model), train_set, Adam())
-    ```
-
 Real training loops often need more flexibility, and the best way to do this is just
 to write the loop. This is ordinary Julia code, without any need to work through some
 callback API. Here is an example, in which it may be helpful to note:
@@ -284,21 +265,21 @@ A very simple model could be implemented as follows:
 grads = Flux.gradient(densemodel) do m
   result = m(input)
   penalty = sum(abs2, m.weight)/2 + sum(abs2, m.bias)/2
-  my_loss(result, label) + 0.42 * penalty
+  my_loss(result, label) + 0.42f0 * penalty
 end
 ```
 
 Accessing each individual parameter array by hand won't work well for large models.
-Instead, we can use [`Flux.params`](@ref) to collect all of them,
+Instead, we can use [`Flux.trainables`](@ref Optimisers.trainables) to collect all of them,
 and then apply a function to each one, and sum the result:
 
 ```julia
 pen_l2(x::AbstractArray) = sum(abs2, x)/2
 
 grads = Flux.gradient(model) do m
   result = m(input)
-  penalty = sum(pen_l2, Flux.params(m))
-  my_loss(result, label) + 0.42 * penalty
+  penalty = sum(pen_l2, Flux.trainables(m))
+  my_loss(result, label) + 0.42f0 * penalty
 end
 ```
 
@@ -317,7 +298,7 @@ decay_opt_state = Flux.setup(OptimiserChain(WeightDecay(0.42), Adam(0.1)), model
 ```
 
 Flux's optimisers are really modifications applied to the gradient before using it to update
-the parameters, and `OptimiserChain` applies two such modifications.
+the parameters, and [`OptimiserChain`](@ref Optimisers.OptimiserChain) applies two such modifications.
 The first, [`WeightDecay`](@ref Flux.WeightDecay) adds `0.42` times the original parameter to the gradient,
 matching the gradient of the penalty above (with the same, unrealistically large, constant).
 After that, in either case, [`Adam`](@ref Flux.Adam) computes the final update.
@@ -348,10 +329,6 @@ for epoch in 1:1000
 end
 ```
 
-!!! compat "Flux ≤ 0.14"
-    With the old "implicit" optimiser, `opt = Adam(0.1)`, the equivalent was to
-    directly mutate the `Adam` struct, `opt.eta = 0.001`. 
-
 Other hyper-parameters can also be adjusted, such as `Flux.adjust!(opt_state, beta = (0.8, 0.99))`.
 And such modifications can be applied to just one part of the model.
 For instance, this sets a different learning rate for the encoder and the decoder:
@@ -382,21 +359,7 @@ train!(loss, bimodel, data, opt_state)
 Flux.thaw!(opt_state)
 ```
 
-!!! compat "Flux ≤ 0.14"
-    The earlier "implicit" equivalent was to pass to `gradient` an object referencing only
-    part of the model, such as `Flux.params(bimodel.layers.enc)`.
-
 While `adjust!` and `freeze!`/`thaw!` make temporary modifications to the optimiser state,
 permanently removing some fields of a new layer type from training is usually done
 when defining the layer, by calling for example [`@layer`](@ref Flux.@layer)` NewLayer trainable=(weight,)`.
 
-## Implicit or Explicit?
-
-Flux used to handle gradients, training, and optimisation rules quite differently.
-The new style described above is called "explicit" by Zygote, and the old style "implicit".
-Flux 0.13 and 0.14 are the transitional versions which support both.
-
-The blue-green boxes above describe the changes.
-For more details on training in the implicit style, see [Flux 0.13.6 documentation](https://fluxml.ai/Flux.jl/v0.13.6/training/training/).
-
-For details about the two gradient modes, see [Zygote's documentation](https://fluxml.ai/Zygote.jl/dev/#Explicit-and-Implicit-Parameters-1).
diff --git a/docs/src/training/zygote.md b/docs/src/training/zygote.md
@@ -18,22 +18,6 @@ Zygote.hessian_reverse
 Zygote.diaghessian
 ```
 
-## Implicit style (Flux ≤ 0.14)
-
-Flux used to use what Zygote calls "implicit" gradients, [described here](https://fluxml.ai/Zygote.jl/dev/#Explicit-and-Implicit-Parameters-1) in its documentation.
-However, support for this will be removed from Flux 0.15.
-
-!!! compat "Training"
-    The blue-green boxes in the [training section](@ref man-training) describe
-    the changes needed to upgrade old code from implicit to explicit style.
-
-```@docs
-Zygote.gradient(loss, ::Params)
-Zygote.Params
-Zygote.Grads
-Zygote.jacobian(loss, ::Params)
-```
-
 ## ChainRules
 
 Sometimes it is necessary to exclude some code, or a whole function, from automatic differentiation. This can be done using [ChainRules](https://github.com/JuliaDiff/ChainRules.jl):

diff --git a/docs/src/tutorials/2020-09-15-deep-learning-flux.md b/docs/src/tutorials/2020-09-15-deep-learning-flux.md
@@ -167,26 +167,8 @@ gradient(myloss, W, b, x)
 
 Now we get gradients for each of the inputs `W`, `b` and `x`, which will come in handy when we want to train models.
 
-Because ML models can contain hundreds of parameters, Flux provides a slightly different way of writing `gradient`. We instead mark arrays with `param` to indicate that we want their derivatives. `W` and `b` represent the weight and bias respectively.
-
-```julia
-using Flux: params
-
-W = randn(3, 5)
-b = zeros(3)
-x = rand(5)
-
-y(x) = sum(W * x .+ b)
-
-grads = gradient(()->y(x), params([W, b]))
-
-grads[W], grads[b]
-```
-
-
-We can now grab the gradients of `W` and `b` directly from those parameters.
-
-This comes in handy when working with *layers*. A layer is just a handy container for some parameters. For example, `Dense` does a linear transform for you.
+ML models can contain hundreds of parameter arrays, therefore it is handy to group them into **layers**.
+A layer is just a handy container for some parameters. For example, `Dense` does a linear transform for you.
 
 ```julia
 using Flux
@@ -196,23 +178,22 @@ m = Dense(10 =>  5)
 x = rand(Float32, 10)
 ```
 
-We can easily get the parameters of any layer or model with params with `params`.
+We can easily get the parameters of any layer or model with `trainables`.
 
 ```julia
-params(m)
+trainables(m)
 ```
 
-This makes it very easy to calculate the gradient for all parameters in a network, even if it has many parameters.
+It very easy to calculate the gradient for all parameters in a network, even if it has many parameters.
+The function `gradient` is not limited to array but can compute the gradient with respect to generic composite types.
 
 ```julia
 x = rand(Float32, 10)
-m = Chain(Dense(10 => 5, relu), Dense(5 => 2), softmax)
-l(x) = sum(Flux.crossentropy(m(x), [0.5, 0.5]))
-grads = gradient(params(m)) do
-    l(x)
-end
-for p in params(m)
-    println(grads[p])
+model = Chain(Dense(10 => 5, relu), Dense(5 => 2))
+loss(model, x) = Flux.logitcrossentropy(model(x), [0.5, 0.5])
+grad = gradient(m -> loss(m, x), model)[1]
+for (k, p) in trainables(model, path=true)
+    println("$k  => $(getkeypath(grad, k))")
 end
 ```
 
@@ -221,38 +202,37 @@ You don't have to use layers, but they can be convient for many simple kinds of
 The next step is to update our weights and perform optimisation. As you might be familiar, *Gradient Descent* is a simple algorithm that takes the weights and steps using a learning rate and the gradients. `weights = weights - learning_rate * gradient`.
 
 ```julia
-using Flux.Optimise: update!, Descent
 η = 0.1
-for p in params(m)
-  update!(p, -η * grads[p])
+for (k, p) in trainables(m)
+  p .+= -η * getkeypath(grads, p)
 end
 ```
 
 While this is a valid way of updating our weights, it can get more complicated as the algorithms we use get more involved.
 
-Flux comes with a bunch of pre-defined optimisers and makes writing our own really simple. We just give it the learning rate η:
+Flux comes with a bunch of pre-defined optimisers and makes writing our own really simple. We just give it the learning rate `η`:
 
 ```julia
-opt = Descent(0.01)
+opt_state = Flux.setup(Descent(η), model)
 ```
 
-`Training` a network reduces down to iterating on a dataset mulitple times, performing these steps in order. Just for a quick implementation, let’s train a network that learns to predict `0.5` for every input of 10 floats. `Flux` defines the `train!` function to do it for us.
+Training a network reduces down to iterating on a dataset mulitple times, performing these steps in order. Just for a quick implementation, let’s train a network that learns to predict `0.5` for every input of 10 floats. `Flux` defines the `train!` function to do it for us.
 
 ```julia
 data, labels = rand(10, 100), fill(0.5, 2, 100)
-loss(x, y) = sum(Flux.crossentropy(m(x), y))
-Flux.train!(loss, params(m), [(data,labels)], opt)
+loss(m, x, y) = Flux.logitcrossentropy(m(x), y)
+Flux.train!(loss, model, [(data, labels)], opt)
 ```
 
 You don't have to use `train!`. In cases where arbitrary logic might be better suited, you could open up this training loop like so:
 
 ```julia
   for d in training_set # assuming d looks like (data, labels)
     # our super logic
-    gs = gradient(params(m)) do #m is our model
-      l = loss(d...)
+    g = gradient(model) do model
+      l = loss(model, d...)
     end
-    update!(opt, params(m), gs)
+    Flux.update!(opt_state, model, g)
   end
 ```
 
@@ -272,7 +252,7 @@ We will do the following steps in order:
 
 ```julia
 using Statistics
-using Flux, Flux.Optimise
+using Flux
 using MLDatasets: CIFAR10
 using Images.ImageCore
 using Flux: onehotbatch, onecold
@@ -323,16 +303,15 @@ m = Chain(
   x -> reshape(x, :, size(x, 4)),
   Dense(200 => 120),
   Dense(120 => 84),
-  Dense(84 => 10),
-  softmax) |> gpu
+  Dense(84 => 10)) |> gpu
 ```
 
 We will use a crossentropy loss and an Momentum optimiser here. Crossentropy will be a good option when it comes to working with mulitple independent classes. Momentum gradually lowers the learning rate as we proceed with the training. It helps maintain a bit of adaptivity in our optimisation, preventing us from over shooting from our desired destination.
 
 ```julia
-using Flux: crossentropy, Momentum
+using Flux: logitcrossentropy, Momentum
 
-loss(x, y) = sum(crossentropy(m(x), y))
+loss(m, x, y) = logitcrossentropy(m(x), y)
 opt = Momentum(0.01)
 ```
 

diff --git a/src/Flux.jl b/src/Flux.jl
@@ -15,13 +15,10 @@ import Optimisers: trainable
 using Optimisers: update!, trainables
 using Random: default_rng
 using Zygote, ChainRulesCore
-using Zygote: Params, @adjoint, gradient, pullback
+using Zygote: @adjoint, gradient, pullback
 using Zygote.ForwardDiff: value
 export gradient
 
-# Pirate error to catch a common mistake. (Internal function `base` because overloading `update!` is more likely to give ambiguities.)
-Optimisers.base(dx::Zygote.Grads) = error("Optimisers.jl cannot be used with Zygote.jl's implicit gradients, `Params` & `Grads`")
-
 export Chain, Dense, Embedding, Maxout, SkipConnection, Parallel, PairwiseFusion,
        RNN, LSTM, GRU, GRUv3,
        SamePad, Conv, CrossCor, ConvTranspose, DepthwiseConv,

diff --git a/src/deprecations.jl b/src/deprecations.jl
@@ -24,3 +24,34 @@ Train.train!(loss::Function, ps::Zygote.Params, data, opt) = throw(ArgumentError
   where `loss_mxy` accepts the model as its first argument.
   """
 ))
+
+
+function params!(p::Params, x, seen = IdSet())
+  # @depwarn "Implicit use of `params` is deprecated. TODO."
+
+  if x isa AbstractArray{<:Number} && Functors.isleaf(x)
+    return push!(p, x)
+  elseif x in seen
+    nothing
+  else
+    _check_new_macro(x)  # complains if you used @functor not @layer
+    push!(seen, x)
+    for child in trainable(x)
+      params!(p, child, seen)
+    end
+  end
+end
+
+function params(m...)
+  # @depwarn "Implicit use of `params` is deprecated. TODO."
+  ps = Params()
+  params!(ps, m)
+  return ps
+end
+
+# Allows caching of the parameters when params is called within gradient() to fix #2040.
+# @non_differentiable params(m...)  # https://github.com/FluxML/Flux.jl/pull/2054
+# That speeds up implicit use, and silently breaks explicit use. 
+# From @macroexpand Zygote.@non_differentiable params(m...) and https://github.com/FluxML/Zygote.jl/pull/1248
+Zygote._pullback(::Zygote.Context{true}, ::typeof(params), m...) = params(m), _ -> nothing
+
diff --git a/src/functor.jl b/src/functor.jl
@@ -75,65 +75,6 @@ function testmode!(m, mode)
   m
 end
 
-function params!(p::Params, x, seen = IdSet())
-  if x isa AbstractArray{<:Number} && Functors.isleaf(x)
-    return push!(p, x)
-  elseif x in seen
-    nothing
-  else
-    _check_new_macro(x)  # complains if you used @functor not @layer
-    push!(seen, x)
-    for child in trainable(x)
-      params!(p, child, seen)
-    end
-  end
-end
-
-"""
-    params(model)
-    params(layers...)
-
-Given a model or specific layers from a model, create a `Params` object pointing to its trainable parameters.
-
-This can be used with the `gradient` function, see the [training section of the manual](@ref man-training), or as input to the [`Flux.train!`](@ref Flux.train!) function.
-
-The behaviour of `params` on custom types can be customized using [`Functors.@functor`](@ref) or [`Flux.trainable`](@ref).
-
-# Examples
-```jldoctest
-julia> using Flux: params
-
-julia> params(Chain(Dense(ones(2,3)), softmax))  # unpacks Flux models
-Params([[1.0 1.0 1.0; 1.0 1.0 1.0], [0.0, 0.0]])
-
-julia> bn = BatchNorm(2, relu)
-BatchNorm(2, relu)  # 4 parameters, plus 4 non-trainable
-
-julia> params(bn)  # only the trainable parameters
-Params([Float32[0.0, 0.0], Float32[1.0, 1.0]])
-
-julia> params([1, 2, 3], [4])  # one or more arrays of numbers
-Params([[1, 2, 3], [4]])
-
-julia> params([[1, 2, 3], [4]])  # unpacks array of arrays
-Params([[1, 2, 3], [4]])
-
-julia> params(1, [2 2], (alpha=[3,3,3], beta=Ref(4), gamma=sin))  # ignores scalars, unpacks NamedTuples
-Params([[2 2], [3, 3, 3]])
-```
-"""
-function params(m...)
-  ps = Params()
-  params!(ps, m)
-  return ps
-end
-
-# Allows caching of the parameters when params is called within gradient() to fix #2040.
-# @non_differentiable params(m...)  # https://github.com/FluxML/Flux.jl/pull/2054
-# That speeds up implicit use, and silently breaks explicit use. 
-# From @macroexpand Zygote.@non_differentiable params(m...) and https://github.com/FluxML/Zygote.jl/pull/1248
-Zygote._pullback(::Zygote.Context{true}, ::typeof(params), m...) = params(m), _ -> nothing
-
 struct FluxCPUAdaptor end
 
 # define rules for handling structured arrays