Print channel dimensions of Dense like those of Conv (#1658)

* print channel dims of Dense like Conv, and accept as input * do the same for Bilinear * fix tests * fix tests * docstring * change a few more * update * docs * rm circular ref * fixup * news + fixes
FluxML · Feb 19, 2022 · f49e81e · f49e81e
1 parent b35b23b
commit f49e81e
Show file tree

Hide file tree

Showing 17 changed files with 142 additions and 130 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -7,6 +7,7 @@ been removed in favour of MLDatasets.jl.
 * `flatten` is not exported anymore due to clash with Iterators.flatten.
 * Remove Juno.jl progress bar support as it is now obsolete.
 * `Dropout` gained improved compatibility with Int and Complex arrays and is now twice-differentiable.
+* Notation `Dense(2 => 3, σ)` for channels matches `Conv`; the equivalent `Dense(2, 3, σ)` still works.
 * Many utily functions and the `DataLoader` are [now provided by MLUtils.jl](https://github.com/FluxML/Flux.jl/pull/1874).
 * The DataLoader is now compatible with generic dataset types implementing `MLUtils.numobs` and `MLUtils.getobs`.
 * Added [truncated normal initialisation](https://github.com/FluxML/Flux.jl/pull/1877) of weights.

diff --git a/docs/src/gpu.md b/docs/src/gpu.md
@@ -39,12 +39,12 @@ Note that we convert both the parameters (`W`, `b`) and the data set (`x`, `y`)
 If you define a structured model, like a `Dense` layer or `Chain`, you just need to convert the internal parameters. Flux provides `fmap`, which allows you to alter all parameters of a model at once.
 
 ```julia
-d = Dense(10, 5, σ)
+d = Dense(10 => 5, σ)
 d = fmap(cu, d)
 d.weight # CuArray
 d(cu(rand(10))) # CuArray output
 
-m = Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
+m = Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)
 m = fmap(cu, m)
 d(cu(rand(10)))
 ```
@@ -54,8 +54,8 @@ As a convenience, Flux provides the `gpu` function to convert models and data to
 ```julia
 julia> using Flux, CUDA
 
-julia> m = Dense(10,5) |> gpu
-Dense(10, 5)
+julia> m = Dense(10, 5) |> gpu
+Dense(10 => 5)
 
 julia> x = rand(10) |> gpu
 10-element CuArray{Float32,1}:

diff --git a/docs/src/models/advanced.md b/docs/src/models/advanced.md
@@ -74,10 +74,10 @@ this using the slicing features `Chain` provides:
 
 ```julia
 m = Chain(
-      Dense(784, 64, relu),
-      Dense(64, 64, relu),
-      Dense(32, 10)
-    )
+      Dense(784 => 64, relu),
+      Dense(64 => 64, relu),
+      Dense(32 => 10)
+    );
 
 ps = Flux.params(m[3:end])
 ```
@@ -142,10 +142,11 @@ Lastly, we can test our new layer. Thanks to the proper abstractions in Julia, o
 ```julia
 model = Chain(
               Join(vcat,
-                   Chain(Dense(1, 5),Dense(5, 1)), # branch 1
-                   Dense(1, 2),                    # branch 2
-                   Dense(1, 1)),                   # branch 3
-              Dense(4, 1)
+                   Chain(Dense(1 => 5, relu), Dense(5 => 1)), # branch 1
+                   Dense(1 => 2),                             # branch 2
+                   Dense(1 => 1)                              # branch 3
+                  ),
+              Dense(4 => 1)
              ) |> gpu
 
 xs = map(gpu, (rand(1), rand(1), rand(1)))
@@ -164,11 +165,11 @@ Join(combine, paths...) = Join(combine, paths)
 # use vararg/tuple version of Parallel forward pass
 model = Chain(
               Join(vcat,
-                   Chain(Dense(1, 5),Dense(5, 1)),
-                   Dense(1, 2),
-                   Dense(1, 1)
+                   Chain(Dense(1 => 5, relu), Dense(5 => 1)),
+                   Dense(1 => 2),
+                   Dense(1 => 1)
                   ),
-              Dense(4, 1)
+              Dense(4 => 1)
              ) |> gpu
 
 xs = map(gpu, (rand(1), rand(1), rand(1)))
@@ -201,8 +202,8 @@ Flux.@functor Split
 Now we can test to see that our `Split` does indeed produce multiple outputs.
 ```julia
 model = Chain(
-              Dense(10, 5),
-              Split(Dense(5, 1),Dense(5, 3),Dense(5, 2))
+              Dense(10 => 5),
+              Split(Dense(5 => 1, tanh), Dense(5 => 3, tanh), Dense(5 => 2))
              ) |> gpu
 
 model(gpu(rand(10)))

diff --git a/docs/src/models/basics.md b/docs/src/models/basics.md
@@ -158,14 +158,14 @@ a(rand(10)) # => 5-element vector
 
 Congratulations! You just built the `Dense` layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
 
-(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10, 5, σ)`.)
+(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, σ)`.)
 
 ## Stacking It Up
 
 It's pretty common to write models that look something like:
 
 ```julia
-layer1 = Dense(10, 5, σ)
+layer1 = Dense(10 => 5, σ)
 # ...
 model(x) = layer3(layer2(layer1(x)))
 ```
@@ -175,7 +175,7 @@ For long chains, it might be a bit more intuitive to have a list of layers, like
 ```julia
 using Flux
 
-layers = [Dense(10, 5, σ), Dense(5, 2), softmax]
+layers = [Dense(10 => 5, σ), Dense(5 => 2), softmax]
 
 model(x) = foldl((x, m) -> m(x), layers, init = x)
 
@@ -186,8 +186,8 @@ Handily, this is also provided for in Flux:
 
 ```julia
 model2 = Chain(
-  Dense(10, 5, σ),
-  Dense(5, 2),
+  Dense(10 => 5, σ),
+  Dense(5 => 2),
   softmax)
 
 model2(rand(10)) # => 2-element vector
@@ -198,7 +198,7 @@ This quickly starts to look like a high-level deep learning library; yet you can
 A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition.
 
 ```julia
-m = Dense(5, 2) ∘ Dense(10, 5, σ)
+m = Dense(5 => 2) ∘ Dense(10 => 5, σ)
 
 m(rand(10))
 ```

diff --git a/docs/src/models/overview.md b/docs/src/models/overview.md
@@ -43,8 +43,8 @@ Normally, your training and test data come from real world observations, but thi
 Now, build a model to make predictions with `1` input and `1` output:
 
 ```julia
-julia> model = Dense(1, 1)
-Dense(1, 1)
+julia> model = Dense(1 => 1)
+Dense(1 => 1)
 
 julia> model.weight
 1×1 Matrix{Float32}:
@@ -58,10 +58,10 @@ julia> model.bias
 Under the hood, a dense layer is a struct with fields `weight` and `bias`. `weight` represents a weights' matrix and `bias` represents a bias vector. There's another way to think about a model. In Flux, *models are conceptually predictive functions*: 
 
 ```julia
-julia> predict = Dense(1, 1)
+julia> predict = Dense(1 => 1)
 ```
 
-`Dense(1, 1)` also implements the function `σ(Wx+b)` where `W` and `b` are the weights and biases. `σ` is an activation function (more on activations later). Our model has one weight and one bias, but typical models will have many more. Think of weights and biases as knobs and levers Flux can use to tune predictions. Activation functions are transformations that tailor models to your needs. 
+`Dense(1 => 1)` also implements the function `σ(Wx+b)` where `W` and `b` are the weights and biases. `σ` is an activation function (more on activations later). Our model has one weight and one bias, but typical models will have many more. Think of weights and biases as knobs and levers Flux can use to tune predictions. Activation functions are transformations that tailor models to your needs. 
 
 This model will already make predictions, though not accurate ones yet:
 
@@ -185,7 +185,7 @@ The predictions are good. Here's how we got there.
 
 First, we gathered real-world data into the variables `x_train`, `y_train`, `x_test`, and `y_test`. The `x_*` data defines inputs, and the `y_*` data defines outputs. The `*_train` data is for training the model, and the `*_test` data is for verifying the model. Our data was based on the function `4x + 2`.
 
-Then, we built a single input, single output predictive model, `predict = Dense(1, 1)`. The initial predictions weren't accurate, because we had not trained the model yet.
+Then, we built a single input, single output predictive model, `predict = Dense(1 => 1)`. The initial predictions weren't accurate, because we had not trained the model yet.
 
 After building the model, we trained it with `train!(loss, parameters, data, opt)`. The loss function is first, followed by the `parameters` holding the weights and biases of the model, the training data, and the `Descent` optimizer provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the `train!` many times to finish the training process.
 

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
@@ -74,7 +74,7 @@ Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also availabl
 Using these tools, we can now build the model shown in the above diagram with: 
 
 ```julia
-m = Chain(RNN(2, 5), Dense(5, 1))
+m = Chain(RNN(2, 5), Dense(5 => 1))
 ```
 In this example, each output has only one component.
 

diff --git a/docs/src/models/regularisation.md b/docs/src/models/regularisation.md
@@ -9,7 +9,7 @@ For example, say we have a simple regression.
 ```julia
 using Flux
 using Flux.Losses: logitcrossentropy
-m = Dense(10, 5)
+m = Dense(10 => 5)
 loss(x, y) = logitcrossentropy(m(x), y)
 ```
 
@@ -39,9 +39,9 @@ Here's a larger example with a multi-layer perceptron.
 
 ```julia
 m = Chain(
-  Dense(28^2, 128, relu),
-  Dense(128, 32, relu),
-  Dense(32, 10))
+  Dense(28^2 => 128, relu),
+  Dense(128 => 32, relu),
+  Dense(32 => 10))
 
 sqnorm(x) = sum(abs2, x)
 
@@ -55,8 +55,8 @@ One can also easily add per-layer regularisation via the `activations` function:
 ```julia
 julia> using Flux: activations
 
-julia> c = Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
-Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
+julia> c = Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)
+Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)
 
 julia> activations(c, rand(10))
 3-element Array{Any,1}:

diff --git a/docs/src/saving.md b/docs/src/saving.md
@@ -11,8 +11,8 @@ julia> using Flux
 
 julia> model = Chain(Dense(10, 5, NNlib.relu), Dense(5, 2), NNlib.softmax)
 Chain(
-  Dense(10, 5, relu),                   # 55 parameters
-  Dense(5, 2),                          # 12 parameters
+  Dense(10 => 5, relu),                 # 55 parameters
+  Dense(5 => 2),                        # 12 parameters
   NNlib.softmax,
 )                   # Total: 4 arrays, 67 parameters, 524 bytes.
 
@@ -32,8 +32,8 @@ julia> @load "mymodel.bson" model
 
 julia> model
 Chain(
-  Dense(10, 5, relu),                   # 55 parameters
-  Dense(5, 2),                          # 12 parameters
+  Dense(10 => 5, relu),                 # 55 parameters
+  Dense(5 => 2),                        # 12 parameters
   NNlib.softmax,
 )                   # Total: 4 arrays, 67 parameters, 524 bytes.
 
@@ -59,7 +59,7 @@ model parameters.
 ```Julia
 julia> using Flux
 
-julia> model = Chain(Dense(10,5,relu),Dense(5,2),softmax)
+julia> model = Chain(Dense(10 => 5,relu),Dense(5 => 2),softmax)
 Chain(Dense(10, 5, NNlib.relu), Dense(5, 2), NNlib.softmax)
 
 julia> weights = Flux.params(model);
@@ -74,7 +74,7 @@ You can easily load parameters back into a model with `Flux.loadparams!`.
 ```julia
 julia> using Flux
 
-julia> model = Chain(Dense(10,5,relu),Dense(5,2),softmax)
+julia> model = Chain(Dense(10 => 5,relu),Dense(5 => 2),softmax)
 Chain(Dense(10, 5, NNlib.relu), Dense(5, 2), NNlib.softmax)
 
 julia> using BSON: @load
@@ -94,7 +94,7 @@ In longer training runs it's a good idea to periodically save your model, so tha
 using Flux: throttle
 using BSON: @save
 
-m = Chain(Dense(10,5,relu),Dense(5,2),softmax)
+m = Chain(Dense(10 => 5, relu), Dense(5 => 2), softmax)
 
 evalcb = throttle(30) do
   # Show loss

diff --git a/docs/src/training/training.md b/docs/src/training/training.md
@@ -47,8 +47,8 @@ We can also define an objective in terms of some model:
 
 ```julia
 m = Chain(
-  Dense(784, 32, σ),
-  Dense(32, 10), softmax)
+  Dense(784 => 32, σ),
+  Dense(32 => 10), softmax)
 
 loss(x, y) = Flux.Losses.mse(m(x), y)
 ps = Flux.params(m)

diff --git a/docs/src/utilities.md b/docs/src/utilities.md
@@ -92,7 +92,7 @@ function make_model(width, height, inchannels, nclasses;
 
   # the input dimension to Dense is programatically calculated from
   #  width, height, and nchannels
-  return Chain(conv_layers..., Dense(prod(conv_outsize), nclasses))
+  return Chain(conv_layers..., Dense(prod(conv_outsize) => nclasses))
 end
 ```
 

diff --git a/src/deprecations.jl b/src/deprecations.jl
@@ -16,4 +16,14 @@ ones32(::Type, dims...) = throw(ArgumentError("Flux.ones32 is always Float32, us
 zeros32(::Type, dims...) = throw(ArgumentError("Flux.zeros32 is always Float32, use Base.zeros to specify the element type"))
 
 # v0.13 deprecations
+
 @deprecate frequencies(xs) group_counts(xs)
+
+# Channel notation: Changed to match Conv, but very softly deprecated!
+# Perhaps change to @deprecate for v0.14, but there is no plan to remove these.
+Dense(in::Integer, out::Integer, σ = identity; kw...) =
+  Dense(in => out, σ; kw...)
+Bilinear(in1::Integer, in2::Integer, out::Integer, σ = identity; kw...) =
+  Bilinear((in1, in2) => out, σ; kw...)
+Embedding(in::Integer, out::Integer; kw...) = Embedding(in => out; kw...)
+