diff --git a/previews/PR2535/.documenter-siteinfo.json b/previews/PR2535/.documenter-siteinfo.json index da0adf9bbd..d7571407c6 100644 --- a/previews/PR2535/.documenter-siteinfo.json +++ b/previews/PR2535/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2024-12-05T16:35:39","documenter_version":"1.8.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2024-12-08T16:33:33","documenter_version":"1.8.0"}} \ No newline at end of file diff --git a/previews/PR2535/ecosystem/index.html b/previews/PR2535/ecosystem/index.html index 8f2e13f04c..a4e8525088 100644 --- a/previews/PR2535/ecosystem/index.html +++ b/previews/PR2535/ecosystem/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -

The Julia Ecosystem around Flux

One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.

This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.

Flux models

  • Flux's model-zoo contains examples from many domains.

Computer vision

  • ObjectDetector.jl provides ready-to-go image detection via YOLO.
  • Metalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.
  • UNet.jl is a generic UNet implementation.

Natural language processing

  • Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.
  • TextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.

Reinforcement learning

  • AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
  • ReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.

Graph learning

  • GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.
  • GeometricFlux.jl is the first graph neural network library for julia.
  • NeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.
  • SeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.

Time series

Robust networks

  • RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.

Tools closely associated with Flux

Utility tools you're unlikely to have met if you never used Flux!

High-level training flows

  • FastAI.jl is a Julia port of Python's fast.ai library.
  • FluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl
  • Ignite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.
  • Tsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.

Datasets

Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:

Plumbing

Tools to put data into the right order for creating a model.

  • Augmentor.jl is a real-time library augmentation library for increasing the number of training images.
  • DataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.
  • MLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.

Parameters


Differentiable programming

Packages based on differentiable programming but not necessarily related to Machine Learning.

  • The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.
  • DiffEqFlux.jl provides tools for creating Neural Differential Equations.
  • Flux3D.jl shows off machine learning on 3D data.
  • RayTracer.jl combines ML with computer vision via a differentiable renderer.
  • Duckietown.jl Differentiable Duckietown simulator.
  • The Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.
  • AtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.
  • DiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.

Probabilistic programming

  • Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.
  • Omega.jl is a research project aimed at causal, higher-order probabilistic programming.
  • Stheno.jl provides flexible Gaussian processes.

Statistics


Useful miscellaneous packages

Some useful and random packages!

  • AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.
  • Mill.jl helps to prototype flexible multi-instance learning models.
  • MLMetrics.jl is a utility for scoring models in data science and machine learning.
  • Torch.jl exposes torch in Julia.
  • ValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.
  • InvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.
  • ProgressMeter.jl progress meters for long-running computations.
  • TensorBoardLogger.jl easy peasy logging to tensorboard in Julia
  • ArgParse.jl is a package for parsing command-line arguments to Julia programs.
  • Parameters.jl types with default field values, keyword constructors and (un-)pack macros.
  • BSON.jl is a package for working with the Binary JSON serialisation format.
  • DataFrames.jl in-memory tabular data in Julia.
  • DrWatson.jl is a scientific project assistant software.

This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.


Alternatives to Flux

Julia has several other libraries for making neural networks.

  • SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl)

  • Knet.jl is a neural network library built around AutoGrad.jl.

  • Lux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.

Explicit or explicit?

Flux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word "explicit" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.

+

The Julia Ecosystem around Flux

One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.

This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.

Flux models

  • Flux's model-zoo contains examples from many domains.

Computer vision

  • ObjectDetector.jl provides ready-to-go image detection via YOLO.
  • Metalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.
  • UNet.jl is a generic UNet implementation.

Natural language processing

  • Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.
  • TextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.

Reinforcement learning

  • AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
  • ReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.

Graph learning

  • GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.
  • GeometricFlux.jl is the first graph neural network library for julia.
  • NeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.
  • SeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.

Time series

Robust networks

  • RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.

Tools closely associated with Flux

Utility tools you're unlikely to have met if you never used Flux!

High-level training flows

  • FastAI.jl is a Julia port of Python's fast.ai library.
  • FluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl
  • Ignite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.
  • Tsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.

Datasets

Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:

Plumbing

Tools to put data into the right order for creating a model.

  • Augmentor.jl is a real-time library augmentation library for increasing the number of training images.
  • DataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.
  • MLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.

Parameters


Differentiable programming

Packages based on differentiable programming but not necessarily related to Machine Learning.

  • The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.
  • DiffEqFlux.jl provides tools for creating Neural Differential Equations.
  • Flux3D.jl shows off machine learning on 3D data.
  • RayTracer.jl combines ML with computer vision via a differentiable renderer.
  • Duckietown.jl Differentiable Duckietown simulator.
  • The Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.
  • AtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.
  • DiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.

Probabilistic programming

  • Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.
  • Omega.jl is a research project aimed at causal, higher-order probabilistic programming.
  • Stheno.jl provides flexible Gaussian processes.

Statistics


Useful miscellaneous packages

Some useful and random packages!

  • AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.
  • Mill.jl helps to prototype flexible multi-instance learning models.
  • MLMetrics.jl is a utility for scoring models in data science and machine learning.
  • Torch.jl exposes torch in Julia.
  • ValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.
  • InvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.
  • ProgressMeter.jl progress meters for long-running computations.
  • TensorBoardLogger.jl easy peasy logging to tensorboard in Julia
  • ArgParse.jl is a package for parsing command-line arguments to Julia programs.
  • Parameters.jl types with default field values, keyword constructors and (un-)pack macros.
  • BSON.jl is a package for working with the Binary JSON serialisation format.
  • DataFrames.jl in-memory tabular data in Julia.
  • DrWatson.jl is a scientific project assistant software.

This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.


Alternatives to Flux

Julia has several other libraries for making neural networks.

  • SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl)

  • Knet.jl is a neural network library built around AutoGrad.jl.

  • Lux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.

Explicit or explicit?

Flux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word "explicit" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.

diff --git a/previews/PR2535/guide/gpu/index.html b/previews/PR2535/guide/gpu/index.html index 21f22e2757..71b930df58 100644 --- a/previews/PR2535/guide/gpu/index.html +++ b/previews/PR2535/guide/gpu/index.html @@ -174,4 +174,4 @@ true

For Metal GPU:

julia> using Metal
 
 julia> Metal.functional()
-true
+true diff --git a/previews/PR2535/guide/models/basics/index.html b/previews/PR2535/guide/models/basics/index.html index 603893781c..05ae96bb57 100644 --- a/previews/PR2535/guide/models/basics/index.html +++ b/previews/PR2535/guide/models/basics/index.html @@ -20,11 +20,10 @@ poly3s = Poly3([10, 1, 0.1]) # construct an instance -poly3s(5) == 17.5 # true

Internally, there is little difference between a closure and a struct. They have the same fields, and equivalent methods:

poly3s.θ3 == poly3.θ3 == θ  # both have a field called :θ3
-dump(poly3)  # contains θ3: Array
-dump(poly3s)
+poly3s(5) == 17.5  # true

Internally, there is little difference between a closure and a struct. They have the same fields, and equivalent methods:

dump(poly3), dump(poly3s)  # both contain θ3: Array
+poly3s.θ3 == poly3.θ3 == θ  # field called :θ3 has same value
 methods(poly3)
-methods(poly3s)  # each has 1 method, taking x::Real

The virtue of encapsulation is that it makes composition very easy. We can make more complicated functions by combining simple ones, and each will keep track of its own parameters. Juia writes function composition as , for instance (inv ∘ sin)(pi/6) ≈ 2, and we can use exactly this for our parameterised polynomials:

poly4 = Poly3([1, 0.5, 0]) ∘ Poly3([10, 1, 0.1])
+methods(poly3s)  # each has 1 method, accepting x

The virtue of encapsulation is that it makes composition very easy. We can make more complicated functions by combining simple ones, and each will keep track of its own parameters. Juia writes function composition as , for instance (inv ∘ sin)(pi/6) ≈ 2, and we can use exactly this for our parameterised polynomials:

poly4 = Poly3([1, 0.5, 0]) ∘ Poly3([10, 1, 0.1])
 
 poly4 isa ComposedFunction  # ∘ creates another struct...
 poly4.outer.θ3 == θ         # which has fields :inner & :outer
@@ -74,7 +73,7 @@
 
 layer3s(x)  # output, 2-element Vector{Float32}
 
-Flux.gradient((x,d) -> d(x)[1], x, layer3s)[2]  # NamedTuple{(:W, :b, :act)}

This ∂f/∂layer3s is a named tuple with the same fields as Layer. Within it, the gradient with respect to W is a matrix of seemingly random numbers. Notice that there is also an entry for act, which is nothing, as this field of the struct is not a smoothly adjustible parameter.

We can compose these layers just as we did the polynomials above. Here's a composition of 3, in which the last step is the function only which takes a 2-element vector and gives us the number inside:

model1 = only ∘ Layer(20, 1) ∘ Layer(1, 20)
+Flux.gradient((x,d) -> d(x)[1], x, layer3s)[2]  # NamedTuple{(:W, :b, :act)}

This ∂f/∂layer3s is a named tuple with the same fields as Layer. Within it, the gradient with respect to W is a matrix of seemingly random numbers. Notice that there is also an entry for act, which is nothing, as this field of the struct is not a smoothly adjustible parameter.

We can compose these layers just as we did the polynomials above. Here's a composition of 3, in which the last step is the function only which takes a 1-element vector and gives us the number inside:

model1 = only ∘ Layer(20, 1) ∘ Layer(1, 20)
 
 y = model1(Float32[0.1])  # output is a Float32 number
 
@@ -89,11 +88,11 @@
 
 model2(Float32[0.1])
 
-Flux.gradient(|>, [1f0], model2)[2]

 Flux's layers

Rather than define everything from scratch every time, Flux provides a library of commonly used layers. The same model could be defined:

model3 = Chain(Dense(1 => 20, σ), Dense(20 => 1), only)

How does this model3 differ from the model1 we had before?

If what you need isn't covered by Flux's built-in layers, it's easy to write your own. There are more details later, but the steps are invariably those shown for struct Layer above:

  1. Define a struct which will hold the parameters.
  2. Make it callable, to define how it uses them to transform the input x
  3. Define a constructor which initialises the parameters (if the default constructor doesn't do what you want).
  4. Annotate with @layer to opt-in to pretty printing, and other enhacements.

 Functors.jl

To deal with such nested structures, Flux relies heavily on an associated package called Functors. Its basic function is fmap, which generalises map(f, x) to work on almost anything.

For example, this is how gpu moves all arrays within a model to the GPU, reconstructing another only ∘ Layer(...) ∘ Layer(...) (or a Chain etc.) around the new CuArrays:

using CUDA, Functors
-fmap(cu, model1)

And this is a very simple gradient update of the parameters, walking over model and grad simultaneously:

fmap((x, dx) -> x isa Array ? (x - dx/100) : x, model, grad)
Note

Before Flux v0.15 (and Functors v0.5), this exploration of structs was opt-in. After defining struct Layer it was necessary to call @functor Layer (or @layer Layer) before Flux would look inside. This has now changed to be opt-out: Functors (and hence Flux) will explore arbitrary structs, unless told not to (using Functors.@leaf). This is why even "anonymous structs" created by closures like poly3 and layer3 above are now valid Flux models, although the use of named structs is still recommended practice.

Curve Fitting

Above we took gradients of the output, or sometimes to the first element of the output – it must be a number, not a vector. Adjusting the parameters to make this smaller won't lead us anywhere interesting. Instead, we should minimise some loss function which compares the actual output to our desired output.

Perhaps the simplest example is curve fitting. The previous page fitted a linear model to data. With out two-layer model, we can fit a nonlinear function. For example, let us use f(x) = 2x - x^3 evaluated at some points x in -2:0.1:2 as the data, and adjust the parameters of model3 from above so that its output is similar.

data = [([x], 2x-x^3) for x in -2:0.1f0:2]  # training points (x, y)
+Flux.gradient(|>, [1f0], model2)[2]

 Flux's layers

Rather than define everything from scratch every time, Flux provides a library of commonly used layers. The same model could be defined:

model3 = Chain(Dense(1 => 20, σ), Dense(20 => 1), only)

How does this model3 differ from the model1 we had before?

If what you need isn't covered by Flux's built-in layers, it's easy to write your own. There are more details later, but the steps are invariably those shown for struct Layer above:

  1. Define a struct which will hold the parameters.
  2. Make it callable, to define how it uses them to transform the input x
  3. Define a constructor which initialises the parameters (if the default constructor doesn't do what you want).
  4. Annotate with @layer to opt-in to pretty printing, and other enhacements.

 Functors.jl

To deal with such nested structures, Flux relies heavily on an associated package called Functors. Its basic function is fmap, which generalises map(f, x) to work on almost anything.

For example, this is how gpu moves all arrays within a model to the GPU, reconstructing another only ∘ Layer(...) ∘ Layer(...) (or a Chain etc.) around the new CuArrays:

using CUDA, Functors
+fmap(cu, model1)

And this is a very simple gradient update of the parameters, walking over model and grad simultaneously:

fmap((x, dx) -> x isa Array ? (x - dx/100) : x, model, grad)
Note

Before Flux v0.15 (and Functors v0.5), this exploration of structs was opt-in. After defining struct Layer it was necessary to call @functor Layer (or @layer Layer) before Flux would look inside. This has now changed to be opt-out: Functors (and hence Flux) will explore arbitrary structs, unless told not to (using Functors.@leaf). This is why even "anonymous structs" created by closures, like poly3 and layer3 above, are now valid Flux models, although the use of named structs is still recommended practice.

Curve Fitting

Above we took gradients of the output, or sometimes to the first element of the output – it must be a number, not a vector. Adjusting the parameters to make this smaller won't lead us anywhere interesting. Instead, we should minimise some loss function which compares the actual output to our desired output.

Perhaps the simplest example is curve fitting. The previous page fitted a linear model to data. With out two-layer model, we can fit a nonlinear function. For example, let us use f(x) = 2x - x^3 evaluated at some points x in -2:0.1:2 as the data, and adjust the parameters of model3 from above so that its output is similar.

data = [([x], 2x-x^3) for x in -2:0.1f0:2]  # training points (x, y)
 
 for _ in 1:1000  # adjust parameters to minimise the error:
   Flux.train!((m,x,y) -> (m(x) - y)^2, model3, data, Descent(0.01))
 end

The same code will also work with model1 or model2 instead. Here's how to plot the desired and actual outputs:

using Plots
 plot(x -> 2x-x^3, -2, 2, label="truth")
-scatter!(x -> model3([x]), -2:0.1f0:2, label="fitted")

More detail about what exactly the function train! is doing, and how to use rules other than simple Descent, is what the next page in this guide is about: training.

+scatter!(x -> model3([x]), -2:0.1f0:2, label="fitted")

More detail about what exactly the function train! is doing, and how to use rules other than simple Descent, is what the next page in this guide is about: training.

diff --git a/previews/PR2535/guide/models/overview/index.html b/previews/PR2535/guide/models/overview/index.html index 743b5f5a6f..8f02150d30 100644 --- a/previews/PR2535/guide/models/overview/index.html +++ b/previews/PR2535/guide/models/overview/index.html @@ -56,4 +56,4 @@ julia> y_test 1×5 Matrix{Int64}: - 26 30 34 38 42

The predictions are good. Here's how we got there.

First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.

Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.

After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.

After we trained the model, we verified it with the test data to verify the results.

This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.

+ 26 30 34 38 42

The predictions are good. Here's how we got there.

First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.

Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.

After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.

After we trained the model, we verified it with the test data to verify the results.

This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.

diff --git a/previews/PR2535/guide/models/quickstart/index.html b/previews/PR2535/guide/models/quickstart/index.html index 41108485fd..78ce04259a 100644 --- a/previews/PR2535/guide/models/quickstart/index.html +++ b/previews/PR2535/guide/models/quickstart/index.html @@ -64,4 +64,4 @@ y_hat = m(x) Flux.logitcrossentropy(y_hat, y) end -end

For more complex models, you can define a custom struct MyModel containing layers and arrays and implement the call operator (::MyModel)(x) = ... to define the forward pass. This is all it is needed for Flux to work. Marking the struct with Flux.@layer will add some more functionality, like pretty printing and the ability to mark some internal fields as trainable or not (also see trainable).

+end

For more complex models, you can define a custom struct MyModel containing layers and arrays and implement the call operator (::MyModel)(x) = ... to define the forward pass. This is all it is needed for Flux to work. Marking the struct with Flux.@layer will add some more functionality, like pretty printing and the ability to mark some internal fields as trainable or not (also see trainable).

diff --git a/previews/PR2535/guide/models/recurrence/index.html b/previews/PR2535/guide/models/recurrence/index.html index 5af63447de..c09fb1639e 100644 --- a/previews/PR2535/guide/models/recurrence/index.html +++ b/previews/PR2535/guide/models/recurrence/index.html @@ -114,4 +114,4 @@ opt_state = Flux.setup(AdamW(1e-3), model) g = gradient(m -> Flux.mse(m(x), y), model)[1] -Flux.update!(opt_state, model, g) +Flux.update!(opt_state, model, g) diff --git a/previews/PR2535/guide/performance/index.html b/previews/PR2535/guide/performance/index.html index 4e3574b068..1b751e6f1b 100644 --- a/previews/PR2535/guide/performance/index.html +++ b/previews/PR2535/guide/performance/index.html @@ -14,4 +14,4 @@ function loss_total(x_batch::Matrix, y_batch::Matrix) y_preds = model(x_batch) sum(loss.(y_preds, y_batch)) -end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.

Be aware of GPU memory inefficiencies

Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.

+end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.

Be aware of GPU memory inefficiencies

Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.

diff --git a/previews/PR2535/guide/saving/index.html b/previews/PR2535/guide/saving/index.html index 3887500752..c9e3d69fff 100644 --- a/previews/PR2535/guide/saving/index.html +++ b/previews/PR2535/guide/saving/index.html @@ -64,4 +64,4 @@ Chain( Dense(10 => 5, relu), # 55 parameters Dense(5 => 2), # 12 parameters -) # Total: 4 arrays, 67 parameters, 476 bytes.
Warning

Saving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.

+) # Total: 4 arrays, 67 parameters, 476 bytes.
Warning

Saving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.

diff --git a/previews/PR2535/guide/training/training/index.html b/previews/PR2535/guide/training/training/index.html index 1fdbcf34fa..f204d1be44 100644 --- a/previews/PR2535/guide/training/training/index.html +++ b/previews/PR2535/guide/training/training/index.html @@ -118,4 +118,4 @@ train!(loss, bimodel, data, opt_state) # Un-freeze the entire model: -Flux.thaw!(opt_state)

While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).

+Flux.thaw!(opt_state)

While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).

diff --git a/previews/PR2535/index.html b/previews/PR2535/index.html index a55cdf4143..ddb5cac69c 100644 --- a/previews/PR2535/index.html +++ b/previews/PR2535/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.
  • Extensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.
  • Play nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.

Installation

Download Julia 1.10 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.

Learning Flux

The quick start page trains a simple neural network.

The rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.

The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.

Community

Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.

+

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.
  • Extensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.
  • Play nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.

Installation

Download Julia 1.10 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.

Learning Flux

The quick start page trains a simple neural network.

The rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.

The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.

Community

Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.

diff --git a/previews/PR2535/reference/data/mldatadevices/index.html b/previews/PR2535/reference/data/mldatadevices/index.html index 33ef539153..a1094f60c1 100644 --- a/previews/PR2535/reference/data/mldatadevices/index.html +++ b/previews/PR2535/reference/data/mldatadevices/index.html @@ -27,4 +27,4 @@ end (i, summary(x)) = (1, "3×13 CuArray{Float32, 2, CUDA.DeviceMemory}") (i, summary(x)) = (2, "3×13 CuArray{Float32, 2, CUDA.DeviceMemory}") -(i, summary(x)) = (3, "3×7 CuArray{Float32, 2, CUDA.DeviceMemory}")source
+(i, summary(x)) = (3, "3×7 CuArray{Float32, 2, CUDA.DeviceMemory}")source
diff --git a/previews/PR2535/reference/data/mlutils/index.html b/previews/PR2535/reference/data/mlutils/index.html index aef80802f9..35b26bbdb0 100644 --- a/previews/PR2535/reference/data/mlutils/index.html +++ b/previews/PR2535/reference/data/mlutils/index.html @@ -198,7 +198,7 @@ 1.7 1.7source
MLUtils.filterobsFunction
filterobs(f, data)

Return a subset of data container data including all indices i for which f(getobs(data, i)) === true.

data = 1:10
 numobs(data) == 10
 fdata = filterobs(>(5), data)
-numobs(fdata) == 5
source
Flux.flattenFunction

flatten(x)

Same as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.

source
MLUtils.flattenFunction
flatten(x::AbstractArray)

Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.

See also unsqueeze.

Examples

julia> rand(3,4,5) |> flatten |> size
+numobs(fdata) == 5
source
Flux.flattenFunction

flatten(x)

Same as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.

source
MLUtils.flattenFunction
flatten(x::AbstractArray)

Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.

See also unsqueeze.

Examples

julia> rand(3,4,5) |> flatten |> size
 (12, 5)
source
MLUtils.getobsFunction
getobs(data, [idx])

Return the observations corresponding to the observation index idx. Note that idx can be any type as long as data has defined getobs for that type. If idx is not provided, then materialize all observations in data.

If data does not have getobs defined, then in the case of Tables.table(data) == true returns the row(s) in position idx, otherwise returns data[idx].

Authors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).

The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.

getobs supports by default nested combinations of array, tuple, named tuples, and dictionaries.

See also getobs! and numobs.

Examples

# named tuples 
 x = (a = [1, 2, 3], b = rand(6, 3))
 
@@ -570,4 +570,4 @@
 julia> zeros_like(x, Float64)
 2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
  0.0  0.0
- 0.0  0.0
source
+ 0.0 0.0source
diff --git a/previews/PR2535/reference/data/onehot/index.html b/previews/PR2535/reference/data/onehot/index.html index 330482aca4..194b23bc0f 100644 --- a/previews/PR2535/reference/data/onehot/index.html +++ b/previews/PR2535/reference/data/onehot/index.html @@ -73,4 +73,4 @@ 3 6 15 3 9 3 12 3 6 15 3source
OneHotArrays.OneHotArrayType
OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}
 OneHotArray(indices, L)

A one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.

Typically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.

source
OneHotArrays.OneHotVectorType
OneHotVector{T} = OneHotArray{T, 0, 1, T}
 OneHotVector(indices, L)

A one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.

source
OneHotArrays.OneHotMatrixType
OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}
-OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.

source
+OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.

source
diff --git a/previews/PR2535/reference/destructure/index.html b/previews/PR2535/reference/destructure/index.html index 7bff8898ef..4401d9fd47 100644 --- a/previews/PR2535/reference/destructure/index.html +++ b/previews/PR2535/reference/destructure/index.html @@ -80,7 +80,7 @@ julia> getkeypath(x, KeyPath(:b, 1, "c")) 2-element Vector{Float64}: 3.0 - 4.0source
Optimisers.isnumericFunction
isnumeric(x) -> Bool

Returns true on any parameter to be adjusted by Optimisers.jl, namely arrays of non-integer numbers. Returns false on all other types.

Requires also that Functors.isleaf(x) == true, to focus on e.g. the parent of a transposed matrix, not the wrapper.

source
Flux.paramsFunction
params(model)

Returns a Zygote.Params object containing all parameter arrays from the model. This is deprecated! This function was the cornerstone of how Flux used Zygote's implicit mode gradients, but since Flux 0.13 we use explicit mode gradient(m -> loss(m, x, y), model) instead. To collect all the parameter arrays for other purposes, use Flux.trainables(model).

source

All Layers

Another kind of flat view of a nested model is provided by the modules command. This extracts a list of all layers:

Flux.modulesFunction
modules(m)

Return an iterator over non-leaf objects that can be reached by recursing m over the children given by Functors.functor.

Useful for applying a function (e.g. a regularizer) over specific modules or subsets of the parameters (e.g. the weights but not the biases).

Examples

julia> m1 = Chain(Dense(28^2, 64), BatchNorm(64, relu));
+ 4.0
source
Optimisers.isnumericFunction
isnumeric(x) -> Bool

Returns true on any parameter to be adjusted by Optimisers.jl, namely arrays of non-integer numbers. Returns false on all other types.

Requires also that Functors.isleaf(x) == true, to focus on e.g. the parent of a transposed matrix, not the wrapper.

source
Flux.paramsFunction
params(model)

Returns a Zygote.Params object containing all parameter arrays from the model. This is deprecated! This function was the cornerstone of how Flux used Zygote's implicit mode gradients, but since Flux 0.13 we use explicit mode gradient(m -> loss(m, x, y), model) instead. To collect all the parameter arrays for other purposes, use Flux.trainables(model).

source

All Layers

Another kind of flat view of a nested model is provided by the modules command. This extracts a list of all layers:

Flux.modulesFunction
modules(m)

Return an iterator over non-leaf objects that can be reached by recursing m over the children given by Functors.functor.

Useful for applying a function (e.g. a regularizer) over specific modules or subsets of the parameters (e.g. the weights but not the biases).

Examples

julia> m1 = Chain(Dense(28^2, 64), BatchNorm(64, relu));
 
 julia> m2 = Chain(m1, Dense(64, 10))
 Chain(
@@ -106,7 +106,7 @@
 L2 (generic function with 1 method)
 
 julia> L2(m2) isa Float32
-true
source

Save and Load

Flux.stateFunction
state(x)

Return an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).

Besides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.

This method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.

The state can be passed to loadmodel! to restore the model.

Examples

Copy the state into another model

julia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));
+true
source

Save and Load

Flux.stateFunction
state(x)

Return an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).

Besides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.

This method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.

The state can be passed to loadmodel! to restore the model.

Examples

Copy the state into another model

julia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));
 
 julia> s = Flux.state(m1)
 (layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)
@@ -132,7 +132,7 @@
 
 julia> JLD2.jldsave("checkpoint.jld2", model_state = s)
 
-julia> Flux.loadmodel!(m2, JLD2.load("checkpoint.jld2", "model_state"))
source
Flux.loadmodel!Function
loadmodel!(dst, src)

Copy all the parameters (trainable and non-trainable) from src into dst.

Recursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).

See also Flux.state.

Examples

julia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))
+julia> Flux.loadmodel!(m2, JLD2.load("checkpoint.jld2", "model_state"))
source
Flux.loadmodel!Function
loadmodel!(dst, src)

Copy all the parameters (trainable and non-trainable) from src into dst.

Recursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).

See also Flux.state.

Examples

julia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))
 Chain(
   Dense(5 => 2, tanh),                  # 12 parameters
   Dense(2 => 1),                        # 3 parameters
@@ -149,7 +149,7 @@
 false
 
 julia> iszero(dst[2].bias)
-true

Extended help

Throws an error when:

  • dst and src do not share the same fields (at any level)
  • the sizes of leaf nodes are mismatched between dst and src
  • copying non-array values to/from an array parameter (except inactive parameters described below)
  • dst is a "tied" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values

Inactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.

source

KeyPath

Functors.KeyPathType
KeyPath(keys...)

A type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, Int, or CartesianIndex.

For custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.

For string, integer, and cartesian index keys, the access is done with getindex instead.

See also getkeypath, haskeypath.

Examples

julia> kp = KeyPath(:b, 3)
+true

Extended help

Throws an error when:

  • dst and src do not share the same fields (at any level)
  • the sizes of leaf nodes are mismatched between dst and src
  • copying non-array values to/from an array parameter (except inactive parameters described below)
  • dst is a "tied" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values

Inactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.

source

KeyPath

Functors.KeyPathType
KeyPath(keys...)

A type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, Int, or CartesianIndex.

For custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.

For string, integer, and cartesian index keys, the access is done with getindex instead.

See also getkeypath, haskeypath.

Examples

julia> kp = KeyPath(:b, 3)
 KeyPath(:b, 3)
 
 julia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths
@@ -196,4 +196,4 @@
 true
 
 julia> haskeypath(x, KeyPath(:b, "d", 4))
-false
source
Functors.setkeypath!Function
setkeypath!(x, kp::KeyPath, v)

Set the value in x at the path kp to v.

See also KeyPath, getkeypath, and haskeypath.

source
+falsesource
Functors.setkeypath!Function
setkeypath!(x, kp::KeyPath, v)

Set the value in x at the path kp to v.

See also KeyPath, getkeypath, and haskeypath.

source
diff --git a/previews/PR2535/reference/models/activation/index.html b/previews/PR2535/reference/models/activation/index.html index 879603b62a..acfbc8cd07 100644 --- a/previews/PR2535/reference/models/activation/index.html +++ b/previews/PR2535/reference/models/activation/index.html @@ -430,4 +430,4 @@ -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ └────────────────────────────────────────┘ ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ - ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
diff --git a/previews/PR2535/reference/models/functors/index.html b/previews/PR2535/reference/models/functors/index.html index c6fd9c55eb..063937f288 100644 --- a/previews/PR2535/reference/models/functors/index.html +++ b/previews/PR2535/reference/models/functors/index.html @@ -22,7 +22,7 @@ julia> tri # now the layer is printed compactly Trio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4)) # 4 parameters -julia> opt_state = Flux.setup(Adam(), tri); # `c` is not in the optimizer state

The macro also adds methods to make using Flux with Enzyme easier.

source
Functors.@leafMacro
@leaf T

Define functor for the type T so that isleaf(x::T) == true.

source
Functors.@functorMacro
@functor T
+julia> opt_state = Flux.setup(Adam(), tri); # `c` is not in the optimizer state

The macro also adds methods to make using Flux with Enzyme easier.

  • Duplicated(m::Layer) allocates a copy for the gradient (initially zero).
  • This is made callable, (m::Duplicated{<:Layer})(x...) = m.val(x...)
  • Pretty printing for show(io, mime, ::Duplicated{<:Layer})
source
Functors.@leafMacro
@leaf T

Define functor for the type T so that isleaf(x::T) == true.

source
Functors.@functorMacro
@functor T
 @functor T (x,)

Adds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.

By default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.

Examples

julia> struct Foo; x; y; end
 
 julia> Functors.children(Foo(1,2))
@@ -185,7 +185,7 @@
 julia> m.bias
 2-element Vector{Float32}:
  0.0
- 0.0
source
Flux.gpuMethod
gpu(m)

Copies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time). It recurses into structs according to Functors.jl.

Use cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.

This function is just defined for convenience around gpu_device, and is equivalent to gpu_device()(m). You may consider defining device = gpu_device() once and then using device(m) to move data.

Example

julia> m = Dense(rand(2, 3))  # constructed with Float64 weight matrix
+ 0.0
source
Flux.gpuMethod
gpu(m)

Copies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time). It recurses into structs according to Functors.jl.

Use cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.

This function is just defined for convenience around gpu_device, and is equivalent to gpu_device()(m). You may consider defining device = gpu_device() once and then using device(m) to move data.

Example

julia> m = Dense(rand(2, 3))  # constructed with Float64 weight matrix
 Dense(3 => 2)       # 8 parameters
 
 julia> typeof(m.weight)
@@ -195,7 +195,7 @@
 Dense(3 => 2)       # 8 parameters
 
 julia> typeof(m_gpu.weight)
-CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
source
Flux.gpuMethod
gpu(data::DataLoader)
+CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
source
Flux.gpuMethod
gpu(data::DataLoader)
 cpu(data::DataLoader)

Transforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)

Example

julia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)
 4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)
   with first element:
@@ -215,4 +215,4 @@
  1.0  1.0  1.0

For large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:

julia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)
 4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)
   with first element:
-  (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
Warning

This only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.

source
+ (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
Warning

This only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.

source
diff --git a/previews/PR2535/reference/models/layers/index.html b/previews/PR2535/reference/models/layers/index.html index 519d38dc7f..16493fbe02 100644 --- a/previews/PR2535/reference/models/layers/index.html +++ b/previews/PR2535/reference/models/layers/index.html @@ -23,7 +23,7 @@ julia> Flux.trainables(model2) # no trainable bias 1-element Vector{AbstractArray}: - [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]source
Flux.BilinearType
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
+ [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]
source
Flux.BilinearType
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
 Bilinear(W::AbstractArray, [bias, σ])

Creates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:

z[i] = σ(x' * W[i,:,:] * y + bias[i])

If x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.

If the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)

The two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.

If the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).

The initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.

Examples

julia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);
 
 julia> B = Flux.Bilinear((5, 5) => 7)
@@ -44,7 +44,7 @@
 (3, 32)
 
 julia> Flux.Bilinear(rand(4,8,16), false, tanh)  # first dim of weight is the output
-Bilinear((8, 16) => 4, tanh; bias=false)  # 512 parameters
source
Flux.ScaleType
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
+Bilinear((8, 16) => 4, tanh; bias=false)  # 512 parameters
source
Flux.ScaleType
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
 Scale(scale::AbstractArray, [bias, σ])

Create an element-wise layer, whose forward pass is given by:

y = σ.(scale .* x .+ bias)

This uses .* instead of matrix multiplication * of Dense.

The learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.

Used by LayerNorm with affine=true.

Examples

julia> a = Flux.Scale(2)
 Scale(2)            # 4 parameters
 
@@ -68,7 +68,7 @@
 
 julia> Flux.trainables(b)
 1-element Vector{AbstractArray}:
- Float32[1.0 2.0 3.0 4.0]
source

Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.

Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

Flux.ConvType
Conv(filter, in => out, σ = identity;
+ Float32[1.0 2.0 3.0 4.0]
source

Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.

Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

Flux.ConvType
Conv(filter, in => out, σ = identity;
      stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])
 Conv(weight, [bias, activation; stride, pad, dilation])

Standard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.

To take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:

  • filter should be a tuple of N integers.
  • Keywords stride and dilation should each be either single integer, or a tuple with N integers.
  • Keyword pad specifies the number of elements added to the borders of the data array. It can be
    • a single integer for equal padding all around,
    • a tuple of N integers, to apply the same padding at begin/end of each spatial dimension,
    • a tuple of 2*N integers, for asymmetric padding, or
    • the singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.
  • Keyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.

Keywords to control initialization of the layer:

  • init - Function used to generate initial weights. Defaults to glorot_uniform.
  • bias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).

The second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself.

See also ConvTranspose, DepthwiseConv, CrossCor.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images
 
@@ -99,7 +99,7 @@
 (98, 5, 64)
 
 julia> Flux.trainables(layer) |> length
-2
source
Flux.ConvTransposeType
ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])
+2
source
Flux.ConvTransposeType
ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])
 ConvTranspose(weight, [bias, activation; stride, pad, outpad, dilation])

Standard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.

Note that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.

To conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

The second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = ConvTranspose((5,5), 3 => 7, relu)
@@ -126,7 +126,7 @@
 (102, 4, 64)
 
 julia> Flux.trainables(layer) |> length
-2
source
Flux.CrossCorType
CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
+2
source
Flux.CrossCorType
CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
 CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Standard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

The second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)
@@ -144,7 +144,7 @@
 CrossCor((3,), 4 => 5, relu)  # 65 parameters
 
 julia> layer(randn(Float32, 100, 4, 64)) |> size
-(98, 5, 64)
source
Flux.DepthwiseConvFunction
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
+(98, 5, 64)
source
Flux.DepthwiseConvFunction
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
 DepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Return a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.

See Conv for a description of the arguments.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)
@@ -154,7 +154,7 @@
 (96, 96, 6, 50)
 
 julia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size
-(50, 50, 9, 50)
source
Flux.SamePadType
SamePad()

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).

See also Conv, MaxPool.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of images
+(50, 50, 9, 50)
source
Flux.SamePadType
SamePad()

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).

See also Conv, MaxPool.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of images
 
 julia> layer = Conv((2,2), 3 => 7, pad=SamePad())
 Conv((2, 2), 3 => 7, pad=(1, 0, 1, 0))  # 91 parameters
@@ -172,7 +172,7 @@
 Conv((5, 5), 3 => 7, pad=2, stride=2)  # 532 parameters
 
 julia> layer3(xs) |> size  # output size = `ceil(input_size/stride)` = 50
-(50, 50, 7, 50)
source

MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

Flux.MultiHeadAttentionType
MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Arguments

  • dims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.
  • nheads: number of heads. Default 8.
  • init: weight initializer for the Dense layers. Default glorot_uniform.
  • bias : whether pointwise QKVO dense transforms use bias. Default false.
  • dropout_prob: dropout probability for the attention scores. Default 0.0.

Forward

(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])

The arguments of the forward pass are:

  • q_in: Input query array of size (q_in_dim, q_len, batch_size).
  • k_in: Input key array of size (k_in_dim, kv_len, batch_size).
  • v_in: Input value array of size (v_in_dim, kv_len, batch_size).
  • bias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.
  • mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.

Alternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).

See also NNlib.dot_product_attention.

Examples

mha = MultiHeadAttention(64, nheads = 8)
+(50, 50, 7, 50)
source

MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

Flux.MultiHeadAttentionType
MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Arguments

  • dims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.
  • nheads: number of heads. Default 8.
  • init: weight initializer for the Dense layers. Default glorot_uniform.
  • bias : whether pointwise QKVO dense transforms use bias. Default false.
  • dropout_prob: dropout probability for the attention scores. Default 0.0.

Forward

(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])

The arguments of the forward pass are:

  • q_in: Input query array of size (q_in_dim, q_len, batch_size).
  • k_in: Input key array of size (k_in_dim, kv_len, batch_size).
  • v_in: Input value array of size (v_in_dim, kv_len, batch_size).
  • bias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.
  • mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.

Alternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).

See also NNlib.dot_product_attention.

Examples

mha = MultiHeadAttention(64, nheads = 8)
 q = rand(Float32, (64, 10, 32))
 k = rand(Float32, (64, 20, 32))
 v = rand(Float32, (64, 20, 32))
@@ -183,13 +183,13 @@
 mha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)
 y, α = mha(q) # self-attention
 # [y] = [1024, 10, 32]
-# [α] = [10, 10, 8, 32]
source

Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

Flux.AdaptiveMaxPoolType
AdaptiveMaxPool(out::NTuple)

Adaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+# [α] = [10, 10, 8, 32]
source

Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

Flux.AdaptiveMaxPoolType
AdaptiveMaxPool(out::NTuple)

Adaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> AdaptiveMaxPool((25, 25))(xs) |> size
 (25, 25, 3, 50)
 
 julia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)
-true
source
Flux.MaxPoolType
MaxPool(window::NTuple; pad=0, stride=window)

Max pooling layer, which replaces all pixels in a block of size window with one.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+true
source
Flux.MaxPoolType
MaxPool(window::NTuple; pad=0, stride=window)

Max pooling layer, which replaces all pixels in a block of size window with one.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))
 Chain(
@@ -207,7 +207,7 @@
 MaxPool((5,), pad=2, stride=3)
 
 julia> layer(rand(Float32, 100, 7, 50)) |> size
-(34, 7, 50)
source
Flux.GlobalMaxPoolType
GlobalMaxPool()

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also MaxPool, GlobalMeanPool.

julia> xs = rand(Float32, 100, 100, 3, 50);
+(34, 7, 50)
source
Flux.GlobalMaxPoolType
GlobalMaxPool()

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also MaxPool, GlobalMeanPool.

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());
 
@@ -215,13 +215,13 @@
 (1, 1, 7, 50)
 
 julia> GlobalMaxPool()(rand(3,5,7)) |> size  # preserves 2 dimensions
-(1, 5, 7)
source
Flux.AdaptiveMeanPoolType
AdaptiveMeanPool(out::NTuple)

Adaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+(1, 5, 7)
source
Flux.AdaptiveMeanPoolType
AdaptiveMeanPool(out::NTuple)

Adaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> AdaptiveMeanPool((25, 25))(xs) |> size
 (25, 25, 3, 50)
 
 julia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)
-true
source
Flux.MeanPoolType
MeanPool(window::NTuple; pad=0, stride=window)

Mean pooling layer, averaging all pixels in a block of size window.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);
+true
source
Flux.MeanPoolType
MeanPool(window::NTuple; pad=0, stride=window)

Mean pooling layer, averaging all pixels in a block of size window.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))
 Chain(
@@ -233,12 +233,12 @@
 (96, 96, 7, 50)
 
 julia> m(xs) |> size
-(20, 20, 7, 50)
source
Flux.GlobalMeanPoolType
GlobalMeanPool()

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

julia> xs = rand(Float32, 100, 100, 3, 50);
+(20, 20, 7, 50)
source
Flux.GlobalMeanPoolType
GlobalMeanPool()

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());
 
 julia> m(xs) |> size
-(1, 1, 7, 50)
source

Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

Flux.UpsampleType
Upsample(mode = :nearest; [scale, size]) 
+(1, 1, 7, 50)
source

Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

Flux.UpsampleType
Upsample(mode = :nearest; [scale, size]) 
 Upsample(scale, mode = :nearest)

An upsampling layer. One of two keywords must be given:

If scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.

Currently supported upsampling modes and corresponding NNlib's methods are:

Examples

julia> m = Upsample(scale = (2, 3))
 Upsample(:nearest, scale = (2, 3))
 
@@ -249,7 +249,7 @@
 Upsample(:bilinear, size = (4, 5))
 
 julia> m(ones(2, 2, 1, 1)) |> size
-(4, 5, 1, 1)
source
Flux.PixelShuffleType
PixelShuffle(r::Int)

Pixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.

See NNlib.pixel_shuffle.

Examples

julia> p = PixelShuffle(2);
+(4, 5, 1, 1)
source
Flux.PixelShuffleType
PixelShuffle(r::Int)

Pixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.

See NNlib.pixel_shuffle.

Examples

julia> p = PixelShuffle(2);
 
 julia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]
 2×2×4×1 Array{Float64, 4}:
@@ -301,7 +301,7 @@
  4.1  4.3  5.1  5.3  6.1  6.3
  4.2  4.4  5.2  5.4  6.2  6.4
  7.1  7.3  8.1  8.3  9.1  9.3
- 7.2  7.4  8.2  8.4  9.2  9.4
source

Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

Flux.EmbeddingType
Embedding(in => out; init=randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.

For indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).

Examples

julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
+ 7.2  7.4  8.2  8.4  9.2  9.4
source

Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

Flux.EmbeddingType
Embedding(in => out; init=randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.

For indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).

Examples

julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
 Embedding(26 => 4)  # 104 parameters
 
 julia> emb(2)  # one column of e.weight (here not random!)
@@ -322,7 +322,7 @@
 true
 
 julia> emb(rand(1:26, (10, 1, 12))) |> size  # three batch dimensions
-(4, 10, 1, 12)
source
Flux.EmbeddingBagType
EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using mean or some other function.

Instead of acting on one "bag", such as x::Vector{Int}, the layer can also act on several:

  • Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).

  • Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all "bags" have the same length.

  • A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.

Examples

julia> vocab_size = 26;  # embed into 3 dimensions, with non-random vectors:
+(4, 10, 1, 12)
source
Flux.EmbeddingBagType
EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using mean or some other function.

Instead of acting on one "bag", such as x::Vector{Int}, the layer can also act on several:

  • Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).

  • Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all "bags" have the same length.

  • A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.

Examples

julia> vocab_size = 26;  # embed into 3 dimensions, with non-random vectors:
 
 julia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))
 EmbeddingBag(26 => 3)  # 78 parameters
@@ -371,7 +371,7 @@
 3×2 Matrix{Float32}:
  33.3333    0.0
  66.6667    0.0
-  0.0     100.0
source

Dataflow Layers, or Containers

The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.

Flux.ChainType
Chain(layers...)
+  0.0     100.0
source

Dataflow Layers, or Containers

The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.

Flux.ChainType
Chain(layers...)
 Chain(name = layer, ...)

Collects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.

Examples

julia> m = Chain(x -> x^2, x -> x+1);
 
 julia> m(5) == 26
@@ -394,12 +394,12 @@
 
 julia> Chain(x->@show(x), Parallel(+, inv, abs2))(4, 5)  # returns 1/4 + 5^2
 x = (4, 5)
-25.25

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!

source
Flux.activationsFunction
activations(c::Chain, input)

Like calling a Chain, but saves the result of each layer as an output.

Examples

julia> using Flux: activations
+25.25

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!

source
Flux.activationsFunction
activations(c::Chain, input)

Like calling a Chain, but saves the result of each layer as an output.

Examples

julia> using Flux: activations
 
 julia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);
 
 julia> activations(c, 1)
-(2, 4, 64)
source
Flux.MaxoutType
Maxout(layers...)
+(2, 4, 64)
source
Flux.MaxoutType
Maxout(layers...)
 Maxout(f, n_alts)

This contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.

Instead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.

Maxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio "Maxout Networks" https://arxiv.org/abs/1302.4389.

See also Parallel to reduce with other operators.

Examples

julia> m = Maxout(x -> abs2.(x), x -> x .* 3);
 
 julia> m([-2 -1 0 1 2])
@@ -414,7 +414,7 @@
 )                   # Total: 6 arrays, 126 parameters, 816 bytes.
 
 julia> Flux.outputsize(m3, (5, 11))
-(7, 11)
source
Flux.SkipConnectionType
SkipConnection(layer, connection)

Create a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just SkipConnection(layer, +). Here is a more complicated example:

julia> m = Conv((3,3), 4 => 7, pad=(1,1));
+(7, 11)
source
Flux.SkipConnectionType
SkipConnection(layer, connection)

Create a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just SkipConnection(layer, +). Here is a more complicated example:

julia> m = Conv((3,3), 4 => 7, pad=(1,1));
 
 julia> x = ones(Float32, 5, 5, 4, 10);
 
@@ -424,7 +424,7 @@
 julia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));
 
 julia> size(sm(x)) == (5, 5, 11, 10)
-true

See also Parallel, Maxout.

source
Flux.ParallelType
Parallel(connection, layers...)
+true

See also Parallel, Maxout.

source
Flux.ParallelType
Parallel(connection, layers...)
 Parallel(connection; name = layer, ...)

Create a layer which passes an input array to each path in layers, before reducing the output with connection.

Obeys the similar rules to broadcasting:

  • Called with one input x, this is equivalent to connection([l(x) for l in layers]...).
  • With multiple inputs and just one layer, it is instead connection([layer(x) for x in inputs]...).
  • With multiple inputs and multiple layers, one input is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).

Like Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.

See also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.

Examples

julia> p = Parallel(+, abs2, sqrt);
 
 julia> p(3, 4)  # == 3^2 + √4, two functions two inputs
@@ -459,7 +459,7 @@
 (2,)
 
 julia> model2[:β] == model2[2]
-true
source
Flux.PairwiseFusionType
PairwiseFusion(connection, layers...)

Arguments

  • connection: A function taking 2 inputs and combining them into a single output
  • layers: The layers whose outputs are combined

Inputs

This layer behaves differently based on input type:

  1. If input x is a tuple of length N (or the input is xs with N x's), matching the number of layers,

then each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:

x1 → layer1 → y1 ↘
+true
source
Flux.PairwiseFusionType
PairwiseFusion(connection, layers...)

Arguments

  • connection: A function taking 2 inputs and combining them into a single output
  • layers: The layers whose outputs are combined

Inputs

This layer behaves differently based on input type:

  1. If input x is a tuple of length N (or the input is xs with N x's), matching the number of layers,

then each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:

x1 → layer1 → y1 ↘
                   connection → layer2 → y2 ↘
               x2 ↗                          connection → layer3 → y3
                                         x3 ↗

... or written as:

y1 = layer1(x1)
@@ -467,7 +467,7 @@
 y3 = layer3(connection(y2, x3))
  1. With just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:
y[1] == layers[1](x)
 for i in 2:length(layers)
     y[i] == connection(layers[i](y[i-1]), x)
-end

Returns

A tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).

source

Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNCellType
RNNCell(in => out, σ = tanh; init_kernel = glorot_uniform, 
+end

Returns

A tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).

source

Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNCellType
RNNCell(in => out, σ = tanh; init_kernel = glorot_uniform, 
   init_recurrent_kernel = glorot_uniform, bias = true)

The most basic recurrent layer. Essentially acts as a Dense layer, but with the output fed back into the input each time step.

In the forward pass, implements the function

\[h^\prime = \sigma(W_i x + W_h h + b)\]

and returns h'.

See RNN for a layer that processes entire sequences.

Arguments

  • in => out: The input and output dimensions of the layer.
  • σ: The non-linearity to apply to the output. Default is tanh.
  • init_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.
  • init_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.
  • bias: Whether to include a bias term initialized to zero. Default is true.

Forward

rnncell(x, [h])

The arguments of the forward pass are:

  • x: The input to the RNN. It should be a vector of size in or a matrix of size in x batch_size.
  • h: The hidden state of the RNN. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.

Examples

r = RNNCell(3 => 5)
 
 # A sequence of length 10 and batch size 4
@@ -488,7 +488,7 @@
 end
 
 h   # The final hidden state
-ŷ   # The hidden states at each time step
source
Flux.RNNType
RNN(in => out, σ = tanh; init_kernel = glorot_uniform, 
+ŷ   # The hidden states at each time step
source
Flux.RNNType
RNN(in => out, σ = tanh; init_kernel = glorot_uniform, 
   init_recurrent_kernel = glorot_uniform, bias = true)

The most basic recurrent layer. Essentially acts as a Dense layer, but with the output fed back into the input each time step.

In the forward pass computes

\[h_t = \sigma(W_i x_t + W_h h_{t-1} + b)\]

for all len steps t in the in input sequence.

See RNNCell for a layer that processes a single time step.

Arguments

  • in => out: The input and output dimensions of the layer.
  • σ: The non-linearity to apply to the output. Default is tanh.
  • init_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.
  • init_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.
  • bias: Whether to include a bias term initialized to zero. Default is true.

Forward

rnn(x, [h])

The arguments of the forward pass are:

  • x: The input to the RNN. It should be a matrix size in x len or an array of size in x len x batch_size.
  • h: The initial hidden state of the RNN. If given, it is a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.

Returns all new hidden states h_t as an array of size out x len x batch_size.

Examples

julia> d_in, d_out, len, batch_size = 4, 6, 3, 5;
 
 julia> x = rand(Float32, (d_in, len, batch_size));
@@ -509,7 +509,7 @@
 
 (m::Model)(x) = m.rnn(x, m.h0)
 
-model = Model(RNN(32 => 64), zeros(Float32, 64))
source
Flux.LSTMCellType
LSTMCell(in => out; init_kernel = glorot_uniform,
+model = Model(RNN(32 => 64), zeros(Float32, 64))
source
Flux.LSTMCellType
LSTMCell(in => out; init_kernel = glorot_uniform,
   init_recurrent_kernel = glorot_uniform, bias = true)

The Long Short Term Memory cell. Behaves like an RNN but generally exhibits a longer memory span over sequences.

In the forward pass, computes

\[i_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i) f_t = \sigma(W_{xf} x_t + W_{hf} h_{t-1} + b_f) c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_{xc} x_t + W_{hc} h_{t-1} + b_c) @@ -527,7 +527,8 @@ julia> h′, c′ = l(x, (h, c)); julia> size(h′) # out x batch_size -(5, 4)

source
Flux.LSTMType

" LSTM(in => out; initkernel = glorotuniform, initrecurrentkernel = glorot_uniform, bias = true)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

In the forward pass, computes

\[i_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i) +(5, 4)

source
Flux.LSTMType
LSTM(in => out; init_kernel = glorot_uniform,
+  init_recurrent_kernel = glorot_uniform, bias = true)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

In the forward pass, computes

\[i_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i) f_t = \sigma(W_{xf} x_t + W_{hf} h_{t-1} + b_f) c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_{xc} x_t + W_{hc} h_{t-1} + b_c) o_t = \sigma(W_{xo} x_t + W_{ho} h_{t-1} + b_o) @@ -546,7 +547,7 @@ x = rand(Float32, (d_in, len, batch_size)) model = Model(LSTM(d_in => d_out), zeros(Float32, d_out), zeros(Float32, d_out)) h, c = model(x) -size(h) # out x len x batch_size

source
Flux.GRUCellType
GRUCell(in => out; init_kernel = glorot_uniform,
+size(h)  # out x len x batch_size
source
Flux.GRUCellType
GRUCell(in => out; init_kernel = glorot_uniform,
   init_recurrent_kernel = glorot_uniform, bias = true)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

In the forward pass, computes

\[r = \sigma(W_{xi} x + W_{hi} h + b_i) z = \sigma(W_{xz} x + W_{hz} h + b_z) h̃ = \tanh(W_{xh} x + r \odot W_{hh} h + b_h) @@ -558,7 +559,7 @@ julia> x = rand(Float32, 3, 4); # in x batch_size -julia> h′ = g(x, h);

source
Flux.GRUType
GRU(in => out; init_kernel = glorot_uniform,
+julia> h′ = g(x, h);
source
Flux.GRUType
GRU(in => out; init_kernel = glorot_uniform,
   init_recurrent_kernel = glorot_uniform, bias = true)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

The forward pass computes

\[r_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i) z_t = \sigma(W_{xz} x_t + W_{hz} h_{t-1} + b_z) h̃_t = \tanh(W_{xh} x_t + r_t \odot W_{hh} h_{t-1} + b_h) @@ -566,15 +567,15 @@ gru = GRU(d_in => d_out) x = rand(Float32, (d_in, len, batch_size)) h0 = zeros(Float32, d_out) -h = gru(x, h0) # out x len x batch_size

source
Flux.GRUv3CellType
GRUv3Cell(in => out; init_kernel = glorot_uniform,
+h = gru(x, h0)  # out x len x batch_size
source
Flux.GRUv3CellType
GRUv3Cell(in => out; init_kernel = glorot_uniform,
   init_recurrent_kernel = glorot_uniform, bias = true)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The forward pass computes

\[r = \sigma(W_{xi} x + W_{hi} h + b_i) z = \sigma(W_{xz} x + W_{hz} h + b_z) h̃ = \tanh(W_{xh} x + W_{hh̃} (r \odot W_{hh} h) + b_h) -h' = (1 - z) \odot h̃ + z \odot h\]

and returns h'. This is a single time step of the GRU.

See GRUv3 for a layer that processes entire sequences. See GRU and GRUCell for variants of this layer.

Arguments

  • in => out: The input and output dimensions of the layer.
  • init_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.
  • init_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.
  • bias: Whether to include a bias term initialized to zero. Default is true.

Forward

gruv3cell(x, [h])

The arguments of the forward pass are:

  • x: The input to the GRU. It should be a vector of size in or a matrix of size in x batch_size.
  • h: The hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.

Returns the new hidden state h' as an array of size out or out x batch_size.

source
Flux.GRUv3Type
GRUv3(in => out; init_kernel = glorot_uniform,
+h' = (1 - z) \odot h̃ + z \odot h\]

and returns h'. This is a single time step of the GRU.

See GRUv3 for a layer that processes entire sequences. See GRU and GRUCell for variants of this layer.

Arguments

  • in => out: The input and output dimensions of the layer.
  • init_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.
  • init_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.
  • bias: Whether to include a bias term initialized to zero. Default is true.

Forward

gruv3cell(x, [h])

The arguments of the forward pass are:

  • x: The input to the GRU. It should be a vector of size in or a matrix of size in x batch_size.
  • h: The hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.

Returns the new hidden state h' as an array of size out or out x batch_size.

source
Flux.GRUv3Type
GRUv3(in => out; init_kernel = glorot_uniform,
   init_recurrent_kernel = glorot_uniform, bias = true)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The forward pass computes

\[r_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i) z_t = \sigma(W_{xz} x_t + W_{hz} h_{t-1} + b_z) h̃_t = \tanh(W_{xh} x_t + W_{hh̃} (r_t \odot W_{hh} h_{t-1}) + b_h) -h_t = (1 - z_t) \odot h̃_t + z_t \odot h_{t-1}\]

for all len steps t in the input sequence. See GRUv3Cell for a layer that processes a single time step. See GRU and GRUCell for variants of this layer.

Notice that GRUv3 is not a more advanced version of GRU but only a less popular variant.

Arguments

  • in => out: The input and output dimensions of the layer.
  • init_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.
  • init_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.
  • bias: Whether to include a bias term initialized to zero. Default is true.
source

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

Flux.BatchNormType
BatchNorm(channels::Integer, λ=identity;
+h_t = (1 - z_t) \odot h̃_t + z_t \odot h_{t-1}\]

for all len steps t in the input sequence. See GRUv3Cell for a layer that processes a single time step. See GRU and GRUCell for variants of this layer.

Notice that GRUv3 is not a more advanced version of GRU but only a less popular variant.

Arguments

  • in => out: The input and output dimensions of the layer.
  • init_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.
  • init_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.
  • bias: Whether to include a bias term initialized to zero. Default is true.
source

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

Flux.BatchNormType
BatchNorm(channels::Integer, λ=identity;
           initβ=zeros32, initγ=ones32,
           affine=true, track_stats=true, active=nothing,
           eps=1f-5, momentum= 0.1f0)

Batch Normalization layer. channels should be the size of the channel dimension in your data (see below).

Given an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.

BatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.

If affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

After normalisation, elementwise activation λ is applied.

If track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Use testmode! during inference.

Examples

julia> using Statistics
@@ -586,7 +587,7 @@
 julia> Flux.trainmode!(m);
 
 julia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))
-true
source
Flux.DropoutType
Dropout(p; [dims, rng, active])

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.

By default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword rng lets you specify a custom random number generator. (Only supported on the CPU.)

Examples

julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
+true
source
Flux.DropoutType
Dropout(p; [dims, rng, active])

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.

By default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword rng lets you specify a custom random number generator. (Only supported on the CPU.)

Examples

julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
 Chain(
   Dense(2 => 3),                        # 9 parameters
   Dropout(0.4),
@@ -618,7 +619,7 @@
 1.9989999999999961
 
 julia> mean(iszero, y)  # is about 0.4
-0.4003
source
Flux.AlphaDropoutType
AlphaDropout(p; [rng, active])

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once testmode! is true.

Examples

julia> using Statistics
+0.4003
source
Flux.AlphaDropoutType
AlphaDropout(p; [rng, active])

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once testmode! is true.

Examples

julia> using Statistics
 
 julia> x = randn32(1000,1);
 
@@ -629,7 +630,7 @@
 julia> y = m(x);
 
 julia> isapprox(std(x), std(y), atol=0.2)
-true
source
Flux.LayerNormType
LayerNorm(size..., λ=identity; affine=true, eps=1f-5)

A normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.

If affine=true, it also applies a learnable shift and rescaling using the Scale layer.

See also BatchNorm, InstanceNorm, GroupNorm, and normalise.

Examples

julia> using Statistics
+true
source
Flux.LayerNormType
LayerNorm(size..., λ=identity; affine=true, eps=1f-5)

A normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.

If affine=true, it also applies a learnable shift and rescaling using the Scale layer.

See also BatchNorm, InstanceNorm, GroupNorm, and normalise.

Examples

julia> using Statistics
 
 julia> xs = rand(3, 3, 3, 2);  # a batch of 2 images, each having 3 channels
 
@@ -638,7 +639,7 @@
 julia> y = m(xs);
 
 julia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)
-true
source
Flux.InstanceNormType
InstanceNorm(channels::Integer, λ=identity;
+true
source
Flux.InstanceNormType
InstanceNorm(channels::Integer, λ=identity;
              initβ=zeros32, initγ=ones32,
              affine=false, track_stats=false,
              eps=1f-5, momentum=0.1f0)

Instance Normalization layer. channels should be the size of the channel dimension in your data (see below).

Given an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.

InstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.

If affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

If track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Warning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).

Examples

julia> using Statistics
@@ -650,7 +651,7 @@
 julia> y = m(xs);
 
 julia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)
-true
source
Flux.GroupNormType
GroupNorm(channels::Int, G::Int, λ = identity;
+true
source
Flux.GroupNormType
GroupNorm(channels::Int, G::Int, λ = identity;
           initβ = zeros32,
           initγ = ones32,
           affine = true,
@@ -667,7 +668,7 @@
 true
 
 julia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])
-true
source
Flux.normaliseFunction
normalise(x; dims=ndims(x), eps=1f-5)

Normalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the variance for numerical stability.

Examples

julia> using Statistics
+true
source
Flux.normaliseFunction
normalise(x; dims=ndims(x), eps=1f-5)

Normalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the variance for numerical stability.

Examples

julia> using Statistics
 
 julia> x = [90, 100, 110, 130, 70];
 
@@ -690,7 +691,7 @@
 julia> y = Flux.normalise(x, dims=1);
 
 julia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)
-true
source

Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

Warning

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

Flux.testmode!Function
testmode!(model, [mode]) -> model

Set a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.

There is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.

Example

julia> d = Dropout(0.3)
+true
source

Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

Warning

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

Flux.testmode!Function
testmode!(model, [mode]) -> model

Set a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.

There is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.

Example

julia> d = Dropout(0.3)
 Dropout(0.3)
 
 julia> testmode!(d)   # dropout is now always disabled
@@ -700,4 +701,4 @@
 Dropout(0.3, active=true)
 
 julia> testmode!(d, :auto)  # back to default
-Dropout(0.3)
source
Flux.trainmode!Function
trainmode!(model) -> model

Set a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.

source
+Dropout(0.3)source
Flux.trainmode!Function
trainmode!(model) -> model

Set a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.

source
diff --git a/previews/PR2535/reference/models/losses/index.html b/previews/PR2535/reference/models/losses/index.html index 4c7883209c..0faf2f8880 100644 --- a/previews/PR2535/reference/models/losses/index.html +++ b/previews/PR2535/reference/models/losses/index.html @@ -10,16 +10,16 @@ loss(ŷ, y, agg=identity) # no aggregation.

Function listing

Flux.Losses.maeFunction
mae(ŷ, y; agg = mean)

Return the loss corresponding to mean absolute error:

agg(abs.(ŷ .- y))

Example

julia> y_model = [1.1, 1.9, 3.1];
 
 julia> Flux.mae(y_model, 1:3)
-0.10000000000000009
source
Flux.Losses.mseFunction
mse(ŷ, y; agg = mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y) .^ 2)

See also: mae, msle, crossentropy.

Example

julia> y_model = [1.1, 1.9, 3.1];
+0.10000000000000009
source
Flux.Losses.mseFunction
mse(ŷ, y; agg = mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y) .^ 2)

See also: mae, msle, crossentropy.

Example

julia> y_model = [1.1, 1.9, 3.1];
 
 julia> y_true = 1:3;
 
 julia> Flux.mse(y_model, y_true)
-0.010000000000000018
source
Flux.Losses.msleFunction
msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)

The ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.

Example

julia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)
+0.010000000000000018
source
Flux.Losses.msleFunction
msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)

The ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.

Example

julia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)
 0.009084041f0
 
 julia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)
-0.011100831f0
source
Flux.Losses.huber_lossFunction
huber_loss(ŷ, y; delta = 1, agg = mean)

Return the mean of the Huber loss given the prediction and true values y.

             | 0.5 * |ŷ - y|^2,            for |ŷ - y| <= δ
+0.011100831f0
source
Flux.Losses.huber_lossFunction
huber_loss(ŷ, y; delta = 1, agg = mean)

Return the mean of the Huber loss given the prediction and true values y.

             | 0.5 * |ŷ - y|^2,            for |ŷ - y| <= δ
 Huber loss = |
              |  δ * (|ŷ - y| - 0.5 * δ), otherwise

Example

julia> ŷ = [1.1, 2.1, 3.1];
 
@@ -27,7 +27,7 @@
 0.005000000000000009
 
 julia> Flux.huber_loss(ŷ, 1:3, delta=0.05)  # changes behaviour as |ŷ - y| > δ
-0.003750000000000005
source
Flux.Losses.label_smoothingFunction
label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)

Returns smoothed labels, meaning the confidence on label values are relaxed.

When y is given as one-hot vector or batch of one-hot, its calculated as

y .* (1 - α) .+ α / size(y, dims)

when y is given as a number or batch of numbers for binary classification, its calculated as

y .* (1 - α) .+ α / 2

in which case the labels are squeezed towards 0.5.

α is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.

dims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.

Example

julia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)
+0.003750000000000005
source
Flux.Losses.label_smoothingFunction
label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)

Returns smoothed labels, meaning the confidence on label values are relaxed.

When y is given as one-hot vector or batch of one-hot, its calculated as

y .* (1 - α) .+ α / size(y, dims)

when y is given as a number or batch of numbers for binary classification, its calculated as

y .* (1 - α) .+ α / 2

in which case the labels are squeezed towards 0.5.

α is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.

dims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.

Example

julia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)
 2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  ⋅  ⋅  ⋅  1  ⋅  1
  1  1  1  ⋅  1  ⋅
@@ -51,7 +51,7 @@
 true
 
 julia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)
-true
source
Flux.Losses.crossentropyFunction
crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)

Return the cross entropy between the given probability distributions; calculated as

agg(-sum(y .* log.(ŷ .+ ϵ); dims))

Cross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction is supposed to sum to one across dims, as would be the case with the output of a softmax operation.

For numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .

Use label_smoothing to smooth the true labels as preprocessing before computing the loss.

See also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.

Example

julia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)
+true
source
Flux.Losses.crossentropyFunction
crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)

Return the cross entropy between the given probability distributions; calculated as

agg(-sum(y .* log.(ŷ .+ ϵ); dims))

Cross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction is supposed to sum to one across dims, as would be the case with the output of a softmax operation.

For numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .

Use label_smoothing to smooth the true labels as preprocessing before computing the loss.

See also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.

Example

julia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)
 3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  1  ⋅  ⋅  ⋅  1
  ⋅  1  ⋅  1  ⋅
@@ -80,7 +80,7 @@
  0.05  0.05  0.9   0.05  0.05
 
 julia> Flux.crossentropy(y_model, y_smooth)
-1.5776052f0
source
Flux.Losses.logitcrossentropyFunction
logitcrossentropy(ŷ, y; dims = 1, agg = mean)

Return the cross entropy calculated by

agg(-sum(y .* logsoftmax(ŷ; dims); dims))

This is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.

See also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.

Example

julia> y_label = Flux.onehotbatch(collect("abcabaa"), 'a':'c')
+1.5776052f0
source
Flux.Losses.logitcrossentropyFunction
logitcrossentropy(ŷ, y; dims = 1, agg = mean)

Return the cross entropy calculated by

agg(-sum(y .* logsoftmax(ŷ; dims); dims))

This is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.

See also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.

Example

julia> y_label = Flux.onehotbatch(collect("abcabaa"), 'a':'c')
 3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  1  ⋅  ⋅  1  ⋅  1  1
  ⋅  1  ⋅  ⋅  1  ⋅  ⋅
@@ -96,7 +96,7 @@
 1.5791205f0
 
 julia> Flux.crossentropy(softmax(y_model), y_label)
-1.5791197f0
source
Flux.Losses.binarycrossentropyFunction
binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the binary cross-entropy loss, computed as

agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))

Where typically, the prediction is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.

Use label_smoothing to smooth the y value as preprocessing before computing the loss.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1]
+1.5791197f0
source
Flux.Losses.binarycrossentropyFunction
binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the binary cross-entropy loss, computed as

agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))

Where typically, the prediction is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.

Use label_smoothing to smooth the y value as preprocessing before computing the loss.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1]
 3-element Vector{Bool}:
  1
  0
@@ -119,7 +119,7 @@
  1  ⋅  1
 
 julia> Flux.crossentropy(y_prob, y_hot)
-0.43989f0
source
Flux.Losses.logitbinarycrossentropyFunction
logitbinarycrossentropy(ŷ, y; agg = mean)

Mathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1];
+0.43989f0
source
Flux.Losses.logitbinarycrossentropyFunction
logitbinarycrossentropy(ŷ, y; agg = mean)

Mathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1];
 
 julia> y_model = Float32[2, -1, pi]
 3-element Vector{Float32}:
@@ -131,7 +131,7 @@
 0.160832f0
 
 julia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)
-0.16083185f0
source
Flux.Losses.kldivergenceFunction
kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given probability distributions.

The KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.

Example

julia> p1 = [1 0; 0 1]
+0.16083185f0
source
Flux.Losses.kldivergenceFunction
kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given probability distributions.

The KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.

Example

julia> p1 = [1 0; 0 1]
 2×2 Matrix{Int64}:
  1  0
  0  1
@@ -151,10 +151,10 @@
 0.0
 
 julia> Flux.kldivergence(p1, p2; eps = 0)  # about 17.3 with the regulator
-Inf
source
Flux.Losses.poisson_lossFunction
poisson_loss(ŷ, y; agg = mean)

Return how much the predicted distribution diverges from the expected Poisson distribution y; calculated as -

sum(ŷ .- y .* log.(ŷ)) / size(y, 2)

More information..

Example

julia> y_model = [1, 3, 3];  # data should only take integral values
+Inf
source
Flux.Losses.poisson_lossFunction
poisson_loss(ŷ, y; agg = mean)

Return how much the predicted distribution diverges from the expected Poisson distribution y; calculated as -

sum(ŷ .- y .* log.(ŷ)) / size(y, 2)

More information..

Example

julia> y_model = [1, 3, 3];  # data should only take integral values
 
 julia> Flux.poisson_loss(y_model, 1:3)
-0.5023128522198171
source
Flux.Losses.hinge_lossFunction
hinge_loss(ŷ, y; agg = mean)

Return the hinge_loss given the prediction and true labels y (containing 1 or -1); calculated as

sum(max.(0, 1 .- ŷ .* y)) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: squared_hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
+0.5023128522198171
source
Flux.Losses.hinge_lossFunction
hinge_loss(ŷ, y; agg = mean)

Return the hinge_loss given the prediction and true labels y (containing 1 or -1); calculated as

sum(max.(0, 1 .- ŷ .* y)) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: squared_hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
 
 julia> y_pred = [0.1, 0.3, 1, 1.5];
 
@@ -168,7 +168,7 @@
 true
 
 julia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs
-true
source
Flux.Losses.squared_hinge_lossFunction
squared_hinge_loss(ŷ, y)

Return the squared hinge_loss loss given the prediction and true labels y (containing 1 or -1); calculated as

sum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
+true
source
Flux.Losses.squared_hinge_lossFunction
squared_hinge_loss(ŷ, y)

Return the squared hinge_loss loss given the prediction and true labels y (containing 1 or -1); calculated as

sum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
 
 julia> y_pred = [0.1, 0.3, 1, 1.5];
 
@@ -182,13 +182,13 @@
 true
 
 julia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0
-true
source
Flux.Losses.dice_coeff_lossFunction
dice_coeff_loss(ŷ, y; smooth = 1)

Return a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:

1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)

Example

julia> y_pred = [1.1, 2.1, 3.1];
+true
source
Flux.Losses.dice_coeff_lossFunction
dice_coeff_loss(ŷ, y; smooth = 1)

Return a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:

1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)

Example

julia> y_pred = [1.1, 2.1, 3.1];
 
 julia> Flux.dice_coeff_loss(y_pred, 1:3)
 0.000992391663909964
 
 julia> 1 - Flux.dice_coeff_loss(y_pred, 1:3)  # ~ F1 score for image segmentation
-0.99900760833609
source
Flux.Losses.tversky_lossFunction
tversky_loss(ŷ, y; beta = 0.7)

Return the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:

1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)
source
Flux.Losses.binary_focal_lossFunction
binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).

For gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.

See also: Losses.focal_loss for multi-class setting

Example

julia> y = [0  1  0
+0.99900760833609
source
Flux.Losses.tversky_lossFunction
tversky_loss(ŷ, y; beta = 0.7)

Return the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:

1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)
source
Flux.Losses.binary_focal_lossFunction
binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).

For gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.

See also: Losses.focal_loss for multi-class setting

Example

julia> y = [0  1  0
             1  0  1]
 2×3 Matrix{Int64}:
  0  1  0
@@ -201,7 +201,7 @@
  0.731059  0.5  0.731059
 
 julia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385
-true
source
Flux.Losses.focal_lossFunction
focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).

The modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.

Example

julia> y = [1  0  0  0  1
+true
source
Flux.Losses.focal_lossFunction
focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).

The modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.

Example

julia> y = [1  0  0  0  1
             0  1  0  1  0
             0  0  1  0  0]
 3×5 Matrix{Int64}:
@@ -216,10 +216,10 @@
  0.665241   0.665241   0.665241   0.665241   0.665241
 
 julia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628
-true

See also: Losses.binary_focal_loss for binary (not one-hot) labels

source
Flux.Losses.siamese_contrastive_lossFunction
siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)

Return the contrastive loss which can be useful for training Siamese Networks. It is given by

agg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)

Specify margin to set the baseline for distance at which pairs are dissimilar.

Example

julia> ŷ = [0.5, 1.5, 2.5];
+true

See also: Losses.binary_focal_loss for binary (not one-hot) labels

source
Flux.Losses.siamese_contrastive_lossFunction
siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)

Return the contrastive loss which can be useful for training Siamese Networks. It is given by

agg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)

Specify margin to set the baseline for distance at which pairs are dissimilar.

Example

julia> ŷ = [0.5, 1.5, 2.5];
 
 julia> Flux.siamese_contrastive_loss(ŷ, 1:3)
 -4.833333333333333
 
 julia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)
--4.0
source
+-4.0source
diff --git a/previews/PR2535/reference/models/nnlib/index.html b/previews/PR2535/reference/models/nnlib/index.html index 483f4f8986..1c8a6ea52a 100644 --- a/previews/PR2535/reference/models/nnlib/index.html +++ b/previews/PR2535/reference/models/nnlib/index.html @@ -375,4 +375,4 @@ [:, :, 1, 1] = 1.0 3.0 1.5 3.5 - 2.0 4.0source
NNlib.∇grid_sampleFunction
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T

Arguments

  • Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
  • input: Input from primal computation in (W_in, H_in, C, N) shape.
  • grid: Grid from primal computation in (2, W_out, H_out, N) shape.
  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.

Returns

dinput (same shape as input) and dgrid (same shape as grid) gradients.

source

Losses

NNlib.ctc_lossFunction
ctc_loss(ŷ, y)

Computes the connectionist temporal classification loss between and y. must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to , so must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with . The blank label is assumed to be the last label category in , so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.

source

Miscellaneous

NNlib.logsumexpFunction
logsumexp(x; dims = :)

Computes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.

See also logsoftmax.

source
NNlib.gluFunction
glu(x, dim = 1)

The gated linear unit from the "Language Modeling with Gated Convolutional Networks" paper.

Calculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.

source
+ 2.0 4.0source
NNlib.∇grid_sampleFunction
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T

Arguments

  • Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
  • input: Input from primal computation in (W_in, H_in, C, N) shape.
  • grid: Grid from primal computation in (2, W_out, H_out, N) shape.
  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.

Returns

dinput (same shape as input) and dgrid (same shape as grid) gradients.

source

Losses

NNlib.ctc_lossFunction
ctc_loss(ŷ, y)

Computes the connectionist temporal classification loss between and y. must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to , so must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with . The blank label is assumed to be the last label category in , so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.

source

Miscellaneous

NNlib.logsumexpFunction
logsumexp(x; dims = :)

Computes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.

See also logsoftmax.

source
NNlib.gluFunction
glu(x, dim = 1)

The gated linear unit from the "Language Modeling with Gated Convolutional Networks" paper.

Calculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.

source
diff --git a/previews/PR2535/reference/outputsize/index.html b/previews/PR2535/reference/outputsize/index.html index 01e9002029..a69a0ffdf6 100644 --- a/previews/PR2535/reference/outputsize/index.html +++ b/previews/PR2535/reference/outputsize/index.html @@ -68,7 +68,7 @@ # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB. julia> outputsize(ans, (28, 28, 1, 32)) -(10, 32)

Limitations:

source
Flux.outputsizeFunction
outputsize(m, x_size, y_size, ...; padbatch=false)

For model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.

Examples

julia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);
+(10, 32)

Limitations:

  • While @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.
  • While Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).
source
Flux.outputsizeFunction
outputsize(m, x_size, y_size, ...; padbatch=false)

For model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.

Examples

julia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);
 
 julia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));
 
@@ -81,4 +81,4 @@
 (13, 1)
 
 julia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))
-true

Notice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.

source
+true

Notice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.

source
diff --git a/previews/PR2535/reference/training/callbacks/index.html b/previews/PR2535/reference/training/callbacks/index.html index 283e5d9797..8259c484ab 100644 --- a/previews/PR2535/reference/training/callbacks/index.html +++ b/previews/PR2535/reference/training/callbacks/index.html @@ -10,7 +10,7 @@ sleep(1) end Flux -Fluxsource

Patience Helpers

Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.

For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.

# create a pseudo-loss that decreases for 4 calls, then starts increasing
+Flux
source

Patience Helpers

Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.

For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.

# create a pseudo-loss that decreases for 4 calls, then starts increasing
 # we call this like loss()
 loss = let t = 0
   () -> begin
@@ -61,7 +61,7 @@
        end
 [ Info: Epoch 1
 [ Info: Epoch 2
-[ Info: Epoch 3
source
Flux.early_stoppingFunction
early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)

Return a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.

Examples

julia> loss = let l = 0
+[ Info: Epoch 3
source
Flux.early_stoppingFunction
early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)

Return a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.

Examples

julia> loss = let l = 0
          () -> l += 1
        end; # pseudo loss function that returns increasing values
 
@@ -74,7 +74,7 @@
        end
 [ Info: Epoch 1
 [ Info: Epoch 2
-[ Info: Epoch 3
source
Flux.plateauFunction
plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)

Return a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.

Examples

julia> f = let v = 10
+[ Info: Epoch 3
source
Flux.plateauFunction
plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)

Return a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.

Examples

julia> f = let v = 10
          () -> v = v / abs(v) - v
        end; # -9, 8, -7, 6, ...
 
@@ -88,4 +88,4 @@
 [ Info: Epoch 1
 [ Info: Epoch 2
 [ Info: Epoch 3
-[ Info: Epoch 4
source
+[ Info: Epoch 4source
diff --git a/previews/PR2535/reference/training/enzyme/index.html b/previews/PR2535/reference/training/enzyme/index.html index 160b457632..a74b16c94f 100644 --- a/previews/PR2535/reference/training/enzyme/index.html +++ b/previews/PR2535/reference/training/enzyme/index.html @@ -67,7 +67,7 @@ julia> Flux.gradient(dup_model, [1]; zero=false) do m, x # implict Const([1]), and grad accumulation sum(abs2, m(x)) end -((layers = ((weight = [12.0;;], bias = [12.0], σ = nothing),),), nothing)source
Flux.withgradientMethod
withgradient(f, args::Union{Const,Duplicated}...)

This should return the same answer as withgradient(f, model, args...), but it uses Enzyme.jl instead of Zygote.jl to compute the derivative.

Only available when Enzyme is loaded!

Experimental

Enzyme support like this is new and somewhat experimental. This method was added in Flux 0.15.

Example

julia> using Flux, Enzyme
+((layers = ((weight = [12.0;;], bias = [12.0], σ = nothing),),), nothing)
source
Flux.withgradientMethod
withgradient(f, args::Union{Const,Duplicated}...)

This should return the same answer as withgradient(f, model, args...), but it uses Enzyme.jl instead of Zygote.jl to compute the derivative.

Only available when Enzyme is loaded!

Experimental

Enzyme support like this is new and somewhat experimental. This method was added in Flux 0.15.

Example

julia> using Flux, Enzyme
 
 julia> model = Chain(Embedding([1.1 2.2 3.3]), Dense([4.4;;]), only);
 
@@ -82,4 +82,4 @@
 (val = (14.52, "aux"), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))
 
 julia> Flux.withgradient(m -> (loss=m(3), aux=round.(m.(1:3); digits=3)), Duplicated(model))
-(val = (loss = 14.52, aux = [4.84, 9.68, 14.52]), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))
source
Flux.Train.train!Method
train!(loss, Duplicated(model), data, opt_state)

This method uses Enzyme.jl instead of Zygote.jl to compute the gradients, but is otherwise the same as train!(loss, model, data, opt_state).

Only available when Enzyme is loaded.

New

This method was added in Flux 0.13.9.

source

Enzyme.jl has its own extensive documentation.

+(val = (loss = 14.52, aux = [4.84, 9.68, 14.52]), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))source
Flux.Train.train!Method
train!(loss, Duplicated(model), data, opt_state)

This method uses Enzyme.jl instead of Zygote.jl to compute the gradients, but is otherwise the same as train!(loss, model, data, opt_state).

Only available when Enzyme is loaded.

New

This method was added in Flux 0.13.9.

source

Enzyme.jl has its own extensive documentation.

diff --git a/previews/PR2535/reference/training/optimisers/index.html b/previews/PR2535/reference/training/optimisers/index.html index e7188365cc..7ced8cba05 100644 --- a/previews/PR2535/reference/training/optimisers/index.html +++ b/previews/PR2535/reference/training/optimisers/index.html @@ -54,4 +54,4 @@ end

ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.

Decays

Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.

Optimisers.SignDecayType
SignDecay(λ = 1e-3)
 SignDecay(; [lambda])

Implements $L_1$ regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.

See also [WeightDecay] for $L_2$ normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source
Optimisers.WeightDecayType
WeightDecay(λ = 5e-4)
 WeightDecay(; [lambda])

Implements $L_2$ regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.

See also [SignDecay] for $L_1$ normalisation.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source

Gradient Clipping

Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is

opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))
Optimisers.ClipGradType
ClipGrad(δ = 10)
-ClipGrad(; [delta])

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

Typically composed with other rules using OptimiserChain.

See also ClipNorm.

source
Optimisers.ClipNormType
ClipNorm(ω = 10, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).

Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Typically composed with other rules using OptimiserChain.

See also ClipGrad.

source
+ClipGrad(; [delta])

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

Typically composed with other rules using OptimiserChain.

See also ClipNorm.

source
Optimisers.ClipNormType
ClipNorm(ω = 10, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).

Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Typically composed with other rules using OptimiserChain.

See also ClipGrad.

source
diff --git a/previews/PR2535/reference/training/reference/index.html b/previews/PR2535/reference/training/reference/index.html index 8cf061b872..419643cf50 100644 --- a/previews/PR2535/reference/training/reference/index.html +++ b/previews/PR2535/reference/training/reference/index.html @@ -19,14 +19,14 @@ 10.19 julia> opt_state # mutated by Flux.train! -(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())source
opt_state = setup(rule, model::Duplicated) = setup(rule, model.val)

Special method for use with Enzyme.jl, ignores the stored gradient.

source
Flux.Train.train!Method
train!(loss, model, data, opt_state)

Uses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.

If model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.

For example, with these definitions...

data = [(x1, y1), (x2, y2), (x3, y3)]
+(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())
source
opt_state = setup(rule, model::Duplicated) = setup(rule, model.val)

Special method for use with Enzyme.jl, ignores the stored gradient.

source
Flux.Train.train!Method
train!(loss, model, data, opt_state)

Uses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.

If model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.

For example, with these definitions...

data = [(x1, y1), (x2, y2), (x3, y3)]
 
 loss3(m, x, y) = norm(m(x) .- y)        # the model is the first argument
 
 opt_state = Flux.setup(Adam(), model)   # explicit setup of optimiser momenta

...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:

for d in data
     ∂L∂m = gradient(loss3, model, d...)[1]
     update!(opt_state, model, ∂L∂m)
-end

You can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:

  • Stop with a DomainError if the loss is infinite or NaN at any point.

  • Show a progress bar using @withprogress.

New

This method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:

  • It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's "implicit" parameter handling, with Grads.)
  • Instead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.
  • opt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.
  • Callback functions are not supported. (But any code can be included in the above for loop.)
source
Optimisers.updateFunction
Optimisers.update(tree, model, gradient) -> (tree, model)

Uses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.

See also update!, which will be faster for models of ordinary Arrays or CuArrays.

Example

julia> m = (x = Float32[1,2,3], y = tanh);
+end

You can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:

  • Stop with a DomainError if the loss is infinite or NaN at any point.

  • Show a progress bar using @withprogress.

New

This method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:

  • It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's "implicit" parameter handling, with Grads.)
  • Instead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.
  • opt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.
  • Callback functions are not supported. (But any code can be included in the above for loop.)
source
Optimisers.updateFunction
Optimisers.update(tree, model, gradient) -> (tree, model)

Uses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.

See also update!, which will be faster for models of ordinary Arrays or CuArrays.

Example

julia> m = (x = Float32[1,2,3], y = tanh);
 
 julia> t = Optimisers.setup(Descent(0.1), m)
 (x = Leaf(Descent(0.1), nothing), y = ())
@@ -115,4 +115,4 @@
 julia> Optimisers.thaw!(s)
 
 julia> s.x
-(Leaf(Momentum(0.01, 0.9), [0.0]), ())
source
Optimisers.thaw!Function
Optimisers.thaw!(tree)

The reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).

source
+(Leaf(Momentum(0.01, 0.9), [0.0]), ())source
Optimisers.thaw!Function
Optimisers.thaw!(tree)

The reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).

source
diff --git a/previews/PR2535/reference/training/zygote/index.html b/previews/PR2535/reference/training/zygote/index.html index eadc07fbd1..ba58a13d2b 100644 --- a/previews/PR2535/reference/training/zygote/index.html +++ b/previews/PR2535/reference/training/zygote/index.html @@ -203,4 +203,4 @@ # this definition of map is for any AD that only defines a reverse mode. # It is not as good as the rrule that can be used if the AD defines a forward-mode as well. -rrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...

For more details see rule configurations and calling back into AD.

source
ChainRulesCore.TangentType
Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent

This type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.

Tangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.

T is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.

For Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.

source
ChainRulesCore.canonicalizeFunction
canonicalize(tangent::Tangent{P}) -> Tangent{P}

Return the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().

source
+rrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...

For more details see rule configurations and calling back into AD.

source
ChainRulesCore.TangentType
Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent

This type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.

Tangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.

T is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.

For Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.

source
ChainRulesCore.canonicalizeFunction
canonicalize(tangent::Tangent{P}) -> Tangent{P}

Return the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().

source
diff --git a/previews/PR2535/reference/utilities/index.html b/previews/PR2535/reference/utilities/index.html index 75a95835a6..0ae01b5270 100644 --- a/previews/PR2535/reference/utilities/index.html +++ b/previews/PR2535/reference/utilities/index.html @@ -32,7 +32,7 @@ julia> ans.bias 2-element Vector{Float32}: 0.0 - 0.0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.glorot_normalFunction
glorot_normal([rng], size...; gain = 1) -> Array
+ 0.0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.glorot_normalFunction
glorot_normal([rng], size...; gain = 1) -> Array
 glorot_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.

This method is described in [1] and also known as Xavier initialization.

Examples

julia> using Statistics
 
 julia> round(std(Flux.glorot_normal(10, 1000)), digits=3)
@@ -48,7 +48,7 @@
 Dense(10 => 1000, tanh)  # 11_000 parameters
 
 julia> round(std(ans.weight), sigdigits=3)
-4.45f0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.kaiming_uniformFunction
kaiming_uniform([rng], size...; gain = √2) -> Array
+4.45f0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.kaiming_uniformFunction
kaiming_uniform([rng], size...; gain = √2) -> Array
 kaiming_uniform([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.

This method is described in [1] and also known as He initialization.

Examples

julia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)
 (-0.774f0, 0.773f0)
 
@@ -56,7 +56,7 @@
 (-0.243f0, 0.245f0)
 
 julia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)
-(-0.245f0, 0.245f0)

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.kaiming_normalFunction
kaiming_normal([rng], size...; gain = √2) -> Array
+(-0.245f0, 0.245f0)

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.kaiming_normalFunction
kaiming_normal([rng], size...; gain = √2) -> Array
 kaiming_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.

This method is described in [1] and also known as He initialization.

Examples

julia> using Statistics
 
 julia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)
@@ -66,7 +66,7 @@
 0.45f0
 
 julia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)
-0.045f0

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.truncated_normalFunction
truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array
+0.045f0

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.truncated_normalFunction
truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array
 truncated_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).

The values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.

Examples

julia> using Statistics
 
 julia> Flux.truncated_normal(3, 4) |> summary
@@ -76,7 +76,7 @@
 (-2.0f0, 2.0f0)
 
 julia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))
-1.0f0
source
Flux.lecun_normalFunction
lecun_normal([rng], size...) -> Array
+1.0f0
source
Flux.lecun_normalFunction
lecun_normal([rng], size...) -> Array
 lecun_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a truncated normal distribution centered on 0 with stddev sqrt(1 / fan_in), where fan_in is the number of input units in the weight tensor.

Examples

julia> using Statistics
 
 julia> round(std(Flux.lecun_normal(10, 1000)), digits=3)
@@ -92,7 +92,7 @@
 Dense(10 => 1000, selu)  # 11_000 parameters
 
 julia> round(std(ans.weight), digits=3)
-0.313f0

References

[1] Lecun, Yann, et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.

source
Flux.orthogonalFunction
orthogonal([rng], size...; gain = 1) -> Array
+0.313f0

References

[1] Lecun, Yann, et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.

source
Flux.orthogonalFunction
orthogonal([rng], size...; gain = 1) -> Array
 orthogonal([rng]; kw...) -> Function

Return an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].

Cannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.

Examples

julia> W = Flux.orthogonal(5, 7);
 
 julia> summary(W)
@@ -112,7 +112,7 @@
 julia> W3 = Flux.orthogonal(3, 3, 2, 4);
 
 julia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)
-true

References

[1] Saxe, McClelland, Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", ICLR 2014, https://arxiv.org/abs/1312.6120

source
Flux.sparse_initFunction
sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array
+true

References

[1] Saxe, McClelland, Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", ICLR 2014, https://arxiv.org/abs/1312.6120

source
Flux.sparse_initFunction
sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array
 sparse_init([rng]; kw...) -> Function

Return a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.

This method is described in [1].

Examples

julia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))
 20
 
@@ -125,7 +125,7 @@
 
 julia> count(iszero, ans.weight, dims=1)
 1×3 Matrix{Int64}:
- 5  5  5

References

[1] Martens, J, "Deep learning via Hessian-free optimization" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.

source
Flux.identity_initFunction
identity_init(size...; gain=1, shift=0) -> Array
+ 5  5  5

References

[1] Martens, J, "Deep learning via Hessian-free optimization" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.

source
Flux.identity_initFunction
identity_init(size...; gain=1, shift=0) -> Array
 identity_init(; kw...) -> Function

Return an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.

Often useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.

Has the following behaviour

  • 1D: A Vector of zeros (useful for an identity bias)
  • 2D: An identity matrix (useful for an identity matrix multiplication)
  • More than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)

Some caveats:

  • Not all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.

  • Layers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.

  • For convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.

Use keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).

For consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.

Examples

julia> Flux.identity_init(3,5)
 3×5 Matrix{Float32}:
  1.0  0.0  0.0  0.0  0.0
@@ -157,7 +157,7 @@
 [:, :, 1, 1] =
  10.0  20.0  30.0
  40.0  50.0  60.0
- 70.0  80.0  90.0
source
Flux.ones32Function
ones32(size...) = ones(Float32, size...)

Return an Array{Float32} of the given size filled with 1s.

source
Flux.zeros32Function
zeros32(size...) = zeros(Float32, size...)

Return an Array{Float32} of the given size filled with 0s.

source
Flux.rand32Function
rand32([rng], size...)

Return an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.

source
Flux.randn32Function
randn32([rng], size...)

Return an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.

source
Flux.create_biasFunction
create_bias(weights, bias, size...)

Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.

  • bias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.
  • bias == false returns false, which is understood by AD to be non-differentiable.
  • bias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.
source

These functions call:

Flux.rng_from_arrayFunction
rng_from_array(x)

Create an instance of the RNG most appropriate for x. As an example, if x is aCuArray, it will return a CUDA.default_rng(). If x is an Array instead, it will return a Random.default_rng().

source
Flux.nfanFunction
nfan(n_out, n_in=1) -> Tuple
+ 70.0  80.0  90.0
source
Flux.ones32Function
ones32(size...) = ones(Float32, size...)

Return an Array{Float32} of the given size filled with 1s.

source
Flux.zeros32Function
zeros32(size...) = zeros(Float32, size...)

Return an Array{Float32} of the given size filled with 0s.

source
Flux.rand32Function
rand32([rng], size...)

Return an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.

source
Flux.randn32Function
randn32([rng], size...)

Return an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.

source
Flux.create_biasFunction
create_bias(weights, bias, size...)

Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.

  • bias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.
  • bias == false returns false, which is understood by AD to be non-differentiable.
  • bias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.
source

These functions call:

Flux.rng_from_arrayFunction
rng_from_array(x)

Create an instance of the RNG most appropriate for x. As an example, if x is aCuArray, it will return a CUDA.default_rng(). If x is an Array instead, it will return a Random.default_rng().

source
Flux.nfanFunction
nfan(n_out, n_in=1) -> Tuple
 nfan(dims...)
 nfan(dims::Tuple)

For a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.

This function is mainly used by weight initializers, e.g., kaiming_normal.

Examples

julia> layer = Dense(10, 20);
 
@@ -167,7 +167,7 @@
 julia> layer = Conv((3, 3), 2=>10);
 
 julia> Flux.nfan(size(layer.weight))
-(18, 90)
source

Changing the type of all parameters

The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):

Flux.f64Function
f64(m)

Converts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.

See also f32 and f16.

source
Flux.f32Function
f32(m)

Converts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.

See also f64 and f16.

source
Flux.f16Function
f16(m)

Converts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.

Support for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.

See also f32 and f64.

Example

julia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10))  # all Float32
+(18, 90)
source

Changing the type of all parameters

The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):

Flux.f64Function
f64(m)

Converts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.

See also f32 and f16.

source
Flux.f32Function
f32(m)

Converts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.

See also f64 and f16.

source
Flux.f16Function
f16(m)

Converts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.

Support for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.

See also f32 and f64.

Example

julia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10))  # all Float32
 Chain(
   Dense(784 => 2048, relu),             # 1_607_680 parameters
   Dense(2048 => 10),                    # 20_490 parameters
@@ -177,4 +177,4 @@
 Chain(
   Dense(784 => 2048, relu),             # 1_607_680 parameters
   Dense(2048 => 10),                    # 20_490 parameters
-)                   # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.
source
+) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.source
diff --git a/previews/PR2535/search_index.js b/previews/PR2535/search_index.js index 1d5641a6c9..9fd5d8f79f 100644 --- a/previews/PR2535/search_index.js +++ b/previews/PR2535/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"guide/models/quickstart/#man-quickstart","page":"Quick Start","title":"A Neural Network in One Minute","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you have used neural networks before, then this simple example might be helpful for seeing how the major parts of Flux work together. Try pasting the code into the REPL prompt.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you haven't, then you might prefer the Fitting a Straight Line page.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"# Install everything, including CUDA, and load packages:\nusing Pkg; Pkg.add([\"Flux\", \"CUDA\", \"cuDNN\", \"ProgressMeter\"])\nusing Flux, Statistics, ProgressMeter\nusing CUDA # optional\ndevice = gpu_device() # function to move data and model to the GPU\n\n# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:\nnoisy = rand(Float32, 2, 1000) # 2×1000 Matrix{Float32}\ntruth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element Vector{Bool}\n\n# Define our model, a multi-layer perceptron with one hidden layer of size 3:\nmodel = Chain(\n Dense(2 => 3, tanh), # activation function inside layer\n BatchNorm(3),\n Dense(3 => 2)) |> device # move model to GPU, if one is available\n\n# The model encapsulates parameters, randomly initialised. Its initial output is:\nout1 = model(noisy |> device) # 2×1000 Matrix{Float32}, or CuArray{Float32}\nprobs1 = softmax(out1) |> cpu # normalise to get probabilities (and move off GPU)\n\n# To train the model, we use batches of 64 samples, and one-hot encoding:\ntarget = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix\nloader = Flux.DataLoader((noisy, target), batchsize=64, shuffle=true);\n\nopt_state = Flux.setup(Flux.Adam(0.01), model) # will store optimiser momentum, etc.\n\n# Training loop, using the whole data set 1000 times:\nlosses = []\n@showprogress for epoch in 1:1_000\n for xy_cpu in loader\n # Unpack batch of data, and move to GPU:\n x, y = xy_cpu |> device\n loss, grads = Flux.withgradient(model) do m\n # Evaluate model and loss inside gradient context:\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\n Flux.update!(opt_state, model, grads[1])\n push!(losses, loss) # logging, outside gradient context\n end\nend\n\nopt_state # parameters, momenta and output have all changed\n\nout2 = model(noisy |> device) # first row is prob. of true, second row p(false)\nprobs2 = softmax(out2) |> cpu # normalise to get probabilities\nmean((probs2[1,:] .> 0.5) .== truth) # accuracy 94% so far!","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"(Image: )","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"using Plots # to draw the above figure\n\np_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title=\"True classification\", legend=false)\np_raw = scatter(noisy[1,:], noisy[2,:], zcolor=probs1[1,:], title=\"Untrained network\", label=\"\", clims=(0,1))\np_done = scatter(noisy[1,:], noisy[2,:], zcolor=probs2[1,:], title=\"Trained network\", legend=false)\n\nplot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Here's the loss during training:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"plot(losses; xaxis=(:log10, \"iteration\"),\n yaxis=\"loss\", label=\"per batch\")\nn = length(loader)\nplot!(n:n:length(losses), mean.(Iterators.partition(losses, n)),\n label=\"epoch mean\", dpi=200)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"This XOR (\"exclusive or\") problem is a variant of the famous one which drove Minsky and Papert to invent deep neural networks in 1969. For small values of \"deep\" – this has one hidden layer, while earlier perceptrons had none. (What they call a hidden layer, Flux calls the output of the first layer, model[1](noisy).)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Since then things have developed a little. ","category":"page"},{"location":"guide/models/quickstart/#Features-to-Note","page":"Quick Start","title":"Features to Note","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Some things to notice in this example are:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"The batch dimension of data is always the last one. Thus a 2×1000 Matrix is a thousand observations, each a column of length 2. Flux defaults to Float32, but most of Julia to Float64.\nThe model can be called like a function, y = model(x). Each layer like Dense is an ordinary struct, which encapsulates some arrays of parameters (and possibly other state, as for BatchNorm).\nBut the model does not contain the loss function, nor the optimisation rule. The momenta needed by Adam are stored in the object returned by setup. And Flux.logitcrossentropy is an ordinary function that combines the softmax and crossentropy functions.\nThe do block creates an anonymous function, as the first argument of gradient. Anything executed within this is differentiated.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Instead of calling gradient and update! separately, there is a convenience function train!. If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"for epoch in 1:1_000\n Flux.train!(model, loader |> device, opt_state) do m, x, y\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\nend","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Notice that the full dataset noisy lives on the CPU, and is moved to the GPU one batch at a time, by xy_cpu |> device. This is generally what you want for large datasets. Calling loader |> device similarly modifies the DataLoader to move one batch at a time.\nIn our simple example, we conveniently created the model has a Chain of layers.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"For more complex models, you can define a custom struct MyModel containing layers and arrays and implement the call operator (::MyModel)(x) = ... to define the forward pass. This is all it is needed for Flux to work. Marking the struct with Flux.@layer will add some more functionality, like pretty printing and the ability to mark some internal fields as trainable or not (also see trainable).","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/reference/#Training-API-Reference","page":"Training API","title":"Training API Reference","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The new version of Flux's training code was written as an independent package, Optimisers.jl. Only the function train! belongs to Flux itself.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The Optimisers package is designed to allow for immutable objects. But at present all Flux models contain parameter arrays (such as Arrays and CuArrays) which can be updated in-place. Because of this:","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The objects returned by Optimisers.update! can be ignored.\nFlux defines its own version of setup which checks this assumption. (Using instead Optimisers.setup will also work, they return the same thing.)","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The available optimization rules are listed the optimisation rules page here. See the Optimisers documentation for details on how the rules work.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Flux.Train.setup\nFlux.Train.train!(loss, model, data, state)\nOptimisers.update\nOptimisers.update!\nOptimisers.setup","category":"page"},{"location":"reference/training/reference/#Flux.Train.setup","page":"Training API","title":"Flux.Train.setup","text":"opt_state = setup(rule, model)\n\nThis is a version of Optimisers.setup, and is the first step before using train!. It differs from Optimisers.setup in that it:\n\nhas one extra check for mutability (since Flux expects to mutate the model in-place, while Optimisers.jl is designed to return an updated model)\nhas methods which accept Flux's old optimisers, and convert them. (The old Flux.Optimise.Adam and new Optimisers.Adam are distinct types.)\n\nExample\n\njulia> model = Dense(2 => 1, leakyrelu; init=ones);\n\njulia> opt_state = Flux.setup(Momentum(0.1), model) # this encodes the optimiser and its state\n(weight = Leaf(Momentum(0.1, 0.9), [0.0 0.0]), bias = Leaf(Momentum(0.1, 0.9), [0.0]), σ = ())\n\njulia> x1, y1 = [0.2, -0.3], [0.4]; # use the same data for two steps:\n\njulia> Flux.train!(model, [(x1, y1), (x1, y1)], opt_state) do m, x, y\n sum(abs.(m(x) .- y)) * 100\n end\n\njulia> model.bias # was zero, mutated by Flux.train!\n1-element Vector{Float64}:\n 10.19\n\njulia> opt_state # mutated by Flux.train!\n(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())\n\n\n\n\n\nopt_state = setup(rule, model::Duplicated) = setup(rule, model.val)\n\nSpecial method for use with Enzyme.jl, ignores the stored gradient.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Flux.Train.train!-NTuple{4, Any}","page":"Training API","title":"Flux.Train.train!","text":"train!(loss, model, data, opt_state)\n\nUses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.\n\nIf model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.\n\nFor example, with these definitions...\n\ndata = [(x1, y1), (x2, y2), (x3, y3)]\n\nloss3(m, x, y) = norm(m(x) .- y) # the model is the first argument\n\nopt_state = Flux.setup(Adam(), model) # explicit setup of optimiser momenta\n\n...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:\n\nfor d in data\n ∂L∂m = gradient(loss3, model, d...)[1]\n update!(opt_state, model, ∂L∂m)\nend\n\nYou can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:\n\nStop with a DomainError if the loss is infinite or NaN at any point.\nShow a progress bar using @withprogress.\n\ncompat: New\nThis method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's \"implicit\" parameter handling, with Grads.)\nInstead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.\nopt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.\nCallback functions are not supported. (But any code can be included in the above for loop.)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/reference/#Optimisers.update","page":"Training API","title":"Optimisers.update","text":"Optimisers.update(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nSee also update!, which will be faster for models of ordinary Arrays or CuArrays.\n\nExample\n\njulia> m = (x = Float32[1,2,3], y = tanh);\n\njulia> t = Optimisers.setup(Descent(0.1), m)\n(x = Leaf(Descent(0.1), nothing), y = ())\n\njulia> g = (x = [1,1,1], y = nothing); # fake gradient\n\njulia> Optimisers.update(t, m, g)\n((x = Leaf(Descent(0.1), nothing), y = ()), (x = Float32[0.9, 1.9, 2.9], y = tanh))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.update!","page":"Training API","title":"Optimisers.update!","text":"Optimisers.update!(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nThis is used in exactly the same manner as update, but because it may mutate arrays within the old model (and the old state), it will be faster for models of ordinary Arrays or CuArrays. However, you should not rely on the old model being fully updated but rather use the returned model. (The original state tree is always mutated, as each Leaf is mutable.)\n\nExample\n\njulia> using StaticArrays, Zygote, Optimisers\n\njulia> m = (x = [1f0, 2f0], y = SA[4f0, 5f0]); # partly mutable model\n\njulia> t = Optimisers.setup(Momentum(1/30, 0.9), m) # tree of states\n(x = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]), y = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]))\n\njulia> g = gradient(m -> sum(abs2.(m.x .+ m.y)), m)[1] # structural gradient\n(x = Float32[10.0, 14.0], y = Float32[10.0, 14.0])\n\njulia> t2, m2 = Optimisers.update!(t, m, g);\n\njulia> m2 # after update or update!, this is the new model\n(x = Float32[0.6666666, 1.5333333], y = Float32[3.6666667, 4.5333333])\n\njulia> m2.x === m.x # update! has re-used this array, for efficiency\ntrue\n\njulia> m # original should be discarded, may be mutated but no guarantee\n(x = Float32[0.6666666, 1.5333333], y = Float32[4.0, 5.0])\n\njulia> t == t2 # original state tree is guaranteed to be mutated\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.setup","page":"Training API","title":"Optimisers.setup","text":"Optimisers.setup(rule, model) -> state_tree\n\nInitialises the given optimiser for every trainable parameter within the model. Returns a tree of the relevant states, which must be passed to update or update!.\n\nExample\n\njulia> m = (x = rand(3), y = (true, false), z = tanh);\n\njulia> Optimisers.setup(Momentum(), m) # same field names as m\n(x = Leaf(Momentum(0.01, 0.9), [0.0, 0.0, 0.0]), y = ((), ()), z = ())\n\nThe recursion into structures uses Functors.jl, and any new structs containing parameters need to be marked with Functors.@functor before use. See the Flux docs for more about this.\n\njulia> struct Layer; mat; fun; end\n\njulia> model = (lay = Layer([1 2; 3 4f0], sin), vec = [5, 6f0]);\n\njulia> Optimisers.setup(Momentum(), model) # new struct is by default ignored\n(lay = (), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[5.0, 6.0], Restructure(NamedTuple, ..., 2))\n\njulia> using Functors; @functor Layer # annotate this type as containing parameters\n\njulia> Optimisers.setup(Momentum(), model)\n(lay = (mat = Leaf(Momentum(0.01, 0.9), Float32[0.0 0.0; 0.0 0.0]), fun = ()), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[1.0, 3.0, 2.0, 4.0, 5.0, 6.0], Restructure(NamedTuple, ..., 6))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"train! uses @progress which should show a progress bar in VSCode automatically. To see one in a terminal, you will need to install TerminalLoggers.jl and follow its setup instructions.","category":"page"},{"location":"reference/training/reference/#Optimisation-Modifiers","page":"Training API","title":"Optimisation Modifiers","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The state returned by setup can be modified to temporarily prevent training of some parts of the model, or to change the learning rate or other hyperparameter. The functions for doing so may be accessed as Flux.freeze!, Flux.thaw!, and Flux.adjust!. All mutate the state (or part of it) and return nothing.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Optimisers.adjust!\nOptimisers.freeze!\nOptimisers.thaw!","category":"page"},{"location":"reference/training/reference/#Optimisers.adjust!","page":"Training API","title":"Optimisers.adjust!","text":"Optimisers.adjust!(tree, η)\n\nAlters the state tree = setup(rule, model) to change the parameters of the optimisation rule, without destroying its stored state. Typically used mid-way through training.\n\nCan be applied to part of a model, by acting only on the corresponding part of the state tree.\n\nTo change just the learning rate, provide a number η::Real.\n\nExample\n\njulia> m = (vec = rand(Float32, 2), fun = sin);\n\njulia> st = Optimisers.setup(Nesterov(), m) # stored momentum is initialised to zero\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[0.0, 0.0]), fun = ())\n\njulia> st, m = Optimisers.update(st, m, (vec = [16, 88], fun = nothing)); # with fake gradient\n\njulia> st\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[-0.016, -0.088]), fun = ())\n\njulia> Optimisers.adjust!(st, 0.123) # change learning rate, stored momentum untouched\n\njulia> st\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\nTo change other parameters, adjust! also accepts keyword arguments matching the field names of the optimisation rule's type.\n\njulia> fieldnames(Adam)\n(:eta, :beta, :epsilon)\n\njulia> st2 = Optimisers.setup(OptimiserChain(ClipGrad(), Adam()), m)\n(vec = Leaf(OptimiserChain(ClipGrad(10.0), Adam(0.001, (0.9, 0.999), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st2; beta = (0.777, 0.909), delta = 11.1) # delta acts on ClipGrad\n(vec = Leaf(OptimiserChain(ClipGrad(11.1), Adam(0.001, (0.777, 0.909), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st; beta = \"no such field\") # silently ignored!\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.freeze!","page":"Training API","title":"Optimisers.freeze!","text":"Optimisers.freeze!(tree)\n\nTemporarily alters the state tree = setup(rule, model) so that parameters will not be updated. Un-done by thaw!.\n\nCan be applied to the state corresponding to only part of a model, for instance with model::Chain, to freeze model.layers[1] you should call freeze!(tree.layers[1]).\n\nExample\n\njulia> m = (x = ([1.0], 2.0), y = [3.0]);\n\njulia> s = Optimisers.setup(Momentum(), m);\n\njulia> Optimisers.freeze!(s.x)\n\njulia> Optimisers.update!(s, m, (x = ([pi], 10pi), y = [100pi])); # with fake gradient\n\njulia> m\n(x = ([1.0], 2.0), y = [-0.14159265358979312])\n\njulia> s\n(x = (Leaf(Momentum(0.01, 0.9), [0.0], frozen = true), ()), y = Leaf(Momentum(0.01, 0.9), [3.14159]))\n\njulia> Optimisers.thaw!(s)\n\njulia> s.x\n(Leaf(Momentum(0.01, 0.9), [0.0]), ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.thaw!","page":"Training API","title":"Optimisers.thaw!","text":"Optimisers.thaw!(tree)\n\nThe reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).\n\n\n\n\n\n","category":"function"},{"location":"tutorials/logistic_regression/#Logistic-Regression","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The following page contains a step-by-step walkthrough of the logistic regression algorithm in Julia using Flux. We will then create a simple logistic regression model without any usage of Flux and compare the different working parts with Flux's implementation.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing the required Julia packages.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames, OneHotArrays","category":"page"},{"location":"tutorials/logistic_regression/#Dataset","page":"Logistic Regression","title":"Dataset","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing a dataset from MLDatasets.jl. We will use the Iris dataset that contains the data of three different Iris species. The data consists of 150 data points (xs), each having four features. Each of these x is mapped to a label (or target) y, the name of a particular Iris species. The following code will download the Iris dataset when run for the first time.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> Iris()\ndataset Iris:\n metadata => Dict{String, Any} with 4 entries\n features => 150×4 DataFrame\n targets => 150×1 DataFrame\n dataframe => 150×5 DataFrame\n\njulia> x, y = Iris(as_df=false)[:];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's have a look at our dataset -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> y\n1×150 Matrix{InlineStrings.String15}:\n \"Iris-setosa\" \"Iris-setosa\" … \"Iris-virginica\" \"Iris-virginica\"\n\njulia> x |> summary\n\"4×150 Matrix{Float64}\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The y values here corresponds to a type of iris plant, with a total of 150 data points. The x values depict the sepal length, sepal width, petal length, and petal width (all in cm) of 150 iris plant (hence the matrix size 4×150). Different type of iris plants have different lengths and widths of sepals and petals associated with them, and there is a definitive pattern for this in nature. We can leverage this to train a simple classifier that outputs the type of iris plant using the length and width of sepals and petals as inputs.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step would be to convert this data into a form that can be fed to a machine learning model. The x values are arranged in a matrix and should ideally be converted to Float32 type (see Performance tips), but the labels must be one hot encoded. Here is a great discourse thread on different techniques that can be used to one hot encode data with or without using any external Julia package.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> x = Float32.(x);\n\njulia> y = vec(y);\n\njulia> custom_y_onehot = unique(y) .== permutedims(y)\n3×150 BitMatrix:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"This same operation can also be performed using OneHotArrays' onehotbatch function. We will use both of these outputs parallelly to show how intuitive FluxML is!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> const classes = [\"Iris-setosa\", \"Iris-versicolor\", \"Iris-virginica\"];\n\njulia> flux_y_onehot = onehotbatch(y, classes)\n3×150 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our data is ready. The next step would be to build a classifier for the same.","category":"page"},{"location":"tutorials/logistic_regression/#Building-a-model","page":"Logistic Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A logistic regression model is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"model(x) = σ(Wx + b)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"where W is the weight matrix, b is the bias vector, and σ is any activation function. For our case, let's use the softmax activation function as we will be performing a multiclass classification task.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> m(W, b, x) = W*x .+ b\nm (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note that this model lacks an activation function, but we will come back to that.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now move ahead to initialize the parameters of our model. Given that our model has four inputs (4 features in every data point), and three outputs (3 different classes), the parameters can be initialized in the following way -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W = rand(Float32, 3, 4);\n\njulia> b = [0.0f0, 0.0f0, 0.0f0];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now our model can take in the complete dataset and predict the class of each x in one go. But, we need to ensure that our model outputs the probabilities of an input belonging to the respective classes. As our model has three outputs, each would denote the probability of the input belonging to a particular class.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We will use an activation function to map our outputs to a probability value. It would make sense to use a softmax activation function here, which is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"σ(vecx) = frace^z_isum_j=1^k e^z_j","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function scales down the outputs to probability values such that the sum of all the final outputs equals 1. Let's implement this in Julia.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_softmax(x) = exp.(x) ./ sum(exp.(x), dims=1)\ncustom_softmax (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The implementation looks straightforward enough! Note that we specify dims=1 in the sum function to calculate the sum of probabilities in each column. Remember, we will have a 3×150 matrix (predicted ys) as the output of our model, where each column would be an output of a corresponding input.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's combine this softmax function with our model to construct the complete custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) = m(W, b, x) |> custom_softmax\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's check if our model works.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) |> size\n(3, 150)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works! Let's check if the softmax function is working.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> all(0 .<= custom_model(W, b, x) .<= 1)\ntrue\n\njulia> sum(custom_model(W, b, x), dims=1)\n1×150 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Every output value is between 0 and 1, and every column adds to 1!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's convert our custom_model to a Flux model. Flux provides the users with a very elegant API that almost feels like writing your code!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note, all the flux_* variables in this tutorial would be general, that is, they can be used as it is with some other similar-looking dataset, but the custom_* variables will remain specific to this tutorial.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model = Chain(Dense(4 => 3), softmax)\nChain(\n Dense(4 => 3), # 15 parameters\n softmax,\n)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A Dense(4 => 3) layer denotes a layer with four inputs (four features in every data point) and three outputs (three classes or labels). This layer is the same as the mathematical model defined by us above. Under the hood, Flux too calculates the output using the same expression, but we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function provided by NNLib.jl is re-exported by Flux, which has been used here. Lastly, Flux provides users with a Chain struct which makes stacking layers seamless.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A model's weights and biases can be accessed as follows -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model[1].weight, flux_model[1].bias\n(Float32[0.78588694 -0.45968163 -0.77409476 0.2358028; -0.9049773 -0.58643705 0.466441 -0.79523873; 0.82426906 0.4143493 0.7630932 0.020588955], Float32[0.0, 0.0, 0.0])","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now pass the complete data in one go, with each data point having four features (four inputs)!","category":"page"},{"location":"tutorials/logistic_regression/#Loss-and-accuracy","page":"Logistic Regression","title":"Loss and accuracy","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step should be to define some quantitative values for our model, which we will maximize or minimize during the complete training procedure. These values will be the loss function and the accuracy metric.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by defining a loss function, a logitcrossentropy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_logitcrossentropy(ŷ, y) = mean(.-sum(y .* logsoftmax(ŷ; dims = 1); dims = 1));","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can wrap the custom_logitcrossentropy inside a function that takes in the model parameters, xs, and ys, and returns the loss value.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_loss(weights, biases, features, labels_onehot)\n ŷ = custom_model(weights, biases, features)\n custom_logitcrossentropy(ŷ, labels_onehot)\n end;\n\njulia> custom_loss(W, b, x, custom_y_onehot)\n1.1714406827505623","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss function works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides us with many minimal yet elegant loss functions. In fact, the custom_logitcrossentropy defined above has been taken directly from Flux. The functions present in Flux includes sanity checks, ensures efficient performance, and behaves well with the overall FluxML ecosystem.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function flux_loss(flux_model, features, labels_onehot)\n ŷ = flux_model(features)\n Flux.logitcrossentropy(ŷ, labels_onehot)\n end;\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n1.2156688659673647","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Next, let's define an accuracy function, which we will try to maximize during our training procedure. Before jumping to accuracy, let's define a onecold function. The onecold function would convert our output, which remember, are probability values, to the actual class names.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can divide this task into two parts -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Identify the index of the maximum element of each column in the output matrix\nConvert this index to a class name","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The maximum index should be calculated along the columns (remember, each column is the output of a single x data point). We can use Julia's argmax function to achieve this.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> argmax(custom_y_onehot, dims=1) # calculate the cartesian index of max element column-wise\n1×150 Matrix{CartesianIndex{2}}:\n CartesianIndex(1, 1) CartesianIndex(1, 2) … CartesianIndex(3, 150)\n\njulia> max_idx = [x[1] for x in argmax(custom_y_onehot; dims=1)]\n1×150 Matrix{Int64}:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 3 3 3 3 3 3 3 3 3 3 3 3","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can write a function that calculates the indices of the maximum element in each column, and maps them to a class name.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_onecold(labels_onehot)\n max_idx = [x[1] for x in argmax(labels_onehot; dims=1)]\n return vec(classes[max_idx])\n end;\n\njulia> custom_onecold(custom_y_onehot)\n150-element Vector{String}:\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n ⋮\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides users with the onecold function so that we don't have to write it on our own. Let's see how our custom_onecold function compares to Flux.onecold.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> istrue = Flux.onecold(flux_y_onehot, classes) .== custom_onecold(custom_y_onehot);\n\njulia> all(istrue)\ntrue","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Both the functions act identically!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We now move to the accuracy metric and run it with the untrained custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_accuracy(W, b, x, y) = mean(custom_onecold(custom_model(W, b, x)) .== y);\n\njulia> custom_accuracy(W, b, x, y)\n0.3333333333333333","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We could also have used Flux's built-in functionality to define this accuracy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_accuracy(x, y) = mean(Flux.onecold(flux_model(x), classes) .== y);\n\njulia> flux_accuracy(x, y)\n0.24","category":"page"},{"location":"tutorials/logistic_regression/#Training-the-model","page":"Logistic Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality. gradient takes in a function and its arguments, and returns a tuple containing ∂f/∂x for each argument x. Let's pass in custom_loss and the arguments required by custom_loss to gradient. We will require the derivatives of the loss function (custom_loss) with respect to the weights (∂f/∂w) and the bias (∂f/∂b) to carry out gradient descent, but we can ignore the partial derivatives of the loss function (custom_loss) with respect to x (∂f/∂x) and one hot encoded y (∂f/∂y).","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, custom_y_onehot);","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W .= W .- 0.1 .* dLdW;\n\njulia> b .= b .- 0.1 .* dLdb;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The parameters have been updated! We can now check the value of our custom loss function -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n1.164742997664842","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss went down! Let's plug our super training logic inside a function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function train_custom_model!(f_loss, weights, biases, features, labels_onehot)\n dLdW, dLdb, _, _ = gradient(f_loss, weights, biases, features, labels_onehot)\n weights .= weights .- 0.1 .* dLdW\n biases .= biases .- 0.1 .* dLdb\n end;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can plug the training function inside a loop and train the model for more epochs. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia. Here we will train the model for a maximum of 500 epochs, but to ensure that the model does not overfit, we will break as soon as our accuracy value crosses or becomes equal to 0.98.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> for i = 1:500\n train_custom_model!(custom_loss, W, b, x, custom_y_onehot);\n custom_accuracy(W, b, x, y) >= 0.98 && break\n end\n\njulia> @show custom_accuracy(W, b, x, y);\ncustom_accuracy(W, b, x, y) = 0.98","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Everything works! Our model achieved an accuracy of 0.98! Let's have a look at the loss.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n0.6520349798243569","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"As expected, the loss went down too! Now, let's repeat the same steps with our flux_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can write a similar-looking training loop for our flux_model and train it similarly.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_loss(flux_model, x, flux_y_onehot)\n1.215731131385928\n\njulia> function train_flux_model!(f_loss, model, features, labels_onehot)\n dLdm, _, _ = gradient(f_loss, model, features, labels_onehot)\n @. model[1].weight = model[1].weight - 0.1 * dLdm[:layers][1][:weight]\n @. model[1].bias = model[1].bias - 0.1 * dLdm[:layers][1][:bias]\n end;\n\njulia> for i = 1:500\n train_flux_model!(flux_loss, flux_model, x, flux_y_onehot);\n flux_accuracy(x, y) >= 0.98 && break\n end","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Looking at the accuracy and loss value -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> @show flux_accuracy(x, y);\nflux_accuracy(x, y) = 0.98\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n0.6952386604624324","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We see a very similar final loss and accuracy.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"info: Info\nOriginally published on 1st April 2023, by Saransh Chopra.","category":"page"},{"location":"tutorials/model_zoo/#Model-Zoo","page":"Model Zoo","title":"Model Zoo","text":"","category":"section"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"The model zoo is a collection of examples that demonstrate how to build and train models using Flux. The examples are organised by domain and include vision, text, and audio. Each example includes a description of the model, the data used, and the training process.","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Some of the examples are pedagogical, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Multilayer Perceptron\nSimple Convolutional Neural Network","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Others are more advanced, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Variational Autoencoder","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/data/mlutils/#Working-with-Data,-using-MLUtils.jl","page":"Batching Data – MLUtils.jl","title":"Working with Data, using MLUtils.jl","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.","category":"page"},{"location":"reference/data/mlutils/#DataLoader","page":"Batching Data – MLUtils.jl","title":"DataLoader","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The DataLoader can be used to create mini-batches of data, in the format train! expects.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.DataLoader","category":"page"},{"location":"reference/data/mlutils/#MLUtils.DataLoader","page":"Batching Data – MLUtils.jl","title":"MLUtils.DataLoader","text":"DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])\n\nAn object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).\n\nTakes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.\n\nThe last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.\n\nThe original data is preserved in the data field of the DataLoader.\n\nArguments\n\ndata: The data to be iterated over. The data type has to be supported by numobs and getobs.\nbatchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.\nbuffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\nparallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.\npartial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.\nrng: A random number generator. Default Random.GLOBAL_RNG.\nshuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.\n\nExamples\n\njulia> Xtrain = rand(10, 100);\n\njulia> array_loader = DataLoader(Xtrain, batchsize=2);\n\njulia> for x in array_loader\n @assert size(x) == (10, 2)\n # do something with x, 50 times\n end\n\njulia> array_loader.data === Xtrain\ntrue\n\njulia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples\n\njulia> for x in tuple_loader\n @assert x isa Tuple{Matrix}\n @assert size(x[1]) == (10, 2)\n end\n\njulia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples\n\njulia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);\n\njulia> for epoch in 1:100\n for (x, y) in train_loader # access via tuple destructuring\n @assert size(x) == (10, 5)\n @assert size(y) == (5,)\n # loss += f(x, y) # etc, runs 100 * 20 times\n end\n end\n\njulia> first(train_loader).label isa Vector{Char} # access via property name\ntrue\n\njulia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true\nfalse\n\njulia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last\n10×30 Matrix{Int8}\n10×30 Matrix{Int8}\n10×4 Matrix{Int8}\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#Utility-Functions","page":"Batching Data – MLUtils.jl","title":"Utility Functions","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.batch\nMLUtils.batchsize\nMLUtils.batchseq\nMLUtils.BatchView\nMLUtils.chunk\nMLUtils.eachobs\nMLUtils.fill_like\nMLUtils.filterobs\nFlux.flatten\nMLUtils.flatten\nMLUtils.getobs\nMLUtils.getobs!\nMLUtils.joinobs\nMLUtils.group_counts\nMLUtils.group_indices\nMLUtils.groupobs\nMLUtils.kfolds\nMLUtils.leavepout\nMLUtils.mapobs\nMLUtils.numobs\nMLUtils.normalise\nMLUtils.obsview\nMLUtils.ObsView\nMLUtils.ones_like\nMLUtils.oversample\nMLUtils.randobs\nMLUtils.rand_like\nMLUtils.randn_like\nMLUtils.rpad_constant\nMLUtils.shuffleobs\nMLUtils.splitobs\nMLUtils.unbatch\nMLUtils.undersample\nMLUtils.unsqueeze\nMLUtils.unstack\nMLUtils.zeros_like","category":"page"},{"location":"reference/data/mlutils/#MLUtils.batch","page":"Batching Data – MLUtils.jl","title":"MLUtils.batch","text":"batch(xs)\n\nBatch the arrays in xs into a single array with an extra dimension.\n\nIf the elements of xs are tuples, named tuples, or dicts, the output will be of the same type. \n\nSee also unbatch.\n\nExamples\n\njulia> batch([[1,2,3], \n [4,5,6]])\n3×2 Matrix{Int64}:\n 1 4\n 2 5\n 3 6\n\njulia> batch([(a=[1,2], b=[3,4])\n (a=[5,6], b=[7,8])]) \n(a = [1 5; 2 6], b = [3 7; 4 8])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchsize","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchsize","text":"batchsize(data::BatchView) -> Int\n\nReturn the fixed size of each batch in data.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert batchsize(A) == 30\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchseq","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchseq","text":"batchseq(seqs, val = 0)\n\nTake a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by val.\n\nExamples\n\njulia> batchseq([[1, 2, 3], [4, 5]], 0)\n3-element Vector{Vector{Int64}}:\n [1, 4]\n [2, 5]\n [3, 0]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.BatchView","page":"Batching Data – MLUtils.jl","title":"MLUtils.BatchView","text":"BatchView(data, batchsize; partial=true, collate=nothing)\nBatchView(data; batchsize=1, partial=true, collate=nothing)\n\nCreate a view of the given data that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize. In the case that the size of the dataset is not dividable by the specified batchsize, the remaining observations will be ignored if partial=false. If partial=true instead the last batch-size can be slightly smaller.\n\nNote that any data access is delayed until getindex is called.\n\nIf used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.\n\nFor BatchView to work on some data structure, the type of the given variable data must implement the data container interface. See ObsView for more info.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nbatchsize : The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).\npartial : If partial=false and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert typeof(A) <: BatchView <: AbstractVector\n@assert eltype(A) <: SubArray{Float64,2}\n@assert length(A) == 5 # Iris has 150 observations\n@assert size(A[1]) == (4,30) # Iris has 4 features\n\n# 5 batches of size 30 observations\nfor x in BatchView(X, batchsize=30)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert numobs(x) === 30\nend\n\n# 7 batches of size 20 observations\n# Note that the iris dataset has 150 observations,\n# which means that with a batchsize of 20, the last\n# 10 observations will be ignored\nfor (x, y) in BatchView((X, Y), batchsize=20, partial=false)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\n @assert numobs(x) == numobs(y) == 20\nend\n\n# collate tuple observations\nfor (x, y) in BatchView((rand(10, 3), [\"a\", \"b\", \"c\"]), batchsize=2, collate=true, partial=false)\n @assert size(x) == (10, 2)\n @assert size(y) == (2,)\nend\n\n\n# randomly assign observations to one and only one batch.\nfor (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\nend\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.chunk","page":"Batching Data – MLUtils.jl","title":"MLUtils.chunk","text":"chunk(x, n; [dims])\nchunk(x; [size, dims])\n\nSplit x into n parts or alternatively, if size is an integer, into equal chunks of size size. The parts contain the same number of elements except possibly for the last one that can be smaller.\n\nIn case size is a collection of integers instead, the elements of x are split into chunks of the given sizes.\n\nIf x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).\n\nExamples\n\njulia> chunk(1:10, 3)\n3-element Vector{UnitRange{Int64}}:\n 1:4\n 5:8\n 9:10\n\njulia> chunk(1:10; size = 2)\n5-element Vector{UnitRange{Int64}}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\njulia> x = reshape(collect(1:20), (5, 4))\n5×4 Matrix{Int64}:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n 4 9 14 19\n 5 10 15 20\n\njulia> xs = chunk(x, 2, dims=1)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:\n [1 6 11 16; 2 7 12 17; 3 8 13 18]\n [4 9 14 19; 5 10 15 20]\n\njulia> xs[1]\n3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n\njulia> xes = chunk(x; size = 2, dims = 2)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1 6; 2 7; … ; 4 9; 5 10]\n [11 16; 12 17; … ; 14 19; 15 20]\n\njulia> xes[2]\n5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:\n 11 16\n 12 17\n 13 18\n 14 19\n 15 20\n\njulia> chunk(1:6; size = [2, 4])\n2-element Vector{UnitRange{Int64}}:\n 1:2\n 3:6\n\n\n\n\n\nchunk(x, partition_idxs; [npartitions, dims])\n\nPartition the array x along the dimension dims according to the indexes in partition_idxs.\n\npartition_idxs must be sorted and contain only positive integers between 1 and the number of partitions. \n\nIf the number of partition npartitions is not provided, it is inferred from partition_idxs.\n\nIf dims is not provided, it defaults to the last dimension.\n\nSee also unbatch.\n\nExamples\n\njulia> x = reshape([1:10;], 2, 5)\n2×5 Matrix{Int64}:\n 1 3 5 7 9\n 2 4 6 8 10\n\njulia> chunk(x, [1, 2, 2, 3, 3])\n3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1; 2;;]\n [3 5; 4 6]\n [7 9; 8 10]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.eachobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.eachobs","text":"eachobs(data; kws...)\n\nReturn an iterator over data.\n\nSupports the same arguments as DataLoader. The batchsize default is -1 here while it is 1 for DataLoader.\n\nExamples\n\nX = rand(4,100)\n\nfor x in eachobs(X)\n # loop entered 100 times\n @assert typeof(x) <: Vector{Float64}\n @assert size(x) == (4,)\nend\n\n# mini-batch iterations\nfor x in eachobs(X, batchsize=10)\n # loop entered 10 times\n @assert typeof(x) <: Matrix{Float64}\n @assert size(x) == (4,10)\nend\n\n# support for tuples, named tuples, dicts\nfor (x, y) in eachobs((X, Y))\n # ...\nend\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.fill_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.fill_like","text":"fill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to val. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and ones_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.16087806\n 0.89916044\n\njulia> fill_like(x, 1.7, (3, 3))\n3×3 Matrix{Float32}:\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.803167 0.476101\n 0.303041 0.317581\n\njulia> fill_like(x, 1.7, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.7 1.7\n 1.7 1.7\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.filterobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.filterobs","text":"filterobs(f, data)\n\nReturn a subset of data container data including all indices i for which f(getobs(data, i)) === true.\n\ndata = 1:10\nnumobs(data) == 10\nfdata = filterobs(>(5), data)\nnumobs(fdata) == 5\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#Flux.flatten","page":"Batching Data – MLUtils.jl","title":"Flux.flatten","text":"flatten(x)\n\nSame as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.flatten","page":"Batching Data – MLUtils.jl","title":"MLUtils.flatten","text":"flatten(x::AbstractArray)\n\nReshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.\n\nSee also unsqueeze.\n\nExamples\n\njulia> rand(3,4,5) |> flatten |> size\n(12, 5)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs","text":"getobs(data, [idx])\n\nReturn the observations corresponding to the observation index idx. Note that idx can be any type as long as data has defined getobs for that type. If idx is not provided, then materialize all observations in data.\n\nIf data does not have getobs defined, then in the case of Tables.table(data) == true returns the row(s) in position idx, otherwise returns data[idx].\n\nAuthors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).\n\nThe returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this \"actual data\" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs! and numobs.\n\nExamples\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\n\ngetobs(x, 2) == (a = 2, b = x.b[:, 2])\ngetobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])\n\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\n\ngetobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])\ngetobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs!","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs!","text":"getobs!(buffer, data, idx)\n\nInplace version of getobs(data, idx). If this method is defined for the type of data, then buffer should be used to store the result, instead of allocating a dedicated object.\n\nImplementing this function is optional. In the case no such method is provided for the type of data, then buffer will be ignored and the result of getobs returned. This could be because the type of data may not lend itself to the concept of copy!. Thus, supporting a custom getobs! is optional and not required.\n\nSee also getobs and numobs. \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.joinobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.joinobs","text":"joinobs(datas...)\n\nConcatenate data containers datas.\n\ndata1, data2 = 1:10, 11:20\njdata = joinumobs(data1, data2)\ngetobs(jdata, 15) == 15\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_counts","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_counts","text":"group_counts(x)\n\nCount the number of times that each element of x appears.\n\nSee also group_indices\n\nExamples\n\njulia> group_counts(['a', 'b', 'b'])\nDict{Char, Int64} with 2 entries:\n 'a' => 1\n 'b' => 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_indices","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_indices","text":"group_indices(x) -> Dict\n\nComputes the indices of elements in the vector x for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.\n\nSee also group_counts.\n\nExamples\n\njulia> x = [:yes, :no, :maybe, :yes];\n\njulia> group_indices(x)\nDict{Symbol, Vector{Int64}} with 3 entries:\n :yes => [1, 4]\n :maybe => [3]\n :no => [2]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.groupobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.groupobs","text":"groupobs(f, data)\n\nSplit data container data data into different data containers, grouping observations by f(obs).\n\ndata = -10:10\ndatas = groupobs(>(0), data)\nlength(datas) == 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.kfolds","page":"Batching Data – MLUtils.jl","title":"MLUtils.kfolds","text":"kfolds(n::Integer, k = 5) -> Tuple\n\nCompute the train/validation assignments for k repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5 or k = 10. The following code snippet generates the indices assignments for k = 5\n\njulia> train_idx, val_idx = kfolds(10, 5);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nkfolds(data, [k = 5])\n\nRepartition a data container k times using a k folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs is invoked.\n\nConceptually, a k-folds repartitioning strategy divides the given data into k roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k different partitions of data.\n\nIn the case that the size of the dataset is not dividable by the specified k, the remaining observations will be evenly distributed among the parts.\n\nfor (x_train, x_val) in kfolds(X, k=10)\n # code called 10 times\n # nobs(x_val) may differ up to ±1 over iterations\nend\n\nMultiple variables are supported (e.g. for labeled data)\n\nfor ((x_train, y_train), val) in kfolds((X, Y), k=10)\n # ...\nend\n\nBy default the folds are created using static splits. Use shuffleobs to randomly assign observations to the folds.\n\nfor (x_train, x_val) in kfolds(shuffleobs(X), k = 10)\n # ...\nend\n\nSee leavepout for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.leavepout","page":"Batching Data – MLUtils.jl","title":"MLUtils.leavepout","text":"leavepout(n::Integer, [size = 1]) -> Tuple\n\nCompute the train/validation assignments for k ≈ n/size repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size or size+1 observations assigned to it. The following code snippet generates the index-vectors for size = 2.\n\njulia> train_idx, val_idx = leavepout(10, 2);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nleavepout(data, p = 1)\n\nRepartition a data container using a k-fold strategy, where k is chosen in such a way, that each validation subset of the resulting folds contains roughly p observations. Defaults to p = 1, which is also known as \"leave-one-out\" partitioning.\n\nThe resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until getobs is invoked.\n\nfor (train, val) in leavepout(X, p=2)\n # if nobs(X) is dividable by 2,\n # then numobs(val) will be 2 for each iteraton,\n # otherwise it may be 3 for the first few iterations.\nend\n\nSeekfolds for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.mapobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.mapobs","text":"mapobs(f, data; batched=:auto)\n\nLazily map f over the observations in a data container data. Returns a new data container mdata that can be indexed and has a length. Indexing triggers the transformation f.\n\nThe batched keyword argument controls the behavior of mdata[idx] and mdata[idxs] where idx is an integer and idxs is a vector of integers:\n\nbatched=:auto (default). Let f handle the two cases. Calls f(getobs(data, idx)) and f(getobs(data, idxs)).\nbatched=:never. The function f is always called on a single observation. Calls f(getobs(data, idx)) and [f(getobs(data, idx)) for idx in idxs].\nbatched=:always. The function f is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1) and f(getobs(data, idxs)).\n\nExamples\n\njulia> data = (a=[1,2,3], b=[1,2,3]);\n\njulia> mdata = mapobs(data) do x\n (c = x.a .+ x.b, d = x.a .- x.b)\n end\nmapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))\n\njulia> mdata[1]\n(c = 2, d = 0)\n\njulia> mdata[1:2]\n(c = [2, 4], d = [0, 0])\n\n\n\n\n\nmapobs(fs, data)\n\nLazily map each function in tuple fs over the observations in data container data. Returns a tuple of transformed data containers.\n\n\n\n\n\nmapobs(namedfs::NamedTuple, data)\n\nMap a NamedTuple of functions over data, turning it into a data container of NamedTuples. Field syntax can be used to select a column of the resulting data container.\n\ndata = 1:10\nnameddata = mapobs((x = sqrt, y = log), data)\ngetobs(nameddata, 10) == (x = sqrt(10), y = log(10))\ngetobs(nameddata.x, 10) == sqrt(10)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.numobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.numobs","text":"numobs(data)\n\nReturn the total number of observations contained in data.\n\nIf data does not have numobs defined, then in the case of Tables.table(data) == true returns the number of rows, otherwise returns length(data).\n\nAuthors of custom data containers should implement Base.length for their type instead of numobs. numobs should only be implemented for types where there is a difference between numobs and Base.length (such as multi-dimensional arrays).\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs.\n\nExamples\n\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\nnumobs(x) == 3\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\nnumobs(x) == 3\n\nAll internal containers must have the same number of observations:\n\njulia> x = (a = [1, 2, 3, 4], b = rand(6, 3));\n\njulia> numobs(x)\nERROR: DimensionMismatch: All data containers must have the same number of observations.\nStacktrace:\n [1] _check_numobs_error()\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163\n [2] _check_numobs\n @ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]\n [3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177\n [4] top-level scope\n @ REPL[35]:1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.normalise","page":"Batching Data – MLUtils.jl","title":"MLUtils.normalise","text":"normalise(x; dims=ndims(x), ϵ=1e-5)\n\nNormalise the array x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. \n\nϵ is a small additive factor added to the denominator for numerical stability.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.obsview","page":"Batching Data – MLUtils.jl","title":"MLUtils.obsview","text":"obsview(data, [indices])\n\nReturns a lazy view of the observations in data that correspond to the given indices. No data will be copied except of the indices. It is similar to constructing an ObsView, but returns a SubArray if the type of data is Array or SubArray. Furthermore, this function may be extended for custom types of data that also want to provide their own subset-type.\n\nIn case data is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView instead of a ObsView of tuples.\n\nIf instead you want to get the subset of observations corresponding to the given indices in their native type, use getobs.\n\nSee ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.ObsView","page":"Batching Data – MLUtils.jl","title":"MLUtils.ObsView","text":"ObsView(data, [indices])\n\nUsed to represent a subset of some data of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.\n\nThe main purpose for the existence of ObsView is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.\n\nAny data access is delayed until getindex is called, and even getindex returns the result of obsview which in general avoids data movement until getobs is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nindices : Optional. The index or indices of the observation(s) in data that the subset should represent. Can be of type Int or some subtype of AbstractVector.\n\nMethods\n\ngetindex : Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.\nnumobs : Returns the total number observations in the subset.\ngetobs : Returns the underlying data that the ObsView represents at the given relative indices. Note that these indices are in \"subset space\", and in general will not directly correspond to the same indices in the underlying data set.\n\nDetails\n\nFor ObsView to work on some data structure, the desired type MyType must implement the following interface:\n\ngetobs(data::MyType, idx) : Should return the observation(s) indexed by idx. In what form is up to the user. Note that idx can be of type Int or AbstractVector.\nnumobs(data::MyType) : Should return the total number of observations in data\n\nThe following methods can also be provided and are optional:\n\ngetobs(data::MyType) : By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.\nobsview(data::MyType, idx) : If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray for representing a subset of some AbstractArray.\ngetobs!(buffer, data::MyType, [idx]) : Inplace version of getobs(data, idx). If this method is provided for MyType, then eachobs can preallocate a buffer that is then reused every iteration. Note: buffer should be equivalent to the return value of getobs(::MyType, ...), since this is how buffer is preallocated by default.\n\nExamples\n\nX, Y = MLUtils.load_iris()\n\n# The iris set has 150 observations and 4 features\n@assert size(X) == (4,150)\n\n# Represents the 80 observations as a ObsView\nv = ObsView(X, 21:100)\n@assert numobs(v) == 80\n@assert typeof(v) <: ObsView\n# getobs indexes into v\n@assert getobs(v, 1:10) == X[:, 21:30]\n\n# Use `obsview` to avoid boxing into ObsView\n# for types that provide a custom \"subset\", such as arrays.\n# Here it instead creates a native SubArray.\nv = obsview(X, 1:100)\n@assert numobs(v) == 100\n@assert typeof(v) <: SubArray\n\n# Also works for tuples of arbitrary length\nsubset = obsview((X, Y), 1:100)\n@assert numobs(subset) == 100\n@assert typeof(subset) <: Tuple # tuple of SubArray\n\n# Use as iterator\nfor x in ObsView(X)\n @assert typeof(x) <: SubArray{Float64,1}\nend\n\n# iterate over each individual labeled observation\nfor (x, y) in ObsView((X, Y))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# same but in random order\nfor (x, y) in ObsView(shuffleobs((X, Y)))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# Indexing: take first 10 observations\nx, y = ObsView((X, Y))[1:10]\n\nSee also\n\nobsview, getobs, numobs, splitobs, shuffleobs, kfolds.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.ones_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.ones_like","text":"ones_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.8621633\n 0.5158395\n\njulia> ones_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.82297 0.656143\n 0.701828 0.391335\n\njulia> ones_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.oversample","page":"Batching Data – MLUtils.jl","title":"MLUtils.oversample","text":"oversample(data, classes; fraction=1, shuffle=true)\noversample(data::Tuple; fraction=1, shuffle=true)\n\nGenerate a re-balanced version of data by repeatedly sampling existing observations in such a way that every class will have at least fraction times the number observations of the largest class in classes. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data.\n\nAs an example, by default (i.e. with fraction = 1) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5 every class in the resulting data with have at least 50% as many observations as the largest class.\n\nThe classes input is an array with the same length as numobs(data). \n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# oversample the class \"a\" to match \"b\"\nX_bal, Y_bal = oversample(X, Y)\n\n# this results in a bigger dataset with repeated data\n@assert size(X_bal) == (3,8)\n@assert length(Y_bal) == 8\n\n# now both \"a\", and \"b\" have 4 observations each\n@assert sum(Y_bal .== \"a\") == 4\n@assert sum(Y_bal .== \"b\") == 4\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple and classes is not given, then it will be assumed that the last element of the tuple contains the classes.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(oversample(data, data.Y))\n8×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.376304 0.100022 a\n 2 │ 0.467095 0.185437 b\n 3 │ 0.481957 0.319906 b\n 4 │ 0.336762 0.390811 b\n 5 │ 0.376304 0.100022 a\n 6 │ 0.427064 0.0648339 a\n 7 │ 0.427064 0.0648339 a\n 8 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also undersample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.randobs","text":"randobs(data, [n])\n\nPick a random observation or a batch of n random observations from data. For this function to work, the type of data must implement numobs and getobs.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rand_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.rand_like","text":"rand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.rand and randn_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> rand_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.780032 0.920552 0.53689\n 0.121451 0.741334 0.5449\n 0.55348 0.138136 0.556404\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> rand_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.429274 0.135379\n 0.718895 0.0098756\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randn_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.randn_like","text":"randn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.randn and rand_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> randn_like(x, (3, 3))\n3×3 Matrix{Float32}:\n -0.385331 0.956231 0.0745102\n 1.43756 -0.967328 2.06311\n 0.0482372 1.78728 -0.902547\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> randn_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n -0.578527 0.823445\n -1.01338 -0.612053\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rpad_constant","page":"Batching Data – MLUtils.jl","title":"MLUtils.rpad_constant","text":"rpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)\n\nReturn the given sequence padded with val along the dimensions dims up to a maximum length in each direction specified by n.\n\nExamples\n\njulia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4\n4-element Vector{Int64}:\n 1\n 2\n -1\n -1\n\njulia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n\n3-element Vector{Int64}:\n 1\n 2\n 3\n\njulia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\njulia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.shuffleobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.shuffleobs","text":"shuffleobs([rng], data)\n\nReturn a \"subset\" of data that spans all observations, but has the order of the observations shuffled.\n\nThe values of data itself are not copied. Instead only the indices are shuffled. This function calls obsview to accomplish that, which means that the return value is likely of a different type than data.\n\n# For Arrays the subset will be of type SubArray\n@assert typeof(shuffleobs(rand(4,10))) <: SubArray\n\n# Iterate through all observations in random order\nfor x in eachobs(shuffleobs(X))\n ...\nend\n\nThe optional parameter rng allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random in Julia's standard library for more info.\n\nFor this function to work, the type of data must implement numobs and getobs. See ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.splitobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.splitobs","text":"splitobs(n::Int; at) -> Tuple\n\nCompute the indices for two or more disjoint subsets of the range 1:n with splits given by at.\n\nExamples\n\njulia> splitobs(100, at=0.7)\n(1:70, 71:100)\n\njulia> splitobs(100, at=(0.1, 0.4))\n(1:10, 11:50, 51:100)\n\n\n\n\n\nsplitobs(data; at, shuffle=false) -> Tuple\n\nPartition the data into two or more subsets. When at is a number (between 0 and 1) this specifies the proportion in the first subset. When at is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at). In all there are length(at)+1 subsets returned.\n\nIf shuffle=true, randomly permute the observations before splitting.\n\nSupports any datatype implementing the numobs and getobs interfaces – including arrays, tuples & NamedTuples of arrays.\n\nExamples\n\njulia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix\n([1 2 … 69 70], [71 72 … 99 100])\n\njulia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension\n(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)\n\njulia> splitobs(data, at=(0.5, 0.3)) # a 50%-30%-20% split, e.g. train/test/validation\n((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))\n\njulia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true); # split a Tuple\n\njulia> vec(test[1]) .+ 100 == test[2]\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unbatch","page":"Batching Data – MLUtils.jl","title":"MLUtils.unbatch","text":"unbatch(x)\n\nReverse of the batch operation, unstacking the last dimension of the array x.\n\nSee also unstack and chunk.\n\nExamples\n\njulia> unbatch([1 3 5 7;\n 2 4 6 8])\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.undersample","page":"Batching Data – MLUtils.jl","title":"MLUtils.undersample","text":"undersample(data, classes; shuffle=true)\n\nGenerate a class-balanced version of data by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data.\n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# subsample the class \"b\" to match \"a\"\nX_bal, Y_bal = undersample(X, Y)\n\n# this results in a smaller dataset\n@assert size(X_bal) == (3,4)\n@assert length(Y_bal) == 4\n\n# now both \"a\", and \"b\" have 2 observations each\n@assert sum(Y_bal .== \"a\") == 2\n@assert sum(Y_bal .== \"b\") == 2\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple, then it will be assumed that the last element of the tuple contains the targets.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(undersample(data, data.Y))\n4×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.427064 0.0648339 a\n 2 │ 0.376304 0.100022 a\n 3 │ 0.467095 0.185437 b\n 4 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also oversample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unsqueeze","page":"Batching Data – MLUtils.jl","title":"MLUtils.unsqueeze","text":"unsqueeze(x; dims)\n\nReturn x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended. dims can be an integer between 1 and ndims(x)+1.\n\nSee also flatten, stack.\n\nExamples\n\njulia> unsqueeze([1 2; 3 4], dims=2)\n2×1×2 Array{Int64, 3}:\n[:, :, 1] =\n 1\n 3\n\n[:, :, 2] =\n 2\n 4\n\n\njulia> xs = [[1, 2], [3, 4], [5, 6]]\n3-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n\njulia> unsqueeze(xs, dims=1)\n1×3 Matrix{Vector{Int64}}:\n [1, 2] [3, 4] [5, 6]\n\n\n\n\n\nunsqueeze(; dims)\n\nReturns a function which, acting on an array, inserts a dimension of size 1 at dims.\n\nExamples\n\njulia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size\n(21, 1, 22, 23)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unstack","page":"Batching Data – MLUtils.jl","title":"MLUtils.unstack","text":"unstack(xs; dims)\n\nUnroll the given xs into an array of arrays along the given dimension dims.\n\nSee also stack, unbatch, and chunk.\n\nExamples\n\njulia> unstack([1 3 5 7; 2 4 6 8], dims=2)\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.zeros_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.zeros_like","text":"zeros_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also ones_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.4005432\n 0.36934233\n\njulia> zeros_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.0695155 0.667979\n 0.558468 0.59903\n\njulia> zeros_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.0 0.0\n 0.0 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/callbacks/#man-callback-helpers","page":"Callback Helpers","title":"Callback Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.throttle","category":"page"},{"location":"reference/training/callbacks/#Flux.throttle","page":"Callback Helpers","title":"Flux.throttle","text":"throttle(f, timeout; leading=true, trailing=false)\n\nReturn a function that when invoked, will only be triggered at most once during timeout seconds.\n\nNormally, the throttled function will run as much as it can, without ever going more than once per wait duration; but if you'd like to disable the execution on the leading edge, pass leading=false. To enable execution on the trailing edge, pass trailing=true.\n\nExamples\n\njulia> a = Flux.throttle(() -> println(\"Flux\"), 2);\n\njulia> for i = 1:4 # a called in alternate iterations\n a()\n sleep(1)\n end\nFlux\nFlux\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Patience-Helpers","page":"Callback Helpers","title":"Patience Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-loss that decreases for 4 calls, then starts increasing\n# we call this like loss()\nloss = let t = 0\n () -> begin\n t += 1\n (t - 4) ^ 2\n end\nend\n\n# create an early stopping trigger\n# returns true when the loss increases for two consecutive steps\nes = early_stopping(loss, 2; init_score = 9)\n\n# this will stop at the 6th (4 decreasing + 2 increasing calls) epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"The keyword argument distance of early_stopping is a function of the form distance(best_score, score). By default distance is -, which implies that the monitored metric f is expected to be decreasing and minimized. If you use some increasing metric (e.g. accuracy), you can customize the distance function: (best_score, score) -> score - best_score.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-accuracy that increases by 0.01 each time from 0 to 1\n# we call this like acc()\nacc = let v = 0\n () -> v = max(1, v + 0.01)\nend\n\n# create an early stopping trigger for accuracy\nes = early_stopping(acc, 3; delta = (best_score, score) -> score - best_score)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"early_stopping and plateau are both built on top of patience. You can use patience to build your own triggers that use a patient counter. For example, if you want to trigger when the loss is below a threshold for several consecutive iterations:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"threshold(f, thresh, delay) = patience(delay) do\n f() < thresh\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Both predicate in patience and f in early_stopping / plateau can accept extra arguments. You can pass such extra arguments to predicate or f through the returned function:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"trigger = patience((a; b) -> a > b, 3)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n trigger(1; b = 2) && break\nend\n\n# this will stop at the 3rd epoch\nfor epoch in 1:10\n trigger(3; b = 2) && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.patience\nFlux.early_stopping\nFlux.plateau","category":"page"},{"location":"reference/training/callbacks/#Flux.patience","page":"Callback Helpers","title":"Flux.patience","text":"patience(predicate, wait)\n\nReturn a function that internally counts by one when predicate(...) == true, otherwise the count is reset to zero. If the count is greater than or equal to wait, the function returns true, otherwise it returns false.\n\nExamples\n\njulia> loss() = rand();\n\njulia> trigger = Flux.patience(() -> loss() < 1, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.early_stopping","page":"Callback Helpers","title":"Flux.early_stopping","text":"early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)\n\nReturn a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.\n\nExamples\n\njulia> loss = let l = 0\n () -> l += 1\n end; # pseudo loss function that returns increasing values\n\njulia> es = Flux.early_stopping(loss, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n es() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.plateau","page":"Callback Helpers","title":"Flux.plateau","text":"plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)\n\nReturn a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.\n\nExamples\n\njulia> f = let v = 10\n () -> v = v / abs(v) - v\n end; # -9, 8, -7, 6, ...\n\njulia> trigger = Flux.plateau(f, 3; init_score=10, min_dist=18);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n[ Info: Epoch 4\n\n\n\n\n\n","category":"function"},{"location":"guide/training/training/#man-training","page":"Training","title":"Training a Flux Model","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Training refers to the process of slowly adjusting the parameters of a model to make it work better. Besides the model itself, we will need three things:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"An objective function that evaluates how well a model is doing on some input.\nAn optimisation rule which describes how the model's parameters should be adjusted.\nSome training data to use as the input during this process.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Usually the training data is some collection of examples (or batches of examples) which are handled one-by-one. One epoch of training means that each example is used once, something like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise the optimiser for this model:\nopt_state = Flux.setup(rule, model)\n\nfor data in train_set\n # Unpack this element (for supervised training):\n input, label = data\n\n # Calculate the gradient of the objective\n # with respect to the parameters within the model:\n grads = Flux.gradient(model) do m\n result = m(input)\n loss(result, label)\n end\n\n # Update the parameters so as to reduce the objective,\n # according the chosen optimisation rule:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This loop can also be written using the function train!, but it's helpful to understand the pieces first:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\nend","category":"page"},{"location":"guide/training/training/#Model-Gradients","page":"Training","title":"Model Gradients","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Fist recall from the section on taking gradients that Flux.gradient(f, a, b) always calls f(a, b), and returns a tuple (∂f_∂a, ∂f_∂b). In the code above, the function f passed to gradient is an anonymous function with one argument, created by the do block, hence grads is a tuple with one element. Instead of a do block, we could have written:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(m -> loss(m(input), label), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Since the model is some nested set of layers, grads[1] is a similarly nested set of NamedTuples, ultimately containing gradient components. If (for example) θ = model.layers[1].weight[2,3] is one scalar parameter, an entry in a matrix of weights, then the derivative of the loss with respect to it is ∂f_∂θ = grads[1].layers[1].weight[2,3].","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is important that the execution of the model takes place inside the call to gradient, in order for the influence of the model's parameters to be observed by Zygote.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is also important that every update! step receives a newly computed gradient, as it will change whenever the model's parameters are changed, and for each new data point.","category":"page"},{"location":"guide/training/training/#Loss-Functions","page":"Training","title":"Loss Functions","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The objective function must return a number representing how far the model is from the desired result. This is termed the loss of the model.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This number can be produced by any ordinary Julia code, but this must be executed within the call to gradient. For instance, we could define a function","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"loss(y_hat, y) = sum((y_hat .- y).^2)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or write this directly inside the do block above. Many commonly used functions, like mse for mean-squared error or crossentropy for cross-entropy loss, are available from the Flux.Losses module.","category":"page"},{"location":"guide/training/training/#Optimisation-Rules","page":"Training","title":"Optimisation Rules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The simplest kind of optimisation using the gradient is termed gradient descent (or sometimes stochastic gradient descent when, as here, it is not applied to the entire dataset at once).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Gradient descent needs a learning rate which is a small number describing how fast to walk downhill, usually written as the Greek letter \"eta\", η. This is often described as a hyperparameter, to distinguish it from the parameters which are being updated θ = θ - η * ∂loss_∂θ. We want to update all the parameters in the model, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"η = 0.01 # learning rate\n\n# For each parameter array, update\n# according to the corresponding gradient:\nfmap(model, grads[1]) do p, g\n p .= p .- η .* g\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"A slightly more refined version of this loop to update all the parameters is wrapped up as a function update!(opt_state, model, grads[1]). And the learning rate is the only thing stored in the Descent struct.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, there are many other optimisation rules, which adjust the step size and direction in various clever ways. Most require some memory of the gradients from earlier steps, rather than always walking straight downhill – Momentum is the simplest. The function setup creates the necessary storage for this, for a particular model. It should be called once, before training, and returns a tree-like object which is the first argument of update!. Like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise momentum \nopt_state = Flux.setup(Momentum(0.01, 0.9), model)\n\nfor data in train_set\n grads = [...]\n\n # Update both model parameters and optimiser state:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Many commonly-used optimisation rules, such as Adam, are built-in. These are listed on the optimisers page.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"compat: Implicit-style optimiser state\nThis setup makes another tree-like structure. Old versions of Flux did not do this, and instead stored a dictionary-like structure within the optimiser Adam(0.001). This was initialised on first use of the version of update! for \"implicit\" parameters.","category":"page"},{"location":"guide/training/training/#Datasets-and-Batches","page":"Training","title":"Datasets & Batches","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The loop above iterates through train_set, expecting at each step a tuple (input, label). The very simplest such object is a vector of tuples, such as this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"x = randn(28, 28)\ny = rand(10)\ndata = [(x, y)]","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or data = [(x, y), (x, y), (x, y)] for the same values three times.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Very often, the initial data is large arrays which you need to slice into examples. To produce one iterator of pairs (x, y), you might want zip:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"X = rand(28, 28, 60_000); # many images, each 28 × 28\nY = rand(10, 60_000)\ndata = zip(eachslice(X; dims=3), eachcol(Y))\n\nfirst(data) isa Tuple{AbstractMatrix, AbstractVector} # true","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Here each iteration will use one matrix x (an image, perhaps) and one vector y. It is very common to instead train on batches of such inputs (or mini-batches, the two words mean the same thing) both for efficiency and for better results. This can be easily done using the DataLoader:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"data = Flux.DataLoader((X, Y), batchsize=32)\n\nx1, y1 = first(data)\nsize(x1) == (28, 28, 32)\nlength(data) == 1875 === 60_000 ÷ 32","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's layers are set up to accept such a batch of input data, and the convolutional layers such as Conv require it. The batch index is always the last dimension.","category":"page"},{"location":"guide/training/training/#Training-Loops","page":"Training","title":"Training Loops","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Simple training loops like the one above can be written compactly using the train! function. Including setup, this reads:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nfor epoch in 1:100\n Flux.train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Or explicitly writing the anonymous function which this do block creates, train!((m,x,y) -> loss(m(x),y), model, train_set, opt_state) is exactly equivalent.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Real training loops often need more flexibility, and the best way to do this is just to write the loop. This is ordinary Julia code, without any need to work through some callback API. Here is an example, in which it may be helpful to note:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The function withgradient is like gradient but also returns the value of the function, for logging or diagnostic use.\nLogging or printing is best done outside of the gradient call, as there is no need to differentiate these commands.\nTo use result for logging purposes, you could change the do block to end with return my_loss(result, label), result, i.e. make the function passed to withgradient return a tuple. The first element is always the loss.\nJulia's break and continue keywords let you exit from parts of the loop.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nmy_log = []\nfor epoch in 1:100\n losses = Float32[]\n for (i, data) in enumerate(train_set)\n input, label = data\n\n val, grads = Flux.withgradient(model) do m\n # Any code inside here is differentiated.\n # Evaluation of the model and loss must be inside!\n result = m(input)\n my_loss(result, label)\n end\n\n # Save the loss from the forward pass. (Done outside of gradient.)\n push!(losses, val)\n\n # Detect loss of Inf or NaN. Print a warning, and then skip update!\n if !isfinite(val)\n @warn \"loss is $val on item $i\" epoch\n continue\n end\n\n Flux.update!(opt_state, model, grads[1])\n end\n\n # Compute some accuracy, and save details as a NamedTuple\n acc = my_accuracy(model, train_set)\n push!(my_log, (; acc, losses))\n\n # Stop training when some criterion is reached\n if acc > 0.95\n println(\"stopping after $epoch epochs\")\n break\n end\nend","category":"page"},{"location":"guide/training/training/#Regularisation","page":"Training","title":"Regularisation","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The term regularisation covers a wide variety of techniques aiming to improve the result of training. This is often done to avoid overfitting.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Some of these can be implemented by simply modifying the loss function. L₂ regularisation (sometimes called ridge regression) adds to the loss a penalty proportional to θ^2 for every scalar parameter. A very simple model could be implemented as follows:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(densemodel) do m\n result = m(input)\n penalty = sum(abs2, m.weight)/2 + sum(abs2, m.bias)/2\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Accessing each individual parameter array by hand won't work well for large models. Instead, we can use Flux.trainables to collect all of them, and then apply a function to each one, and sum the result:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"pen_l2(x::AbstractArray) = sum(abs2, x)/2\n\ngrads = Flux.gradient(model) do m\n result = m(input)\n penalty = sum(pen_l2, Flux.trainables(m))\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, the gradient of this penalty term is very simple: It is proportional to the original weights. So there is a simpler way to implement exactly the same thing, by modifying the optimiser instead of the loss function. This is done by replacing this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"with this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"decay_opt_state = Flux.setup(OptimiserChain(WeightDecay(0.42), Adam(0.1)), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's optimisers are really modifications applied to the gradient before using it to update the parameters, and OptimiserChain applies two such modifications. The first, WeightDecay adds 0.42 times the original parameter to the gradient, matching the gradient of the penalty above (with the same, unrealistically large, constant). After that, in either case, Adam computes the final update.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same trick works for L₁ regularisation (also called Lasso), where the penalty is pen_l1(x::AbstractArray) = sum(abs, x) instead. This is implemented by SignDecay(0.42).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same OptimiserChain mechanism can be used for other purposes, such as gradient clipping with ClipGrad or ClipNorm.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is provided by the Dropout layer. This turns off some outputs of the previous layer during training. It should switch automatically, but see trainmode! / testmode! to manually enable or disable this layer.","category":"page"},{"location":"guide/training/training/#Learning-Rate-Schedules","page":"Training","title":"Learning Rate Schedules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Finer control of training, you may wish to alter the learning rate mid-way through training. This can be done with adjust!, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model) # initialise once\n\nfor epoch in 1:1000\n train!([...], state) # Train with η = 0.1 for first 100,\n if epoch == 100 # then change to use η = 0.01 for the rest.\n Flux.adjust!(opt_state, 0.01)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Other hyper-parameters can also be adjusted, such as Flux.adjust!(opt_state, beta = (0.8, 0.99)). And such modifications can be applied to just one part of the model. For instance, this sets a different learning rate for the encoder and the decoder:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Consider some model with two parts:\nbimodel = Chain(enc = [...], dec = [...])\n\n# This returns a tree whose structure matches the model:\nopt_state = Flux.setup(Adam(0.02), bimodel)\n\n# Adjust the learning rate to be used for bimodel.layers.enc\nFlux.adjust!(opt_state.layers.enc, 0.03)","category":"page"},{"location":"guide/training/training/#Freezing-layer-parameters","page":"Training","title":"Freezing layer parameters","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"To completely disable training of some part of the model, use freeze!. This is a temporary modification, reversed by thaw!:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux.freeze!(opt_state.layers.enc)\n\n# Now training won't update parameters in bimodel.layers.enc\ntrain!(loss, bimodel, data, opt_state)\n\n# Un-freeze the entire model:\nFlux.thaw!(opt_state)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/activation/#man-activations","page":"Activation Functions","title":"Activation Functions from NNlib.jl","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"These non-linearities used between layers of your model are exported by the NNlib package.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on. Alternatively, they can be passed to a layer like Dense(784 => 1024, relu) which will handle this broadcasting.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Functions like softmax are sometimes described as activation functions, but not by Flux. They must see all the outputs, and hence cannot be broadcasted. See the next page for details.","category":"page"},{"location":"reference/models/activation/#Alphabetical-Listing","page":"Activation Functions","title":"Alphabetical Listing","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"celu\nelu\ngelu\nhardsigmoid\nhardswish\nhardtanh\nleakyrelu\nlisht\nlogcosh\nlogsigmoid\nmish\nrelu\nrelu6\nrrelu\nselu\nsigmoid\nsigmoid_fast\nsoftplus\nsoftshrink\nsoftsign\nswish\ntanhshrink\ntanh_fast\ntrelu","category":"page"},{"location":"reference/models/activation/#NNlib.celu","page":"Activation Functions","title":"NNlib.celu","text":"celu(x, α=1) = x ≥ 0 ? x : α * (exp(x/α) - 1)\n\nActivation function from \"Continuously Differentiable Exponential Linear Units\".\n\njulia> lineplot(celu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ celu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> celu(-10f0)\n-0.9999546f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.elu","page":"Activation Functions","title":"NNlib.elu","text":"elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)\n\nExponential Linear Unit activation function. See \"Fast and Accurate Deep Network Learning by Exponential Linear Units\". You can also specify the coefficient explicitly, e.g. elu(x, 1).\n\njulia> lineplot(elu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> elu(-10f0)\n-0.9999546f0\n\njulia> elu(-10f0, 2)\n-1.9999092f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.gelu","page":"Activation Functions","title":"NNlib.gelu","text":"gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))\n\nActivation function from \"Gaussian Error Linear Units\".\n\njulia> lineplot(gelu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣤⣤⣤⣤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⡤⡧⠶⠶⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(gelu, -5, 0, height=7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸│ gelu(x) \n │⠑⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇│ swish(x)\n │⠀⠀⠀⠀⠀⠈⠉⠒⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⢠⡇⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⣄⠀⠀⠀⠀⠀⢠⡞⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⢄⣀⣀⡤⢣⠃⠀⠀│ \n -0.2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠇⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardsigmoid","page":"Activation Functions","title":"NNlib.hardsigmoid","text":"hardσ(x) = max(0, min(1, (x + 3) / 6))\n\nPiecewise linear approximation of sigmoid.\n\njulia> lineplot(hardsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⡗⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⠤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardswish","page":"Activation Functions","title":"NNlib.hardswish","text":"hardswish(x) = x * hardσ(x)\n\nHard-Swish activation function. See \"Searching for MobileNetV3\".\n\njulia> lineplot(hardswish, -2, 5, height = 7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠒⠉⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣤⣤⣖⣚⣉⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠉⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(hardswish, -4, 0, height = 7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⢣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡜│ hardswish(x)\n │⠒⠒⠢⠤⢄⣀⡀⠀⠀⠀⠀⠱⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠎⠀│ swish(x) \n │⠀⠀⠀⠀⠀⠀⠈⠉⠑⠒⠦⢄⣘⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠃⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⡖⠦⢄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⢔⠏⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠣⣄⠀⠉⠑⠒⠦⠤⢄⣀⣀⣀⣀⡠⠤⠖⣊⠕⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⠤⡀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀│ \n -0.4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠢⠤⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-4⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> hardswish.(-5:5)'\n1×11 adjoint(::Vector{Float64}) with eltype Float64:\n -0.0 -0.0 -0.0 -0.333333 -0.333333 0.0 0.666667 1.66667 3.0 4.0 5.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardtanh","page":"Activation Functions","title":"NNlib.hardtanh","text":"hardtanh(x) = max(-1, min(1, x))\n\nSegment-wise linear approximation of tanh, much cheaper to compute. See \"Large Scale Machine Learning\".\n\nSee also tanh_fast.\n\njulia> lineplot(hardtanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠋⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⠔⠋⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x\n\njulia> lineplot(tanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠤⠒⠒⠒⠊⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠊⠁⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⡠⠤⠤⠤⠖⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.leakyrelu","page":"Activation Functions","title":"NNlib.leakyrelu","text":"leakyrelu(x, a=0.01) = max(a*x, x)\n\nLeaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).\n\njulia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠤⠒⠒⠋⠉⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⠤⠤⠒⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> leakyrelu(-10f0, 0.2)\n-2.0f0\n\njulia> leakyrelu(-10f0, 0.02)\n-0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.lisht","page":"Activation Functions","title":"NNlib.lisht","text":"lisht(x) = x * tanh(x)\n\nActivation function from \"LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ...\"\n\njulia> lineplot(lisht, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)\n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ \n │⠀⠀⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⣄⣀⣀⣇⣀⣀⠤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, logcosh)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x) \n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ logcosh(x)\n │⠢⣄⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⣀⠔│ \n f(x) │⠀⠈⠑⠢⣀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⣀⠔⠊⠁⠀│ \n │⠀⠀⠀⠀⠀⠉⠢⢄⡀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⡠⠔⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠦⣌⡓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⣁⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠪⠷⣦⣄⣀⣀⣇⣀⣀⣤⠶⠕⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logcosh","page":"Activation Functions","title":"NNlib.logcosh","text":"logcosh(x)\n\nReturn log(cosh(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logcosh, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)\n │⠉⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ \n │⠀⠀⠀⠑⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠑⠦⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠢⢄⣀⣀⣇⣀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logsigmoid","page":"Activation Functions","title":"NNlib.logsigmoid","text":"logσ(x)\n\nReturn log(σ(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡤⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.mish","page":"Activation Functions","title":"NNlib.mish","text":"mish(x) = x * tanh(softplus(x))\n\nActivation function from \"Mish: A Self Regularized Non-Monotonic Neural Activation Function\".\n\njulia> lineplot(mish, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣧⣔⣊⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu","page":"Activation Functions","title":"NNlib.relu","text":"relu(x) = max(0, x)\n\nRectified Linear Unit activation function.\n\njulia> lineplot(relu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu6","page":"Activation Functions","title":"NNlib.relu6","text":"relu6(x) = min(max(0, x), 6)\n\nRectified Linear Unit activation function capped at 6. See \"Convolutional Deep Belief Networks\" from CIFAR-10.\n\njulia> lineplot(relu6, -10, 10, height=7)\n ┌────────────────────────────────────────┐ \n 6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡤⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⡠⠎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡧⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-10⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.rrelu","page":"Activation Functions","title":"NNlib.rrelu","text":"rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)\n# where `a` is randomly sampled from uniform distribution `U(lo, hi)`\n\nRandomized Leaky Rectified Linear Unit activation function. See \"Empirical Evaluation of Rectified Activations\" You can also specify the bound explicitly, e.g. rrelu(x, 0.0, 1.0).\n\njulia> lineplot(rrelu, -20, 10, height=7)\n ┌────────────────────────────────────────┐ \n 10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⠤⣤⣤⢤⣤⣤⠤⠤⠤⢼⠮⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⣰⢀⣆⡄⣄⡄⡠⡰⠦⠷⡜⢢⠷⠳⠢⠊⠉⠉⠀⠀⠁⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠃⠉⠙⠘⠃⠈⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-20⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> extrema(rrelu.(fill(-10f0, 1000)))\n(-3.3316886f0, -1.2548422f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.selu","page":"Activation Functions","title":"NNlib.selu","text":"selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))\n\nλ ≈ 1.05070...\nα ≈ 1.67326...\n\nScaled exponential linear units. See \"Self-Normalizing Neural Networks\".\n\njulia> lineplot(selu, -3, 2, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ selu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠊⠉⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⡠⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⣉⠭⠛⡏⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡤⠤⠒⠊⠉⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠤⠤⠖⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> selu(-10f0)\n-1.7580194f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid","page":"Activation Functions","title":"NNlib.sigmoid","text":"σ(x) = 1 / (1 + exp(-x))\n\nClassic sigmoid activation function. Unicode σ can be entered as \\sigma then tab, in many editors. The ascii name sigmoid is also exported.\n\nSee also sigmoid_fast.\n\njulia> using UnicodePlots\n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> sigmoid === σ\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid_fast","page":"Activation Functions","title":"NNlib.sigmoid_fast","text":"sigmoid_fast(x)\n\nThis is a faster, and very slightly less accurate, version of sigmoid. For x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.\n\nSee also tanh_fast.\n\njulia> sigmoid(0.2f0)\n0.54983395f0\n\njulia> sigmoid_fast(0.2f0)\n0.54983395f0\n\njulia> hardσ(0.2f0)\n0.53333336f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softplus","page":"Activation Functions","title":"NNlib.softplus","text":"softplus(x) = log(exp(x) + 1)\n\nSee \"Deep Sparse Rectifier Neural Networks\", JMLR 2011.\n\njulia> lineplot(softplus, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⠤⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⡠⠤⠤⠤⠤⠔⠒⠒⠚⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, relu)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠│ relu(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡴⠞⠋⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣤⡴⠞⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⢤⡲⠝⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⣉⠥⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣠⣤⣤⣤⣤⣔⣒⣒⣚⣉⣉⣁⣀⣇⠴⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softplus(16f0)\n16.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softshrink","page":"Activation Functions","title":"NNlib.softshrink","text":"softshrink(x, λ=0.5) =\n (x ≥ λ ? x - λ : (-λ ≥ x ? x + λ : 0))\n\nSee \"Softshrink Activation Function\".\n\njulia> lineplot(softshrink, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⠉⠁│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⡤⠤⠤⠤⠤⠤⠤⡧⠤⠤⠤⠤⠶⠮⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⢀⣀⠤⠖⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⣀⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanhshrink)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⣉⡡│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⣒⣋⠥⠤⠒⠊⠉⠁⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠾⠿⠯⠭⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⢀⣀⡠⠤⠖⢒⣋⠭⠗⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠊⣉⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀\n\njulia> softshrink.((-10f0, 10f0))\n(-9.5f0, 9.5f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softsign","page":"Activation Functions","title":"NNlib.softsign","text":"softsign(x) = x / (1 + |x|)\n\nSee \"Quadratic Polynomials Learn Better Image Features\" (2009).\n\njulia> lineplot(softsign, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⠋⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠒⠒⠒⠒⠒⠊⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanh)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠊⠉⠉⠉⣉⣉⣉⣉⣉⠭⠭⠭⠭⠭│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⣃⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanh(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⡞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡴⠃⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⢋⠕⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣒⣒⣒⣒⣒⣊⣉⣉⣉⣉⣁⣀⣀⡠⠤⠒⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softsign(1f0)\n0.5f0\n\njulia> softsign(100f0)\n0.990099f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.swish","page":"Activation Functions","title":"NNlib.swish","text":"swish(x) = x * σ(x)\n\nSelf-gated activation function. See \"Swish: a Self-Gated Activation Function\".\n\njulia> lineplot(swish, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋⠁⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⣀⡤⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⣤⡤⡧⠴⠶⠯⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠉⠑⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanhshrink","page":"Activation Functions","title":"NNlib.tanhshrink","text":"tanhshrink(x) = x - tanh(x)\n\nSee \"Tanhshrink Activation Function\".\n\njulia> lineplot(tanhshrink, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠊│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⣀⡠⠤⠒⠊⠉⠁⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠮⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⣀⡠⠴⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡠⠴⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> tanhshrink.((-10f0, 10f0))\n(-9.0f0, 9.0f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanh_fast","page":"Activation Functions","title":"NNlib.tanh_fast","text":"tanh_fast(x)\n\nThis is a faster but slighly less accurate version of tanh.\n\nWhere Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit. \n\nFor x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.\n\nSee also sigmoid_fast.\n\njulia> tanh(0.5f0)\n0.46211717f0\n\njulia> tanh_fast(0.5f0)\n0.46211714f0\n\njulia> hard_tanh(0.5f0)\n0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.trelu","page":"Activation Functions","title":"NNlib.trelu","text":"trelu(x, theta=1) = x > theta ? x : 0\n\nThreshold gated rectified linear activation function. See \"Zero-bias autoencoders and the benefits of co-adapting features\"\n\njulia> lineplot(trelu, -2, 4, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣀⣀⣀⣀⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀4⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#One-More","page":"Activation Functions","title":"One More","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Julia's Base.Math also provides tanh, which can be used as an activation function.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"julia> using UnicodePlots\n\njulia> lineplot(tanh, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⣀⠤⠔⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡰⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⡤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠎⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ","category":"page"},{"location":"ecosystem/#The-Julia-Ecosystem-around-Flux","page":"Ecosystem","title":"The Julia Ecosystem around Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.","category":"page"},{"location":"ecosystem/#Flux-models","page":"Ecosystem","title":"Flux models","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Flux's model-zoo contains examples from many domains.","category":"page"},{"location":"ecosystem/#Computer-vision","page":"Ecosystem","title":"Computer vision","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ObjectDetector.jl provides ready-to-go image detection via YOLO.\nMetalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.\nUNet.jl is a generic UNet implementation.","category":"page"},{"location":"ecosystem/#Natural-language-processing","page":"Ecosystem","title":"Natural language processing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.\nTextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.","category":"page"},{"location":"ecosystem/#Reinforcement-learning","page":"Ecosystem","title":"Reinforcement learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.\nReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.","category":"page"},{"location":"ecosystem/#Graph-learning","page":"Ecosystem","title":"Graph learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.\nGeometricFlux.jl is the first graph neural network library for julia. \nNeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.\nSeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.","category":"page"},{"location":"ecosystem/#Time-series","page":"Ecosystem","title":"Time series","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FluxArchitectures.jl is a collection of advanced network architectures for time series forecasting.","category":"page"},{"location":"ecosystem/#Robust-networks","page":"Ecosystem","title":"Robust networks","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Tools-closely-associated-with-Flux","page":"Ecosystem","title":"Tools closely associated with Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Utility tools you're unlikely to have met if you never used Flux!","category":"page"},{"location":"ecosystem/#High-level-training-flows","page":"Ecosystem","title":"High-level training flows","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FastAI.jl is a Julia port of Python's fast.ai library.\nFluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl\nIgnite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.\nTsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.","category":"page"},{"location":"ecosystem/#Datasets","page":"Ecosystem","title":"Datasets","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"MLDatasets.jl focuses on downloading, unpacking, and accessing benchmark datasets.\nGraphMLDatasets.jl: a library for machine learning datasets on graph.","category":"page"},{"location":"ecosystem/#Plumbing","page":"Ecosystem","title":"Plumbing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Tools to put data into the right order for creating a model.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Augmentor.jl is a real-time library augmentation library for increasing the number of training images.\nDataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.\nMLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.","category":"page"},{"location":"ecosystem/#Parameters","page":"Ecosystem","title":"Parameters","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ParameterSchedulers.jl standard scheduling policies for machine learning.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Differentiable-programming","page":"Ecosystem","title":"Differentiable programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Packages based on differentiable programming but not necessarily related to Machine Learning. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.\nDiffEqFlux.jl provides tools for creating Neural Differential Equations.\nFlux3D.jl shows off machine learning on 3D data.\nRayTracer.jl combines ML with computer vision via a differentiable renderer.\nDuckietown.jl Differentiable Duckietown simulator.\nThe Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.\nAtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.\nDiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.","category":"page"},{"location":"ecosystem/#Probabilistic-programming","page":"Ecosystem","title":"Probabilistic programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.\nOmega.jl is a research project aimed at causal, higher-order probabilistic programming.\nStheno.jl provides flexible Gaussian processes.","category":"page"},{"location":"ecosystem/#Statistics","page":"Ecosystem","title":"Statistics","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"OnlineStats.jl provides single-pass algorithms for statistics.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Useful-miscellaneous-packages","page":"Ecosystem","title":"Useful miscellaneous packages","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Some useful and random packages!","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.\nMill.jl helps to prototype flexible multi-instance learning models.\nMLMetrics.jl is a utility for scoring models in data science and machine learning.\nTorch.jl exposes torch in Julia.\nValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.\nInvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.\nProgressMeter.jl progress meters for long-running computations.\nTensorBoardLogger.jl easy peasy logging to tensorboard in Julia\nArgParse.jl is a package for parsing command-line arguments to Julia programs.\nParameters.jl types with default field values, keyword constructors and (un-)pack macros.\nBSON.jl is a package for working with the Binary JSON serialisation format.\nDataFrames.jl in-memory tabular data in Julia.\nDrWatson.jl is a scientific project assistant software.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Alternatives-to-Flux","page":"Ecosystem","title":"Alternatives to Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Julia has several other libraries for making neural networks. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl) \nKnet.jl is a neural network library built around AutoGrad.jl.\nLux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"compat: Explicit or explicit?\nFlux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word \"explicit\" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/functors/#Recursive-transformations-from-Functors.jl","page":"Nested Structures – Functors.jl","title":"Recursive transformations from Functors.jl","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux models are deeply nested structures, and Functors.jl provides tools needed to explore such objects, apply functions to the parameters they contain (e.g. for moving them to gpu), and re-build them.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"compat: Flux ≤ v0.14\nAll layers were previously defined with the Functors.@functor macro. This still works, but it is recommended that you use the new Flux.@layer macro instead. Both allow Flux.setup to see the parameters inside, and gpu to move them to the GPU, but Flux.@layer also overloads printing, and offers a way to define trainable at the same time.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"compat: Functors v0.5\nWith Functors.jl v0.5, which is required by Flux v0.15 and later, every custom type is a functor by default. This means that applying Flux.@layer to a type is no longer strictly necessary, but it is still recommended for addictional features like pretty-printing.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Functors.jl has its own notes on basic usage for more details. Additionally, the Advanced Model Building and Customisation page covers the use cases of Functors in greater details.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux.@layer\nFunctors.@leaf\nFunctors.@functor\nFunctors.fmap\nFunctors.fmap_with_path\nFunctors.isleaf\nFunctors.children\nFunctors.fcollect\nFunctors.functor\nFunctors.fmapstructure\nFunctors.fmapstructure_with_path\nFunctors.execute\nFunctors.AbstractWalk\nFunctors.ExcludeWalk\nFunctors.CachedWalk","category":"page"},{"location":"reference/models/functors/#Flux.@layer","page":"Nested Structures – Functors.jl","title":"Flux.@layer","text":"@layer [showtype] MyModel [trainable=(field1,...)]\n\nThis macro adds convenience functionality to a custom type to serve as a neural network layer, as a module, or as an entire model.\n\nThe optional keyword trainable allows you to specify which fields of your model can be trained, instead of assuming all fieldnames(MyModel) to trainable. Note that it is never necessary to tell Flux to ignore non-array objects such as functions or sizes. This can be also be done by defining trainable(::MyModel) for your type.\n\nThe macro also handles overloads of the 3-arg show(::IO, ::MIME\"text/plain\", ::MyModel) for pretty printing. The optional argument showtype can take any of the following values:\n\n:expand (default): This will expand the representation of container types like Chain, while maintaining a compat representation of types like Dense containing only arrays.\n:noexpand: This is to be used in case your type contains other layers but you want to keep the representation simple.\n:ignore: To opt out of the pretty printing.\n\nYou probably still want to define 2-arg show(::IO, ::MyModel), the macro does not touch this.\n\nNote that re-running the macro with different options may not remove all methods, you will need to restart.\n\nExample\n\njulia> struct Trio; a; b; c end\n\njulia> tri = Trio(Dense([1.1 2.2], [0.0], tanh), Dense(hcat(3.3), false), Dropout(0.4))\nTrio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4))\n\njulia> Flux.@layer Trio\n\njulia> tri # now the layer is printed like Chain\nTrio(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1; bias=false), # 1 parameters\n Dropout(0.4),\n) # Total: 3 arrays, 4 parameters, 240 bytes.\n\njulia> Flux.@layer :noexpand Trio trainable=(a,b)\n\njulia> tri # now the layer is printed compactly\nTrio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4)) # 4 parameters\n\njulia> opt_state = Flux.setup(Adam(), tri); # `c` is not in the optimizer state\n\nThe macro also adds methods to make using Flux with Enzyme easier.\n\nDuplicated(m::Layer) allocates a copy for the gradient (initially zero).\nThis is made callable, (m::Duplicated{<:Layer})(x...) = m.val(x...)\nPretty printing for show(io, mime, ::Duplicated{<:Layer})\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.@leaf","page":"Nested Structures – Functors.jl","title":"Functors.@leaf","text":"@leaf T\n\nDefine functor for the type T so that isleaf(x::T) == true.\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.@functor","page":"Nested Structures – Functors.jl","title":"Functors.@functor","text":"@functor T\n@functor T (x,)\n\nAdds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.\n\nBy default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> Functors.children(Foo(1,2))\n(x = 1, y = 2)\n\njulia> _, re = Functors.functor(Foo(1,2));\n\njulia> re((10, 20))\nFoo(10, 20)\n\njulia> @functor Foo # same as before, nothing changes\n\njulia> struct TwoThirds a; b; c; end\n\njulia> @functor TwoThirds (a, c)\n\njulia> ch2, re3 = Functors.functor(TwoThirds(10,20,30));\n\njulia> ch2\n(a = 10, c = 30)\n\njulia> re3((\"ten\", \"thirty\"))\nTwoThirds(\"ten\", 20, \"thirty\")\n\njulia> fmap(x -> 10x, TwoThirds(Foo(1,2), Foo(3,4), 56))\nTwoThirds(Foo(10, 20), Foo(3, 4), 560)\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.fmap","page":"Nested Structures – Functors.jl","title":"Functors.fmap","text":"fmap(f, x, ys...; exclude = Functors.isleaf, walk = Functors.DefaultWalk(), [prune])\n\nA structure and type preserving map.\n\nBy default it transforms every leaf node (identified by exclude, default isleaf) by applying f, and otherwise traverses x recursively using functor. Optionally, it may also be associated with objects ys with the same tree structure. In that case, f is applied to the corresponding leaf nodes in x and ys.\n\nSee also fmap_with_path and fmapstructure.\n\nExamples\n\njulia> fmap(string, (x=1, y=(2, 3)))\n(x = \"1\", y = (\"2\", \"3\"))\n\njulia> nt = (a = [1,2], b = [23, (45,), (x=6//7, y=())], c = [8,9]);\n\njulia> fmap(println, nt)\n[1, 2]\n23\n45\n6//7\n()\n[8, 9]\n(a = nothing, b = Any[nothing, (nothing,), (x = nothing, y = nothing)], c = nothing)\n\njulia> fmap(println, nt; exclude = x -> x isa Array)\n[1, 2]\nAny[23, (45,), (x = 6//7, y = ())]\n[8, 9]\n(a = nothing, b = nothing, c = nothing)\n\njulia> twice = [1, 2]; # println only acts once on this\n\njulia> fmap(println, (i = twice, ii = 34, iii = [5, 6], iv = (twice, 34), v = 34.0))\n[1, 2]\n34\n[5, 6]\n34\n34.0\n(i = nothing, ii = nothing, iii = nothing, iv = (nothing, nothing), v = nothing)\n\njulia> d1 = Dict(\"x\" => [1,2], \"y\" => 3);\n\njulia> d2 = Dict(\"x\" => [4,5], \"y\" => 6, \"z\" => \"an_extra_value\");\n\njulia> fmap(+, d1, d2) == Dict(\"x\" => [5, 7], \"y\" => 9) # Note that \"z\" is ignored\ntrue\n\nMutable objects which appear more than once are only handled once (by caching f(x) in an IdDict). Thus the relationship x.i === x.iv[1] will be preserved. An immutable object which appears twice is not stored in the cache, thus f(34) will be called twice, and the results will agree only if f is pure.\n\nBy default, almost all container-like types have children to recurse into. Arrays of numbers do not.\n\nTo opt out of recursion for custom types use @leaf or pass a custom exclude function.\n\njulia> struct Foo; x; y; end\n\njulia> struct Bar; x; end\n\njulia> m = Foo(Bar([1,2,3]), (4, 5, Bar(Foo(6, 7))));\n\njulia> fmap(x -> 10x, m)\nFoo(Bar([10, 20, 30]), (40, 50, Bar(Foo(60, 70))))\n\njulia> fmap(string, m)\nFoo(Bar(\"[1, 2, 3]\"), (\"4\", \"5\", Bar(Foo(\"6\", \"7\"))))\n\njulia> fmap(string, m, exclude = v -> v isa Bar)\nFoo(\"Bar([1, 2, 3])\", (4, 5, \"Bar(Foo(6, 7))\"))\n\nTo recurse into custom types without reconstructing them afterwards, use fmapstructure.\n\nFor advanced customization of the traversal behaviour, pass a custom walk function that subtypes Functors.AbstractWalk. The call fmap(f, x, ys...; walk = mywalk) will wrap mywalk in ExcludeWalk then CachedWalk. Here, ExcludeWalk is responsible for applying f at excluded nodes. For a low-level interface for executing a user-constructed walk, see execute.\n\njulia> struct MyWalk <: Functors.AbstractWalk end\n\njulia> (::MyWalk)(recurse, x) = x isa Bar ? \"hello\" :\n Functors.DefaultWalk()(recurse, x)\n\njulia> fmap(x -> 10x, m; walk = MyWalk())\nFoo(\"hello\", (40, 50, \"hello\"))\n\nThe behaviour when the same node appears twice can be altered by giving a value to the prune keyword, which is then used in place of all but the first:\n\njulia> twice = [1, 2];\n\njulia> fmap(float, (x = twice, y = [1,2], z = twice); prune = missing)\n(x = [1.0, 2.0], y = [1.0, 2.0], z = missing)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmap_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmap_with_path","text":"fmap_with_path(f, x, ys...; exclude = isleaf, walk = DefaultWalkWithPath(), [prune])\n\nLike fmap, but also passes a KeyPath to f for each node in the recursion. The KeyPath is a tuple of the indices used to reach the current node from the root of the recursion. The KeyPath is constructed by the walk function, and can be used to reconstruct the path to the current node from the root of the recursion.\n\nf has to accept two arguments: the associated KeyPath and the value of the current node.\n\nexclude also receives the KeyPath as its first argument and a node as its second. It should return true if the recursion should not continue on its children and f applied to it.\n\nprune is used to control the behaviour when the same node appears twice, see fmap for more information.\n\nExamples\n\njulia> x = ([1, 2, 3], 4, (a=5, b=Dict(\"A\"=>6, \"B\"=>7), c=Dict(\"C\"=>8, \"D\"=>9)));\n\njulia> exclude(kp, x) = kp == KeyPath(3, :c) || Functors.isleaf(x);\n\njulia> fmap_with_path((kp, x) -> x isa Dict ? nothing : x.^2, x; exclude = exclude)\n([1, 4, 9], 16, (a = 25, b = Dict(\"B\" => 49, \"A\" => 36), c = nothing))\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.isleaf","page":"Nested Structures – Functors.jl","title":"Functors.isleaf","text":"isleaf(x)\n\nReturn true if x has no children according to functor.\n\nExamples\n\njulia> Functors.isleaf(1)\ntrue\n\njulia> Functors.isleaf([2, 3, 4])\ntrue\n\njulia> Functors.isleaf([\"five\", [6, 7]])\nfalse\n\njulia> Functors.isleaf([])\nfalse\n\njulia> Functors.isleaf((8, 9))\nfalse\n\njulia> Functors.isleaf(())\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.children","page":"Nested Structures – Functors.jl","title":"Functors.children","text":"children(x)\n\nReturn the children of x as defined by functor. Equivalent to functor(x)[1].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fcollect","page":"Nested Structures – Functors.jl","title":"Functors.fcollect","text":"fcollect(x; exclude = v -> false)\n\nTraverse x by recursing each child of x as defined by functor and collecting the results into a flat array, ordered by a breadth-first traversal of x, respecting the iteration order of children calls.\n\nDoesn't recurse inside branches rooted at nodes v for which exclude(v) == true. In such cases, the root v is also excluded from the result. By default, exclude always yields false.\n\nSee also children.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> struct Bar; x; end\n\njulia> struct TypeWithNoChildren; x; y; end\n\njulia> @leaf TypeWithNoChildren\n\njulia> m = Foo(Bar([1,2,3]), TypeWithNoChildren(:a, :b))\nFoo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n\njulia> fcollect(m)\n4-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n [1, 2, 3]\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> v isa Bar)\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> Functors.isleaf(v))\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.functor","page":"Nested Structures – Functors.jl","title":"Functors.functor","text":"functor(x)\nfunctor(typeof(x), x)\n\nReturns a tuple containing, first, a NamedTuple of the children of x (typically its fields), and second, a reconstruction function. This controls the behaviour of fmap.\n\nMethods should be added to functor(::Type{T}, x) for custom types, usually using the macro @functor.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure","text":"fmapstructure(f, x, ys...; exclude = isleaf, [prune])\n\nLike fmap, but doesn't preserve the type of custom structs. Instead, it returns a NamedTuple (or a Tuple, or an array), or a nested set of these.\n\nUseful for when the output must not contain custom structs.\n\nSee also fmap and fmapstructure_with_path.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> m = Foo([1,2,3], [4, (5, 6), Foo(7, 8)]);\n\njulia> fmapstructure(x -> 2x, m)\n(x = [2, 4, 6], y = Any[8, (10, 12), (x = 14, y = 16)])\n\njulia> fmapstructure(println, m)\n[1, 2, 3]\n4\n5\n6\n7\n8\n(x = nothing, y = Any[nothing, (nothing, nothing), (x = nothing, y = nothing)])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure_with_path","text":"fmapstructure_with_path(f, x, ys...; [exclude, prune])\n\nLike fmap_with_path, but doesn't preserve the type of custom structs. Instead, it returns a named tuple, a tuple, an array, a dict, or a nested set of these.\n\nSee also fmapstructure.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.execute","page":"Nested Structures – Functors.jl","title":"Functors.execute","text":"execute(walk, x, ys...)\n\nExecute a walk that recursively calls itself, starting at a node x in a Functors tree, as well as optional associated nodes ys... in other Functors trees. Any custom walk function that subtypes Functors.AbstractWalk is permitted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.AbstractWalk","page":"Nested Structures – Functors.jl","title":"Functors.AbstractWalk","text":"AbstractWalk\n\nAny walk for use with fmap should inherit from this type. A walk subtyping AbstractWalk must satisfy the walk function interface:\n\nstruct MyWalk <: AbstractWalk end\n\nfunction (::MyWalk)(recurse, x, ys...)\n # implement this\nend\n\nThe walk function is called on a node x in a Functors tree. It may also be passed associated nodes ys... in other Functors trees. The walk function recurses further into (x, ys...) by calling recurse on the child nodes. The choice of which nodes to recurse and in what order is custom to the walk.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.ExcludeWalk","page":"Nested Structures – Functors.jl","title":"Functors.ExcludeWalk","text":"ExcludeWalk(walk, fn, exclude)\n\nA walk that recurses nodes (x, ys...) according to walk, except when exclude(x) is true. Then, fn(x, ys...) is applied instead of recursing further.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.CachedWalk","page":"Nested Structures – Functors.jl","title":"Functors.CachedWalk","text":"CachedWalk(walk[; prune])\n\nA walk that recurses nodes (x, ys...) according to walk and storing the output of the recursion in a cache indexed by x (based on object ID). Whenever the cache already contains x, either:\n\nprune is specified, then it is returned, or\nprune is unspecified, and the previously cached recursion of (x, ys...) returned.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Moving-models,-or-data,-to-the-GPU","page":"Nested Structures – Functors.jl","title":"Moving models, or data, to the GPU","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux provides some convenience functions based on fmap. Some (f16, f32, f64) change the precision of all arrays in a model. Others are used for moving a model to of from GPU memory:","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"cpu\ngpu(::Any)\ngpu(::Flux.DataLoader)","category":"page"},{"location":"reference/models/functors/#Flux.cpu","page":"Nested Structures – Functors.jl","title":"Flux.cpu","text":"cpu(m)\n\nCopies m onto the CPU, the opposite of gpu. Recurses into structs (thanks to Functors.jl).\n\nExample\n\njulia> m_gpu = Dense(CUDA.randn(2, 5))\nDense(5 => 2) # 12 parameters\n\njulia> m_gpu.bias # matches the given weight matrix\n2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.0\n 0.0\n\njulia> m = m_gpu |> cpu\nDense(5 => 2) # 12 parameters\n\njulia> m.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Flux.gpu-Tuple{Any}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(m)\n\nCopies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time). It recurses into structs according to Functors.jl.\n\nUse cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.\n\nThis function is just defined for convenience around gpu_device, and is equivalent to gpu_device()(m). You may consider defining device = gpu_device() once and then using device(m) to move data.\n\nExample\n\njulia> m = Dense(rand(2, 3)) # constructed with Float64 weight matrix\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m.weight)\nMatrix{Float64} (alias for Array{Float64, 2})\n\njulia> m_gpu = gpu(m) # can equivalently be written m_gpu = m |> gpu\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m_gpu.weight)\nCUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}\n\n\n\n\n\n","category":"method"},{"location":"reference/models/functors/#Flux.gpu-Tuple{DataLoader}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(data::DataLoader)\ncpu(data::DataLoader)\n\nTransforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)\n\nExample\n\njulia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64})\n\njulia> first(dl)\n(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c')\n\njulia> c_dl = gpu(dl)\n4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3)\n with first element:\n (; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64})\n\njulia> first(c_dl).x\n2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\nFor large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:\n\njulia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})\n\nwarning: Warning\nThis only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.\n\n\n\n\n\n","category":"method"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/losses/#man-losses","page":"Loss Functions","title":"Loss Functions","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux provides a large number of common loss functions used for training machine learning models. They are grouped together in the Flux.Losses module.","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Loss functions for supervised learning typically expect as inputs a target y, and a prediction ŷ from your model. In Flux's convention, the order of the arguments is the following","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y)","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Most loss functions in Flux have an optional argument agg, denoting the type of aggregation performed over the batch:","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y) # defaults to `mean`\nloss(ŷ, y, agg=sum) # use `sum` for reduction\nloss(ŷ, y, agg=x->sum(x, dims=2)) # partial reduction\nloss(ŷ, y, agg=x->mean(w .* x)) # weighted mean\nloss(ŷ, y, agg=identity) # no aggregation.","category":"page"},{"location":"reference/models/losses/#Function-listing","page":"Loss Functions","title":"Function listing","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux.Losses.mae\nFlux.Losses.mse\nFlux.Losses.msle\nFlux.Losses.huber_loss\nFlux.Losses.label_smoothing\nFlux.Losses.crossentropy\nFlux.Losses.logitcrossentropy\nFlux.Losses.binarycrossentropy\nFlux.Losses.logitbinarycrossentropy\nFlux.Losses.kldivergence\nFlux.Losses.poisson_loss\nFlux.Losses.hinge_loss\nFlux.Losses.squared_hinge_loss\nFlux.Losses.dice_coeff_loss\nFlux.Losses.tversky_loss\nFlux.Losses.binary_focal_loss\nFlux.Losses.focal_loss\nFlux.Losses.siamese_contrastive_loss","category":"page"},{"location":"reference/models/losses/#Flux.Losses.mae","page":"Loss Functions","title":"Flux.Losses.mae","text":"mae(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean absolute error:\n\nagg(abs.(ŷ .- y))\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> Flux.mae(y_model, 1:3)\n0.10000000000000009\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.mse","page":"Loss Functions","title":"Flux.Losses.mse","text":"mse(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean square error:\n\nagg((ŷ .- y) .^ 2)\n\nSee also: mae, msle, crossentropy.\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> y_true = 1:3;\n\njulia> Flux.mse(y_model, y_true)\n0.010000000000000018\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.msle","page":"Loss Functions","title":"Flux.Losses.msle","text":"msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nThe loss corresponding to mean squared logarithmic errors, calculated as\n\nagg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)\n\nThe ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.\n\nExample\n\njulia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)\n0.009084041f0\n\njulia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)\n0.011100831f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.huber_loss","page":"Loss Functions","title":"Flux.Losses.huber_loss","text":"huber_loss(ŷ, y; delta = 1, agg = mean)\n\nReturn the mean of the Huber loss given the prediction ŷ and true values y.\n\n | 0.5 * |ŷ - y|^2, for |ŷ - y| <= δ\nHuber loss = |\n | δ * (|ŷ - y| - 0.5 * δ), otherwise\n\nExample\n\njulia> ŷ = [1.1, 2.1, 3.1];\n\njulia> Flux.huber_loss(ŷ, 1:3) # default δ = 1 > |ŷ - y|\n0.005000000000000009\n\njulia> Flux.huber_loss(ŷ, 1:3, delta=0.05) # changes behaviour as |ŷ - y| > δ\n0.003750000000000005\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.label_smoothing","page":"Loss Functions","title":"Flux.Losses.label_smoothing","text":"label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)\n\nReturns smoothed labels, meaning the confidence on label values are relaxed.\n\nWhen y is given as one-hot vector or batch of one-hot, its calculated as\n\ny .* (1 - α) .+ α / size(y, dims)\n\nwhen y is given as a number or batch of numbers for binary classification, its calculated as\n\ny .* (1 - α) .+ α / 2\n\nin which case the labels are squeezed towards 0.5.\n\nα is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.\n\ndims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.\n\nExample\n\njulia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)\n2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ ⋅ ⋅ 1 ⋅ 1\n 1 1 1 ⋅ 1 ⋅\n\njulia> y_smoothed = Flux.label_smoothing(y, 0.2f0)\n2×6 Matrix{Float32}:\n 0.1 0.1 0.1 0.9 0.1 0.9\n 0.9 0.9 0.9 0.1 0.9 0.1\n\njulia> y_sim = softmax(y .* log(2f0))\n2×6 Matrix{Float32}:\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n\njulia> y_dis = vcat(y_sim[2,:]', y_sim[1,:]')\n2×6 Matrix{Float32}:\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n\njulia> Flux.crossentropy(y_sim, y) < Flux.crossentropy(y_sim, y_smoothed)\ntrue\n\njulia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.crossentropy","page":"Loss Functions","title":"Flux.Losses.crossentropy","text":"crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)\n\nReturn the cross entropy between the given probability distributions; calculated as\n\nagg(-sum(y .* log.(ŷ .+ ϵ); dims))\n\nCross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction ŷ is supposed to sum to one across dims, as would be the case with the output of a softmax operation.\n\nFor numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .\n\nUse label_smoothing to smooth the true labels as preprocessing before computing the loss.\n\nSee also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.\n\nExample\n\njulia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)\n3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ ⋅ 1\n ⋅ 1 ⋅ 1 ⋅\n ⋅ ⋅ 1 ⋅ ⋅\n\njulia> y_model = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> sum(y_model; dims=1)\n1×5 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0\n\njulia> Flux.crossentropy(y_model, y_label)\n1.6076053f0\n\njulia> 5 * ans ≈ Flux.crossentropy(y_model, y_label; agg=sum)\ntrue\n\njulia> y_smooth = Flux.label_smoothing(y_label, 0.15f0)\n3×5 Matrix{Float32}:\n 0.9 0.05 0.05 0.05 0.9\n 0.05 0.9 0.05 0.9 0.05\n 0.05 0.05 0.9 0.05 0.05\n\njulia> Flux.crossentropy(y_model, y_smooth)\n1.5776052f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitcrossentropy","page":"Loss Functions","title":"Flux.Losses.logitcrossentropy","text":"logitcrossentropy(ŷ, y; dims = 1, agg = mean)\n\nReturn the cross entropy calculated by\n\nagg(-sum(y .* logsoftmax(ŷ; dims); dims))\n\nThis is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.\n\nSee also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.\n\nExample\n\njulia> y_label = Flux.onehotbatch(collect(\"abcabaa\"), 'a':'c')\n3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 1\n ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n\njulia> y_model = reshape(vcat(-9:0, 0:9, 7.5f0), 3, 7)\n3×7 Matrix{Float32}:\n -9.0 -6.0 -3.0 0.0 2.0 5.0 8.0\n -8.0 -5.0 -2.0 0.0 3.0 6.0 9.0\n -7.0 -4.0 -1.0 1.0 4.0 7.0 7.5\n\njulia> Flux.logitcrossentropy(y_model, y_label)\n1.5791205f0\n\njulia> Flux.crossentropy(softmax(y_model), y_label)\n1.5791197f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binarycrossentropy","page":"Loss Functions","title":"Flux.Losses.binarycrossentropy","text":"binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the binary cross-entropy loss, computed as\n\nagg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))\n\nWhere typically, the prediction ŷ is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.\n\nUse label_smoothing to smooth the y value as preprocessing before computing the loss.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1]\n3-element Vector{Bool}:\n 1\n 0\n 1\n\njulia> y_prob = softmax(reshape(vcat(1:3, 3:5), 2, 3) .* 1f0)\n2×3 Matrix{Float32}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binarycrossentropy(y_prob[2,:], y_bin)\n0.43989f0\n\njulia> all(p -> 0 < p < 1, y_prob[2,:]) # else DomainError\ntrue\n\njulia> y_hot = Flux.onehotbatch(y_bin, 0:1)\n2×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n\njulia> Flux.crossentropy(y_prob, y_hot)\n0.43989f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitbinarycrossentropy","page":"Loss Functions","title":"Flux.Losses.logitbinarycrossentropy","text":"logitbinarycrossentropy(ŷ, y; agg = mean)\n\nMathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1];\n\njulia> y_model = Float32[2, -1, pi]\n3-element Vector{Float32}:\n 2.0\n -1.0\n 3.1415927\n\njulia> Flux.logitbinarycrossentropy(y_model, y_bin)\n0.160832f0\n\njulia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)\n0.16083185f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.kldivergence","page":"Loss Functions","title":"Flux.Losses.kldivergence","text":"kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the Kullback-Leibler divergence between the given probability distributions.\n\nThe KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.\n\nExample\n\njulia> p1 = [1 0; 0 1]\n2×2 Matrix{Int64}:\n 1 0\n 0 1\n\njulia> p2 = fill(0.5, 2, 2)\n2×2 Matrix{Float64}:\n 0.5 0.5\n 0.5 0.5\n\njulia> Flux.kldivergence(p2, p1) ≈ log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p1; agg = sum) ≈ 2log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p2; eps = 0) # about -2e-16 with the regulator\n0.0\n\njulia> Flux.kldivergence(p1, p2; eps = 0) # about 17.3 with the regulator\nInf\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.poisson_loss","page":"Loss Functions","title":"Flux.Losses.poisson_loss","text":"poisson_loss(ŷ, y; agg = mean)\n\nReturn how much the predicted distribution ŷ diverges from the expected Poisson distribution y; calculated as -\n\nsum(ŷ .- y .* log.(ŷ)) / size(y, 2)\n\nMore information..\n\nExample\n\njulia> y_model = [1, 3, 3]; # data should only take integral values\n\njulia> Flux.poisson_loss(y_model, 1:3)\n0.5023128522198171\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.hinge_loss","page":"Loss Functions","title":"Flux.Losses.hinge_loss","text":"hinge_loss(ŷ, y; agg = mean)\n\nReturn the hinge_loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum(max.(0, 1 .- ŷ .* y)) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: squared_hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.hinge_loss(y_pred, y_true)\n0.55\n\njulia> Flux.hinge_loss(y_pred[1], y_true[1]) != 0 # same sign but |ŷ| < 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[end], y_true[end]) == 0 # same sign but |ŷ| >= 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.squared_hinge_loss","page":"Loss Functions","title":"Flux.Losses.squared_hinge_loss","text":"squared_hinge_loss(ŷ, y)\n\nReturn the squared hinge_loss loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.squared_hinge_loss(y_pred, y_true)\n0.625\n\njulia> Flux.squared_hinge_loss(y_pred[1], y_true[1]) != 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[end], y_true[end]) == 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.dice_coeff_loss","page":"Loss Functions","title":"Flux.Losses.dice_coeff_loss","text":"dice_coeff_loss(ŷ, y; smooth = 1)\n\nReturn a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:\n\n1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)\n\nExample\n\njulia> y_pred = [1.1, 2.1, 3.1];\n\njulia> Flux.dice_coeff_loss(y_pred, 1:3)\n0.000992391663909964\n\njulia> 1 - Flux.dice_coeff_loss(y_pred, 1:3) # ~ F1 score for image segmentation\n0.99900760833609\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.tversky_loss","page":"Loss Functions","title":"Flux.Losses.tversky_loss","text":"tversky_loss(ŷ, y; beta = 0.7)\n\nReturn the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:\n\n1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binary_focal_loss","page":"Loss Functions","title":"Flux.Losses.binary_focal_loss","text":"binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nFor gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.\n\nSee also: Losses.focal_loss for multi-class setting\n\nExample\n\njulia> y = [0 1 0\n 1 0 1]\n2×3 Matrix{Int64}:\n 0 1 0\n 1 0 1\n\njulia> ŷ = [0.268941 0.5 0.268941\n 0.731059 0.5 0.731059]\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.focal_loss","page":"Loss Functions","title":"Flux.Losses.focal_loss","text":"focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nThe modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.\n\nExample\n\njulia> y = [1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0]\n3×5 Matrix{Int64}:\n 1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0\n\njulia> ŷ = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628\ntrue\n\nSee also: Losses.binary_focal_loss for binary (not one-hot) labels\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.siamese_contrastive_loss","page":"Loss Functions","title":"Flux.Losses.siamese_contrastive_loss","text":"siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)\n\nReturn the contrastive loss which can be useful for training Siamese Networks. It is given by\n\nagg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)\n\nSpecify margin to set the baseline for distance at which pairs are dissimilar.\n\nExample\n\njulia> ŷ = [0.5, 1.5, 2.5];\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3)\n-4.833333333333333\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)\n-4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#man-layers","page":"Built-in Layers","title":"Built-in Layer Types","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"If you started at the beginning of the guide, then you have already met the basic Dense layer, and seen Chain for combining layers. These core layers form the foundation of almost all neural networks.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The Dense exemplifies several features:","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"It contains an an activation function, which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.\nIt take an init keyword, which accepts a function acting like rand. That is, init(2,3,4) should create an array of this size. Flux has many such functions built-in. All make a CPU array, moved later with gpu if desired.\nThe bias vector is always initialised Flux.zeros32. The keyword bias=false will turn this off, i.e. keeping the bias permanently zero.\nIt is annotated with @layer, which means that Flux.setup will see the contents, and gpu will move their arrays to the GPU.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"By contrast, Chain itself contains no parameters, but connects other layers together. The section on dataflow layers introduces others like this.","category":"page"},{"location":"reference/models/layers/#Fully-Connected","page":"Built-in Layers","title":"Fully Connected","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Dense\nFlux.Bilinear\nFlux.Scale","category":"page"},{"location":"reference/models/layers/#Flux.Dense","page":"Built-in Layers","title":"Flux.Dense","text":"Dense(in => out, σ=identity; bias=true, init=glorot_uniform)\nDense(W::AbstractMatrix, [bias, σ])\n\nCreate a traditional fully connected layer, whose forward pass is given by:\n\ny = σ.(W * x .+ bias)\n\nThe input x should be a vector of length in, or batch of vectors represented as an in × N matrix, or any array with size(x,1) == in. The out y will be a vector of length out, or a batch with size(y) == (out, size(x)[2:end]...)\n\nKeyword bias=false will switch off trainable bias for the layer. The initialisation of the weight matrix is W = init(out, in), calling the function given to keyword init, with default glorot_uniform. The weight matrix and/or the bias vector (of length out) may also be provided explicitly.\n\nExamples\n\njulia> model = Dense(5 => 2)\nDense(5 => 2) # 12 parameters\n\njulia> model(rand32(5, 64)) |> size\n(2, 64)\n\njulia> model(rand32(5, 6, 4, 64)) |> size # treated as three batch dimensions\n(2, 6, 4, 64)\n\njulia> model2 = Dense(ones(2, 5), false, tanh) # using provided weight matrix\nDense(5 => 2, tanh; bias=false) # 10 parameters\n\njulia> model2(ones(5))\n2-element Vector{Float64}:\n 0.9999092042625951\n 0.9999092042625951\n\njulia> Flux.trainables(model2) # no trainable bias\n1-element Vector{AbstractArray}:\n [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Bilinear","page":"Built-in Layers","title":"Flux.Bilinear","text":"Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)\nBilinear(W::AbstractArray, [bias, σ])\n\nCreates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:\n\nz[i] = σ(x' * W[i,:,:] * y + bias[i])\n\nIf x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.\n\nIf the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)\n\nThe two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.\n\nIf the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).\n\nThe initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.\n\nExamples\n\njulia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);\n\njulia> B = Flux.Bilinear((5, 5) => 7)\nBilinear(5 => 7) # 182 parameters\n\njulia> B(x) |> size # interactions based on one input\n(7, 32)\n\njulia> B(x,y) == B((x,y)) # two inputs, may be given as a tuple\ntrue\n\njulia> sc = SkipConnection(\n Chain(Dense(5 => 20, tanh), Dense(20 => 9, tanh)),\n Flux.Bilinear((9, 5) => 3, bias=false),\n ); # used as the recombinator, with skip as the second input\n\njulia> sc(x) |> size\n(3, 32)\n\njulia> Flux.Bilinear(rand(4,8,16), false, tanh) # first dim of weight is the output\nBilinear((8, 16) => 4, tanh; bias=false) # 512 parameters\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Scale","page":"Built-in Layers","title":"Flux.Scale","text":"Scale(size::Integer..., σ=identity; bias=true, init=ones32)\nScale(scale::AbstractArray, [bias, σ])\n\nCreate an element-wise layer, whose forward pass is given by:\n\ny = σ.(scale .* x .+ bias)\n\nThis uses .* instead of matrix multiplication * of Dense.\n\nThe learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.\n\nUsed by LayerNorm with affine=true.\n\nExamples\n\njulia> a = Flux.Scale(2)\nScale(2) # 4 parameters\n\njulia> Flux.trainables(a)\n2-element Vector{AbstractArray}:\n Float32[1.0, 1.0]\n Float32[0.0, 0.0]\n\njulia> a([1 2 3])\n2×3 Matrix{Float32}:\n 1.0 2.0 3.0\n 1.0 2.0 3.0\n\njulia> b = Flux.Scale(Float32[1 2 3 4], false, abs2)\nScale(1, 4, abs2; bias=false) # 4 parameters\n\njulia> b([1, 10])\n2×4 Matrix{Float32}:\n 1.0 4.0 9.0 16.0\n 100.0 400.0 900.0 1600.0\n\njulia> Flux.trainables(b)\n1-element Vector{AbstractArray}:\n Float32[1.0 2.0 3.0 4.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.","category":"page"},{"location":"reference/models/layers/#Convolution-Models","page":"Built-in Layers","title":"Convolution Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are used to build convolutional neural networks (CNNs).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Conv\nConvTranspose\nCrossCor\nDepthwiseConv\nSamePad","category":"page"},{"location":"reference/models/layers/#Flux.Conv","page":"Built-in Layers","title":"Flux.Conv","text":"Conv(filter, in => out, σ = identity;\n stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])\nConv(weight, [bias, activation; stride, pad, dilation])\n\nStandard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nImage data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.\n\nTo take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:\n\nfilter should be a tuple of N integers.\nKeywords stride and dilation should each be either single integer, or a tuple with N integers.\nKeyword pad specifies the number of elements added to the borders of the data array. It can be\na single integer for equal padding all around,\na tuple of N integers, to apply the same padding at begin/end of each spatial dimension,\na tuple of 2*N integers, for asymmetric padding, or\nthe singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.\nKeyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.\n\nKeywords to control initialization of the layer:\n\ninit - Function used to generate initial weights. Defaults to glorot_uniform.\nbias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).\n\nThe second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself.\n\nSee also ConvTranspose, DepthwiseConv, CrossCor.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = Conv((5,5), 3 => 7, relu; bias = false)\nConv((5, 5), 3 => 7, relu, bias=false) # 525 parameters\n\njulia> layer(xs) |> size\n(96, 96, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2)(xs) |> size\n(48, 48, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, pad = SamePad())(xs) |> size\n(50, 50, 7, 50)\n\njulia> Conv((1,1), 3 => 7; pad = (20,10,0,0))(xs) |> size\n(130, 100, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size\n(42, 42, 7, 50)\n\njulia> weight = rand(Float32, 3, 4, 5);\n\njulia> bias = zeros(Float32, 5);\n\njulia> layer = Conv(weight, bias, sigmoid) # expects 1 spatial dimension\nConv((3,), 4 => 5, σ) # 65 parameters\n\njulia> layer(randn(Float32, 100, 4, 64)) |> size\n(98, 5, 64)\n\njulia> Flux.trainables(layer) |> length\n2\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.ConvTranspose","page":"Built-in Layers","title":"Flux.ConvTranspose","text":"ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])\nConvTranspose(weight, [bias, activation; stride, pad, outpad, dilation])\n\nStandard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.\n\nNote that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.\n\nTo conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nThe second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself.\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = ConvTranspose((5,5), 3 => 7, relu)\nConvTranspose((5, 5), 3 => 7, relu) # 532 parameters\n\njulia> layer(xs) |> size\n(104, 104, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2)(xs) |> size\n(203, 203, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2, outpad=1)(xs) |> size\n(204, 204, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size\n(300, 300, 7, 50)\n\njulia> weight = rand(Float32, 3, 4, 5);\n\njulia> bias = zeros(Float32, 4);\n\njulia> layer = ConvTranspose(weight, bias, sigmoid)\nConvTranspose((3,), 5 => 4, σ) # 64 parameters\n\njulia> layer(randn(Float32, 100, 5, 64)) |> size # transposed convolution will increase the dimension size (upsampling)\n(102, 4, 64)\n\njulia> Flux.trainables(layer) |> length\n2\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.CrossCor","page":"Built-in Layers","title":"Flux.CrossCor","text":"CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\nCrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nStandard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nThe second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)\nCrossCor((5, 5), 3 => 6, relu, bias=false) # 450 parameters\n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size\n(34, 32, 7, 50)\n\njulia> weight = rand(Float32, 3, 4, 5);\n\njulia> bias = zeros(Float32, 5);\n\njulia> layer = CrossCor(weight, bias, relu)\nCrossCor((3,), 4 => 5, relu) # 65 parameters\n\njulia> layer(randn(Float32, 100, 4, 64)) |> size\n(98, 5, 64)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.DepthwiseConv","page":"Built-in Layers","title":"Flux.DepthwiseConv","text":"DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\nDepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nReturn a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.\n\nSee Conv for a description of the arguments.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)\nConv((5, 5), 3 => 6, relu, groups=3, bias=false) # 150 parameters \n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size\n(50, 50, 9, 50)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.SamePad","page":"Built-in Layers","title":"Flux.SamePad","text":"SamePad()\n\nPassed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).\n\nSee also Conv, MaxPool.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of images\n\njulia> layer = Conv((2,2), 3 => 7, pad=SamePad())\nConv((2, 2), 3 => 7, pad=(1, 0, 1, 0)) # 91 parameters\n\njulia> layer(xs) |> size # notice how the dimensions stay the same with this padding\n(100, 100, 7, 50)\n\njulia> layer2 = Conv((2,2), 3 => 7)\nConv((2, 2), 3 => 7) # 91 parameters\n\njulia> layer2(xs) |> size # the output dimension changes as the padding was not \"same\"\n(99, 99, 7, 50)\n\njulia> layer3 = Conv((5, 5), 3 => 7, stride=2, pad=SamePad())\nConv((5, 5), 3 => 7, pad=2, stride=2) # 532 parameters\n\njulia> layer3(xs) |> size # output size = `ceil(input_size/stride)` = 50\n(50, 50, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#MultiHeadAttention","page":"Built-in Layers","title":"MultiHeadAttention","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"MultiHeadAttention","category":"page"},{"location":"reference/models/layers/#Flux.MultiHeadAttention","page":"Built-in Layers","title":"Flux.MultiHeadAttention","text":"MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])\n\nThe multi-head dot-product attention layer used in Transformer architectures [1].\n\nReturns the transformed input sequence and the attention scores.\n\n[1] Vaswani et al. \"Attention is all you need.\" Advances in Neural Information Processing Systems. 2017.\n\nArguments\n\ndims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.\nnheads: number of heads. Default 8.\ninit: weight initializer for the Dense layers. Default glorot_uniform.\nbias : whether pointwise QKVO dense transforms use bias. Default false.\ndropout_prob: dropout probability for the attention scores. Default 0.0.\n\nForward\n\n(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])\n\nThe arguments of the forward pass are:\n\nq_in: Input query array of size (q_in_dim, q_len, batch_size).\nk_in: Input key array of size (k_in_dim, kv_len, batch_size).\nv_in: Input value array of size (v_in_dim, kv_len, batch_size).\nbias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.\nmask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.\n\nAlternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).\n\nSee also NNlib.dot_product_attention.\n\nExamples\n\nmha = MultiHeadAttention(64, nheads = 8)\nq = rand(Float32, (64, 10, 32))\nk = rand(Float32, (64, 20, 32))\nv = rand(Float32, (64, 20, 32))\ny, α = mha(q, k, v) \n# [y] = [64, 10, 32]\n# [α] = [20, 10, 8, 32]\n\nmha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)\ny, α = mha(q) # self-attention\n# [y] = [1024, 10, 32]\n# [α] = [10, 10, 8, 32]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Pooling","page":"Built-in Layers","title":"Pooling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"AdaptiveMaxPool\nMaxPool\nGlobalMaxPool\nAdaptiveMeanPool\nMeanPool\nGlobalMeanPool","category":"page"},{"location":"reference/models/layers/#Flux.AdaptiveMaxPool","page":"Built-in Layers","title":"Flux.AdaptiveMaxPool","text":"AdaptiveMaxPool(out::NTuple)\n\nAdaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMaxPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MaxPool","page":"Built-in Layers","title":"Flux.MaxPool","text":"MaxPool(window::NTuple; pad=0, stride=window)\n\nMax pooling layer, which replaces all pixels in a block of size window with one.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7, pad=2), # 532 parameters\n MaxPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(100, 100, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\njulia> layer = MaxPool((5,), pad=2, stride=(3,)) # one-dimensional window\nMaxPool((5,), pad=2, stride=3)\n\njulia> layer(rand(Float32, 100, 7, 50)) |> size\n(34, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMaxPool","page":"Built-in Layers","title":"Flux.GlobalMaxPool","text":"GlobalMaxPool()\n\nGlobal max pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.\n\nSee also MaxPool, GlobalMeanPool.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\njulia> GlobalMaxPool()(rand(3,5,7)) |> size # preserves 2 dimensions\n(1, 5, 7)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AdaptiveMeanPool","page":"Built-in Layers","title":"Flux.AdaptiveMeanPool","text":"AdaptiveMeanPool(out::NTuple)\n\nAdaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMeanPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MeanPool","page":"Built-in Layers","title":"Flux.MeanPool","text":"MeanPool(window::NTuple; pad=0, stride=window)\n\nMean pooling layer, averaging all pixels in a block of size window.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7), # 532 parameters\n MeanPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(96, 96, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMeanPool","page":"Built-in Layers","title":"Flux.GlobalMeanPool","text":"GlobalMeanPool()\n\nGlobal mean pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Upsampling","page":"Built-in Layers","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Upsample\nPixelShuffle","category":"page"},{"location":"reference/models/layers/#Flux.Upsample","page":"Built-in Layers","title":"Flux.Upsample","text":"Upsample(mode = :nearest; [scale, size]) \nUpsample(scale, mode = :nearest)\n\nAn upsampling layer. One of two keywords must be given:\n\nIf scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.\n\nCurrently supported upsampling modes and corresponding NNlib's methods are:\n\n:nearest -> NNlib.upsample_nearest \n:bilinear -> NNlib.upsample_bilinear\n:trilinear -> NNlib.upsample_trilinear\n\nExamples\n\njulia> m = Upsample(scale = (2, 3))\nUpsample(:nearest, scale = (2, 3))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 6, 1, 1)\n\njulia> m = Upsample(:bilinear, size = (4, 5))\nUpsample(:bilinear, size = (4, 5))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 5, 1, 1)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PixelShuffle","page":"Built-in Layers","title":"Flux.PixelShuffle","text":"PixelShuffle(r::Int)\n\nPixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.\n\nSee NNlib.pixel_shuffle.\n\nExamples\n\njulia> p = PixelShuffle(2);\n\njulia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]\n2×2×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 4.1\n 5.1 6.1\n\n[:, :, 2, 1] =\n 3.2 4.2\n 5.2 6.2\n\n[:, :, 3, 1] =\n 3.3 4.3\n 5.3 6.3\n\n[:, :, 4, 1] =\n 3.4 4.4\n 5.4 6.4\n\njulia> p(xs)\n4×4×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 3.3 4.1 4.3\n 3.2 3.4 4.2 4.4\n 5.1 5.3 6.1 6.3\n 5.2 5.4 6.2 6.4\n\njulia> xs = [3row + col + channel/10 for row in 1:2, col in 1:3, channel in 1:4, n in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 5.1 6.1\n 7.1 8.1 9.1\n\n[:, :, 2, 1] =\n 4.2 5.2 6.2\n 7.2 8.2 9.2\n\n[:, :, 3, 1] =\n 4.3 5.3 6.3\n 7.3 8.3 9.3\n\n[:, :, 4, 1] =\n 4.4 5.4 6.4\n 7.4 8.4 9.4\n\njulia> p(xs)\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 4.3 5.1 5.3 6.1 6.3\n 4.2 4.4 5.2 5.4 6.2 6.4\n 7.1 7.3 8.1 8.3 9.1 9.3\n 7.2 7.4 8.2 8.4 9.2 9.4\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Embedding-Vectors","page":"Built-in Layers","title":"Embedding Vectors","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Flux.Embedding\nFlux.EmbeddingBag","category":"page"},{"location":"reference/models/layers/#Flux.Embedding","page":"Built-in Layers","title":"Flux.Embedding","text":"Embedding(in => out; init=randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.\n\nThis layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.\n\nFor indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).\n\nExamples\n\njulia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))\nEmbedding(26 => 4) # 104 parameters\n\njulia> emb(2) # one column of e.weight (here not random!)\n4-element Vector{Float32}:\n 0.0\n 22.0\n 0.0\n 0.0\n\njulia> emb([3, 1, 20, 14, 4, 15, 7]) # vocabulary indices, in 1:26\n4×7 Matrix{Float32}:\n 0.0 22.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n 22.0 0.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 22.0 0.0 0.0\n\njulia> ans == emb(Flux.onehotbatch(\"cat&dog\", 'a':'z', 'n'))\ntrue\n\njulia> emb(rand(1:26, (10, 1, 12))) |> size # three batch dimensions\n(4, 10, 1, 12)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.EmbeddingBag","page":"Built-in Layers","title":"Flux.EmbeddingBag","text":"EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a \"bag\". Their individual embedding vectors are reduced to one, using mean or some other function.\n\nInstead of acting on one \"bag\", such as x::Vector{Int}, the layer can also act on several:\n\nActing on a vector of \"bags\", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).\nAny higher-rank array of integers is interpreted as a collection of \"bags\" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all \"bags\" have the same length.\nA vector of \"bags\" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.\n\nThe \"bag\" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.\n\nExamples\n\njulia> vocab_size = 26; # embed into 3 dimensions, with non-random vectors:\n\njulia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))\nEmbeddingBag(26 => 3) # 78 parameters\n\njulia> eb([2]) # one bag of 1 item\n3-element Vector{Float32}:\n 0.0\n 100.0\n 0.0\n\njulia> eb([3,3,1]) # one bag of 3 items, one mean embedding\n3-element Vector{Float32}:\n 33.333332\n 0.0\n 66.666664\n\njulia> eb([[3,1,3], [2,1]]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 50.0\n 0.0 50.0\n 66.6667 0.0\n\njulia> eb([1 1 1 1; 1 2 3 4]) # 4 bags each of 2 items, eachcol([1 1 1 1; 1 2 3 4])\n3×4 Matrix{Float32}:\n 100.0 50.0 50.0 50.0\n 0.0 50.0 0.0 0.0\n 0.0 0.0 50.0 0.0\n\njulia> eb(rand(1:26, 10, 5, 5)) |> size # 25 bags each of 10 items\n(3, 5, 5)\n\nAnother way to specify \"many bags of many items\" is to provide a vector data (each in 1:in) and a vector at stating where to split that up into \"bags\". The first bag starts with data[at[1]], the second at data[at[2]], and so on, with no overlaps and nothing left out (thus it requires at[1]==1).\n\njulia> data = [11, 1, 12, 2, 13, 3, 14];\n\njulia> data[1:3], data[4:end]\n([11, 1, 12], [2, 13, 3, 14])\n\njulia> eb(data, [1, 4]) # two bags, of 3 and 4 items\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 0.0 25.0\n 0.0 25.0\n\nFinally, each bag may also be also be represented as a OneHotMatrix.\n\njulia> eb(Flux.onehotbatch(\"bba\", 'a':'z')) # same as [2,2,1], one bag of 3 items\n3-element Vector{Float32}:\n 33.333332\n 66.666664\n 0.0\n\njulia> eb([Flux.onehotbatch(\"bba\", 'a':'z'), Flux.onehotbatch(\"cc\", 'a':'z')]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 66.6667 0.0\n 0.0 100.0\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#man-dataflow-layers","page":"Built-in Layers","title":"Dataflow Layers, or Containers","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Chain\nFlux.activations\nMaxout\nSkipConnection\nParallel\nPairwiseFusion","category":"page"},{"location":"reference/models/layers/#Flux.Chain","page":"Built-in Layers","title":"Flux.Chain","text":"Chain(layers...)\nChain(name = layer, ...)\n\nCollects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.\n\nExamples\n\njulia> m = Chain(x -> x^2, x -> x+1);\n\njulia> m(5) == 26\ntrue\n\njulia> m = Chain(Dense(10 => 5, tanh), Dense(5 => 2));\n\njulia> x = rand32(10, 32);\n\njulia> m(x) == m[2](m[1](x))\ntrue\n\njulia> m2 = Chain(enc = Chain(Flux.flatten, Dense(10 => 5, tanh)), \n dec = Dense(5 => 2));\n\njulia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)\ntrue\n\nA chain may be called with multiple arguments, which is equivalent to calling it with one tuple of these arguments. Such a tuple is understood by Parallel to mean the same as several arguments:\n\njulia> Chain(println, println)(1, 2, 3) # three arguments become a tuple\n(1, 2, 3)\nnothing\n\njulia> Chain(x->@show(x), Parallel(+, inv, abs2))(4, 5) # returns 1/4 + 5^2\nx = (4, 5)\n25.25\n\nFor large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.activations","page":"Built-in Layers","title":"Flux.activations","text":"activations(c::Chain, input)\n\nLike calling a Chain, but saves the result of each layer as an output.\n\nExamples\n\njulia> using Flux: activations\n\njulia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);\n\njulia> activations(c, 1)\n(2, 4, 64)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.Maxout","page":"Built-in Layers","title":"Flux.Maxout","text":"Maxout(layers...)\nMaxout(f, n_alts)\n\nThis contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.\n\nInstead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.\n\nMaxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio \"Maxout Networks\" https://arxiv.org/abs/1302.4389.\n\nSee also Parallel to reduce with other operators.\n\nExamples\n\njulia> m = Maxout(x -> abs2.(x), x -> x .* 3);\n\njulia> m([-2 -1 0 1 2])\n1×5 Matrix{Int64}:\n 4 1 0 3 6\n\njulia> m3 = Maxout(() -> Dense(5 => 7, tanh), 3)\nMaxout(\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n) # Total: 6 arrays, 126 parameters, 816 bytes.\n\njulia> Flux.outputsize(m3, (5, 11))\n(7, 11)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.SkipConnection","page":"Built-in Layers","title":"Flux.SkipConnection","text":"SkipConnection(layer, connection)\n\nCreate a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, \"skipped\" input.\n\nThe simplest \"ResNet\"-type connection is just SkipConnection(layer, +). Here is a more complicated example:\n\njulia> m = Conv((3,3), 4 => 7, pad=(1,1));\n\njulia> x = ones(Float32, 5, 5, 4, 10);\n\njulia> size(m(x)) == (5, 5, 7, 10)\ntrue\n\njulia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));\n\njulia> size(sm(x)) == (5, 5, 11, 10)\ntrue\n\nSee also Parallel, Maxout.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Parallel","page":"Built-in Layers","title":"Flux.Parallel","text":"Parallel(connection, layers...)\nParallel(connection; name = layer, ...)\n\nCreate a layer which passes an input array to each path in layers, before reducing the output with connection.\n\nObeys the similar rules to broadcasting:\n\nCalled with one input x, this is equivalent to connection([l(x) for l in layers]...).\nWith multiple inputs and just one layer, it is instead connection([layer(x) for x in inputs]...).\nWith multiple inputs and multiple layers, one input is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).\n\nLike Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.\n\nSee also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.\n\nExamples\n\njulia> p = Parallel(+, abs2, sqrt);\n\njulia> p(3, 4) # == 3^2 + √4, two functions two inputs\n11.0\n\njulia> p((3, 4)) # tuple is always splatted\n11.0\n\njulia> p(4) # == 4^2 + √4, one input used twice\n18.0\n\njulia> Parallel(hcat, inv)(1, 2, 4) # one function three inputs\n1×3 Matrix{Float64}:\n 1.0 0.5 0.25\n\nWith Flux layers:\n\njulia> model = Chain(Dense(3 => 5),\n Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),\n Dense(8 => 17));\n\njulia> model(rand32(3)) |> size\n(17,)\n\njulia> model2 = Parallel(+; α = Dense(10 => 2, tanh), β = Dense(5 => 2))\nParallel(\n +,\n α = Dense(10 => 2, tanh), # 22 parameters\n β = Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 34 parameters, 344 bytes.\n\njulia> model2(rand32(10), rand32(5)) |> size\n(2,)\n\njulia> model2[:α](rand32(10)) |> size\n(2,)\n\njulia> model2[:β] == model2[2]\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PairwiseFusion","page":"Built-in Layers","title":"Flux.PairwiseFusion","text":"PairwiseFusion(connection, layers...)\n\nArguments\n\nconnection: A function taking 2 inputs and combining them into a single output \nlayers: The layers whose outputs are combined\n\nInputs\n\nThis layer behaves differently based on input type:\n\nIf input x is a tuple of length N (or the input is xs with N x's), matching the number of layers, \n\nthen each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:\n\nx1 → layer1 → y1 ↘\n connection → layer2 → y2 ↘\n x2 ↗ connection → layer3 → y3\n x3 ↗\n\n... or written as:\n\ny1 = layer1(x1)\ny2 = layer2(connection(y1, x2))\ny3 = layer3(connection(y2, x3))\n\nWith just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:\n\ny[1] == layers[1](x)\nfor i in 2:length(layers)\n y[i] == connection(layers[i](y[i-1]), x)\nend\n\nReturns\n\nA tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Recurrent-Models","page":"Built-in Layers","title":"Recurrent Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"RNNCell\nRNN\nLSTMCell\nLSTM\nGRUCell\nGRU\nGRUv3Cell\nGRUv3","category":"page"},{"location":"reference/models/layers/#Flux.RNNCell","page":"Built-in Layers","title":"Flux.RNNCell","text":"RNNCell(in => out, σ = tanh; init_kernel = glorot_uniform, \n init_recurrent_kernel = glorot_uniform, bias = true)\n\nThe most basic recurrent layer. Essentially acts as a Dense layer, but with the output fed back into the input each time step.\n\nIn the forward pass, implements the function\n\nh^prime = sigma(W_i x + W_h h + b)\n\nand returns h'.\n\nSee RNN for a layer that processes entire sequences.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\nσ: The non-linearity to apply to the output. Default is tanh.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nrnncell(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the RNN. It should be a vector of size in or a matrix of size in x batch_size.\nh: The hidden state of the RNN. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nExamples\n\nr = RNNCell(3 => 5)\n\n# A sequence of length 10 and batch size 4\nx = [rand(Float32, 3, 4) for _ in 1:10]\n\n# Initialize the hidden state\nh = zeros(Float32, 5)\n\n# We collect the hidden states in an array `history`\n# in case the loss depends on the entire sequence.\nŷ = []\n\nfor x_t in x\n h = r(x_t, h)\n ŷ = [ŷ..., h] # Cannot use `push!(ŷ, h)` here since mutation \n # is not automatic differentiation friendly yet.\n # Can use `y = vcat(y, [h])` as an alternative.\nend\n\nh # The final hidden state\nŷ # The hidden states at each time step\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.RNN","page":"Built-in Layers","title":"Flux.RNN","text":"RNN(in => out, σ = tanh; init_kernel = glorot_uniform, \n init_recurrent_kernel = glorot_uniform, bias = true)\n\nThe most basic recurrent layer. Essentially acts as a Dense layer, but with the output fed back into the input each time step. \n\nIn the forward pass computes\n\nh_t = sigma(W_i x_t + W_h h_t-1 + b)\n\nfor all len steps t in the in input sequence. \n\nSee RNNCell for a layer that processes a single time step.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\nσ: The non-linearity to apply to the output. Default is tanh.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nrnn(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the RNN. It should be a matrix size in x len or an array of size in x len x batch_size.\nh: The initial hidden state of the RNN. If given, it is a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nReturns all new hidden states h_t as an array of size out x len x batch_size.\n\nExamples\n\njulia> d_in, d_out, len, batch_size = 4, 6, 3, 5;\n\njulia> x = rand(Float32, (d_in, len, batch_size));\n\njulia> h = zeros(Float32, (d_out, batch_size));\n\njulia> rnn = RNN(d_in => d_out)\nRNN(\n RNNCell(4 => 6, tanh), # 66 parameters\n) # Total: 3 arrays, 66 parameters, 424 bytes.\n\njulia> y = rnn(x, h); # [y] = [d_out, len, batch_size]\n\nSometimes, the initial hidden state is a learnable parameter. In this case, the RNN should be wrapped in a custom struct.\n\nstruct Model\n rnn::RNN\n h0::AbstractVector\nend\n\nFlux.@layer Model\n\n(m::Model)(x) = m.rnn(x, m.h0)\n\nmodel = Model(RNN(32 => 64), zeros(Float32, 64))\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LSTMCell","page":"Built-in Layers","title":"Flux.LSTMCell","text":"LSTMCell(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nThe Long Short Term Memory cell. Behaves like an RNN but generally exhibits a longer memory span over sequences.\n\nIn the forward pass, computes\n\ni_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nf_t = sigma(W_xf x_t + W_hf h_t-1 + b_f)\nc_t = f_t odot c_t-1 + i_t odot tanh(W_xc x_t + W_hc h_t-1 + b_c)\no_t = sigma(W_xo x_t + W_ho h_t-1 + b_o)\nh_t = o_t odot tanh(c_t)\n\nThe LSTMCell returns the new hidden state h_t and cell state c_t for a single time step. See also LSTM for a layer that processes entire sequences.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nlstmcell(x, (h, c))\nlstmcell(x)\n\nThe arguments of the forward pass are:\n\nx: The input to the LSTM. It should be a matrix of size in or an array of size in x batch_size.\n(h, c): A tuple containing the hidden and cell states of the LSTM. They should be vectors of size out or matrices of size out x batch_size. If not provided, they are assumed to be vectors of zeros.\n\nReturns a tuple (h′, c′) containing the new hidden state and cell state in tensors of size out or out x batch_size. \n\nExamples\n\njulia> l = LSTMCell(3 => 5)\nLSTMCell(3 => 5) # 180 parameters\n\njulia> h = zeros(Float32, 5); # hidden state\n\njulia> c = zeros(Float32, 5); # cell state\n\njulia> x = rand(Float32, 3, 4); # in x batch_size\n\njulia> h′, c′ = l(x, (h, c));\n\njulia> size(h′) # out x batch_size\n(5, 4)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LSTM","page":"Built-in Layers","title":"Flux.LSTM","text":"\" LSTM(in => out; initkernel = glorotuniform, initrecurrentkernel = glorot_uniform, bias = true)\n\nLong Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.\n\nSee this article for a good overview of the internals.\n\nIn the forward pass, computes\n\ni_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nf_t = sigma(W_xf x_t + W_hf h_t-1 + b_f)\nc_t = f_t odot c_t-1 + i_t odot tanh(W_xc x_t + W_hc h_t-1 + b_c)\no_t = sigma(W_xo x_t + W_ho h_t-1 + b_o)\nh_t = o_t odot tanh(c_t)\n\nfor all len steps t in the input sequence. See LSTMCell for a layer that processes a single time step.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nlstm(x, (h, c))\nlstm(x)\n\nThe arguments of the forward pass are:\n\nx: The input to the LSTM. It should be a matrix of size in x len or an array of size in x len x batch_size.\n(h, c): A tuple containing the hidden and cell states of the LSTM. They should be vectors of size out or matrices of size out x batch_size. If not provided, they are assumed to be vectors of zeros.\n\nReturns a tuple (h′, c′) containing all new hidden states h_t and cell states c_t in tensors of size out x len or out x len x batch_size.\n\nExamples\n\nstruct Model\n lstm::LSTM\n h0::AbstractVector\n c0::AbstractVector\nend\n\nFlux.@layer Model\n\n(m::Model)(x) = m.lstm(x, (m.h0, m.c0))\n\nd_in, d_out, len, batch_size = 2, 3, 4, 5\nx = rand(Float32, (d_in, len, batch_size))\nmodel = Model(LSTM(d_in => d_out), zeros(Float32, d_out), zeros(Float32, d_out))\nh, c = model(x)\nsize(h) # out x len x batch_size\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRUCell","page":"Built-in Layers","title":"Flux.GRUCell","text":"GRUCell(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.\n\nIn the forward pass, computes\n\nr = sigma(W_xi x + W_hi h + b_i)\nz = sigma(W_xz x + W_hz h + b_z)\nh = tanh(W_xh x + r odot W_hh h + b_h)\nh = (1 - z) odot h + z odot h\n\nSee also GRU for a layer that processes entire sequences.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\ngrucell(x, h)\ngrucell(x)\n\nThe arguments of the forward pass are:\n\nx: The input to the GRU. It should be a vector of size in or a matrix of size in x batch_size.\nh: The hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nReturns the new hidden state h' as an array of size out or out x batch_size.\n\nExamples\n\njulia> g = GRUCell(3 => 5)\nGRUCell(3 => 5) # 135 parameters\n\njulia> h = zeros(Float32, 5); # hidden state\n\njulia> x = rand(Float32, 3, 4); # in x batch_size\n\njulia> h′ = g(x, h);\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRU","page":"Built-in Layers","title":"Flux.GRU","text":"GRU(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.\n\nThe forward pass computes\n\nr_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nz_t = sigma(W_xz x_t + W_hz h_t-1 + b_z)\nh_t = tanh(W_xh x_t + r_t odot W_hh h_t-1 + b_h)\nh_t = (1 - z_t) odot h_t + z_t odot h_t-1\n\nfor all len steps t in the input sequence. See GRUCell for a layer that processes a single time step.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\ngru(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the GRU. It should be a matrix of size in x len or an array of size in x len x batch_size.\nh: The initial hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros. \n\nReturns all new hidden states h_t as an array of size out x len x batch_size.\n\nExamples\n\nd_in, d_out, len, batch_size = 2, 3, 4, 5\ngru = GRU(d_in => d_out)\nx = rand(Float32, (d_in, len, batch_size))\nh0 = zeros(Float32, d_out)\nh = gru(x, h0) # out x len x batch_size\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRUv3Cell","page":"Built-in Layers","title":"Flux.GRUv3Cell","text":"GRUv3Cell(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.\n\nThe forward pass computes\n\nr = sigma(W_xi x + W_hi h + b_i)\nz = sigma(W_xz x + W_hz h + b_z)\nh = tanh(W_xh x + W_hh (r odot W_hh h) + b_h)\nh = (1 - z) odot h + z odot h\n\nand returns h'. This is a single time step of the GRU.\n\nSee GRUv3 for a layer that processes entire sequences. See GRU and GRUCell for variants of this layer.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\ngruv3cell(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the GRU. It should be a vector of size in or a matrix of size in x batch_size.\nh: The hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nReturns the new hidden state h' as an array of size out or out x batch_size.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRUv3","page":"Built-in Layers","title":"Flux.GRUv3","text":"GRUv3(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.\n\nThe forward pass computes\n\nr_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nz_t = sigma(W_xz x_t + W_hz h_t-1 + b_z)\nh_t = tanh(W_xh x_t + W_hh (r_t odot W_hh h_t-1) + b_h)\nh_t = (1 - z_t) odot h_t + z_t odot h_t-1\n\nfor all len steps t in the input sequence. See GRUv3Cell for a layer that processes a single time step. See GRU and GRUCell for variants of this layer.\n\nNotice that GRUv3 is not a more advanced version of GRU but only a less popular variant.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Normalisation-and-Regularisation","page":"Built-in Layers","title":"Normalisation & Regularisation","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"BatchNorm\nDropout\nAlphaDropout\nLayerNorm\nInstanceNorm\nGroupNorm\nFlux.normalise","category":"page"},{"location":"reference/models/layers/#Flux.BatchNorm","page":"Built-in Layers","title":"Flux.BatchNorm","text":"BatchNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=true, track_stats=true, active=nothing,\n eps=1f-5, momentum= 0.1f0)\n\nBatch Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.\n\nBatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nAfter normalisation, elementwise activation λ is applied.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nUse testmode! during inference.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = BatchNorm(3);\n\njulia> Flux.trainmode!(m);\n\njulia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Dropout","page":"Built-in Layers","title":"Flux.Dropout","text":"Dropout(p; [dims, rng, active])\n\nLayer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.\n\nWhile training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.\n\nBy default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.\n\nBy default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).\n\nKeyword rng lets you specify a custom random number generator. (Only supported on the CPU.)\n\nExamples\n\njulia> m = Chain(Dense(ones(3,2)), Dropout(0.4))\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4),\n)\n\njulia> m(ones(2, 7)) # test mode, no effect\n3×7 Matrix{Float64}:\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n\njulia> Flux.trainmode!(m) # equivalent to use within gradient\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4, active=true),\n)\n\njulia> m(ones(2, 7))\n3×7 Matrix{Float64}:\n 0.0 0.0 3.33333 0.0 0.0 0.0 0.0\n 3.33333 0.0 3.33333 0.0 3.33333 0.0 3.33333\n 3.33333 3.33333 0.0 3.33333 0.0 0.0 3.33333\n\njulia> y = m(ones(2, 10_000));\n\njulia> using Statistics\n\njulia> mean(y) # is about 2.0, same as in test mode\n1.9989999999999961\n\njulia> mean(iszero, y) # is about 0.4\n0.4003\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AlphaDropout","page":"Built-in Layers","title":"Flux.AlphaDropout","text":"AlphaDropout(p; [rng, active])\n\nA dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.\n\nDoes nothing to the input once testmode! is true.\n\nExamples\n\njulia> using Statistics\n\njulia> x = randn32(1000,1);\n\njulia> m = Chain(Dense(1000 => 1000, selu), AlphaDropout(0.2));\n\njulia> Flux.trainmode!(m);\n\njulia> y = m(x);\n\njulia> isapprox(std(x), std(y), atol=0.2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LayerNorm","page":"Built-in Layers","title":"Flux.LayerNorm","text":"LayerNorm(size..., λ=identity; affine=true, eps=1f-5)\n\nA normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.\n\nIn the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.\n\nIf affine=true, it also applies a learnable shift and rescaling using the Scale layer.\n\nSee also BatchNorm, InstanceNorm, GroupNorm, and normalise.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = LayerNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.InstanceNorm","page":"Built-in Layers","title":"Flux.InstanceNorm","text":"InstanceNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=false, track_stats=false,\n eps=1f-5, momentum=0.1f0)\n\nInstance Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nInstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nWarning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = InstanceNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GroupNorm","page":"Built-in Layers","title":"Flux.GroupNorm","text":"GroupNorm(channels::Int, G::Int, λ = identity;\n initβ = zeros32,\n initγ = ones32,\n affine = true,\n eps = 1f-5,\n momentum = 0.1f0)\n\nGroup Normalization layer.\n\nchs is the number of channels, the channel dimension of your input. For an array of N dimensions, the N-1th index is the channel dimension.\n\nG is the number of groups along which the statistics are computed. The number of channels must be an integer multiple of the number of groups.\n\nchannels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 4, 2); # a batch of 2 images, each having 4 channels\n\njulia> m = GroupNorm(4, 2);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y[:, :, 1:2, 1]), 1, atol=0.1) && std(xs[:, :, 1:2, 1]) != std(y[:, :, 1:2, 1])\ntrue\n\njulia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.normalise","page":"Built-in Layers","title":"Flux.normalise","text":"normalise(x; dims=ndims(x), eps=1f-5)\n\nNormalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the variance for numerical stability.\n\nExamples\n\njulia> using Statistics\n\njulia> x = [90, 100, 110, 130, 70];\n\njulia> mean(x), std(x; corrected=false)\n(100.0, 20.0)\n\njulia> y = Flux.normalise(x)\n5-element Vector{Float64}:\n -0.4999999999999375\n 0.0\n 0.4999999999999375\n 1.4999999999998124\n -1.4999999999998124\n\njulia> isapprox(std(y; corrected=false), 1, atol=1e-5)\ntrue\n\njulia> x = rand(10:100, 10, 10);\n\njulia> y = Flux.normalise(x, dims=1);\n\njulia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Test-vs.-Train","page":"Built-in Layers","title":"Test vs. Train","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"warning: Warning\nThis automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"testmode!\ntrainmode!","category":"page"},{"location":"reference/models/layers/#Flux.testmode!","page":"Built-in Layers","title":"Flux.testmode!","text":"testmode!(model, [mode]) -> model\n\nSet a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.\n\nIf you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.\n\nThere is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.\n\nExample\n\njulia> d = Dropout(0.3)\nDropout(0.3)\n\njulia> testmode!(d) # dropout is now always disabled\nDropout(0.3, active=false)\n\njulia> trainmode!(d) # dropout is now always enabled\nDropout(0.3, active=true)\n\njulia> testmode!(d, :auto) # back to default\nDropout(0.3)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.trainmode!","page":"Built-in Layers","title":"Flux.trainmode!","text":"trainmode!(model) -> model\n\nSet a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/enzyme/#autodiff-enzyme","page":"Gradients – Enzyme.jl","title":"Automatic Differentiation using Enzyme.jl","text":"","category":"section"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Enzyme.jl is a new package for automatic differentiation. Like Zygote.jl, calling gradient(f, x) causes it to hooks into the compiler and transform code that is executed while calculating f(x), in order to produce code for ∂f/∂x. But it does so much later in the optimisation process (on LLVM instead of Julia's untyped IR) which you can read about here]. It needs far fewer custom rules than Zygote/ChainRules, and in particular is able to support mutation of arrays.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Flux now builds in support for this, using Enzyme's own Duplicated type. Calling Duplicated on any Flux model which was defined using @layer will allocate space for the gradient, and passing that to gradient (or withgradient, or train!) will then use Enzyme instead of Zygote. The gradient functions still return the gradient as usual, which can then be passed to update!:","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> using Flux, Enzyme\n\njulia> model = Chain(Dense(28^2 => 32, sigmoid), Dense(32 => 10), softmax); # from model zoo\n\njulia> dup_model = Enzyme.Duplicated(model) # this allocates space for the gradient\nDuplicated(\n Chain(\n Dense(784 => 32, σ), # 25_120 parameters\n Dense(32 => 10), # 330 parameters\n NNlib.softmax,\n ),\n # norm(∇) ≈ 0.0f0\n) # Total: 4 arrays, 25_450 parameters, 199.391 KiB.\n\njulia> x1 = randn32(28*28, 1); # fake image\n\njulia> y1 = [i==3 for i in 0:9]; # fake label\n\njulia> grads_f = Flux.gradient((m,x,y) -> sum(abs2, m(x) .- y), dup_model, Const(x1), Const(y1)) # uses Enzyme\n((layers = ((weight = Float32[-0.010354728 0.032972857 …\n -0.0014538406], σ = nothing), nothing),), nothing, nothing)","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"The gradient returned here is also stored within dup_model. Both share the same arrays – what is returned is not a copy, just a view of the same memory (wrapped in NamedTuples instead of structs). They will all be set to zero when you call gradient again, then replaced with the new values. Alternatively, gradient(f, args...; zero=false) will add the new gradient to what's already stored.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Writing Const(x1) is optional, just plain x1 is implicitly constant. Any set of Duplicated and Const arguments may appear in any order, so long as there is at least one Duplicated.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"The gradient grads_f[1] can be passed to update! as usual. But for convenience, you may also use what is stored within Duplicated. These are equivalent ways to perform an update step:","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> opt_state = Flux.setup(Adam(), model)\n\njulia> ans == Flux.setup(Adam(), dup_model)\n\njulia> Flux.update!(opt_state, model, grads_f[1]) # exactly as for Zygote gradients\n\njulia> Flux.update!(opt_state, dup_model) # equivlent new path, Enzyme only","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Instead of using these FLux functions, you can also use Enzyme's own functions directly. Enzyme.gradient works like this:","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> grads_e = Enzyme.gradient(Reverse, (m,x,y) -> sum(abs2, m(x) .- y), model, Const(x1), Const(y1))\n(Chain(Dense(784 => 32, σ), Dense(32 => 10), softmax), nothing, nothing)\n\njulia> grads_f[1].layers[2].bias ≈ grads_e[1].layers[2].bias\ntrue","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Note that what Enzyme.gradient returns is an object like deepcopy(model) of the same type, grads_e[1] isa Chain. But its fields contain the same gradient.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"There is also a method of train! which similarly takes Duplicated(model):","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> opt_state = Flux.setup(Adam(0), model);\n\njulia> Flux.train!((m,x,y) -> sum(abs2, m(x) .- y), dup_model, [(x1, y1)], opt_state)","category":"page"},{"location":"reference/training/enzyme/#Second-order-AD","page":"Gradients – Enzyme.jl","title":"Second-order AD","text":"","category":"section"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"If you calculate a gradient within the loss function, then training will involve 2nd derivatives. While this is in principle supported by Zygote.jl, there are many bugs, and Enzyme.jl is probably a better choice.","category":"page"},{"location":"reference/training/enzyme/#Listing","page":"Gradients – Enzyme.jl","title":"Listing","text":"","category":"section"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Flux.gradient(f, args::Union{Flux.EnzymeCore.Const, Flux.EnzymeCore.Duplicated}...)\nFlux.withgradient(f, args::Union{Flux.EnzymeCore.Const, Flux.EnzymeCore.Duplicated}...)\nFlux.train!(loss, model::Flux.EnzymeCore.Duplicated, data, opt)","category":"page"},{"location":"reference/training/enzyme/#Flux.gradient-Tuple{Any, Vararg{Union{EnzymeCore.Const, EnzymeCore.Duplicated}}}","page":"Gradients – Enzyme.jl","title":"Flux.gradient","text":"gradient(f, args::Union{Const,Duplicated}...)\n\nThis should return the same answer as gradient(f, args...), but it uses Enzyme.jl instead of Zygote.jl to compute the derivative.\n\nOnly available when Enzyme is loaded!\n\nThis method is used when at least one argument is of type Duplicated, and all unspecified aguments are wrapped in Const. Note that Enzyme's Active is not supported.\n\nBesides returning the gradient, this is also stored within the Duplicated object. Calling Enzyme.Duplicated(model) allocates space for the gradient, which is zero'd befor use when calling gradient. With the keyword zero=false, the new gradient will instead be added to what is already stored.\n\nwarning: Experimental\nEnzyme support like this is new and somewhat experimental. This method was added in Flux 0.15.\n\nExample\n\njulia> using Flux\n\njulia> model = Chain(Dense([3.0;;]));\n\njulia> Flux.gradient(model, [1]) do m, x # computed using Zygote\n sum(abs2, m(x))\n end\n((layers = ((weight = [6.0;;], bias = [6.0], σ = nothing),),), [18.0])\n\njulia> using Enzyme\n\njulia> dup_model = Duplicated(model); # allocates space for gradient\n\njulia> Flux.gradient(dup_model, Const([1])) do m, x # Enzyme, returns the same\n sum(abs2, m(x))\n end\n((layers = ((weight = [6.0;;], bias = [6.0], σ = nothing),),), nothing)\n\njulia> dup_model # same gradient is also stored within Duplicated\nDuplicated(\n Chain(\n Dense(1 => 1), # 2 parameters\n ),\n # norm(∇) ≈ 8.49\n)\n\njulia> Flux.destructure((weight = [6.0;;], bias = [6.0]))[1] |> norm\n8.48528137423857\n\njulia> Flux.gradient(dup_model, [1]; zero=false) do m, x # implict Const([1]), and grad accumulation\n sum(abs2, m(x))\n end\n((layers = ((weight = [12.0;;], bias = [12.0], σ = nothing),),), nothing)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/enzyme/#Flux.withgradient-Tuple{Any, Vararg{Union{EnzymeCore.Const, EnzymeCore.Duplicated}}}","page":"Gradients – Enzyme.jl","title":"Flux.withgradient","text":"withgradient(f, args::Union{Const,Duplicated}...)\n\nThis should return the same answer as withgradient(f, model, args...), but it uses Enzyme.jl instead of Zygote.jl to compute the derivative.\n\nOnly available when Enzyme is loaded!\n\nwarning: Experimental\nEnzyme support like this is new and somewhat experimental. This method was added in Flux 0.15.\n\nExample\n\njulia> using Flux, Enzyme\n\njulia> model = Chain(Embedding([1.1 2.2 3.3]), Dense([4.4;;]), only);\n\njulia> model(3)\n14.52\n\njulia> Flux.withgradient(m -> m(3), model) # this uses Zygote\n(val = 14.52, grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\njulia> Flux.withgradient(m -> m(3), Duplicated(model)) # this uses Enzyme\n(val = 14.52, grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\nThe function f may return Tuple or NamedTuple, with the loss as the first element. The gradient is then grad = gradient(first∘f, args...) but the returned value is val = f(args...):\n\njulia> Flux.withgradient(m -> (m(3), \"aux\"), Duplicated(model))\n(val = (14.52, \"aux\"), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\njulia> Flux.withgradient(m -> (loss=m(3), aux=round.(m.(1:3); digits=3)), Duplicated(model))\n(val = (loss = 14.52, aux = [4.84, 9.68, 14.52]), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/enzyme/#Flux.Train.train!-Tuple{Any, EnzymeCore.Duplicated, Any, Any}","page":"Gradients – Enzyme.jl","title":"Flux.Train.train!","text":"train!(loss, Duplicated(model), data, opt_state)\n\nThis method uses Enzyme.jl instead of Zygote.jl to compute the gradients, but is otherwise the same as train!(loss, model, data, opt_state).\n\nOnly available when Enzyme is loaded.\n\ncompat: New\nThis method was added in Flux 0.13.9.\n\n\n\n\n\n","category":"method"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Enzyme.jl has its own extensive documentation.","category":"page"},{"location":"reference/data/mldatadevices/","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transfer Data to GPU – MLDataDevices.jl","text":"CurrentModule = MLDataDevices\nCollapsedDocStrings = true","category":"page"},{"location":"reference/data/mldatadevices/#Transferring-data-across-devices","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transferring data across devices","text":"","category":"section"},{"location":"reference/data/mldatadevices/","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transfer Data to GPU – MLDataDevices.jl","text":"Flux relies on the MLDataDevices.jl package to manage devices and transfer data across them. You don't have to explicitly use the package, as Flux re-exports the necessary functions and types.","category":"page"},{"location":"reference/data/mldatadevices/","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transfer Data to GPU – MLDataDevices.jl","text":"MLDataDevices.cpu_device\nMLDataDevices.default_device_rng\nMLDataDevices.functional\nMLDataDevices.get_device\nMLDataDevices.gpu_device\nMLDataDevices.gpu_backend!\nMLDataDevices.get_device_type\nMLDataDevices.isleaf\nMLDataDevices.loaded\nMLDataDevices.reset_gpu_device!\nMLDataDevices.set_device!\nMLDataDevices.supported_gpu_backends\nMLDataDevices.DeviceIterator","category":"page"},{"location":"reference/data/mldatadevices/#MLDataDevices.cpu_device","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.cpu_device","text":"cpu_device() -> CPUDevice()\n\nReturn a CPUDevice object which can be used to transfer data to CPU.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.default_device_rng","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.default_device_rng","text":"default_device_rng(::AbstractDevice)\n\nReturns the default RNG for the device. This can be used to directly generate parameters and states on the device using WeightInitializers.jl.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.functional","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.functional","text":"functional(x::AbstractDevice) -> Bool\nfunctional(::Type{<:AbstractDevice}) -> Bool\n\nChecks if the device is functional. This is used to determine if the device can be used for computation. Note that even if the backend is loaded (as checked via MLDataDevices.loaded), the device may not be functional.\n\nNote that while this function is not exported, it is considered part of the public API.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.get_device","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.get_device","text":"get_device(x) -> dev::AbstractDevice | Exception | Nothing\n\nIf all arrays (on the leaves of the structure) are on the same device, we return that device. Otherwise, we throw an error. If the object is device agnostic, we return nothing.\n\nnote: Note\nTrigger Packages must be loaded for this to return the correct device.\n\nSpecial Retuened Values\n\nnothing – denotes that the object is device agnostic. For example, scalar, abstract range, etc.\nUnknownDevice() – denotes that the device type is unknown.\n\nSee also get_device_type for a faster alternative that can be used for dispatch based on device type.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.gpu_device","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.gpu_device","text":"gpu_device(device_id::Union{Nothing, Integer}=nothing;\n force::Bool=false) -> AbstractDevice\n\nSelects GPU device based on the following criteria:\n\nIf gpu_backend preference is set and the backend is functional on the system, then that device is selected.\nOtherwise, an automatic selection algorithm is used. We go over possible device backends in the order specified by supported_gpu_backends() and select the first functional backend.\nIf no GPU device is functional and force is false, then cpu_device() is invoked.\nIf nothing works, an error is thrown.\n\nArguments\n\ndevice_id::Union{Nothing, Integer}: The device id to select. If nothing, then we return the last selected device or if none was selected then we run the autoselection and choose the current device using CUDA.device() or AMDGPU.device() or similar. If Integer, then we select the device with the given id. Note that this is 1-indexed, in contrast to the 0-indexed CUDA.jl. For example, id = 4 corresponds to CUDA.device!(3).\n\nwarning: Warning\ndevice_id is only applicable for CUDA and AMDGPU backends. For Metal, oneAPI and CPU backends, device_id is ignored and a warning is printed.\n\nwarning: Warning\ngpu_device won't select a CUDA device unless both CUDA.jl and cuDNN.jl are loaded. This is to ensure that deep learning operations work correctly. Nonetheless, if cuDNN is not loaded you can still manually create a CUDADevice object and use it (e.g. dev = CUDADevice()).\n\nKeyword Arguments\n\nforce::Bool: If true, then an error is thrown if no functional GPU device is found.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.gpu_backend!","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.gpu_backend!","text":"gpu_backend!() = gpu_backend!(\"\")\ngpu_backend!(backend) = gpu_backend!(string(backend))\ngpu_backend!(backend::AbstractGPUDevice)\ngpu_backend!(backend::String)\n\nCreates a LocalPreferences.toml file with the desired GPU backend.\n\nIf backend == \"\", then the gpu_backend preference is deleted. Otherwise, backend is validated to be one of the possible backends and the preference is set to backend.\n\nIf a new backend is successfully set, then the Julia session must be restarted for the change to take effect.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.get_device_type","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.get_device_type","text":"get_device_type(x) -> Type{<:AbstractDevice} | Exception | Type{Nothing}\n\nSimilar to get_device but returns the type of the device instead of the device itself. This value is often a compile time constant and is recommended to be used instead of get_device where ever defining dispatches based on the device type.\n\nnote: Note\nTrigger Packages must be loaded for this to return the correct device.\n\nSpecial Retuened Values\n\nNothing – denotes that the object is device agnostic. For example, scalar, abstract range, etc.\nUnknownDevice – denotes that the device type is unknown.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.isleaf","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.isleaf","text":"isleaf(x) -> Bool\n\nReturns true if x is a leaf node in the data structure.\n\nDefining MLDataDevices.isleaf(x::T) = true for custom types can be used to customize the behavior the data movement behavior when an object with nested structure containing the type is transferred to a device.\n\nAdapt.adapt_structure(::AbstractDevice, x::T) or Adapt.adapt_structure(::AbstractDevice, x::T) will be called during data movement if isleaf(x::T).\n\nIf MLDataDevices.isleaf(x::T) is not defined, then it will fall back to Functors.isleaf(x).\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.loaded","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.loaded","text":"loaded(x::AbstractDevice) -> Bool\nloaded(::Type{<:AbstractDevice}) -> Bool\n\nChecks if the trigger package for the device is loaded. Trigger packages are as follows:\n\nCUDA.jl and cuDNN.jl (or just LuxCUDA.jl) for NVIDIA CUDA Support.\nAMDGPU.jl for AMD GPU ROCM Support.\nMetal.jl for Apple Metal GPU Support.\noneAPI.jl for Intel oneAPI GPU Support.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.reset_gpu_device!","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.reset_gpu_device!","text":"reset_gpu_device!()\n\nResets the selected GPU device. This is useful when automatic GPU selection needs to be run again.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.set_device!","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.set_device!","text":"set_device!(T::Type{<:AbstractDevice}, dev_or_id)\n\nSet the device for the given type. This is a no-op for CPUDevice. For CUDADevice and AMDGPUDevice, it prints a warning if the corresponding trigger package is not loaded.\n\nCurrently, MetalDevice and oneAPIDevice don't support setting the device.\n\nArguments\n\nT::Type{<:AbstractDevice}: The device type to set.\ndev_or_id: Can be the device from the corresponding package. For example for CUDA it can be a CuDevice. If it is an integer, it is the device id to set. This is 1-indexed.\n\ndanger: Danger\nThis specific function should be considered experimental at this point and is currently provided to support distributed training in Lux. As such please use Lux.DistributedUtils instead of using this function.\n\n\n\n\n\nset_device!(T::Type{<:AbstractDevice}, ::Nothing, rank::Integer)\n\nSet the device for the given type. This is a no-op for CPUDevice. For CUDADevice and AMDGPUDevice, it prints a warning if the corresponding trigger package is not loaded.\n\nCurrently, MetalDevice and oneAPIDevice don't support setting the device.\n\nArguments\n\nT::Type{<:AbstractDevice}: The device type to set.\nrank::Integer: Local Rank of the process. This is applicable for distributed training and must be 0-indexed.\n\ndanger: Danger\nThis specific function should be considered experimental at this point and is currently provided to support distributed training in Lux. As such please use Lux.DistributedUtils instead of using this function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.supported_gpu_backends","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.supported_gpu_backends","text":"supported_gpu_backends() -> Tuple{String, ...}\n\nReturn a tuple of supported GPU backends.\n\nwarning: Warning\nThis is not the list of functional backends on the system, but rather backends which MLDataDevices.jl supports.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.DeviceIterator","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.DeviceIterator","text":"DeviceIterator(dev::AbstractDevice, iterator)\n\nCreate a DeviceIterator that iterates through the provided iterator via iterate. Upon each iteration, the current batch is copied to the device dev, and the previous iteration is marked as freeable from GPU memory (via unsafe_free!) (no-op for a CPU device).\n\nThe conversion follows the same semantics as dev().\n\ntip: Similarity to `CUDA.CuIterator`\nThe design inspiration was taken from CUDA.CuIterator and was generalized to work with other backends and more complex iterators (using Functors).\n\ntip: `MLUtils.DataLoader`\nCalling dev(::MLUtils.DataLoader) will automatically convert the dataloader to use the same semantics as DeviceIterator. This is generally preferred over looping over the dataloader directly and transferring the data to the device.\n\nExamples\n\nThe following was run on a computer with an NVIDIA GPU.\n\njulia> using MLDataDevices, MLUtils\n\njulia> X = rand(Float64, 3, 33);\n\njulia> dataloader = DataLoader(X; batchsize=13, shuffle=false);\n\njulia> for (i, x) in enumerate(dataloader)\n @show i, summary(x)\n end\n(i, summary(x)) = (1, \"3×13 Matrix{Float64}\")\n(i, summary(x)) = (2, \"3×13 Matrix{Float64}\")\n(i, summary(x)) = (3, \"3×7 Matrix{Float64}\")\n\njulia> for (i, x) in enumerate(CUDADevice()(dataloader))\n @show i, summary(x)\n end\n(i, summary(x)) = (1, \"3×13 CuArray{Float32, 2, CUDA.DeviceMemory}\")\n(i, summary(x)) = (2, \"3×13 CuArray{Float32, 2, CUDA.DeviceMemory}\")\n(i, summary(x)) = (3, \"3×7 CuArray{Float32, 2, CUDA.DeviceMemory}\")\n\n\n\n\n\n","category":"type"},{"location":"tutorials/custom_layers/#man-advanced","page":"Custom Layers","title":"Defining Customised Layers","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here we will try and describe usage of some more advanced features that Flux provides to give more control over model building.","category":"page"},{"location":"tutorials/custom_layers/#Custom-Model-Example","page":"Custom Layers","title":"Custom Model Example","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here is a basic example of a custom model. It simply adds the input to the result from the neural network.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"struct CustomModel{T <: Chain} # Parameter to avoid type instability\n chain::T\nend\n\nfunction (m::CustomModel)(x)\n # Arbitrary code can go here, but note that everything will be differentiated.\n # Zygote does not allow some operations, like mutating arrays.\n\n return m.chain(x) + x\nend\n\n# This is optional but recommended for pretty printing and other niceties\nFlux.@layer CustomModel","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice that we parameterized the type of the chain field. This is necessary for fast Julia code, so that that struct field can be given a concrete type. Chains have a type parameter fully specifying the types of the layers they contain. By using a type parameter, we are freeing Julia to determine the correct concrete type, so that we do not need to specify the full, possibly quite long, type ourselves.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"You can then use the model like:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"chain = Chain(Dense(10 => 10, relu), Dense(10 => 10))\nmodel = CustomModel(chain)\nmodel(rand(Float32, 10))","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"For an intro to Flux and automatic differentiation, see this tutorial.","category":"page"},{"location":"tutorials/custom_layers/#Customising-Parameter-Collection-for-a-Model","page":"Custom Layers","title":"Customising Parameter Collection for a Model","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Taking reference from our example Affine layer from the basics.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"By default all the fields in the Affine type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our \"layers\" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, the way to mark some fields of our layer as trainable is through overloading the trainable function:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> struct Affine\n W\n b\n end\n\njulia> Affine(in::Int, out::Int) = Affine(randn(out, in), randn(out));\n\njulia> (m::Affine)(x) = m.W * x .+ m.b;\n\njulia> Flux.@layer Affine\n\njulia> a = Affine(Float32[1 2; 3 4; 5 6], Float32[7, 8, 9])\nAffine(Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a) # default behavior\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], b = Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a::Affine) = (; W = a.W) # returns a NamedTuple using the field's name\n\njulia> Flux.trainable(a)\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0],)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Only the fields returned by trainable will be seen by Flux.setup and Flux.update! for training. But all fields wil be seen by gpu and similar functions, for example:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> a |> f16\nAffine(Float16[1.0 2.0; 3.0 4.0; 5.0 6.0], Float16[7.0, 8.0, 9.0])","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Note that there is no need to overload trainable to hide fields which do not contain numerical array (for example, activation functions, or Boolean flags). These are always ignored by training.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The exact same method of trainable can also be defined using the macro, for convenience:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Affine trainable=(W,)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling Functors.@functor Affine (W,) means that no exploration of the model will ever visit the other fields: They will not be moved to the GPU by gpu, and their precision will not be changed by f32. This requires the struct to have a corresponding constructor that accepts only W as an argument.","category":"page"},{"location":"tutorials/custom_layers/#Custom-multiple-input-or-output-layer","page":"Custom Layers","title":"Custom multiple input or output layer","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the inception module.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. That would mean a new struct for each different block, e.g. one would have a TransformerBlock struct for a transformer block, and a ResNetBlock struct for a ResNet block, each block being composed by smaller sub-blocks. This is often the simplest and cleanest way to implement complex models.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"This guide instead will show you how to construct a high-level layer (like Chain) that is made of multiple sub-layers for each path. It may be the case that using the layers described as follows makes the definition of your model harder to read and to change. In that case, consider using the simpler approach of defining a custom structure described above.","category":"page"},{"location":"tutorials/custom_layers/#Multiple-inputs:-a-custom-Join-layer","page":"Custom Layers","title":"Multiple inputs: a custom Join layer","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Join layer will accept multiple inputs at once, pass each input through a separate path, then combine the results together. Note that this layer can already be constructed using Parallel, but we will first walk through how do this manually.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by defining a new struct, Join, that stores the different paths and a combine operation as its fields.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom join layer\nstruct Join{T, F}\n combine::F\n paths::T\nend\n\n# allow Join(op, m1, m2, ...) as a constructor\nJoin(combine, paths...) = Join(combine, paths)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice again that we parameterized the type of the combine and paths fields. In addition to the performance considerations of concrete types, this allows either field to be Vectors, Tuples, or one of each - we don't need to pay attention to which.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The next step is to use Flux.@layer to make our struct behave like a Flux layer. In Flux < v0.15 this used to be important so that calling Flux.setup on a Join maps over the underlying trainable arrays on each path. Since Flux v0.15, this is no longer necessary, since now Functors.jl automatically traverses custom types. However, Flux.@layer is still recommended for pretty printing and other niceties.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Join","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Finally, we define the forward pass. For Join, this means applying each path in paths to each input array, then using combine to merge the results.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs)...)\n(m::Join)(xs...) = m(xs)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Lastly, we can test our new layer. Thanks to the proper abstractions in Julia, our layer works on GPU arrays out of the box!","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)), # branch 1\n Dense(1 => 2), # branch 2\n Dense(1 => 1) # branch 3\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Join layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"tutorials/custom_layers/#Using-Parallel","page":"Custom Layers","title":"Using Parallel","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux already provides Parallel that can offer the same functionality. In this case, Join is going to just be syntactic sugar for Parallel.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Join(combine, paths) = Parallel(combine, paths)\nJoin(combine, paths...) = Join(combine, paths)\n\n# use vararg/tuple version of Parallel forward pass\nmodel = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)),\n Dense(1 => 2),\n Dense(1 => 1)\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"tutorials/custom_layers/#Multiple-outputs:-a-custom-Split-layer","page":"Custom Layers","title":"Multiple outputs: a custom Split layer","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Split layer will accept a single input, then pass the input through a separate path to produce multiple outputs.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by following the same steps as the Join layer: define a struct, use Flux.@layer, and define the forward pass.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom split layer\nstruct Split{T}\n paths::T\nend\n\nSplit(paths...) = Split(paths)\n\nFlux.@layer Split\n\n(m::Split)(x::AbstractArray) = map(f -> f(x), m.paths)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Now we can test to see that our Split does indeed produce multiple outputs.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Dense(10 => 5),\n Split(Dense(5 => 1, tanh), Dense(5 => 3, tanh), Dense(5 => 2))\n ) |> gpu\n\nmodel(gpu(rand(10)))\n# returns a tuple with three float vectors","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"A custom loss function for the multiple outputs may look like this:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Statistics\n\n# assuming model returns the output of a Split\n# x is a single input\n# ys is a tuple of outputs\nfunction loss(x, ys, model)\n # rms over all the mse\n ŷs = model(x)\n return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs)))\nend","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Split layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"guide/models/overview/#man-overview","page":"Fitting a Line","title":"Flux Overview: Fitting a Straight Line","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Here's how you'd use Flux to build and train the most basic of models, step by step.","category":"page"},{"location":"guide/models/overview/#A-Trivial-Prediction","page":"Fitting a Line","title":"A Trivial Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will predict the output of the function 4x + 2. Making such predictions is called \"linear regression\", and is really too simple to need a neural network. But it's a nice toy example.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, import Flux and define the function we want to simulate:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux\n\njulia> actual(x) = 4x + 2\nactual (generic function with 1 method)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will build a model to approximate the actual function.","category":"page"},{"location":"guide/models/overview/#1.-Provide-Training-and-Test-Data","page":"Fitting a Line","title":"1. Provide Training and Test Data","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Use the actual function to build sets of data for training and verification:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> x_train, x_test = hcat(0:5...), hcat(6:10...)\n([0 1 … 4 5], [6 7 … 9 10])\n\njulia> y_train, y_test = actual.(x_train), actual.(x_test)\n([2 6 … 18 22], [26 30 … 38 42])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Normally, your training and test data come from real world observations, but here we simulate them.","category":"page"},{"location":"guide/models/overview/#2.-Build-a-Model-to-Make-Predictions","page":"Fitting a Line","title":"2. Build a Model to Make Predictions","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, build a model to make predictions with 1 input and 1 output:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters\n\njulia> model.weight\n1×1 Matrix{Float32}:\n 0.95041317\n\njulia> model.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, a dense layer is a struct with fields weight and bias. weight represents a weights' matrix and bias represents a bias vector. There's another way to think about a model. In Flux, models are conceptually predictive functions: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Dense(1 => 1) also implements the function σ(Wx+b) where W and b are the weights and biases. σ is an activation function (more on activations later). Our model has one weight and one bias, but typical models will have many more. Think of weights and biases as knobs and levers Flux can use to tune predictions. Activation functions are transformations that tailor models to your needs. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This model will already make predictions, though not accurate ones yet:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_train)\n1×6 Matrix{Float32}:\n 0.0 0.906654 1.81331 2.71996 3.62662 4.53327","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In order to make better predictions, you'll need to provide a loss function to tell Flux how to objectively evaluate the quality of a prediction. Loss functions compute the cumulative distance between actual values and predictions. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Statistics\n\njulia> loss(model, x, y) = mean(abs2.(model(x) .- y));\n\njulia> loss(predict, x_train, y_train)\n122.64734f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"More accurate predictions will yield a lower loss. You can write your own loss functions or rely on those already provided by Flux. This loss function is called mean squared error (and built-in as mse). Flux works by iteratively reducing the loss through training.","category":"page"},{"location":"guide/models/overview/#3.-Improve-the-Prediction","page":"Fitting a Line","title":"3. Improve the Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, the Flux Flux.train! function uses a loss function and training data to improve the parameters of your model based on a pluggable optimiser:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux: train!\n\njulia> opt = Descent()\nDescent(0.1f0)\n\njulia> data = [(x_train, y_train)]\n1-element Vector{Tuple{Matrix{Int64}, Matrix{Int64}}}:\n ([0 1 … 4 5], [2 6 … 18 22])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, we have the optimiser and data we'll pass to train!. All that remains are the parameters of the model. Remember, each model is a Julia struct with a function and configurable parameters. Remember, the dense layer has weights and biases that depend on the dimensions of the inputs and outputs: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight\n1×1 Matrix{Float32}:\n 0.9066542\n\njulia> predict.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The dimensions of these model parameters depend on the number of inputs and outputs.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux will adjust predictions by iteratively changing these parameters according to the optimiser.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This optimiser implements the classic gradient descent strategy. Now improve the parameters of the model with a call to Flux.train! like this:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> train!(loss, predict, data, opt)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"And check the loss:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> loss(predict, x_train, y_train)\n116.38745f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"It went down. Why? ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight, predict.bias\n(Float32[7.246838;;], Float32[1.748103])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The parameters have changed. This single step is the essence of machine learning.","category":"page"},{"location":"guide/models/overview/#3.-Iteratively-Train-the-Model","page":"Fitting a Line","title":"3+. Iteratively Train the Model","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In the previous section, we made a single call to train! which iterates over the data we passed in just once. An epoch refers to one pass over the dataset. Typically, we will run the training for multiple epochs to drive the loss down even further. Let's run it a few more times:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> for epoch in 1:200\n train!(loss, predict, data, opt)\n end\n\njulia> loss(predict, x_train, y_train)\n0.00339581f0\n\njulia> predict.weight, predict.bias\n(Float32[4.0159144;;], Float32[2.004479])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After 200 training steps, the loss went down, and the parameters are getting close to those in the function the model is built to predict.","category":"page"},{"location":"guide/models/overview/#4.-Verify-the-Results","page":"Fitting a Line","title":"4. Verify the Results","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, let's verify the predictions:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_test)\n1×5 Matrix{Float32}:\n 26.1121 30.13 34.1479 38.1657 42.1836\n\njulia> y_test\n1×5 Matrix{Int64}:\n 26 30 34 38 42","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The predictions are good. Here's how we got there. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After we trained the model, we verified it with the test data to verify the results. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/destructure/#man-destructure","page":"Flat vs. Nested","title":"Flat vs. Nested Structures","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"A Flux model is a nested structure, with parameters stored within many layers. Sometimes you may want a flat representation of them, to interact with functions expecting just one vector. This is provided by destructure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> model = Chain(Dense(2=>1, tanh), Dense(1=>1))\nChain(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1), # 2 parameters\n) # Total: 4 arrays, 5 parameters, 276 bytes.\n\njulia> flat, rebuild = Flux.destructure(model)\n(Float32[0.863101, 1.2454957, 0.0, -1.6345707, 0.0], Restructure(Chain, ..., 5))\n\njulia> rebuild(zeros(5)) # same structure, new parameters\nChain(\n Dense(2 => 1, tanh), # 3 parameters (all zero)\n Dense(1 => 1), # 2 parameters (all zero)\n) # Total: 4 arrays, 5 parameters, 276 bytes.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Both destructure and the Restructure function can be used within gradient computations. For instance, this computes the Hessian ∂²L/∂θᵢ∂θⱼ of some loss function, with respect to all parameters of the Flux model. The resulting matrix has off-diagonal entries, which cannot really be expressed in a nested structure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> x = rand(Float32, 2, 16);\n\njulia> grad = gradient(m -> sum(abs2, m(x)), model) # nested gradient\n((layers = ((weight = Float32[10.339018 11.379145], bias = Float32[22.845667], σ = nothing), (weight = Float32[-29.565302;;], bias = Float32[-37.644184], σ = nothing)),),)\n\njulia> function loss(v::Vector)\n m = rebuild(v)\n y = m(x)\n sum(abs2, y)\n end;\n\njulia> gradient(loss, flat) # flat gradient, same numbers\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184],)\n\njulia> Zygote.hessian(loss, flat) # second derivative\n5×5 Matrix{Float32}:\n -7.13131 -5.54714 -11.1393 -12.6504 -8.13492\n -5.54714 -7.11092 -11.0208 -13.9231 -9.36316\n -11.1393 -11.0208 -13.7126 -27.9531 -22.741\n -12.6504 -13.9231 -27.9531 18.0875 23.03\n -8.13492 -9.36316 -22.741 23.03 32.0\n\njulia> Flux.destructure(grad) # acts on non-models, too\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"In order to collect all parameters of a model into a list instead, you can use the trainables function:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> Flux.trainables(model)\n5-element Vector{AbstractArray}:\n [0.863101 1.2454957]\n [0.0]\n [1.290355429422727;;]\n [0.0]","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Any mutation of the elements of the resulting list will affect the model's parameters.","category":"page"},{"location":"reference/destructure/#All-Parameters","page":"Flat vs. Nested","title":"All Parameters","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"The functions destructure and trainables live in Optimisers.jl.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Optimisers.destructure\nOptimisers.trainable\nOptimisers.trainables\nOptimisers.isnumeric\nFlux.params","category":"page"},{"location":"reference/destructure/#Optimisers.destructure","page":"Flat vs. Nested","title":"Optimisers.destructure","text":"destructure(model) -> vector, reconstructor\n\nCopies all trainable, isnumeric parameters in the model to a vector, and returns also a function which reverses this transformation. Differentiable.\n\nExample\n\njulia> v, re = destructure((x=[1.0, 2.0], y=(sin, [3.0 + 4.0im])))\n(ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], Restructure(NamedTuple, ..., 3))\n\njulia> re([3, 5, 7+11im])\n(x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))\n\nIf model contains various number types, they are promoted to make vector, and are usually restored by Restructure. Such restoration follows the rules of ChainRulesCore.ProjectTo, and thus will restore floating point precision, but will permit more exotic numbers like ForwardDiff.Dual.\n\nIf model contains only GPU arrays, then vector will also live on the GPU. At present, a mixture of GPU and ordinary CPU arrays is undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainable","page":"Flat vs. Nested","title":"Optimisers.trainable","text":"trainable(x::Layer) -> NamedTuple\n\nThis may be overloaded to make optimisers ignore some fields of every Layer, which would otherwise contain trainable parameters.\n\nwarning: Warning\nThis is very rarely required. Fields of struct Layer which contain functions, or integers like sizes, are always ignored anyway. Overloading trainable is only necessary when some arrays of numbers are to be optimised, and some arrays of numbers are not.\n\nThe default is Functors.children(x), usually a NamedTuple of all fields, and trainable(x) must contain a subset of these.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainables","page":"Flat vs. Nested","title":"Optimisers.trainables","text":"trainables(x, path = false)\n\nReturn an iterable over all the trainable parameters in x, that is all the numerical arrays (see isnumeric) which are reachable through trainable.\n\nParameters appearing multiple times in the model (tied weights) will be present only once in the output.\n\nIf path = false, the output is a list of numerical arrays.\n\nIf path = true, the output is a list of (KeyPath, AbstractArray) pairs, where KeyPath is a type representing the path to the array in the original structure.\n\nSee also destructure for a similar operation that returns a single flat vector instead.\n\nExamples\n\njulia> struct MyLayer\n w\n b\n end\n\njulia> Functors.@functor MyLayer\n\njulia> Optimisers.trainable(x::MyLayer) = (; w = x.w,) # only w is trainable in this example\n\njulia> x = MyLayer([1.0,2.0,3.0], [4.0,5.0,6.0]);\n\njulia> trainables(x)\n1-element Vector{AbstractArray}:\n [1.0, 2.0, 3.0]\n\n julia> x = MyLayer((a=[1.0,2.0], b=[3.0]), [4.0,5.0,6.0]);\n\n julia> trainables(x) # collects nested parameters\n 2-element Vector{AbstractArray}:\n [1.0, 2.0]\n [3.0]\n\njulia> x = (a = [1.0,2.0], b = (Dict(\"c\" => [3.0, 4.0], \"d\" => 5.0), [6.0,7.0]));\n\njulia> for (kp, y) in trainables(x, path = true)\n println(kp, \" => \", y)\n end\nKeyPath(:a,) => [1.0, 2.0]\nKeyPath(:b, 1, \"c\") => [3.0, 4.0]\nKeyPath(:b, 2) => [6.0, 7.0]\n\njulia> getkeypath(x, KeyPath(:b, 1, \"c\"))\n2-element Vector{Float64}:\n 3.0\n 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.isnumeric","page":"Flat vs. Nested","title":"Optimisers.isnumeric","text":"isnumeric(x) -> Bool\n\nReturns true on any parameter to be adjusted by Optimisers.jl, namely arrays of non-integer numbers. Returns false on all other types.\n\nRequires also that Functors.isleaf(x) == true, to focus on e.g. the parent of a transposed matrix, not the wrapper.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Flux.params","page":"Flat vs. Nested","title":"Flux.params","text":"params(model)\n\nReturns a Zygote.Params object containing all parameter arrays from the model. This is deprecated! This function was the cornerstone of how Flux used Zygote's implicit mode gradients, but since Flux 0.13 we use explicit mode gradient(m -> loss(m, x, y), model) instead. To collect all the parameter arrays for other purposes, use Flux.trainables(model).\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#All-Layers","page":"Flat vs. Nested","title":"All Layers","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Another kind of flat view of a nested model is provided by the modules command. This extracts a list of all layers:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.modules","category":"page"},{"location":"reference/destructure/#Flux.modules","page":"Flat vs. Nested","title":"Flux.modules","text":"modules(m)\n\nReturn an iterator over non-leaf objects that can be reached by recursing m over the children given by Functors.functor.\n\nUseful for applying a function (e.g. a regularizer) over specific modules or subsets of the parameters (e.g. the weights but not the biases).\n\nExamples\n\njulia> m1 = Chain(Dense(28^2, 64), BatchNorm(64, relu));\n\njulia> m2 = Chain(m1, Dense(64, 10))\nChain(\n Chain(\n Dense(784 => 64), # 50_240 parameters\n BatchNorm(64, relu), # 128 parameters, plus 128\n ),\n Dense(64 => 10), # 650 parameters\n) # Total: 6 trainable arrays, 51_018 parameters,\n # plus 2 non-trainable, 128 parameters, summarysize 200.211 KiB.\n\njulia> Flux.modules(m2)\n7-element Vector{Any}:\n Chain(Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10)) # 51_018 parameters, plus 128 non-trainable\n (Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10))\n Chain(Dense(784 => 64), BatchNorm(64, relu)) # 50_368 parameters, plus 128 non-trainable\n (Dense(784 => 64), BatchNorm(64, relu))\n Dense(784 => 64) # 50_240 parameters\n BatchNorm(64, relu) # 128 parameters, plus 128 non-trainable\n Dense(64 => 10) # 650 parameters\n\njulia> L2(m) = sum(sum(abs2, l.weight) for l in Flux.modules(m) if l isa Dense)\nL2 (generic function with 1 method)\n\njulia> L2(m2) isa Float32\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Save-and-Load","page":"Flat vs. Nested","title":"Save and Load","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.state\nFlux.loadmodel!","category":"page"},{"location":"reference/destructure/#Flux.state","page":"Flat vs. Nested","title":"Flux.state","text":"state(x)\n\nReturn an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).\n\nBesides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.\n\nThis method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.\n\nThe state can be passed to loadmodel! to restore the model.\n\nExamples\n\nCopy the state into another model\n\njulia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));\n\njulia> s = Flux.state(m1)\n(layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)\n\njulia> m2 = Chain(Dense(1, 2, tanh), Dense(2, 1; bias=false)); # weights are random numbers\n\njulia> Flux.loadmodel!(m2, s);\n\njulia> m2[1].weight # now the weights of m2 are the same as m1\n2×1 Matrix{Float32}:\n 1.0\n 1.0\n\njulia> Flux.state(trainmode!(Dropout(0.2))) # contains p & activity, but not RNG state\n(p = 0.2, dims = (), active = true, rng = ())\n\njulia> Flux.state(BatchNorm(1)) # contains non-trainable arrays μ, σ²\n(λ = (), β = Float32[0.0], γ = Float32[1.0], μ = Float32[0.0], σ² = Float32[1.0], ϵ = 1.0f-5, momentum = 0.1f0, affine = true, track_stats = true, active = nothing, chs = 1)\n\nSave and load with BSON\n\njulia> using BSON\n\njulia> BSON.@save \"checkpoint.bson\" model_state = s\n\njulia> Flux.loadmodel!(m2, BSON.load(\"checkpoint.bson\")[:model_state])\n\nSave and load with JLD2\n\njulia> using JLD2\n\njulia> JLD2.jldsave(\"checkpoint.jld2\", model_state = s)\n\njulia> Flux.loadmodel!(m2, JLD2.load(\"checkpoint.jld2\", \"model_state\"))\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Flux.loadmodel!","page":"Flat vs. Nested","title":"Flux.loadmodel!","text":"loadmodel!(dst, src)\n\nCopy all the parameters (trainable and non-trainable) from src into dst.\n\nRecursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).\n\nSee also Flux.state.\n\nExamples\n\njulia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))\nChain(\n Dense(5 => 2, tanh), # 12 parameters\n Dense(2 => 1), # 3 parameters\n) # Total: 4 arrays, 15 parameters, 316 bytes.\n\njulia> dst[1].weight ≈ ones(2, 5) # by construction\ntrue\n\njulia> src = Chain(Dense(5 => 2, relu), Dense(2 => 1, bias=false));\n\njulia> Flux.loadmodel!(dst, src);\n\njulia> dst[1].weight ≈ ones(2, 5) # values changed\nfalse\n\njulia> iszero(dst[2].bias)\ntrue\n\nExtended help\n\nThrows an error when:\n\ndst and src do not share the same fields (at any level)\nthe sizes of leaf nodes are mismatched between dst and src\ncopying non-array values to/from an array parameter (except inactive parameters described below)\ndst is a \"tied\" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values\n\nInactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#KeyPath","page":"Flat vs. Nested","title":"KeyPath","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Functors.KeyPath\nFunctors.getkeypath\nFunctors.haskeypath\nFunctors.setkeypath!","category":"page"},{"location":"reference/destructure/#Functors.KeyPath","page":"Flat vs. Nested","title":"Functors.KeyPath","text":"KeyPath(keys...)\n\nA type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, Int, or CartesianIndex.\n\nFor custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.\n\nFor string, integer, and cartesian index keys, the access is done with getindex instead.\n\nSee also getkeypath, haskeypath.\n\nExamples\n\njulia> kp = KeyPath(:b, 3)\nKeyPath(:b, 3)\n\njulia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths\nKeyPath(:a, :b, 3, :c, 4)\n\njulia> struct T\n a\n b\n end\n\njulia> function Base.getproperty(x::T, k::Symbol)\n if k in fieldnames(T)\n return getfield(x, k)\n elseif k === :ab\n return \"ab\"\n else \n error()\n end\n end;\n\njulia> Base.propertynames(::T) = (:a, :b, :ab);\n\njulia> x = T(3, Dict(:c => 4, :d => 5));\n\njulia> getkeypath(x, KeyPath(:ab)) # equivalent to x.ab\n\"ab\"\n\njulia> getkeypath(x, KeyPath(:b, :c)) # equivalent to (x.b)[:c]\n4\n\n\n\n\n\n","category":"type"},{"location":"reference/destructure/#Functors.getkeypath","page":"Flat vs. Nested","title":"Functors.getkeypath","text":"getkeypath(x, kp::KeyPath)\n\nReturn the value in x at the path kp.\n\nSee also KeyPath, haskeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> getkeypath(x, KeyPath(:b, \"d\", 2))\n6\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Functors.haskeypath","page":"Flat vs. Nested","title":"Functors.haskeypath","text":"haskeypath(x, kp::KeyPath)\n\nReturn true if x has a value at the path kp.\n\nSee also KeyPath, getkeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> haskeypath(x, KeyPath(:a))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 1))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 4))\nfalse\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Functors.setkeypath!","page":"Flat vs. Nested","title":"Functors.setkeypath!","text":"setkeypath!(x, kp::KeyPath, v)\n\nSet the value in x at the path kp to v.\n\nSee also KeyPath, getkeypath, and haskeypath.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/data/onehot/#One-Hot-Encoding-with-OneHotArrays.jl","page":"OneHotArrays.jl","title":"One-Hot Encoding with OneHotArrays.jl","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"It's common to encode categorical variables (like true, false or cat, dog) in \"one-of-k\" or \"one-hot\" form. OneHotArrays.jl provides the onehot function to make this easy.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehot(:b, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> onehot(:c, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n ⋅\n 1","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"There is also a onecold function, which is an inverse of onehot. It can also be given an array of numbers instead of booleans, in which case it performs an argmax-like operation, returning the label with the highest corresponding weight.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> onecold(ans, [:a, :b, :c])\n:c\n\njulia> onecold([true, false, false], [:a, :b, :c])\n:a\n\njulia> onecold([0.3, 0.2, 0.5], [:a, :b, :c])\n:c","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"For multiple samples at once, onehotbatch creates a batch (matrix) of one-hot vectors, and onecold treats matrices as batches.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehotbatch([:b, :a, :b], [:a, :b, :c])\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n ⋅ ⋅ ⋅\n\njulia> onecold(ans, [:a, :b, :c])\n3-element Vector{Symbol}:\n :b\n :a\n :b","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"Note that these operations returned OneHotVector and OneHotMatrix rather than Arrays. OneHotVectors behave like normal vectors but avoid any unnecessary cost compared to using an integer index directly. For example, multiplying a matrix with a one-hot vector simply slices out the relevant row of the matrix under the hood.","category":"page"},{"location":"reference/data/onehot/#Function-listing","page":"OneHotArrays.jl","title":"Function listing","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"OneHotArrays.onehot\nOneHotArrays.onecold\nOneHotArrays.onehotbatch\nOneHotArrays.OneHotArray\nOneHotArrays.OneHotVector\nOneHotArrays.OneHotMatrix","category":"page"},{"location":"reference/data/onehot/#OneHotArrays.onehot","page":"OneHotArrays.jl","title":"OneHotArrays.onehot","text":"onehot(x, labels, [default])\n\nReturns a OneHotVector which is roughly a sparse representation of x .== labels.\n\nInstead of storing say Vector{Bool}, it stores the index of the first occurrence of x in labels. If x is not found in labels, then it either returns onehot(default, labels), or gives an error if no default is given.\n\nSee also onehotbatch to apply this to many xs, and onecold to reverse either of these, as well as to generalise argmax.\n\nExamples\n\njulia> β = onehot(:b, (:a, :b, :c))\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> αβγ = (onehot(0, 0:2), β, onehot(:z, [:a, :b, :c], :c)) # uses default\n(Bool[1, 0, 0], Bool[0, 1, 0], Bool[0, 0, 1])\n\njulia> hcat(αβγ...) # preserves sparsity\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅\n ⋅ 1 ⋅\n ⋅ ⋅ 1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onecold","page":"OneHotArrays.jl","title":"OneHotArrays.onecold","text":"onecold(y::AbstractArray, labels = 1:size(y,1))\n\nRoughly the inverse operation of onehot or onehotbatch: This finds the index of the largest element of y, or each column of y, and looks them up in labels.\n\nIf labels are not specified, the default is integers 1:size(y,1) – the same operation as argmax(y, dims=1) but sometimes a different return type.\n\nExamples\n\njulia> onecold([false, true, false])\n2\n\njulia> onecold([0.3, 0.2, 0.5], (:a, :b, :c))\n:c\n\njulia> onecold([ 1 0 0 1 0 1 0 1 0 0 1\n 0 1 0 0 0 0 0 0 1 0 0\n 0 0 0 0 1 0 0 0 0 0 0\n 0 0 0 0 0 0 1 0 0 0 0\n 0 0 1 0 0 0 0 0 0 1 0 ], 'a':'e') |> String\n\"abeacadabea\"\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onehotbatch","page":"OneHotArrays.jl","title":"OneHotArrays.onehotbatch","text":"onehotbatch(xs, labels, [default])\n\nReturns a OneHotMatrix where kth column of the matrix is onehot(xs[k], labels). This is a sparse matrix, which stores just a Vector{UInt32} containing the indices of the nonzero elements.\n\nIf one of the inputs in xs is not found in labels, that column is onehot(default, labels) if default is given, else an error.\n\nIf xs has more dimensions, N = ndims(xs) > 1, then the result is an AbstractArray{Bool, N+1} which is one-hot along the first dimension, i.e. result[:, k...] == onehot(xs[k...], labels).\n\nNote that xs can be any iterable, such as a string. And that using a tuple for labels will often speed up construction, certainly for less than 32 classes.\n\nExamples\n\njulia> oh = onehotbatch(\"abracadabra\", 'a':'e', 'e')\n5×11 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ 1\n ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅\n\njulia> reshape(1:15, 3, 5) * oh # this matrix multiplication is done efficiently\n3×11 Matrix{Int64}:\n 1 4 13 1 7 1 10 1 4 13 1\n 2 5 14 2 8 2 11 2 5 14 2\n 3 6 15 3 9 3 12 3 6 15 3\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.OneHotArray","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotArray","text":"OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}\nOneHotArray(indices, L)\n\nA one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.\n\nTypically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotVector","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotVector","text":"OneHotVector{T} = OneHotArray{T, 0, 1, T}\nOneHotVector(indices, L)\n\nA one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotMatrix","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotMatrix","text":"OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}\nOneHotMatrix(indices, L)\n\nA one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/zygote/#autodiff-zygote","page":"Gradients – Zygote.jl","title":"Automatic Differentiation using Zygote.jl","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Flux's gradient function uses Zygote by default, and also uses this function within train! to differentiate the model. Zygote has its own documentation, in particular listing some important limitations.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Flux also has support for Enzyme.jl, documented on its own page.","category":"page"},{"location":"reference/training/zygote/#Explicit-style","page":"Gradients – Zygote.jl","title":"Explicit style","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"The preferred way of using Zygote, and the only way of using most other AD packages, is to explicitly provide a function and its arguments.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Zygote.gradient(f, args...)\nZygote.withgradient(f, args...)\nZygote.jacobian(f, args...)\nZygote.withjacobian(f, args...)\nZygote.hessian\nZygote.hessian_reverse\nZygote.diaghessian\nZygote.pullback","category":"page"},{"location":"reference/training/zygote/#Zygote.gradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.gradient","text":"gradient(f, args...)\n\nReturns a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or the gradient. If no gradient is defined, ∂f/∂x will be nothing.\n\nf(args...) must be a real number, see jacobian for array output.\n\nSee also withgradient to keep the value f(args...), and pullback for value and back-propagator.\n\njulia> gradient(*, 2.0, 3.0, 5.0)\n(15.0, 10.0, 6.0)\n\njulia> gradient(x -> sum(abs2,x), [7.0, 11.0, 13.0])\n([14.0, 22.0, 26.0],)\n\njulia> gradient([7, 11], 0, 1) do x, y, d\n p = size(x, d)\n sum(x.^p .+ y)\n end\n([14.0, 22.0], 2.0, nothing)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withgradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withgradient","text":"withgradient(f, args...)\nwithgradient(f, ::Params)\n\nReturns both the value of the function and the gradient, as a named tuple.\n\njulia> y, ∇ = withgradient(/, 1, 2)\n(val = 0.5, grad = (0.5, -0.25))\n\njulia> ∇ == gradient(/, 1, 2)\ntrue\n\nAllows you to capture auxillary outputs, in addition to the scalar used by gradient. To do this, f must return a Tuple or NamedTuple. Then it calculates grad = gradient(first∘f, args...) but returns the wholeval = f(args...)`:\n\njulia> withgradient([1,2,4]) do x\n z = 1 ./ x\n sum(z), z # here z is an auxillary output\n end\n(val = (1.75, [1.0, 0.5, 0.25]), grad = ([-1.0, -0.25, -0.0625],))\n\njulia> withgradient(3.0, 4.0) do x, y\n (div = x/y, mul = x*y)\n end\n(val = (div = 0.75, mul = 12.0), grad = (0.25, -0.1875))\n\nAlso supports implicit mode:\n\njulia> w = [3.0];\n\njulia> res = withgradient(() -> sum(abs2, w), Params([w]))\n(val = 9.0, grad = Grads(...))\n\njulia> res.grad[w]\n1-element Vector{Float64}:\n 6.0\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.jacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.jacobian","text":"jacobian(f, args...) -> Tuple\n\nFor each array a ∈ args this returns a matrix with Ja[k,i] = ∂y[k]/∂a[i] where y = f(args...) is usually a vector. Arrays of higher dimension are treated like vec(a), or vec(y) for output.\n\nFor scalar x::Number ∈ args, the result is a vector Jx[k] = ∂y[k]/∂x, while for scalar y all results have just one row.\n\nWith any other argument type, no result is produced, even if gradient would work.\n\nThis reverse-mode Jacobian needs to evaluate the pullback once for each element of y. Doing so is usually only efficient when length(y) is small compared to length(a), otherwise forward mode is likely to be better.\n\nSee also withjacobian, hessian, hessian_reverse.\n\nExamples\n\njulia> jacobian(a -> 100*a[1:3].^2, 1:7)[1] # first index (rows) is output\n3×7 Matrix{Int64}:\n 200 0 0 0 0 0 0\n 0 400 0 0 0 0 0\n 0 0 600 0 0 0 0\n\njulia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1) # scalar argument has vector jacobian\n([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])\n\njulia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)\n([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])\n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\njulia> jacobian((a,s) -> a.^length(s), [1,2,3], \"str\")\n([3 0 0; 0 12 0; 0 0 27], nothing)\n\njulia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))\n([4 4 4], nothing)\n\njulia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple\n([4 4 4], (6, 1))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withjacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withjacobian","text":"withjacobian(f, args...)\n\nReturns both the value f(args...) and the jacobian as a named tuple.\n\njulia> withjacobian(cumsum, [1,2,3])\n(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.hessian","page":"Gradients – Zygote.jl","title":"Zygote.hessian","text":"hessian(f, x)\n\nConstruct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.\n\nThis uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.\n\nSee also diaghessian to compute only the diagonal part.\n\nExamples\n\njulia> hessian(x -> x[1]*x[2], randn(2))\n2×2 Matrix{Float64}:\n 0.0 1.0\n 1.0 0.0\n\njulia> hessian(x -> sum(x.^3), [1 2; 3 4]) # uses linear indexing of x\n4×4 Matrix{Int64}:\n 6 0 0 0\n 0 18 0 0\n 0 0 12 0\n 0 0 0 24\n\njulia> hessian(sin, pi/2)\n-1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.hessian_reverse","page":"Gradients – Zygote.jl","title":"Zygote.hessian_reverse","text":"hessian_reverse(f, x)\n\nThis should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.diaghessian","page":"Gradients – Zygote.jl","title":"Zygote.diaghessian","text":"diaghessian(f, args...) -> Tuple\n\nDiagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.\n\nFor one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote. \n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\nExamples\n\njulia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]\n2×2 Matrix{Int64}:\n 6 12\n 18 24\n\njulia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4]) # full Hessian is diagonal\ntrue\n\njulia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666]) # two array arguments\n([0.0 0.0; 0.0 0.0], [2.0, 8.0])\n\njulia> diaghessian(atan, 1, 2) # two scalar arguments\n(-0.16, 0.16)\n\njulia> hessian(xy -> atan(xy[1], xy[2]), [1, 2]) # full Hessian is not diagonal\n2×2 Matrix{Float64}:\n -0.16 -0.12\n -0.12 0.16\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ZygoteRules.pullback","page":"Gradients – Zygote.jl","title":"ZygoteRules.pullback","text":"pullback(f, args...)\npullback(f, ::Params)\n\nReturns the value of the function f and a back-propagator function, which can be called to obtain a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or gradient.\n\ny, back = pullback(f, args...)\n∇ = back(seed)\n\nback must be called with a start value seed matching the output of f(args...). If f(args...) returns a number, seed should be a number. If f(args...) returns an array, seed should be an equally-sized array.\n\nSee also withgradient to obtain the value and gradients in one call, and gradient for obtaining just the gradients.\n\njulia> y, back = pullback(*, 2.0, 3.0, 5.0);\n\njulia> y\n30.0\n\njulia> back(1.0)\n(15.0, 10.0, 6.0)\n\njulia> back(2.0)\n(30.0, 20.0, 12.0)\n\njulia> y, back = pullback(x -> [x, x], 1.0);\n\njulia> y\n2-element Vector{Float64}:\n 1.0\n 1.0\n\njulia> back([1.0, 1.0])\n(2.0,)\n\njulia> back([2.0, nothing])\n(2.0,)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRules","page":"Gradients – Zygote.jl","title":"ChainRules","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Sometimes it is necessary to exclude some code, or a whole function, from automatic differentiation. This can be done using ChainRules:","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.ignore_derivatives\nChainRulesCore.@non_differentiable","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.ignore_derivatives","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ignore_derivatives","text":"ignore_derivatives(f::Function)\n\nTells the AD system to ignore the gradients of the wrapped closure. The primal computation (forward pass) is executed normally.\n\nignore_derivatives() do\n value = rand()\n push!(collection, value)\nend\n\nUsing this incorrectly could lead to incorrect gradients. For example, the following function will have zero gradients with respect to its argument:\n\nfunction wrong_grads(x)\n y = ones(3)\n ignore_derivatives() do\n push!(y, x)\n end\n return sum(y)\nend\n\n\n\n\n\nignore_derivatives(x)\n\nTells the AD system to ignore the gradients of the argument. Can be used to avoid unnecessary computation of gradients.\n\nignore_derivatives(x) * w\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@non_differentiable","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@non_differentiable","text":"@non_differentiable(signature_expression)\n\nA helper to make it easier to declare that a method is not differentiable. This is a short-hand for defining an frule and rrule that return NoTangent() for all partials (even for the function s̄elf-partial itself)\n\nKeyword arguments should not be included.\n\njulia> @non_differentiable Base.:(==)(a, b)\n\njulia> _, pullback = rrule(==, 2.0, 3.0);\n\njulia> pullback(1.0)\n(NoTangent(), NoTangent(), NoTangent())\n\nYou can place type-constraints in the signature:\n\njulia> @non_differentiable Base.length(xs::Union{Number, Array})\n\njulia> frule((ZeroTangent(), 1), length, [2.0, 3.0])\n(2, NoTangent())\n\nwarning: Warning\nThis helper macro covers only the simple common cases. It does not support where-clauses. For these you can declare the rrule and frule directly\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"To manually supply the gradient for one function, you should define a method of rrule. ChainRules has detailed documentation on how this works.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.rrule\nChainRulesCore.frule\nChainRulesCore.@scalar_rule\nChainRulesCore.NoTangent\nChainRulesCore.ZeroTangent\nChainRulesCore.RuleConfig\nChainRulesCore.Tangent\nChainRulesCore.canonicalize","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.rrule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.rrule","text":"rrule([::RuleConfig,] f, x...)\n\nExpressing x as the tuple (x₁, x₂, ...) and the output tuple of f(x...) as Ω, return the tuple:\n\n(Ω, (Ω̄₁, Ω̄₂, ...) -> (s̄elf, x̄₁, x̄₂, ...))\n\nWhere the second return value is the the propagation rule or pullback. It takes in cotangents corresponding to the outputs (x̄₁, x̄₂, ...), and s̄elf, the internal values of the function itself (for closures)\n\nIf no method matching rrule(f, xs...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> x = rand();\n\njulia> sinx, sin_pullback = rrule(sin, x);\n\njulia> sinx == sin(x)\ntrue\n\njulia> sin_pullback(1) == (NoTangent(), cos(x))\ntrue\n\nbinary input, unary output scalar function:\n\njulia> x, y = rand(2);\n\njulia> hypotxy, hypot_pullback = rrule(hypot, x, y);\n\njulia> hypotxy == hypot(x, y)\ntrue\n\njulia> hypot_pullback(1) == (NoTangent(), (x / hypot(x, y)), (y / hypot(x, y)))\ntrue\n\nThe optional RuleConfig option allows specifying rrules only for AD systems that support given features. If not needed, then it can be omitted and the rrule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: frule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.frule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.frule","text":"frule([::RuleConfig,] (Δf, Δx...), f, x...)\n\nExpressing the output of f(x...) as Ω, return the tuple:\n\n(Ω, ΔΩ)\n\nThe second return value is the tangent w.r.t. the output.\n\nIf no method matching frule((Δf, Δx...), f, x...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> dself = NoTangent();\n\njulia> x = rand()\n0.8236475079774124\n\njulia> sinx, Δsinx = frule((dself, 1), sin, x)\n(0.7336293678134624, 0.6795498147167869)\n\njulia> sinx == sin(x)\ntrue\n\njulia> Δsinx == cos(x)\ntrue\n\nUnary input, binary output scalar function:\n\njulia> sincosx, Δsincosx = frule((dself, 1), sincos, x);\n\njulia> sincosx == sincos(x)\ntrue\n\njulia> Δsincosx[1] == cos(x)\ntrue\n\njulia> Δsincosx[2] == -sin(x)\ntrue\n\nNote that techically speaking julia does not have multiple output functions, just functions that return a single output that is iterable, like a Tuple. So this is actually a Tangent:\n\njulia> Δsincosx\nTangent{Tuple{Float64, Float64}}(0.6795498147167869, -0.7336293678134624)\n\nThe optional RuleConfig option allows specifying frules only for AD systems that support given features. If not needed, then it can be omitted and the frule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: rrule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@scalar_rule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@scalar_rule","text":"@scalar_rule(f(x₁, x₂, ...),\n @setup(statement₁, statement₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nA convenience macro that generates simple scalar forward or reverse rules using the provided partial derivatives. Specifically, generates the corresponding methods for frule and rrule:\n\nfunction ChainRulesCore.frule((NoTangent(), Δx₁, Δx₂, ...), ::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, (\n (∂f₁_∂x₁ * Δx₁ + ∂f₁_∂x₂ * Δx₂ + ...),\n (∂f₂_∂x₁ * Δx₁ + ∂f₂_∂x₂ * Δx₂ + ...),\n ...\n )\nend\n\nfunction ChainRulesCore.rrule(::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, ((ΔΩ₁, ΔΩ₂, ...)) -> (\n NoTangent(),\n ∂f₁_∂x₁ * ΔΩ₁ + ∂f₂_∂x₁ * ΔΩ₂ + ...),\n ∂f₁_∂x₂ * ΔΩ₁ + ∂f₂_∂x₂ * ΔΩ₂ + ...),\n ...\n )\nend\n\nIf no type constraints in f(x₁, x₂, ...) within the call to @scalar_rule are provided, each parameter in the resulting frule/rrule definition is given a type constraint of Number. Constraints may also be explicitly be provided to override the Number constraint, e.g. f(x₁::Complex, x₂), which will constrain x₁ to Complex and x₂ to Number.\n\nAt present this does not support defining for closures/functors. Thus in reverse-mode, the first returned partial, representing the derivative with respect to the function itself, is always NoTangent(). And in forward-mode, the first input to the returned propagator is always ignored.\n\nThe result of f(x₁, x₂, ...) is automatically bound to Ω. This allows the primal result to be conveniently referenced (as Ω) within the derivative/setup expressions.\n\nThis macro assumes complex functions are holomorphic. In general, for non-holomorphic functions, the frule and rrule must be defined manually.\n\nIf the derivative is one, (e.g. for identity functions) true can be used as the most general multiplicative identity.\n\nThe @setup argument can be elided if no setup code is need. In other words:\n\n@scalar_rule(f(x₁, x₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nis equivalent to:\n\n@scalar_rule(f(x₁, x₂, ...),\n @setup(nothing),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nFor examples, see ChainRules' rulesets directory.\n\nSee also: frule, rrule.\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/#ChainRulesCore.NoTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.NoTangent","text":"NoTangent() <: AbstractZero\n\nThis tangent indicates that the derivative does not exist. It is the tangent type for primal types that are not differentiable, such as integers or booleans (when they are not being used to represent floating-point values). The only valid way to perturb such values is to not change them at all. As a consequence, NoTangent is functionally identical to ZeroTangent(), but it provides additional semantic information.\n\nAdding NoTangent() to a primal is generally wrong: gradient-based methods cannot be used to optimize over discrete variables. An optimization package making use of this might want to check for such a case.\n\nnote: Note\nThis does not indicate that the derivative is not implemented, but rather that mathematically it is not defined.\n\nThis mostly shows up as the derivative with respect to dimension, index, or size arguments.\n\nfunction rrule(fill, x, len::Int)\n y = fill(x, len)\n fill_pullback(ȳ) = (NoTangent(), @thunk(sum(Ȳ)), NoTangent())\n return y, fill_pullback\nend\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.ZeroTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ZeroTangent","text":"ZeroTangent() <: AbstractZero\n\nThe additive identity for tangents. This is basically the same as 0. A derivative of ZeroTangent() does not propagate through the primal function.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.RuleConfig","page":"Gradients – Zygote.jl","title":"ChainRulesCore.RuleConfig","text":"RuleConfig{T}\n\nThe configuration for what rules to use. T: traits. This should be a Union of all special traits needed for rules to be allowed to be defined for your AD. If nothing special this should be set to Union{}.\n\nAD authors should define a subtype of RuleConfig to use when calling frule/rrule.\n\nRule authors can dispatch on this config when defining rules. For example:\n\n# only define rrule for `pop!` on AD systems where mutation is supported.\nrrule(::RuleConfig{>:SupportsMutation}, typeof(pop!), ::Vector) = ...\n\n# this definition of map is for any AD that defines a forwards mode\nrrule(conf::RuleConfig{>:HasForwardsMode}, typeof(map), ::Vector) = ...\n\n# this definition of map is for any AD that only defines a reverse mode.\n# It is not as good as the rrule that can be used if the AD defines a forward-mode as well.\nrrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...\n\nFor more details see rule configurations and calling back into AD.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.Tangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.Tangent","text":"Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent\n\nThis type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.\n\nTangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.\n\nT is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.\n\nFor Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.canonicalize","page":"Gradients – Zygote.jl","title":"ChainRulesCore.canonicalize","text":"canonicalize(tangent::Tangent{P}) -> Tangent{P}\n\nReturn the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().\n\n\n\n\n\n","category":"function"},{"location":"guide/models/basics/#man-basics","page":"Gradients and Layers","title":"How Flux Works: Parameters, Gradients, and Layers","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"A neural network is a function with parameters. That is, it takes some input x and gives you some output y, whose value also depends on some other numbers θ.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"A sufficiently flexible function can, by adjusting the parameters just right, be made to do many things. And the one magic trick for adjusting parameters is to follow a gradient.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This page describes Flux's take on how to construct such flexible functions containing many parameters, and how to handle their gradients.","category":"page"},{"location":"guide/models/basics/#Parameterised-Functions","page":"Gradients and Layers","title":"Parameterised Functions","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Let's start with very simple functions. This is a polynomial in x::Real, returning another real number y which depends on some coefficients stored in a vector:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"θ = [10, 1, 0.1]\n\npoly1(x::Real) = θ[1] + θ[2]*x + θ[3]*x^2\n\npoly1(5) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Here the parameters are a global variable θ. They could be handled in other ways, for instance by explicitly passing them as an additional argument to the function:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly2(x::Real, θ2) = evalpoly(x, θ2) # built-in, from Base.Math\n\npoly2(5, θ) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux chooses a third path, by encapsulating the parameters within the function. The simplest way to do this is a closure, an anonymous function which Julia knows to depend on some local variable θ3:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly3 = let θ3 = [10, 1, 0.1]\n x -> evalpoly(x, θ3)\nend\n\npoly3(5) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"An equivalent, but tidier, way is to construct a struct in which to store the parameters. Any struct can be made callable, allowing its instances to act just like function:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"struct Poly3{T} # container struct\n θ3::T\nend\n(p::Poly3)(x::Real) = evalpoly(x, p.θ3) # make this callable\n\npoly3s = Poly3([10, 1, 0.1]) # construct an instance\n\npoly3s(5) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Internally, there is little difference between a closure and a struct. They have the same fields, and equivalent methods:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly3s.θ3 == poly3.θ3 == θ # both have a field called :θ3\ndump(poly3) # contains θ3: Array\ndump(poly3s)\nmethods(poly3)\nmethods(poly3s) # each has 1 method, taking x::Real","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The virtue of encapsulation is that it makes composition very easy. We can make more complicated functions by combining simple ones, and each will keep track of its own parameters. Juia writes function composition as ∘, for instance (inv ∘ sin)(pi/6) ≈ 2, and we can use exactly this for our parameterised polynomials:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly4 = Poly3([1, 0.5, 0]) ∘ Poly3([10, 1, 0.1])\n\npoly4 isa ComposedFunction # ∘ creates another struct...\npoly4.outer.θ3 == θ # which has fields :inner & :outer\n\npoly4(5) == 9.75 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux models are precisely made by such function composition. In fact, poly3 and poly4 are already valid Flux models.","category":"page"},{"location":"guide/models/basics/#man-taking-gradients","page":"Gradients and Layers","title":"Structural Gradients","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The derivative of a scalar function is its slope: how fast the output changes as the input is changed slightly. This may be found approximately by evaluating at two nearby points, and exactly by taking the limit in which the distance between them approaches zero:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> (poly1(5 + 0.1) - poly1(5)) / 0.1\n2.010000000000005\n\njulia> (poly1(5 + 0.001) - poly1(5)) / 0.001 # answer is getting close to 2\n2.000100000003613","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's gradient(f, x) works this out for f(x), and gives exactly ∂f/∂x = 2.0 here:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> using Flux\n\njulia> gradient(poly1, 5)\n(2.0,)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The reason gradient returns a tuple, not just the number 2.0, is to allow for functions taking several arguments. (That's also why it's not called \"derivative\".) For instance, this returns ∂f/∂x, ∂f/∂y, ∂f/∂z:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> gradient((x,y,z) -> (x*y)+z, 30, 40, 50)\n(40.0, 30.0, 1.0)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For our parameterised polynomial, we have ∂f/∂x but we are really more interested in ∂f/∂θ, as this will tell us about how the parameters are affecting the answer. It is not impossible to track gradients with respect to global θ, but much clearer to track explicit arguments. Here's how this works for poly2 (which takes θ as a 2nd argument) and poly3 (which encapsulates θ):","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> grad2 = gradient(poly2, 5, θ)\n(2.0, [1.0, 5.0, 25.0])\n\njulia> grad3 = gradient((x,p) -> p(x), 5, poly3s)\n(2.0, (θ3 = [1.0, 5.0, 25.0],))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The first entry is ∂f/∂x as before, but the second entry is more interesting. For poly2, we get ∂f/∂θ as grad2[2] directly. It is a vector, because θ is a vector, and has elements [∂f/∂θ[1], ∂f/∂θ[2], ∂f/∂θ[3]].","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For poly3, however, we get a NamedTuple whose fields correspond to those of the struct Poly3. This is called a structural gradient. And the nice thing about them is that they work for arbitrarily complicated structures, for instance:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> grad4 = gradient(|>, 5, poly4)\n(1.0, (outer = (θ3 = [1.0, 17.5, 306.25],), inner = (θ3 = [0.5, 2.5, 12.5],)))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Here grad4.inner.θ3 corresponds to poly4.inner.θ3. These matching nested structures are at the core of how Flux works.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"note: Implicit gradients\nEarlier versions of Flux used a different way to relate parameters and gradients, which looks like this:g1 = gradient(() -> poly1(5), Params([θ]))\ng1[θ] == [1.0, 5.0, 25.0]Here Params is a set of references to global variables using objectid, and g1 isa Grads is a dictionary from these to their gradients. This method of gradient takes a zero-argument function, which only implicitly depends on θ.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"

 Zygote.jl

","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's gradient function by default calls a companion packages called Zygote. Zygote performs source-to-source automatic differentiation, meaning that gradient(f, x) hooks into Julia's compiler to find out what operations f contains, and transforms this to produce code for computing ∂f/∂x.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Zygote can in principle differentiate almost any Julia code. However, it's not perfect, and you may eventually want to read its page about limitations. In particular, a major limitation is that mutating an array is not allowed.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux can also be used with other automatic differentiation (AD) packages. It was originally written using Tracker, a more traditional operator-overloading approach. The future might be Enzyme, and Flux now builds in an easy way to use this instead, turned on by wrapping the model in Duplicated. (For details, see the Enzyme page in the manual.)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> using Enzyme: Const, Duplicated\n\njulia> grad3e = Flux.gradient((x,p) -> p(x), Const(5.0), Duplicated(poly3s))\n(nothing, (θ3 = [1.0, 5.0, 25.0],))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux.gradient follows Zygote's convention that arguments with no derivative are marked nothing. Here, this is because Const(5.0) is explicitly constant. Below, we will see an example where nothing shows up because the model struct has fields containing things other than parameters, such as an activation function. (It also adopts the convention that gradient(f, x, y) returns a tuple (∂f/∂x, ∂f/∂y), without a \"∂f/∂f\" term for the function. This is why we had to write gradient(|>, 5, poly4) above, not just gradient(poly4, 5).)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Finally, the function withgradient works the same way, but also returns the value of the function:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> Flux.withgradient((x,p) -> p(x), 5.0, poly3s)\n(val = 17.5, grad = (2.0, (θ3 = [1.0, 5.0, 25.0],)))","category":"page"},{"location":"guide/models/basics/#Simple-Neural-Networks","page":"Gradients and Layers","title":"Simple Neural Networks","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The polynomial functions above send a number x to another a number y. Neural networks typically take a vector of numbers, mix them all up, and return another vector. Here's a very simple one, which will take a vector like x = [1.0, 2.0, 3.0] and return another vector y = layer1(x) with length(y) == 2:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"W = randn(2, 3)\nb = zeros(2)\n\nsigmoid(x::Real) = 1 / (1 + exp(-x))\nlayer1(x) = sigmoid.(W*x .+ b)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Here sigmoid is a nonlinear function, applied element-wise because it is called with .(), called broadcasting.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Like poly1 above, this layer1 has as its parameters the global variables W, b. We can similarly define a version which takes these as arguments (like poly2), and a version which encapsulates them (like poly3 above):","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"layer2(x, W2, b2) = sigmoid.(W2*x .+ b2) # explicit parameter arguments\n\nlayer3 = let\n W3 = randn(2, 3)\n b3 = zeros(2)\n x -> sigmoid.(W3*x .+ b3) # closure over local variables\nend\n\nlayer3([1.0, 2.0, 3.0]) isa Vector # check that it runs","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This third way is precisely a Flux model. And we can again make a tidier version using a struct to hold the parameters:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"struct Layer # container struct\n W::Matrix\n b::Vector\n act::Function\nend\n\n(d::Layer)(x) = d.act.(d.W*x .+ d.b) # make it callabale\n\nLayer(in::Int, out::Int, act::Function=sigmoid) =\n Layer(randn(Float32, out, in), zeros(Float32, out), act)\n\nlayer3s = Layer(3, 2) # instance with its own parameters","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The one new thing here is a friendly constructor Layer(in, out, act). This is because we anticipate composing several instances of this thing, with independent parameter arrays, of different sizes and different random initial parameters.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"x = Float32[0.1, 0.2, 0.3] # input\n\nlayer3s(x) # output, 2-element Vector{Float32}\n\nFlux.gradient((x,d) -> d(x)[1], x, layer3s)[2] # NamedTuple{(:W, :b, :act)}","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This ∂f/∂layer3s is a named tuple with the same fields as Layer. Within it, the gradient with respect to W is a matrix of seemingly random numbers. Notice that there is also an entry for act, which is nothing, as this field of the struct is not a smoothly adjustible parameter.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"We can compose these layers just as we did the polynomials above. Here's a composition of 3, in which the last step is the function only which takes a 2-element vector and gives us the number inside:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model1 = only ∘ Layer(20, 1) ∘ Layer(1, 20)\n\ny = model1(Float32[0.1]) # output is a Float32 number\n\ngrad = Flux.gradient(|>, [1f0], model1)[2]","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This gradient is starting to be a complicated nested structure. But it works just like before: grad.outer.inner.W corresponds to model1.outer.inner.W.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"We don't have to use ∘ (which makes a ComposedFunction struct) to combine layers. Instead, we could define our own container struct, or use a closure:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model2 = let\n lay1 = Layer(1, 20) # local variables containing layers\n lay2 = Layer(20, 1)\n function fwd(x) # equivalent to x -> only(lay2(lay1(x)))\n mid = lay1(x)\n lay2(mid) |> only\n end\nend\n\nmodel2(Float32[0.1])\n\nFlux.gradient(|>, [1f0], model2)[2]","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"

 Flux's layers

","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Rather than define everything from scratch every time, Flux provides a library of commonly used layers. The same model could be defined:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model3 = Chain(Dense(1 => 20, σ), Dense(20 => 1), only)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"How does this model3 differ from the model1 we had before?","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's Chain works left-to-right, the reverse of Base's ∘. Its contents is stored in a tuple, thus model3.layers[1].weight is an array.\nFlux's layer Dense has only minor differences:\nLike struct Poly3{T} above, it has type parameters for its fields – the compiler does not know exactly what type layer3s.W will be, which costs speed.\nIts initialisation uses not randn (normal distribution) but glorot_uniform by default.\nIt reshapes some inputs (to allow several batch dimensions), and produces more friendly errors on wrong-size input.\nAnd it has some performance tricks: making sure element types match, and re-using some memory.\nThe function σ is calculated in a slightly better way, and has a rule telling Zygote how to differentiate it efficiently.\nFlux overloads Base.show so to give pretty printing at the REPL prompt. Calling Flux.@layer Layer will add this, and some other niceties.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"If what you need isn't covered by Flux's built-in layers, it's easy to write your own. There are more details later, but the steps are invariably those shown for struct Layer above:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Define a struct which will hold the parameters.\nMake it callable, to define how it uses them to transform the input x\nDefine a constructor which initialises the parameters (if the default constructor doesn't do what you want).\nAnnotate with @layer to opt-in to pretty printing, and other enhacements.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"

 Functors.jl

","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"To deal with such nested structures, Flux relies heavily on an associated package called Functors. Its basic function is fmap, which generalises map(f, x) to work on almost anything.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For example, this is how gpu moves all arrays within a model to the GPU, reconstructing another only ∘ Layer(...) ∘ Layer(...) (or a Chain etc.) around the new CuArrays:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using CUDA, Functors\nfmap(cu, model1)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"And this is a very simple gradient update of the parameters, walking over model and grad simultaneously:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"fmap((x, dx) -> x isa Array ? (x - dx/100) : x, model, grad)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"note: Note\nBefore Flux v0.15 (and Functors v0.5), this exploration of structs was opt-in. After defining struct Layer it was necessary to call @functor Layer (or @layer Layer) before Flux would look inside. This has now changed to be opt-out: Functors (and hence Flux) will explore arbitrary structs, unless told not to (using Functors.@leaf). This is why even \"anonymous structs\" created by closures like poly3 and layer3 above are now valid Flux models, although the use of named structs is still recommended practice.","category":"page"},{"location":"guide/models/basics/#Curve-Fitting","page":"Gradients and Layers","title":"Curve Fitting","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Above we took gradients of the output, or sometimes to the first element of the output – it must be a number, not a vector. Adjusting the parameters to make this smaller won't lead us anywhere interesting. Instead, we should minimise some loss function which compares the actual output to our desired output.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Perhaps the simplest example is curve fitting. The previous page fitted a linear model to data. With out two-layer model, we can fit a nonlinear function. For example, let us use f(x) = 2x - x^3 evaluated at some points x in -2:0.1:2 as the data, and adjust the parameters of model3 from above so that its output is similar.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"data = [([x], 2x-x^3) for x in -2:0.1f0:2] # training points (x, y)\n\nfor _ in 1:1000 # adjust parameters to minimise the error:\n Flux.train!((m,x,y) -> (m(x) - y)^2, model3, data, Descent(0.01))\nend","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The same code will also work with model1 or model2 instead. Here's how to plot the desired and actual outputs:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Plots\nplot(x -> 2x-x^3, -2, 2, label=\"truth\")\nscatter!(x -> model3([x]), -2:0.1f0:2, label=\"fitted\")","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"More detail about what exactly the function train! is doing, and how to use rules other than simple Descent, is what the next page in this guide is about: training.","category":"page"},{"location":"guide/models/recurrence/#Recurrent-Models","page":"Recurrence","title":"Recurrent Models","text":"","category":"section"},{"location":"guide/models/recurrence/#Recurrent-cells","page":"Recurrence","title":"Recurrent cells","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To introduce Flux's recurrence functionalities, we will consider the following vanilla recurrent neural network structure:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"(Image: )","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the above, we have a sequence of length 3, where x1 to x3 represent the input at each step. It could be a timestamp or a word in a sentence encoded as vectors. y1 to y3 are their respective outputs.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"An aspect to recognise is that in such a model, the recurrent cells A all refer to the same structure. What distinguishes it from a simple dense layer is that the cell A is fed, in addition to an input x, with information from the previous state of the model (hidden state denoted as h1 & h2 in the diagram).","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the most basic RNN case, cell A could be defined by the following: ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"output_size = 5\ninput_size = 2\nWxh = randn(Float32, output_size, input_size)\nWhh = randn(Float32, output_size, output_size)\nb = zeros(Float32, output_size)\n\nfunction rnn_cell(x, h)\n h = tanh.(Wxh * x .+ Whh * h .+ b)\n return h\nend\n\nseq_len = 3\n# dummy input data\nx = [rand(Float32, input_size) for i = 1:seq_len] \n# random initial hidden state\nh0 = zeros(Float32, output_size) \n\ny = []\nht = h0\nfor xt in x\n ht = rnn_cell(xt, ht)\n y = [y; [ht]] # concatenate in non-mutating (AD friendly) way\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Notice how the above is essentially a Dense layer that acts on two inputs, xt and ht.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The output at each time step, called the hidden state, is used as the input to the next time step and is also the output of the model. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"There are various recurrent cells available in Flux, notably RNNCell, LSTMCell and GRUCell, which are documented in the layer reference. The hand-written example above can be replaced with:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Flux\n\noutput_size = 5\ninput_size = 2\nseq_len = 3\nx = [rand(Float32, input_size) for i = 1:seq_len] \nh0 = zeros(Float32, output_size) \n\nrnn_cell = Flux.RNNCell(input_size => output_size)\n\ny = []\nht = h0\nfor xt in x\n ht = rnn_cell(xt, ht)\n y = [y; [ht]]\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The entire output y or just the last output y[end] can be used for further processing, such as classification or regression. ","category":"page"},{"location":"guide/models/recurrence/#Using-a-cell-as-part-of-a-model","page":"Recurrence","title":"Using a cell as part of a model","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Let's consider a simple model that is trained to predict a scalar quantity for each time step in a sequence. The model will have a single RNN cell, followed by a dense layer to produce the output. Since the RNNCell can deal with batches of data, we can define the model to accept an input where at each time step, the input is a matrix of size (input_size, batch_size). ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"struct RecurrentCellModel{H,C,D}\n h0::H\n cell::C\n dense::D\nend\n\n# we choose to not train the initial hidden state\nFlux.@layer RecurrentCellModel trainable=(cell,dense) \n\nfunction RecurrentCellModel(input_size::Int, hidden_size::Int)\n return RecurrentCellModel(\n zeros(Float32, hidden_size), \n RNNCell(input_size => hidden_size),\n Dense(hidden_size => 1))\nend\n\nfunction (m::RecurrentCellModel)(x)\n z = []\n ht = m.h0\n for xt in x\n ht = m.cell(xt, ht)\n z = [z; [ht]]\n end\n z = stack(z, dims=2) # [hidden_size, seq_len, batch_size] or [hidden_size, seq_len]\n ŷ = m.dense(z) # [1, seq_len, batch_size] or [1, seq_len]\n return ŷ\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Notice that we stack the hidden states z to form a tensor of size (hidden_size, seq_len, batch_size). This can speedup the final classification, since we then process all the outputs at once with a single forward pass of the dense layer. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Let's now define the training loop for this model:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Optimisers: AdamW\n\nfunction loss(model, x, y)\n ŷ = model(x)\n y = stack(y, dims=2)\n return Flux.mse(ŷ, y)\nend\n\n# create dummy data\nseq_len, batch_size, input_size = 3, 4, 2\nx = [rand(Float32, input_size, batch_size) for _ = 1:seq_len]\ny = [rand(Float32, 1, batch_size) for _ = 1:seq_len]\n\n# initialize the model and optimizer\nmodel = RecurrentCellModel(input_size, 5)\nopt_state = Flux.setup(AdamW(1e-3), model)\n\n# compute the gradient and update the model\ng = gradient(m -> loss(m, x, y),model)[1]\nFlux.update!(opt_state, model, g)","category":"page"},{"location":"guide/models/recurrence/#Handling-the-whole-sequence-at-once","page":"Recurrence","title":"Handling the whole sequence at once","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the above example, we processed the sequence one time step at a time using a recurrent cell. However, it is possible to process the entire sequence at once. This can be done by stacking the input data x to form a tensor of size (input_size, seq_len) or (input_size, seq_len, batch_size). One can then use the RNN, LSTM or GRU layers to process the entire input tensor. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Let's consider the same example as above, but this time we use an RNN layer instead of an RNNCell:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"struct RecurrentModel{H,C,D}\n h0::H\n rnn::C\n dense::D\nend\n\nFlux.@layer RecurrentModel trainable=(rnn, dense)\n\nfunction RecurrentModel(input_size::Int, hidden_size::Int)\n return RecurrentModel(\n zeros(Float32, hidden_size), \n RNN(input_size => hidden_size),\n Dense(hidden_size => 1))\nend\n\nfunction (m::RecurrentModel)(x)\n z = m.rnn(x, m.h0) # [hidden_size, seq_len, batch_size] or [hidden_size, seq_len]\n ŷ = m.dense(z) # [1, seq_len, batch_size] or [1, seq_len]\n return ŷ\nend\n\nseq_len, batch_size, input_size = 3, 4, 2\nx = rand(Float32, input_size, seq_len, batch_size)\ny = rand(Float32, 1, seq_len, batch_size)\n\nmodel = RecurrentModel(input_size, 5)\nopt_state = Flux.setup(AdamW(1e-3), model)\n\ng = gradient(m -> Flux.mse(m(x), y), model)[1]\nFlux.update!(opt_state, model, g)","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/nnlib/#Neural-Network-primitives-from-NNlib.jl","page":"Low-level Operations – NNlib.jl","title":"Neural Network primitives from NNlib.jl","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux re-exports all of the functions exported by the NNlib package. This includes activation functions, described on their own page. Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.","category":"page"},{"location":"reference/models/nnlib/#Attention","page":"Low-level Operations – NNlib.jl","title":"Attention","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Primitives for the MultiHeadAttention layer.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dot_product_attention\nNNlib.dot_product_attention_scores\nNNlib.make_causal_mask","category":"page"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention","text":"dot_product_attention(query, key, value, [bias]; [fdrop, mask, nheads])\n\nMultihead dot product attention used in transformer architectures.\n\nThe input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.\n\nReturns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).\n\nSee also dot_product_attention_scores if you only need the attention scores.\n\nArguments\n\nquery: Query array of size (qk_dim, q_len, batch_size...).\nkey: Key array of size (qk_dim, kv_len, batch_size...).\nvalue: Value array of size (v_dim, kv_len, batch_size...).\nbias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.\nfdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).\nmask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.\nnheads: Number of heads to split the input arrays into. Default 1.\n\nExamples\n\nq, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)\ny, α = dot_product_attention(q, k, v)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention_scores","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention_scores","text":"dot_product_attention_scores(query, key, [bias]; [fdrop, mask])\n\nReturn the attention scores for the dot_product_attention. Input arrays must have dimensions (num_features ÷ nheads, nheads, sequence_length, batch_size).\n\nSee dot_product_attention for more details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.make_causal_mask","page":"Low-level Operations – NNlib.jl","title":"NNlib.make_causal_mask","text":"make_causal_mask(x, dims=2)\n\nReturn a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.\n\nCan be used to mask the attention scores in dot_product_attention.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Softmax","page":"Low-level Operations – NNlib.jl","title":"Softmax","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"softmax\nlogsoftmax","category":"page"},{"location":"reference/models/nnlib/#NNlib.softmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.softmax","text":"softmax(x; dims = 1)\n\nSoftmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:\n\nsoftmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)\n\nwith additional manipulations enhancing numerical stability.\n\nFor a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.\n\nSee also logsoftmax.\n\nExamples\n\njulia> softmax([1, 2, 3])\n3-element Vector{Float64}:\n 0.09003057317038046\n 0.24472847105479764\n 0.6652409557748218\n\njulia> softmax([1 2 3; 2 2 2]) # dims=1\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.731059\n 0.731059 0.5 0.268941\n\njulia> softmax([1 2 3; 2 2 2]; dims=2)\n2×3 Matrix{Float64}:\n 0.0900306 0.244728 0.665241\n 0.333333 0.333333 0.333333\n\nNote that, when used with Flux.jl, softmax must not be passed to layers like Dense which accept an activation function. The activation is broadcasted over the result, thus applies to individual numbers. But softmax always needs to see the whole column.\n\njulia> using Flux\n\njulia> x = randn(Float32, 4, 4, 3, 13);\n\njulia> model = Chain(Conv((4, 4), 3 => 8, tanh), Flux.flatten, Dense(8 => 7), softmax);\n\njulia> model(x) |> size\n(7, 13)\n\njulia> Dense(4 => 7, softmax)(x)\nERROR: `softmax(x)` called with a number, but it expects an array. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.logsoftmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsoftmax","text":"logsoftmax(x; dims = 1)\n\nComputes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.\n\nIt is semantically equivalent to the following:\n\nlogsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))\n\nSee also softmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Pooling","page":"Low-level Operations – NNlib.jl","title":"Pooling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.PoolDims\nNNlib.lpnormpool\nNNlib.maxpool\nNNlib.meanpool","category":"page"},{"location":"reference/models/nnlib/#NNlib.PoolDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.PoolDims","text":"PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};\n stride=k, padding=0, dilation=1) where {M, L}\n\nDimensions for a \"pooling\" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.lpnormpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.lpnormpool","text":"lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\np is restricted to 0 < p < Inf.\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\nFor all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.\n\nThus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.maxpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.maxpool","text":"maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform max pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.meanpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.meanpool","text":"meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform mean pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Padding","page":"Low-level Operations – NNlib.jl","title":"Padding","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.pad_circular\nNNlib.pad_constant\nNNlib.pad_reflect\nNNlib.pad_repeat\nNNlib.pad_symmetric\nNNlib.pad_zeros","category":"page"},{"location":"reference/models/nnlib/#NNlib.pad_circular","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_circular","text":"pad_circular(x, pad::Tuple; [dims])\npad_circular(x, pad::Int; [dims])\n\nPad the array x \"circularly\" across the border by wrapping around values from the opposite side of x. \n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nThe pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.\n\nSee also pad_repeat, pad_reflect, pad_symmetric, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_circular(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_constant","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_constant","text":"pad_constant(x, pad::Tuple, val = 0; [dims = :])\npad_constant(x, pad::Int, val = 0; [dims = :])\n\nPad the array x with the constant value val.\n\npad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.\n\nFor integer pad input, it is applied on both sides on every dimension in dims.\n\nSee also pad_zeros, pad_repeat, pad_reflect, pad_symmetric, and pad_circular.\n\njulia> r = reshape(1:4, 2, 2)\n2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:\n 1 3\n 2 4\n\njulia> pad_constant(r, (1, 2, 3, 4), 8)\n5×9 Matrix{Int64}:\n 8 8 8 8 8 8 8 8 8\n 8 8 8 1 3 8 8 8 8\n 8 8 8 2 4 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n\njulia> pad_constant(r, 1, 8)\n4×4 Matrix{Int64}:\n 8 8 8 8\n 8 1 3 8\n 8 2 4 8\n 8 8 8 8\n\njulia> r = reshape(1:27, 3, 3, 3)\n3×3×3 reshape(::UnitRange{Int64}, 3, 3, 3) with eltype Int64:\n[:, :, 1] =\n 1 4 7\n 2 5 8\n 3 6 9\n\n[:, :, 2] =\n 10 13 16\n 11 14 17\n 12 15 18\n\n[:, :, 3] =\n 19 22 25\n 20 23 26\n 21 24 27\n\njulia> pad_constant(r, (2,1), dims = 1) # assymetric padding\n6×3×3 Array{Int64, 3}:\n[:, :, 1] =\n 0 0 0\n 0 0 0\n 1 4 7\n 2 5 8\n 3 6 9\n 0 0 0\n\n[:, :, 2] =\n 0 0 0\n 0 0 0\n 10 13 16\n 11 14 17\n 12 15 18\n 0 0 0\n\n[:, :, 3] =\n 0 0 0\n 0 0 0\n 19 22 25\n 20 23 26\n 21 24 27\n 0 0 0\n\njulia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it\nERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)\nStacktrace:\n[...]\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_reflect","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_reflect","text":"pad_reflect(x, pad::Tuple; [dims])\npad_reflect(x, pad::Int; [dims])\n\nPad the array x reflecting its values across the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_reflect(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n 5 2 5 8 5 2\n 6 3 6 9 6 3\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_repeat","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_repeat","text":"pad_repeat(x, pad::Tuple; [dims])\npad_repeat(x, pad::Int; [dims])\n\nPad the array x repeating the values on the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_reflect, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_repeat(r, (1,2,3,4))\n6×10 Matrix{Int64}:\n 1 1 1 1 4 7 7 7 7 7\n 1 1 1 1 4 7 7 7 7 7\n 2 2 2 2 5 8 8 8 8 8\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_symmetric","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_symmetric","text":"pad_symmetric(x, pad::Tuple; [dims])\npad_symmetric(x, pad::Int; [dims])\n\nPad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_reflect, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_symmetric(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 1 1 4 7 7 4\n 1 1 4 7 7 4\n 2 2 5 8 8 5\n 3 3 6 9 9 6\n 3 3 6 9 9 6\n 2 2 5 8 8 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_zeros","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_zeros","text":"pad_zeros(x, pad::Tuple; [dims])\npad_zeros(x, pad::Int; [dims])\n\nPad the array x with zeros. Equivalent to pad_constant with the constant equal to 0. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Convolution","page":"Low-level Operations – NNlib.jl","title":"Convolution","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally. ","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"conv\nConvDims\ndepthwiseconv\nDepthwiseConvDims\nDenseConvDims","category":"page"},{"location":"reference/models/nnlib/#NNlib.conv","page":"Low-level Operations – NNlib.jl","title":"NNlib.conv","text":"conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)\n\nApply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.ConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.ConvDims","text":"ConvDims\n\nType system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.depthwiseconv","page":"Low-level Operations – NNlib.jl","title":"NNlib.depthwiseconv","text":"depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)\n\nDepthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.DepthwiseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DepthwiseConvDims","text":"DepthwiseConvDims\n\nConcrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by C_in, C_mult, rather than C_in, C_out. Useful to be separate from DenseConvDims primarily for channel calculation differences.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.DenseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DenseConvDims","text":"DenseConvDims\n\nConcrete subclass of ConvDims for a normal, dense, conv2d/conv3d.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#Dropout","page":"Low-level Operations – NNlib.jl","title":"Dropout","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dropout\nNNlib.dropout!","category":"page"},{"location":"reference/models/nnlib/#NNlib.dropout","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout","text":"dropout([rng], A, p; [dims])\n\nReturns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).\n\nBy default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.\n\nOptional first argument is the random number generator used.\n\nExamples\n\njulia> dropout(ones(2, 10), 0.2)\n2×10 Matrix{Float64}:\n 1.25 1.25 0.0 1.25 1.25 1.25 1.25 1.25 1.25 1.25\n 1.25 1.25 1.25 0.0 1.25 1.25 0.0 1.25 1.25 1.25\n\njulia> mean(dropout(ones(10^4, 5), 0.2), dims=1)\n1×5 Matrix{Float64}:\n 0.998 1.00075 0.99125 0.99575 1.00075\n\njulia> dropout(ones(5, 5), 0.7, dims=1) # whole row the same\n5×5 Matrix{Float64}:\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n\njulia> mean(dropout(ones(10^4, 5), 0.3, dims=1), dims=1)\n1×5 Matrix{Float64}:\n 1.00571 1.00571 1.00571 1.00571 1.00571\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dropout!","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout!","text":"dropout!(B, A, p; [dims])\n\nThis does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Upsampling","page":"Low-level Operations – NNlib.jl","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"upsample_nearest\nupsample_linear\n∇upsample_linear\nupsample_bilinear\n∇upsample_bilinear\nupsample_trilinear\n∇upsample_trilinear\npixel_shuffle","category":"page"},{"location":"reference/models/nnlib/#NNlib.upsample_nearest","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_nearest","text":"upsample_nearest(x, scale::NTuple{S,Int})\nupsample_nearest(x; size::NTuple{S,Int})\n\nUpsamples the array x by integer multiples along the first S dimensions. Subsequent dimensions of x are not altered.\n\nEither the scale factors or the final output size can be specified.\n\nSee also upsample_bilinear, for two dimensions of an N=4 array.\n\nExample\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2, 3))\n4×9 Matrix{Int64}:\n 1 1 1 2 2 2 3 3 3\n 1 1 1 2 2 2 3 3 3\n 4 4 4 5 5 5 6 6 6\n 4 4 4 5 5 5 6 6 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6]; size=(4, 9)) # equivalent\ntrue\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2,))\n4×3 Matrix{Int64}:\n 1 2 3\n 1 2 3\n 4 5 6\n 4 5 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6], size=(4,))\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_linear","text":"upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)\nupsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)\n\nUpsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_linear","text":"∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_bilinear","text":"upsample_bilinear(x::AbstractArray{T,4}, scale::NTuple{2,Real}; align_corners::Bool = true)\nupsample_bilinear(x::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true)\n\nUpsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).\n\nExamples\n\njulia> x = reshape(Float32[1 2 3; 4 5 6], (2,3,1,1))\n2×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 2.0 3.0\n 4.0 5.0 6.0\n\njulia> upsample_bilinear(x, (2, 3))\n4×9×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.25 1.5 1.75 2.0 2.25 2.5 2.75 3.0\n 2.0 2.25 2.5 2.75 3.0 3.25 3.5 3.75 4.0\n 3.0 3.25 3.5 3.75 4.0 4.25 4.5 4.75 5.0\n 4.0 4.25 4.5 4.75 5.0 5.25 5.5 5.75 6.0\n\njulia> ans == upsample_bilinear(x; size=(4, 9)) # specify ouput size instead\ntrue\n\njulia> upsample_bilinear(x, (2.5, 3.5)) # non-integer scaling factors are allowed\n5×10×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.22222 1.44444 1.66667 1.88889 … 2.33333 2.55556 2.77778 3.0\n 1.75 1.97222 2.19444 2.41667 2.63889 3.08333 3.30556 3.52778 3.75\n 2.5 2.72222 2.94444 3.16667 3.38889 3.83333 4.05556 4.27778 4.5\n 3.25 3.47222 3.69444 3.91667 4.13889 4.58333 4.80556 5.02778 5.25\n 4.0 4.22222 4.44444 4.66667 4.88889 5.33333 5.55556 5.77778 6.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_bilinear","text":"∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral (W,H) size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_trilinear","text":"upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)\nupsample_trilinear(x::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true)\n\nUpsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).\n\nExamples\n\nupsample_trilinear(x, (2, 3, 4))\nupsample_trilinear(x; size=(4, 9, 11)) # specify ouput size instead\nupsample_trilinear(x, (2.5, 3.5, pi)) # non-integer scaling factors are allowed\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_trilinear","text":"∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral size & depth (W,H,D) of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pixel_shuffle","page":"Low-level Operations – NNlib.jl","title":"NNlib.pixel_shuffle","text":"pixel_shuffle(x, r::Integer)\n\nPixel shuffling operation, upscaling by a factor r.\n\nFor 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.\n\nUsed in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., \"Real-Time Single Image and Video Super-Resolution ...\", CVPR 2016, https://arxiv.org/abs/1609.05158\n\nExamples\n\njulia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 12.1 13.1\n 21.1 22.1 23.1\n\n[:, :, 2, 1] =\n 11.2 12.2 13.2\n 21.2 22.2 23.2\n\n[:, :, 3, 1] =\n 11.3 12.3 13.3\n 21.3 22.3 23.3\n\n[:, :, 4, 1] =\n 11.4 12.4 13.4\n 21.4 22.4 23.4\n\njulia> pixel_shuffle(x, 2) # 4 channels used up as 2x upscaling of image dimensions\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 11.3 12.1 12.3 13.1 13.3\n 11.2 11.4 12.2 12.4 13.2 13.4\n 21.1 21.3 22.1 22.3 23.1 23.3\n 21.2 21.4 22.2 22.4 23.2 23.4\n\njulia> y = [i + channel/10 for i in 1:3, channel in 1:6, batch in 1:1]\n3×6×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.2 1.3 1.4 1.5 1.6\n 2.1 2.2 2.3 2.4 2.5 2.6\n 3.1 3.2 3.3 3.4 3.5 3.6\n\njulia> pixel_shuffle(y, 2) # 1D image, with 6 channels reduced to 3\n6×3×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.3 1.5\n 1.2 1.4 1.6\n 2.1 2.3 2.5\n 2.2 2.4 2.6\n 3.1 3.3 3.5\n 3.2 3.4 3.6\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Batched-Operations","page":"Low-level Operations – NNlib.jl","title":"Batched Operations","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"batched_mul\nbatched_mul!\nbatched_adjoint\nbatched_transpose\nbatched_vec","category":"page"},{"location":"reference/models/nnlib/#NNlib.batched_mul","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul","text":"batched_mul(A, B) -> C\nA ⊠ B # \\boxtimes\n\nBatched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.\n\nIf ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.\n\nTo transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:\n\njulia> A, B = randn(2,5,17), randn(5,9,17);\n\njulia> A ⊠ B |> size\n(2, 9, 17)\n\njulia> batched_adjoint(A) |> size\n(5, 2, 17)\n\njulia> batched_mul(A, batched_adjoint(randn(9,5,17))) |> size\n(2, 9, 17)\n\njulia> A ⊠ randn(5,9,1) |> size\n(2, 9, 17)\n\njulia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))\ntrue\n\nThe equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.\n\nHowever, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.\n\nBoth this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV[\"JULIA_DEBUG\"] = NNlib will display them.\n\n\n\n\n\nbatched_mul(A::Array{T,3}, B::Matrix)\nbatched_mul(A::Matrix, B::Array{T,3})\nA ⊠ B\n\nThis is always matrix-matrix multiplication, but either A or B may lack a batch index.\n\nWhen B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.\nWhen A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.\n\njulia> randn(16,8,32) ⊠ randn(8,4) |> size\n(16, 4, 32)\n\njulia> randn(16,8,32) ⊠ randn(8,4,1) |> size # equivalent\n(16, 4, 32)\n\njulia> randn(16,8) ⊠ randn(8,4,32) |> size\n(16, 4, 32)\n\nSee also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_mul!","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul!","text":"batched_mul!(C, A, B) -> C\nbatched_mul!(C, A, B, α=1, β=0)\n\nIn-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.\n\nThis will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.\n\nFor complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_adjoint","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_adjoint","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_transpose","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_transpose","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_vec","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_vec","text":"batched_vec(A::Array{T,3}, B::Matrix)\nbatched_vec(A::Array{T,3}, b::Vector)\n\nBatched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.\n\nWith the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).\n\njulia> A, B, b = randn(16,8,32), randn(8,32), randn(8);\n\njulia> batched_vec(A,B) |> size\n(16, 32)\n\njulia> batched_vec(A,b) |> size\n(16, 32)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Gather-and-Scatter","page":"Low-level Operations – NNlib.jl","title":"Gather and Scatter","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Embedding layer uses NNlib.gather as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.gather\nNNlib.gather!\nNNlib.scatter\nNNlib.scatter!","category":"page"},{"location":"reference/models/nnlib/#NNlib.gather","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather","text":"NNlib.gather(src, idx) -> dst\n\nReverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather! for an in-place version.\n\nExamples\n\njulia> NNlib.gather([1,20,300,4000], [2,4,2])\n3-element Vector{Int64}:\n 20\n 4000\n 20\n\njulia> NNlib.gather([1 2 3; 4 5 6], [1,3,1,3,1])\n2×5 Matrix{Int64}:\n 1 3 1 3 1\n 4 6 4 6 4\n\n\n\n\n\ngather(src, IJK...)\n\nConvert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).\n\nExamples\n\njulia> src = reshape([1:15;], 3, 5)\n3×5 Matrix{Int64}:\n 1 4 7 10 13\n 2 5 8 11 14\n 3 6 9 12 15\n\njulia> NNlib.gather(src, [1, 2], [2, 4])\n2-element Vector{Int64}:\n 4\n 11\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.gather!","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather!","text":"NNlib.gather!(dst, src, idx)\n\nReverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather for an allocating version.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter","text":"NNlib.scatter(op, src, idx; [init, dstsize])\n\nScatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.\n\nIf keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).\nIf dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.\n\nSee scatter! for full details on how idx works.\n\nExamples\n\njulia> NNlib.scatter(+, [10,100,1000], [3,1,2])\n3-element Vector{Int64}:\n 100\n 1000\n 10\n\njulia> NNlib.scatter(+, [1 2 3 4; 5 6 7 8], [2,1,1,5])\n2×5 Matrix{Int64}:\n 5 1 0 0 4\n 13 5 0 0 8\n\njulia> NNlib.scatter(*, [10,200,3000], [1,4,2]; init = 10, dstsize = 6)\n6-element Vector{Int64}:\n 100\n 30000\n 10\n 2000\n 10\n 10\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter!","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter!","text":"NNlib.scatter!(op, dst, src, idx)\n\nScatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to\n\ndst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])\n\nSee also scatter, gather.\n\nArguments\n\nop: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.\ndst: The destination for src to aggregate to. This argument will be mutated.\nsrc: The source data for aggregating.\nidx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.\n\nExamples\n\njulia> NNlib.scatter!(+, ones(3), [10,100], [1,3])\n3-element Vector{Float64}:\n 11.0\n 1.0\n 101.0\n\njulia> NNlib.scatter!(*, fill(0.5, 2, 4), [1 10; 100 1000], [3,2])\n2×4 Matrix{Float64}:\n 0.5 5.0 0.5 0.5\n 0.5 500.0 50.0 0.5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Sampling","page":"Low-level Operations – NNlib.jl","title":"Sampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"grid_sample\n∇grid_sample","category":"page"},{"location":"reference/models/nnlib/#NNlib.grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.grid_sample","text":"grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)\n\nGiven input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.\n\nThis implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).\n\nArguments\n\ninput: Input array in (W_in, H_in, C, N) shape.\ngrid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.\nTherefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.\nOut-of-bound values are handled according to the padding_mode.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.\n\nReturns\n\n(W_out, H_out, C, N) sampled grid from input.\n\nExamples\n\nIn the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.\n\njulia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))\n2×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 2.0 4.0\n\njulia> grid = Array{Float64}(undef, 2, 3, 2, 1);\n\njulia> grid[:, 1, 1, 1] .= (-3, -1);\n\njulia> grid[:, 2, 1, 1] .= (0, -1);\n\njulia> grid[:, 3, 1, 1] .= (1, -1);\n\njulia> grid[:, 1, 2, 1] .= (-1, 1);\n\njulia> grid[:, 2, 2, 1] .= (0, 1);\n\njulia> grid[:, 3, 2, 1] .= (3, 1);\n\njulia> grid_sample(x, grid; padding_mode=:zeros)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 0.0 3.0\n 1.5 3.5\n 2.0 0.0\n\njulia> grid_sample(x, grid; padding_mode=:border)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 1.5 3.5\n 2.0 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇grid_sample","text":"∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T\n\nArguments\n\nΔ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).\ninput: Input from primal computation in (W_in, H_in, C, N) shape.\ngrid: Grid from primal computation in (2, W_out, H_out, N) shape.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.\n\nReturns\n\ndinput (same shape as input) and dgrid (same shape as grid) gradients.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Losses","page":"Low-level Operations – NNlib.jl","title":"Losses","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"ctc_loss","category":"page"},{"location":"reference/models/nnlib/#NNlib.ctc_loss","page":"Low-level Operations – NNlib.jl","title":"NNlib.ctc_loss","text":"ctc_loss(ŷ, y)\n\nComputes the connectionist temporal classification loss between ŷ and y. ŷ must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to ŷ, so ŷ must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with ŷ. The blank label is assumed to be the last label category in ŷ, so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Miscellaneous","page":"Low-level Operations – NNlib.jl","title":"Miscellaneous","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"logsumexp\nNNlib.glu","category":"page"},{"location":"reference/models/nnlib/#NNlib.logsumexp","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsumexp","text":"logsumexp(x; dims = :)\n\nComputes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.\n\nSee also logsoftmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.glu","page":"Low-level Operations – NNlib.jl","title":"NNlib.glu","text":"glu(x, dim = 1)\n\nThe gated linear unit from the \"Language Modeling with Gated Convolutional Networks\" paper.\n\nCalculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/training/optimisers/#man-optimisers","page":"Optimisation Rules","title":"Optimisation Rules","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Any optimization rule from Optimisers.jl can be used with train! and other training functions.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"For full details of how the new interface works, see the Optimisers.jl documentation.","category":"page"},{"location":"reference/training/optimisers/#Optimisers-Reference","page":"Optimisation Rules","title":"Optimisers Reference","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"All optimisers return an object that, when passed to train!, will update the parameters passed to it.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.Descent\nOptimisers.Momentum\nOptimisers.Nesterov\nOptimisers.RMSProp\nOptimisers.Adam\nOptimisers.RAdam\nOptimisers.AdaMax\nOptimisers.AdaGrad\nOptimisers.AdaDelta\nOptimisers.AMSGrad\nOptimisers.NAdam\nOptimisers.AdamW\nOptimisers.OAdam\nOptimisers.AdaBelief\nOptimisers.Lion","category":"page"},{"location":"reference/training/optimisers/#Optimisers.Descent","page":"Optimisation Rules","title":"Optimisers.Descent","text":"Descent(η = 1f-1)\nDescent(; [eta])\n\nClassic gradient descent optimiser with learning rate η. For each parameter p and its gradient dp, this runs p -= η*dp.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Momentum","page":"Optimisation Rules","title":"Optimisers.Momentum","text":"Momentum(η = 0.01, ρ = 0.9)\nMomentum(; [eta, rho])\n\nGradient descent optimizer with learning rate η and momentum ρ.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Nesterov","page":"Optimisation Rules","title":"Optimisers.Nesterov","text":"Nesterov(η = 0.001, ρ = 0.9)\nNesterov(; [eta, rho])\n\nGradient descent optimizer with learning rate η and Nesterov momentum ρ.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nNesterov momentum (ρ): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RMSProp","page":"Optimisation Rules","title":"Optimisers.RMSProp","text":"RMSProp(η = 0.001, ρ = 0.9, ϵ = 1e-8; centred = false)\nRMSProp(; [eta, rho, epsilon, centred])\n\nOptimizer using the RMSProp algorithm. Often a good choice for recurrent networks. Parameters other than learning rate generally don't need tuning.\n\nCentred RMSProp is a variant which normalises gradients by an estimate their variance, instead of their second moment.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\nKeyword centred (or centered): Indicates whether to use centred variant of the algorithm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Adam","page":"Optimisation Rules","title":"Optimisers.Adam","text":"Adam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nAdam(; [eta, beta, epsilon])\n\nAdam optimiser.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RAdam","page":"Optimisation Rules","title":"Optimisers.RAdam","text":"RAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nRAdam(; [eta, beta, epsilon])\n\nRectified Adam optimizer.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaMax","page":"Optimisation Rules","title":"Optimisers.AdaMax","text":"AdaMax(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nAdaMax(; [eta, beta, epsilon])\n\nAdaMax is a variant of Adam based on the ∞-norm.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaGrad","page":"Optimisation Rules","title":"Optimisers.AdaGrad","text":"AdaGrad(η = 0.1, ϵ = 1e-8)\nAdaGrad(; [eta, epsilon])\n\nAdaGrad optimizer. It has parameter specific learning rates based on how frequently it is updated. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaDelta","page":"Optimisation Rules","title":"Optimisers.AdaDelta","text":"AdaDelta(ρ = 0.9, ϵ = 1e-8)\nAdaDelta(; [rho, epsilon])\n\nAdaDelta is a version of AdaGrad adapting its learning rate based on a window of past gradient updates. Parameters don't need tuning.\n\nParameters\n\nRho (ρ == rho): Factor by which the gradient is decayed at each time step.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AMSGrad","page":"Optimisation Rules","title":"Optimisers.AMSGrad","text":"AMSGrad(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nAMSGrad(; [eta, beta, epsilon])\n\nThe AMSGrad version of the Adam optimiser. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.NAdam","page":"Optimisation Rules","title":"Optimisers.NAdam","text":"NAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nNAdam(; [eta, beta, epsilon])\n\nNAdam is a Nesterov variant of Adam. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdamW","page":"Optimisation Rules","title":"Optimisers.AdamW","text":"AdamW(η = 0.001, β = (0.9, 0.999), λ = 0, ϵ = 1e-8; couple = true)\nAdamW(; [eta, beta, lambda, epsilon, couple])\n\nAdamW is a variant of Adam fixing (as in repairing) its weight decay regularization. Implemented as an OptimiserChain of Adam and WeightDecay`.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nWeight decay (λ == lambda): Controls the strength of L_2 regularisation.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\nKeyword couple: If true, the weight decay is coupled with the learning rate, as in pytorch's AdamW. This corresponds to an update of the form x = x - η * (dx + λ * x), where dx is the update from Adam with learning rate 1. If false, the weight decay is decoupled from the learning rate, in the spirit of the original paper. This corresponds to an update of the form x = x - η * dx - λ * x. Default is true.\n\nwarning: Breaking change in v0.4\nWith version 0.4 the default update rule for AdamW has changed to match the pytorch implementation. The previous rule, which is closer to the original paper, can be obtained by setting AdamW(..., couple=false). See this issue for more details.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.OAdam","page":"Optimisation Rules","title":"Optimisers.OAdam","text":"OAdam(η = 0.001, β = (0.5, 0.9), ϵ = 1e-8)\nOAdam(; [eta, beta, epsilon])\n\nOAdam (Optimistic Adam) is a variant of Adam adding an \"optimistic\" term suitable for adversarial training.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaBelief","page":"Optimisation Rules","title":"Optimisers.AdaBelief","text":"AdaBelief(η = 0.001, β = (0.9, 0.999), ϵ = 1e-16)\nAdaBelief(; [eta, beta, epsilon])\n\nThe AdaBelief optimiser is a variant of the well-known Adam optimiser.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Lion","page":"Optimisation Rules","title":"Optimisers.Lion","text":"Lion(η = 0.001, β = (0.9, 0.999))\nLion(; [eta, beta])\n\nLion optimiser.\n\nParameters\n\nLearning rate (η == eta): Magnitude by which gradients are updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Composing-Optimisers","page":"Optimisation Rules","title":"Composing Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Flux (through Optimisers.jl) defines a special kind of optimiser called OptimiserChain which takes in arbitrary optimisers as input. Its behaviour is similar to the usual optimisers, but differs in that it acts by calling the optimisers listed in it sequentially. Each optimiser produces a modified gradient that will be fed into the next, and the resultant update will be applied to the parameter as usual. A classic use case is where adding decays is desirable. Optimisers.jl defines the basic decay corresponding to an L_2 regularization in the loss as WeightDecay.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(WeightDecay(1e-4), Descent())","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Here we apply the weight decay to the Descent optimiser. The resulting optimiser opt can be used as any optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"w = [randn(10, 10), randn(10, 10)]\nopt_state = Flux.setup(opt, w)\n\nloss(w, x) = Flux.mse(w[1] * x, w[2] * x)\n\nloss(w, rand(10)) # around 0.9\n\nfor t = 1:10^5\n g = gradient(w -> loss(w[1], w[2], rand(10)), w)\n Flux.update!(opt_state, w, g)\nend\n\nloss(w, rand(10)) # around 0.9","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"It is possible to compose optimisers for some added flexibility.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.OptimiserChain","category":"page"},{"location":"reference/training/optimisers/#Optimisers.OptimiserChain","page":"Optimisation Rules","title":"Optimisers.OptimiserChain","text":"OptimiserChain(opts...)\n\nCompose a sequence of optimisers so that each opt in opts updates the gradient, in the order specified.\n\nWith an empty sequence, OptimiserChain() is the identity, so update! will subtract the full gradient from the parameters. This is equivalent to Descent(1).\n\nExample\n\njulia> o = OptimiserChain(ClipGrad(1.0), Descent(0.1));\n\njulia> m = (zeros(3),);\n\njulia> s = Optimisers.setup(o, m)\n(Leaf(OptimiserChain(ClipGrad(1.0), Descent(0.1)), (nothing, nothing)),)\n\njulia> Optimisers.update(s, m, ([0.3, 1, 7],))[2] # clips before discounting\n([-0.03, -0.1, -0.1],)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Scheduling-Optimisers","page":"Optimisation Rules","title":"Scheduling Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in ParameterSchedulers.jl. The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a cosine annealing schedule with a momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between 1e-4 and 1e-2 every 10 steps. We also create a new Momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers\n\nopt = Momentum()\nschedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)\nfor (eta, epoch) in zip(schedule, 1:100)\n opt.eta = eta\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"schedule can also be indexed (e.g. schedule(100)) or iterated like any iterator in Julia.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a stateful schedule, you can use ParameterSchedulers.Stateful:","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers: Stateful, next!\n\nschedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10))\nfor epoch in 1:100\n opt.eta = next!(schedule)\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.","category":"page"},{"location":"reference/training/optimisers/#Decays","page":"Optimisation Rules","title":"Decays","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.SignDecay\nOptimisers.WeightDecay","category":"page"},{"location":"reference/training/optimisers/#Optimisers.SignDecay","page":"Optimisation Rules","title":"Optimisers.SignDecay","text":"SignDecay(λ = 1e-3)\nSignDecay(; [lambda])\n\nImplements L_1 regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.\n\nSee also [WeightDecay] for L_2 normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.WeightDecay","page":"Optimisation Rules","title":"Optimisers.WeightDecay","text":"WeightDecay(λ = 5e-4)\nWeightDecay(; [lambda])\n\nImplements L_2 regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.\n\nSee also [SignDecay] for L_1 normalisation.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Gradient-Clipping","page":"Optimisation Rules","title":"Gradient Clipping","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.ClipGrad\nOptimisers.ClipNorm","category":"page"},{"location":"reference/training/optimisers/#Optimisers.ClipGrad","page":"Optimisation Rules","title":"Optimisers.ClipGrad","text":"ClipGrad(δ = 10)\nClipGrad(; [delta])\n\nRestricts every gradient component to obey -δ ≤ dx[i] ≤ δ.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipNorm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.ClipNorm","page":"Optimisation Rules","title":"Optimisers.ClipNorm","text":"ClipNorm(ω = 10, p = 2; throw = true)\n\nScales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).\n\nThrows an error if the norm is infinite or NaN, which you can turn off with throw = false.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipGrad.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#GPU-Support","page":"GPU Support","title":"GPU Support","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Most work on neural networks involves the use of GPUs, as they can typically perform the required computation much faster. This page describes how Flux co-operates with various other packages, which talk to GPU hardware.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For those in a hurry, see the quickstart page. Or do using CUDA and then call gpu on both the model and the data. ","category":"page"},{"location":"guide/gpu/#Basic-GPU-use:-from-Array-to-CuArray","page":"GPU Support","title":"Basic GPU use: from Array to CuArray","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Julia's GPU packages work with special array types, in place of the built-in Array. The most used is CuArray provided by CUDA.jl, for GPUs made by NVIDIA. That package provides a function cu which converts an ordinary Array (living in CPu memory) to a CuArray (living in GPU memory). Functions like * and broadcasting specialise so that, when given CuArrays, all the computation happens on the GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"W = randn(3, 4) # some weights, on CPU: 3×4 Array{Float64, 2}\nx = randn(4) # fake data\ny = tanh.(W * x) # computation on the CPU\n\nusing CUDA\n\ncu(W) isa CuArray{Float32}\n(cW, cx) = (W, x) |> cu # move both to GPU\ncy = tanh.(cW * cx) # computation on the GPU","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Notice that cu doesn't only move arrays, it also recurses into many structures, such as the tuple (W, x) above. (Notice also that it converts Julia's default Float64 numbers to Float32, as this is what most GPUs support efficiently – it calls itself \"opinionated\". Flux defaults to Float32 in all cases.)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To use CUDA with Flux, you can simply use cu to move both the model, and the data. It will create a copy of the Flux model, with all of its parameter arrays moved to the GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Pkg; Pkg.add([\"CUDA\", \"cuDNN\"]) # do this once\n\nusing Flux, CUDA\nCUDA.allowscalar(false) # recommended\n\nmodel = Dense(W, true, tanh) # wrap the same matrix W in a Flux layer\nmodel(x) ≈ y # same result, still on CPU\n\nc_model = cu(model) # move all the arrays within model to the GPU\nc_model(cx) # computation on the GPU","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Notice that you need using CUDA (every time) but also ] add cuDNN (once, when installing packages). This is a quirk of how these packages are set up. (The cuDNN.jl sub-package handles operations such as convolutions, called by Flux via NNlib.jl.)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux's gradient, and training functions like setup, update!, and train!, are all equally happy to accept GPU arrays and GPU models, and then perform all computations on the GPU. It is recommended that you move the model to the GPU before calling setup.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"grads = Flux.gradient((f,x) -> sum(abs2, f(x)), model, x) # on CPU\nc_grads = Flux.gradient((f,x) -> sum(abs2, f(x)), c_model, cx) # same result, all on GPU\n\nc_opt = Flux.setup(Adam(), c_model) # setup optimiser after moving model to GPU\n\nFlux.update!(c_opt, c_model, c_grads[1]) # mutates c_model but not model","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To move arrays and other objects back to the CPU, Flux provides a function cpu. This is recommended when saving models, Flux.state(c_model |> cpu), see below.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"cpu(cW) isa Array{Float32, 2}\n\nmodel2 = cpu(c_model) # copy model back to CPU\nmodel2(x)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"compat: Flux ≤ 0.13\nOld versions of Flux automatically loaded CUDA.jl to provide GPU support. Starting from Flux v0.14, it has to be loaded separately. Julia's package extensions allow Flux to automatically load some GPU-specific code when needed.","category":"page"},{"location":"guide/gpu/#Other-GPU-packages-for-AMD-and-Apple","page":"GPU Support","title":"Other GPU packages for AMD & Apple","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Non-NVIDIA graphics cards are supported by other packages. Each provides its own function which behaves like cu. AMD GPU support provided by AMDGPU.jl, on systems with ROCm and MIOpen installed. This package has a function roc which converts Array to ROCArray:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Flux, AMDGPU\nAMDGPU.allowscalar(false)\n\nr_model = roc(model)\nr_model(roc(x))\n\nFlux.gradient((f,x) -> sum(abs2, f(x)), r_model, roc(x))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Experimental support for Apple devices with M-series chips is provided by Metal.jl. This has a function mtl which works like cu, converting Array to MtlArray:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Flux, Metal\nMetal.allowscalar(false)\n\nm_model = mtl(model)\nm_y = m_model(mtl(x))\n\nFlux.gradient((f,x) -> sum(abs2, f(x)), m_model, mtl(x))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"danger: Experimental\nMetal support in Flux is experimental and many features are not yet available. AMD support is improving, but likely to have more rough edges than CUDA.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If you want your model to work with any brand of GPU, or none, then you may not wish to write cu everywhere. One simple way to be generic is, at the top of the file, to un-comment one of several lines which import a package and assign its \"adaptor\" to the same name:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using CUDA: cu as device # after this, `device === cu`\n# using AMDGPU: roc as device\n# device = identity # do-nothing, for CPU\n\nusing Flux\nmodel = Chain(...) |> device","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"note: Adapt.jl\nThe functions cu, mtl, roc all use Adapt.jl, to work within various wrappers. The reason they work on Flux models is that Flux.@layer Layer defines methods of Adapt.adapt_structure(to, lay::Layer).","category":"page"},{"location":"guide/gpu/#Automatic-GPU-choice-with-gpu-and-gpu_device","page":"GPU Support","title":"Automatic GPU choice with gpu and gpu_device","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux also provides a more automatic way of choosing which GPU (or none) to use. This is the function gpu:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default it does nothing.\nIf the package CUDA is loaded, and CUDA.functional() === true, then it behaves like cu.\nIf the package AMDGPU is loaded, and AMDGPU.functional() === true, then it behaves like roc.\nIf the package Metal is loaded, and Metal.functional() === true, then it behaves like mtl.\nIf two differnet GPU packages are loaded, the first one takes priority.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For the most part, this means that a script which says model |> gpu and data |> gpu will just work. It should always run, and if a GPU package is loaded (and finds the correct hardware) then that will be used.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The function gpu uses a lower-level function called gpu_device from MLDataDevices.jl, which checks what to do and then returns some device object. In fact, the entire implementation is just this:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"gpu(x) = gpu_device()(x)\ncpu(x) = cpu_device()(x)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Automatic backend selection through gpu is not type-stable. That doesn't matter if you do it once, or once per large batch – it costs a few microseconds. But it might matter if you do it within some loop.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To avoid this, you can first obtain a \"device object\" with device = gpu_device(), once, and then use this as the function to transfer data. Something like this:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"to_device = gpu_device()\ngpu_model = model |> to_device\n\nfor epoch in 1:num_epochs\n for (x, y) in dataloader\n x_gpu, y_gpu = (x, y) |> to_device\n # training code...","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Finally, setting a backend prefence with gpu_backend! gives type stability to the whole pipeline.","category":"page"},{"location":"guide/gpu/#Transferring-Training-Data","page":"GPU Support","title":"Transferring Training Data","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Iterating over the batches in a DataLoader object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:\ntrain_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true)\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x_cpu, y_cpu) in train_loader\n x = gpu(x_cpu)\n y = gpu(y_cpu)\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nRather than write this out every time, you can just call gpu(::DataLoader):\ngpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nThis is equivalent to DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...). Something similar can also be done with CUDA.CuIterator, gpu_train_loader = CUDA.CuIterator(train_loader). However, this only works with a limited number of data types: first(train_loader) should be a tuple (or NamedTuple) of arrays.\nTransferring all training data to the GPU at once before creating the DataLoader. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.\ngpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32)\n# ...\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n # ...\nHere (X, Y) |> gpu applies gpu to both arrays, as it recurses into structures.","category":"page"},{"location":"guide/gpu/#Saving-GPU-Trained-Models","page":"GPU Support","title":"Saving GPU-Trained Models","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"After the training process is done, we must always transfer the trained model back to the CPU memory before serializing or saving to disk. This can be done with cpu:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"model = cpu(model) # or model = model |> cpu","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"and then","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using BSON\n# ...\nBSON.@save \"./path/to/trained_model.bson\" model\n\n# in this approach the cpu-transferred model (referenced by the variable `model`)\n# only exists inside the `let` statement\nlet model = cpu(model)\n # ...\n BSON.@save \"./path/to/trained_model.bson\" model\nend\n\n# is equivalent to the above, but uses `key=value` storing directive from BSON.jl\nBSON.@save \"./path/to/trained_model.bson\" model = cpu(model)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The reason behind this is that models trained in the GPU but not transferred to the CPU memory scope will expect CuArrays as input. In other words, Flux models expect input data coming from the same kind device in which they were trained on.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In controlled scenarios in which the data fed to the loaded models is guaranteed to be in the GPU there's no need to transfer them back to CPU memory scope, however in production environments, where artifacts are shared among different processes, equipments or configurations, there is no guarantee that the CUDA.jl package will be available for the process performing inference on the model loaded from the disk.","category":"page"},{"location":"guide/gpu/#Disabling-CUDA-or-choosing-which-GPUs-are-visible-to-Flux","page":"GPU Support","title":"Disabling CUDA or choosing which GPUs are visible to Flux","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Sometimes it is required to control which GPUs are visible to julia on a system with multiple GPUs or disable GPUs entirely. This can be achieved with an environment variable CUDA_VISIBLE_DEVICES.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To disable all devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='-1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To select specific devices by device id:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='0,1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"More information for conditional use of GPUs in CUDA.jl can be found in its documentation, and information about the specific use of the variable is described in the Nvidia CUDA blog post.","category":"page"},{"location":"guide/gpu/#Data-movement-across-GPU-devices","page":"GPU Support","title":"Data movement across GPU devices","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux also supports getting handles to specific GPU devices, and transferring models from one GPU device to another GPU device from the same backend. Let's try it out for NVIDIA GPUs. First, we list all the available devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA;\n\njulia> CUDA.devices()\nCUDA.DeviceIterator() for 3 devices:\n0. NVIDIA TITAN RTX\n1. NVIDIA TITAN RTX\n2. NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's select the device with id 0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device0 = gpu_device(1)\n(::CUDADevice{CuDevice}) (generic function with 4 methods)\n\njulia> device0.device\nCuDevice(0): NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Notice that indexing starts from 0 in the CUDA.devices() output, but gpu_device! expects the device id starting from 1.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's move a simple dense layer to the GPU represented by device0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> dense_model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> dense_model = dense_model |> device0;\n\njulia> dense_model.weight\n3×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n -0.142062 -0.131455\n -0.828134 -1.06552\n 0.608595 -1.05375\n\njulia> CUDA.device(dense_model.weight) # check the GPU to which dense_model is attached\nCuDevice(0): NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Next, we'll get a handle to the device with id 1, and move dense_model to that device:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device1 = gpu_device(2)\n(::CUDADevice{CuDevice}) (generic function with 4 methods)\n\njulia> dense_model = dense_model |> device1; # don't directly print the model; see warning below\n\njulia> CUDA.device(dense_model.weight)\nCuDevice(1): NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.","category":"page"},{"location":"guide/gpu/#Distributed-data-parallel-training","page":"GPU Support","title":"Distributed data parallel training","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"danger: Experimental\nDistributed support is experimental and could change in the future.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux supports now distributed data parallel training with DistributedUtils module. If you want to run your code on multiple GPUs, you have to install MPI.jl (see docs for more info).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using MPI\n\njulia> MPI.install_mpiexecjl()","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can run your code with mpiexecjl --project=. -n julia .jl from CLI.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You can use either the MPIBackend or NCCLBackend, the latter only if also NCCL.jl is loaded. First, initialize a backend with DistributedUtils.initialize, e.g.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, MPI, NCCL, CUDA\n\njulia> CUDA.allowscalar(false)\n\njulia> DistributedUtils.initialize(NCCLBackend)\n\njulia> backend = DistributedUtils.get_distributed_backend(NCCLBackend)\nNCCLBackend{Communicator, MPIBackend{MPI.Comm}}(Communicator(Ptr{NCCL.LibNCCL.ncclComm} @0x000000000607a660), MPIBackend{MPI.Comm}(MPI.Comm(1140850688)))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Pass your model, as well as any data to GPU device.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.\n\njulia> x = rand(Float32, 1, 16) |> gpu\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.239324 0.331029 0.924996 0.55593 0.853093 0.874513 0.810269 0.935858 0.477176 0.564591 0.678907 0.729682 0.96809 0.115833 0.66191 0.75822\n\njulia> y = x .^ 3\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.0137076 0.0362744 0.791443 0.171815 0.620854 0.668804 0.53197 0.819654 0.108651 0.179971 0.312918 0.388508 0.907292 0.00155418 0.29 0.435899","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In this case, we are training on a total of 16 * number of processes samples. You can also use DistributedUtils.DistributedDataContainer to split the data uniformly across processes (or do it manually).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> data = DistributedUtils.DistributedDataContainer(backend, x)\nFlux.DistributedUtils.DistributedDataContainer(Float32[0.23932439 0.33102947 … 0.66191036 0.75822026], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You have to wrap your model in DistributedUtils.FluxDistributedModel and synchronize it (broadcast accross all processes):","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0)\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Time to set up an optimizer by using DistributedUtils.DistributedOptimizer and synchronize it as well.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Optimisers\n\njulia> opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))\nDistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8))\n\njulia> st_opt = Optimisers.setup(opt, model)\n(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)\n\njulia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0)\n(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can define loss and train the model.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> loss(model) = mean((model(x) .- y).^2)\nloss (generic function with 1 method)\n\njulia> for epoch in 1:100\n global model, st_opt\n l, grad = Zygote.withgradient(loss, model)\n println(\"Epoch $epoch: Loss $l\")\n st_opt, model = Optimisers.update(st_opt, model, grad[1])\n end\nEpoch 1: Loss 0.011638729\nEpoch 2: Loss 0.0116432225\nEpoch 3: Loss 0.012763695\n...","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n julia .jl, where is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default MPI.jl MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl. Then test if your MPI is CUDA-aware by","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> import Pkg\njulia> Pkg.test(\"MPI\"; test_args=[\"--backend=CUDA\"])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If it is, set your local preference as below","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Preferences\njulia> set_preferences!(\"Flux\", \"FluxDistributedMPICUDAAware\" => true)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"warning: Known shortcomings\nWe don't run CUDA-aware tests so you're running it at own risk.","category":"page"},{"location":"guide/gpu/#Checking-GPU-Availability","page":"GPU Support","title":"Checking GPU Availability","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default, Flux will run the checks on your system to see if it can support GPU functionality. You can check if Flux identified a valid GPU setup by typing the following:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using CUDA\n\njulia> CUDA.functional()\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For AMD GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using AMDGPU\n\njulia> AMDGPU.functional()\ntrue\n\njulia> AMDGPU.functional(:MIOpen)\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For Metal GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Metal\n\njulia> Metal.functional()\ntrue","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/utilities/#man-init-funcs","page":"Weight Initialisation","title":"Random Weight Initialisation","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux initialises convolutional layers and recurrent cells with glorot_uniform by default. Most layers accept a function as an init keyword, which replaces this default. For example:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> conv = Conv((3, 3), 3 => 2, relu; init=Flux.glorot_normal)\nConv((3, 3), 3 => 2, relu) # 56 parameters\n\njulia> conv.bias\n2-element Vector{Float32}:\n 0.0\n 0.0","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Note that init creates the weight array, but not the bias vector.","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Many of the initialisation functions accept keywords such as gain, and a random number generator. To make it easy to pass these to layers, there are methods which return a function:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> Dense(4 => 5, tanh; init=Flux.glorot_uniform(gain=2))\nDense(4 => 5, tanh) # 25 parameters\n\njulia> Dense(4 => 5, tanh; init=Flux.randn32(MersenneTwister(1)))\nDense(4 => 5, tanh) # 25 parameters","category":"page"},{"location":"reference/utilities/#Initialisation-functions","page":"Weight Initialisation","title":"Initialisation functions","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.glorot_uniform\nFlux.glorot_normal\nFlux.kaiming_uniform\nFlux.kaiming_normal\nFlux.truncated_normal\nFlux.lecun_normal\nFlux.orthogonal\nFlux.sparse_init\nFlux.identity_init\nFlux.ones32\nFlux.zeros32\nFlux.rand32\nFlux.randn32\nFlux.create_bias","category":"page"},{"location":"reference/utilities/#Flux.glorot_uniform","page":"Weight Initialisation","title":"Flux.glorot_uniform","text":"glorot_uniform([rng], size...; gain = 1) -> Array\nglorot_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval -x x, where x = gain * sqrt(6 / (fan_in + fan_out)).\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> Flux.glorot_uniform(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.glorot_uniform(10, 100)), digits=3)\n(-0.233f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 10)), digits=3)\n(-0.234f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 100)), digits=3)\n(-0.173f0, 0.173f0)\n\njulia> Dense(3 => 2, tanh; init = Flux.glorot_uniform(MersenneTwister(1)))\nDense(3 => 2, tanh) # 8 parameters\n\njulia> ans.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.glorot_normal","page":"Weight Initialisation","title":"Flux.glorot_normal","text":"glorot_normal([rng], size...; gain = 1) -> Array\nglorot_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.glorot_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.glorot_normal(1000, 10)), digits=3)\n0.045f0\n\njulia> round(std(Flux.glorot_normal(1000, 1000)), digits=3)\n0.032f0\n\njulia> Dense(10 => 1000, tanh; init = Flux.glorot_normal(gain=100))\nDense(10 => 1000, tanh) # 11_000 parameters\n\njulia> round(std(ans.weight), sigdigits=3)\n4.45f0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_uniform","page":"Weight Initialisation","title":"Flux.kaiming_uniform","text":"kaiming_uniform([rng], size...; gain = √2) -> Array\nkaiming_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)\n(-0.774f0, 0.773f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(10, 100)), digits=3)\n(-0.243f0, 0.245f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)\n(-0.245f0, 0.245f0)\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_normal","page":"Weight Initialisation","title":"Flux.kaiming_normal","text":"kaiming_normal([rng], size...; gain = √2) -> Array\nkaiming_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.kaiming_normal(1000, 10)), digits=3)\n0.45f0\n\njulia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)\n0.045f0\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.truncated_normal","page":"Weight Initialisation","title":"Flux.truncated_normal","text":"truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array\ntruncated_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).\n\nThe values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.\n\nExamples\n\njulia> using Statistics\n\njulia> Flux.truncated_normal(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.truncated_normal(10^6)); digits=3)\n(-2.0f0, 2.0f0)\n\njulia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))\n1.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.lecun_normal","page":"Weight Initialisation","title":"Flux.lecun_normal","text":"lecun_normal([rng], size...) -> Array\nlecun_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a truncated normal distribution centered on 0 with stddev sqrt(1 / fan_in), where fan_in is the number of input units in the weight tensor.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.lecun_normal(10, 1000)), digits=3)\n0.032f0\n\njulia> round(std(Flux.lecun_normal(1000, 10)), digits=3)\n0.32f0\n\njulia> round(std(Flux.lecun_normal(1000, 1000)), digits=3)\n0.032f0\n\njulia> Dense(10 => 1000, selu; init = Flux.lecun_normal())\nDense(10 => 1000, selu) # 11_000 parameters\n\njulia> round(std(ans.weight), digits=3)\n0.313f0\n\nReferences\n\n[1] Lecun, Yann, et al. \"Efficient backprop.\" Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.orthogonal","page":"Weight Initialisation","title":"Flux.orthogonal","text":"orthogonal([rng], size...; gain = 1) -> Array\northogonal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].\n\nCannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.\n\nExamples\n\njulia> W = Flux.orthogonal(5, 7);\n\njulia> summary(W)\n\"5×7 Matrix{Float32}\"\n\njulia> W * W' ≈ I(5)\ntrue\n\njulia> W2 = Flux.orthogonal(7, 5);\n\njulia> W2 * W2' ≈ I(7)\nfalse\n\njulia> W2' * W2 ≈ I(5)\ntrue\n\njulia> W3 = Flux.orthogonal(3, 3, 2, 4);\n\njulia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)\ntrue\n\nReferences\n\n[1] Saxe, McClelland, Ganguli. \"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\", ICLR 2014, https://arxiv.org/abs/1312.6120\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.sparse_init","page":"Weight Initialisation","title":"Flux.sparse_init","text":"sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array\nsparse_init([rng]; kw...) -> Function\n\nReturn a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.\n\nThis method is described in [1].\n\nExamples\n\njulia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))\n20\n\njulia> sum(0 .== Flux.sparse_init(10, 11, sparsity=0.9), dims=1)\n1×11 Matrix{Int64}:\n 9 9 9 9 9 9 9 9 9 9 9\n\njulia> Dense(3 => 10, tanh; init=Flux.sparse_init(sparsity=0.5))\nDense(3 => 10, tanh) # 40 parameters\n\njulia> count(iszero, ans.weight, dims=1)\n1×3 Matrix{Int64}:\n 5 5 5\n\nReferences\n\n[1] Martens, J, \"Deep learning via Hessian-free optimization\" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.identity_init","page":"Weight Initialisation","title":"Flux.identity_init","text":"identity_init(size...; gain=1, shift=0) -> Array\nidentity_init(; kw...) -> Function\n\nReturn an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.\n\nOften useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.\n\nHas the following behaviour\n\n1D: A Vector of zeros (useful for an identity bias)\n2D: An identity matrix (useful for an identity matrix multiplication)\nMore than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)\n\nSome caveats: \n\nNot all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.\nLayers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.\nFor convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.\n\nUse keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).\n\nFor consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.\n\nExamples\n\njulia> Flux.identity_init(3,5)\n3×5 Matrix{Float32}:\n 1.0 0.0 0.0 0.0 0.0\n 0.0 1.0 0.0 0.0 0.0\n 0.0 0.0 1.0 0.0 0.0\n\njulia> Dense(5 => 3, relu, init=Flux.identity_init)([1,-2,3,-4,5])\n3-element Vector{Float32}:\n 1.0\n 0.0\n 3.0\n\njulia> Flux.identity_init(3,3,2; gain=100)\n3×3×2 Array{Float32, 3}:\n[:, :, 1] =\n 0.0 0.0 0.0\n 100.0 0.0 0.0\n 0.0 0.0 0.0\n\n[:, :, 2] =\n 0.0 0.0 0.0\n 0.0 100.0 0.0\n 0.0 0.0 0.0\n\njulia> x4 = cat([1 2 3; 4 5 6; 7 8 9]; dims=4);\n\njulia> Conv((2,2), 1 => 1, init=Flux.identity_init(gain=10), pad=SamePad())(x4)\n3×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 10.0 20.0 30.0\n 40.0 50.0 60.0\n 70.0 80.0 90.0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.ones32","page":"Weight Initialisation","title":"Flux.ones32","text":"ones32(size...) = ones(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 1s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.zeros32","page":"Weight Initialisation","title":"Flux.zeros32","text":"zeros32(size...) = zeros(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 0s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.rand32","page":"Weight Initialisation","title":"Flux.rand32","text":"rand32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.randn32","page":"Weight Initialisation","title":"Flux.randn32","text":"randn32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.create_bias","page":"Weight Initialisation","title":"Flux.create_bias","text":"create_bias(weights, bias, size...)\n\nReturn a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.\n\nbias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.\nbias == false returns false, which is understood by AD to be non-differentiable.\nbias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"These functions call:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.rng_from_array\nFlux.nfan","category":"page"},{"location":"reference/utilities/#Flux.rng_from_array","page":"Weight Initialisation","title":"Flux.rng_from_array","text":"rng_from_array(x)\n\nCreate an instance of the RNG most appropriate for x. As an example, if x is aCuArray, it will return a CUDA.default_rng(). If x is an Array instead, it will return a Random.default_rng().\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.nfan","page":"Weight Initialisation","title":"Flux.nfan","text":"nfan(n_out, n_in=1) -> Tuple\nnfan(dims...)\nnfan(dims::Tuple)\n\nFor a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.\n\nThis function is mainly used by weight initializers, e.g., kaiming_normal.\n\nExamples\n\njulia> layer = Dense(10, 20);\n\njulia> Flux.nfan(size(layer.weight))\n(10, 20)\n\njulia> layer = Conv((3, 3), 2=>10);\n\njulia> Flux.nfan(size(layer.weight))\n(18, 90)\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Changing-the-type-of-all-parameters","page":"Weight Initialisation","title":"Changing the type of all parameters","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.f64\nFlux.f32\nFlux.f16","category":"page"},{"location":"reference/utilities/#Flux.f64","page":"Weight Initialisation","title":"Flux.f64","text":"f64(m)\n\nConverts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.\n\nSee also f32 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f32","page":"Weight Initialisation","title":"Flux.f32","text":"f32(m)\n\nConverts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.\n\nSee also f64 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f16","page":"Weight Initialisation","title":"Flux.f16","text":"f16(m)\n\nConverts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.\n\nSupport for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.\n\nSee also f32 and f64.\n\nExample\n\njulia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10)) # all Float32\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 6.211 MiB.\n\njulia> m |> f16 # takes half the memory\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.\n\n\n\n\n\n","category":"function"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/outputsize/#Shape-Inference","page":"Shape Inference","title":"Shape Inference","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux has some tools to help generate models in an automated fashion, by inferring the size of arrays that layers will recieve, without doing any computation. This is especially useful for convolutional models, where the same Conv layer accepts any size of image, but the next layer may not. ","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The higher-level tool is a macro @autosize which acts on the code defining the layers, and replaces each appearance of _ with the relevant size. This simple example returns a model with Dense(845 => 10) as the last layer:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"@autosize (28, 28, 1, 32) Chain(Conv((3, 3), _ => 5, relu, stride=2), Flux.flatten, Dense(_ => 10))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The input size may be provided at runtime, like @autosize (sz..., 1, 32) Chain(Conv(..., but all the layer constructors containing _ must be explicitly written out – the macro sees the code as written.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"This macro relies on a lower-level function outputsize, which you can also use directly:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"c = Conv((3, 3), 1 => 5, relu, stride=2)\nFlux.outputsize(c, (28, 28, 1, 32)) # returns (13, 13, 5, 32)","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The function outputsize works by passing a \"dummy\" array into the model, which propagates through very cheaply. It should work for all layers, including custom layers, out of the box.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"An example of how to automate model building is this:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"\"\"\"\n make_model(width, height, [inchannels, nclasses; layer_config])\n\nCreate a CNN for a given set of configuration parameters. Arguments:\n- `width`, `height`: the input image size in pixels\n- `inchannels`: the number of channels in the input image, default `1`\n- `nclasses`: the number of output classes, default `10`\n- Keyword `layer_config`: a vector of the number of channels per layer, default `[16, 16, 32, 64]`\n\"\"\"\nfunction make_model(width, height, inchannels = 1, nclasses = 10;\n layer_config = [16, 16, 32, 64])\n # construct a vector of layers:\n conv_layers = []\n push!(conv_layers, Conv((5, 5), inchannels => layer_config[1], relu, pad=SamePad()))\n for (inch, outch) in zip(layer_config, layer_config[2:end])\n push!(conv_layers, Conv((3, 3), inch => outch, sigmoid, stride=2))\n end\n\n # compute the output dimensions after these conv layers:\n conv_outsize = Flux.outputsize(conv_layers, (width, height, inchannels); padbatch=true)\n\n # use this to define appropriate Dense layer:\n last_layer = Dense(prod(conv_outsize) => nclasses)\n return Chain(conv_layers..., Flux.flatten, last_layer)\nend\n\nm = make_model(28, 28, 3, layer_config = [9, 17, 33, 65])\n\nFlux.outputsize(m, (28, 28, 3, 42)) == (10, 42) == size(m(randn(Float32, 28, 28, 3, 42)))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Alternatively, using the macro, the definition of make_model could end with:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":" # compute the output dimensions & construct appropriate Dense layer:\n return @autosize (width, height, inchannels, 1) Chain(conv_layers..., Flux.flatten, Dense(_ => nclasses))\nend","category":"page"},{"location":"reference/outputsize/#Listing","page":"Shape Inference","title":"Listing","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux.@autosize\nFlux.outputsize","category":"page"},{"location":"reference/outputsize/#Flux.@autosize","page":"Shape Inference","title":"Flux.@autosize","text":"@autosize (size...,) Chain(Layer(_ => 2), Layer(_), ...)\n\nReturns the specified model, with each _ replaced by an inferred number, for input of the given size.\n\nThe unknown sizes are usually the second-last dimension of that layer's input, which Flux regards as the channel dimension. (A few layers, Dense & LayerNorm, instead always use the first dimension.) The underscore may appear as an argument of a layer, or inside a =>. It may be used in further calculations, such as Dense(_ => _÷4).\n\nExamples\n\njulia> @autosize (3, 1) Chain(Dense(_ => 2, sigmoid), BatchNorm(_, affine=false))\nChain(\n Dense(3 => 2, σ), # 8 parameters\n BatchNorm(2, affine=false),\n) \n\njulia> img = [28, 28];\n\njulia> @autosize (img..., 1, 32) Chain( # size is only needed at runtime\n Chain(c = Conv((3,3), _ => 5; stride=2, pad=SamePad()),\n p = MeanPool((3,3)),\n b = BatchNorm(_),\n f = Flux.flatten),\n Dense(_ => _÷4, relu, init=Flux.rand32), # can calculate output size _÷4\n SkipConnection(Dense(_ => _, relu), +),\n Dense(_ => 10),\n )\nChain(\n Chain(\n c = Conv((3, 3), 1 => 5, pad=1, stride=2), # 50 parameters\n p = MeanPool((3, 3)),\n b = BatchNorm(5), # 10 parameters, plus 10\n f = Flux.flatten,\n ),\n Dense(80 => 20, relu), # 1_620 parameters\n SkipConnection(\n Dense(20 => 20, relu), # 420 parameters\n +,\n ),\n Dense(20 => 10), # 210 parameters\n) # Total: 10 trainable arrays, 2_310 parameters,\n # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB.\n\njulia> outputsize(ans, (28, 28, 1, 32))\n(10, 32)\n\nLimitations:\n\nWhile @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.\nWhile Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).\n\n\n\n\n\n","category":"macro"},{"location":"reference/outputsize/#Flux.outputsize","page":"Shape Inference","title":"Flux.outputsize","text":"outputsize(m, x_size, y_size, ...; padbatch=false)\n\nFor model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.\n\nExamples\n\njulia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);\n\njulia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));\n\njulia> Flux.outputsize(par, (5, 64), (7, 64))\n(20, 64)\n\njulia> m = Chain(par, Dense(20 => 13), softmax);\n\njulia> Flux.outputsize(m, (5,), (7,); padbatch=true)\n(13, 1)\n\njulia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))\ntrue\n\nNotice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.\n\n\n\n\n\n","category":"function"},{"location":"guide/performance/#man-performance-tips","page":"Performance Tips","title":"Performance Tips","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"All the usual Julia performance tips apply. As always profiling your code is generally a useful way of finding bottlenecks. Below follow some Flux specific tips/reminders.","category":"page"},{"location":"guide/performance/#Don't-use-more-precision-than-you-need","page":"Performance Tips","title":"Don't use more precision than you need","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Flux works great with all kinds of number types. But often you do not need to be working with say Float64 (let alone BigFloat). Switching to Float32 can give you a significant speed up, not because the operations are faster, but because the memory usage is halved. Which means allocations occur much faster. And you use less memory.","category":"page"},{"location":"guide/performance/#Preserve-inputs'-types","page":"Performance Tips","title":"Preserve inputs' types","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Not only should your activation and loss functions be type-stable, they should also preserve the type of their inputs.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"A very artificial example using an activation function like","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"my_tanh(x) = Float64(tanh(x))","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"will result in performance on Float32 input orders of magnitude slower than the normal tanh would, because it results in having to use slow mixed type multiplication in the dense layers. Similar situations can occur in the loss function during backpropagation.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Which means if you change your data say from Float64 to Float32 (which should give a speedup: see above), you will see a large slow-down.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. E.g. the following will have run into the same problem as above:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = 0.01*x + tanh(x)","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While one could change the activation function (e.g. to use 0.01f0*x), the idiomatic (and safe way) to avoid type casts whenever inputs changes is to use oftype:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = oftype(x/1, 0.01)*x + tanh(x)","category":"page"},{"location":"guide/performance/#Evaluate-batches-as-matrices-of-features","page":"Performance Tips","title":"Evaluate batches as matrices of features","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While it can sometimes be tempting to process your observations (feature vectors) one at a time e.g.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})\n sum(zip(xs, ys)) do (x, y_target)\n y_pred = model(x) # evaluate the model\n return loss(y_pred, y_target)\n end\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"It is much faster to concatenate them into a matrix, as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. The improvement is enough that it is worthwhile allocating new memory to store them contiguously.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"x_batch = reduce(hcat, xs)\ny_batch = reduce(hcat, ys)\n...\nfunction loss_total(x_batch::Matrix, y_batch::Matrix)\n y_preds = model(x_batch)\n sum(loss.(y_preds, y_batch))\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.","category":"page"},{"location":"guide/performance/#Be-aware-of-GPU-memory-inefficiencies","page":"Performance Tips","title":"Be aware of GPU memory inefficiencies","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.","category":"page"},{"location":"#Flux:-The-Julia-Machine-Learning-Library","page":"Welcome","title":"Flux: The Julia Machine Learning Library","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Flux is a library for machine learning. It comes \"batteries-included\" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.\nExtensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.\nPlay nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.","category":"page"},{"location":"#Installation","page":"Welcome","title":"Installation","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Download Julia 1.10 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.","category":"page"},{"location":"#Learning-Flux","page":"Welcome","title":"Learning Flux","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"The quick start page trains a simple neural network.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"The rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.","category":"page"},{"location":"#Community","page":"Welcome","title":"Community","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.","category":"page"},{"location":"tutorials/linear_regression/#man-linear-regression","page":"Linear Regression","title":"Tutorial: Linear Regression","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The following page contains a step-by-step walkthrough of the linear regression algorithm in Julia using Flux! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by Flux.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let us start by building a simple linear regression model. This model would be trained on the data points of the form (x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ). In the real world, these xs can have multiple features, and the ys denote a label. In our example, each x has a single feature; hence, our data would have n data points, each point mapping a single feature to a single label.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Importing the required Julia packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Plots","category":"page"},{"location":"tutorials/linear_regression/#Generating-a-dataset","page":"Linear Regression","title":"Generating a dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the xs of our data points and map them to the respective ys using a simple function. Remember, here each x is equivalent to a feature, and each y is the corresponding label. Combining all the xs and ys would create the complete dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = hcat(collect(Float32, -3:0.1:3)...)\n1×61 Matrix{Float32}:\n -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 … 2.4 2.5 2.6 2.7 2.8 2.9 3.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The hcat call generates a Matrix with numbers ranging from -3.0 to 3.0 with a gap of 0.1 between them. Each column of this matrix holds a single x, a total of 61 xs. The next step would be to generate the corresponding labels or the ys.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> f(x) = @. 3x + 2;\n\njulia> y = f(x)\n1×61 Matrix{Float32}:\n -7.0 -6.7 -6.4 -6.1 -5.8 -5.5 … 9.5 9.8 10.1 10.4 10.7 11.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The function f maps each x to a y, and as x is a Matrix, the expression broadcasts the scalar values using @. macro. Our data points are ready, but they are too perfect. In a real-world scenario, we will not have an f function to generate y values, but instead, the labels would be manually added.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = x .* reshape(rand(Float32, 61), (1, 61));","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Visualizing the final data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(vec(x), vec(y), lw = 3, seriestype = :scatter, label = \"\", title = \"Generated data\", xlabel = \"x\", ylabel= \"y\");","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-data)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data looks random enough now! The x and y values are still somewhat correlated; hence, the linear regression algorithm should work fine on our dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed ahead and build a model for our dataset!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-model","page":"Linear Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A linear regression model is defined mathematically as -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"model(W b x) = Wx + b","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"where W is the weight matrix and b is the bias. For our case, the weight matrix (W) would constitute only a single element, as we have only a single feature. We can define our model in Julia using the exact same notation!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) = @. W*x + b\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The @. macro allows you to perform the calculations by broadcasting the scalar quantities (for example - the bias).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The next step would be to initialize the model parameters, which are the weight and the bias. There are a lot of initialization techniques available for different machine learning models, but for the sake of this example, let's pull out the weight from a uniform distribution and initialize the bias as 0.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = rand(Float32, 1, 1)\n1×1 Matrix{Float32}:\n 0.99285793\n\njulia> b = [0.0f0]\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Time to test if our model works!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) |> size\n(1, 61)\n\njulia> custom_model(W, b, x)[1], y[1]\n(-1.6116865f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function custom_loss(weights, biases, features, labels)\n ŷ = custom_model(weights, biases, features)\n sum((labels .- ŷ).^2) / length(features)\n end;\n\njulia> custom_loss(W, b, x, y)\n23.772217f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Calling the loss function on our xs and ys shows how far our predictions (ŷ) are from the real labels. More precisely, it calculates the sum of the squares of residuals and divides it by the total number of data points.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We have successfully defined our model and the loss function, but surprisingly, we haven't used Flux anywhere till now. Let's see how we can write the same code using Flux.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A Dense(1 => 1) layer denotes a layer of one neuron with one input (one feature) and one output. This layer is exactly same as the mathematical model defined by us above! Under the hood, Flux too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model.weight, flux_model.bias\n(Float32[-1.2678515;;], Float32[0.0])","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Now we can check if our model is acting right. We can pass the complete data in one go, with each x having exactly one feature (one input) -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model(x) |> size\n(1, 61)\n\njulia> flux_model(x)[1], y[1]\n(-1.8525281f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It is! The next step would be defining the loss function using Flux's functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function flux_loss(flux_model, features, labels)\n ŷ = flux_model(features)\n Flux.mse(ŷ, labels)\n end;\n\njulia> flux_loss(flux_model, x, y)\n22.74856f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Everything works as before! It almost feels like Flux provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the flux_model is from our custom_model. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom_model to match that of the flux_model -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = Float32[1.1412252]\n1-element Vector{Float32}:\n 1.1412252","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"To check how both the models are performing on the data, let's find out the losses using the loss and flux_loss functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)\n(22.74856f0, 22.74856f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The losses are identical! This means that our model and the flux_model are identical on some level, and the loss functions are completely identical! The difference in models would be that Flux's Dense layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom_model.","category":"page"},{"location":"tutorials/linear_regression/#Training-the-model","page":"Linear Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W .= W .- 0.1 .* dLdW\n1-element Vector{Float32}:\n 1.8144473\n\njulia> b .= b .- 0.1 .* dLdb\n1-element Vector{Float32}:\n 0.41325632","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The parameters have been updated! We can now check the value of the loss function -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y)\n17.157953f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, change in loss < 0.1. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's plug our super training logic inside a function and test it again -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_custom_model!(f_loss, weights, biases, features, labels)\n dLdW, dLdb, _, _ = gradient(f_loss, weights, biases, features, labels)\n @. weights = weights - 0.1 * dLdW\n @. biases = biases - 0.1 * dLdb\n end;\n\njulia> train_custom_model!(custom_loss, W, b, x, y);\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[2.340657], Float32[0.7516814], 13.64972f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> for i = 1:40\n train_custom_model!(custom_loss, W, b, x, y)\n end\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"There was a significant reduction in loss, and the parameters were updated!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from 22.74856 to 7.6680417f. Time for some visualization!","category":"page"},{"location":"tutorials/linear_regression/#Results","page":"Linear Regression","title":"Results","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, Wx + b is nothing more than a line's equation, with slope = W[1] and y-intercept = b[1] (indexing at 1 as W and b are iterable).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Plotting the line and the data points using Plot.jl -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = \"\", title = \"Simple Linear Regression\", xlabel = \"x\", ylabel= \"y\");\n\njulia> plot!((x) -> b[1] + W[1] * x, -3, 3, label=\"Custom model\", lw=2);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-line)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!","category":"page"},{"location":"tutorials/linear_regression/#Linear-regression-model-on-a-real-dataset","page":"Linear Regression","title":"Linear regression model on a real dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We now move on to a relatively complex linear regression model. Here we will use a real dataset from MLDatasets.jl, which will not confine our data points to have only one feature. Let's start by importing the required packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames","category":"page"},{"location":"tutorials/linear_regression/#Gathering-real-data","page":"Linear Regression","title":"Gathering real data","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's start by initializing our dataset. We will be using the BostonHousing dataset consisting of 506 data points. Each of these data points has 13 features and a corresponding label, the house's price. The xs are still mapped to a single y, but now, a single x data point has 13 features.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dataset = BostonHousing();\n\njulia> x, y = BostonHousing(as_df=false)[:];\n\njulia> x, y = Float32.(x), Float32.(y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now split the obtained data into training and testing data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end];\n\njulia> x_train |> size, x_test |> size, y_train |> size, y_test |> size\n((13, 400), (13, 106), (1, 400), (1, 106))","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to normalise the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> std(x_train)\n134.06786f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data is indeed not normalised. We can use the Flux.normalise function to normalise the training data.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train_n = Flux.normalise(x_train);\n\njulia> std(x_train_n)\n1.0000844f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The standard deviation is now close to one! Our data is ready!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-Flux-model","page":"Linear Regression","title":"Building a Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now directly use Flux and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and Flux will handle everything for us! Remember, we could have declared a model in plain Julia as well. The model will have 14 parameters: 13 weights and 1 bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> model = Dense(13 => 1)\nDense(13 => 1) # 14 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function loss(model, features, labels)\n ŷ = model(features)\n Flux.mse(ŷ, labels)\n end;\n\njulia> loss(model, x_train_n, y_train)\n676.1656f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed to the training phase!","category":"page"},{"location":"tutorials/linear_regression/#Training-the-Flux-model","page":"Linear Regression","title":"Training the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The training procedure would make use of the same mathematics, but now we can pass in the model inside the gradient call and let Flux and Zygote handle the derivatives!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_model!(f_loss, model, features, labels)\n dLdm, _, _ = gradient(f_loss, model, features, labels)\n @. model.weight = model.weight - 0.000001 * dLdm.weight\n @. model.bias = model.bias - 0.000001 * dLdm.bias\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when change in loss < δ. The quantity δ can be altered according to a user's need, but let's fix it to 10⁻³ for this tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can write such custom training loops effortlessly using Flux and plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss_init = Inf;\n\njulia> while true\n train_model!(loss, model, x_train_n, y_train)\n if loss_init == Inf\n loss_init = loss(model, x_train_n, y_train)\n continue\n end\n if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4\n break\n else\n loss_init = loss(model, x_train_n, y_train)\n end\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The code starts by initializing an initial value for the loss, infinity. Next, it runs an infinite loop that breaks if change in loss < 10⁻³, or the code changes the value of loss_init to the current loss and moves on to the next iteration.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This custom loop works! This shows how easily a user can write down any custom training routine using Flux and Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's have a look at the loss -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss(model, x_train_n, y_train)\n27.1272f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down significantly! It can be minimized further by choosing an even smaller δ.","category":"page"},{"location":"tutorials/linear_regression/#Testing-the-Flux-model","page":"Linear Regression","title":"Testing the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_test_n = Flux.normalise(x_test);\n\njulia> loss(model, x_test_n, y_test)\n66.91015f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"info: Info\nOriginally published on 21 November 2022, by Saransh Chopra.","category":"page"},{"location":"guide/saving/#Saving-and-Loading-Models","page":"Saving & Loading","title":"Saving and Loading Models","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You may wish to save models so that they can be loaded and run in a later session. Flux provides a number of ways to do this. The recommended way, which is the most robust one for long term storage, is to use Flux.state in combination with a serialization format like JLD2.jl or BSON.jl.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> struct MyModel\n net\n end\n\njulia> Flux.@layer MyModel\n\njulia> MyModel() = MyModel(Chain(Dense(10 => 5, relu), Dense(5 => 2)));\n\njulia> model = MyModel()\nMyModel(\n Chain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n ),\n) # Total: 4 arrays, 67 parameters, 484 bytes.\n\njulia> model_state = Flux.state(model);\n\njulia> using JLD2\n\njulia> jldsave(\"mymodel.jld2\"; model_state)","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session using Flux.loadmodel!:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, JLD2\n\njulia> model_state = JLD2.load(\"mymodel.jld2\", \"model_state\");\n\njulia> model = MyModel(); # MyModel definition must be available\n\njulia> Flux.loadmodel!(model, model_state);","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"note: Note\nIf a saved model's parameters are stored on the GPU, the model will not load later on if there is no GPU support available. It's best to move your model to the CPU with cpu(model) before saving it.","category":"page"},{"location":"guide/saving/#Checkpointing","page":"Saving & Loading","title":"Checkpointing","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"In longer training runs it's a good idea to periodically save your model, so that you can resume if training is interrupted (for example, if there's a power cut). ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux: throttle\n\njulia> using JLD2\n\njulia> m = Chain(Dense(10 => 5, relu), Dense(5 => 2))\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 476 bytes.\n\njulia> for epoch in 1:10\n # ... train model ...\n jldsave(\"model-checkpoint.jld2\", model_state = Flux.state(m))\n end;","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"This will update the \"model-checkpoint.jld2\" every epoch.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can get more advanced by saving a series of models throughout training, for example","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m))","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"will produce a series of models like \"model-2018-03-06T02:57:10.41.jld2\". You could also store the current test set loss, so that it's easy to (for example) revert to an older copy of the model if it starts to overfit.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m), loss = testloss())","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Note that to resume a model's training, you might need to restore other stateful parts of your training loop. Possible examples are the optimiser state and the randomness used to partition the original data into the training and validation sets.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can store the optimiser state alongside the model, to resume training exactly where you left off: ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"model = MyModel()\nopt_state = Flux.setup(AdamW(), model)\n\n# ... train model ...\n\nmodel_state = Flux.state(model)\njldsave(\"checkpoint_epoch=42.jld2\"; model_state, opt_state)","category":"page"},{"location":"guide/saving/#Saving-Models-as-Julia-Structs","page":"Saving & Loading","title":"Saving Models as Julia Structs","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Models are just normal Julia structs, so it's fine to use any Julia storage format to save the struct as it is instead of saving the state returned by Flux.state. BSON.jl is particularly convenient for this, since it can also save anonymous functions, which are sometimes part of a model definition.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> model = Chain(Dense(10 => 5, NNlib.relu), Dense(5 => 2));\n\njulia> using BSON: @save\n\njulia> @save \"mymodel.bson\" model","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, BSON\n\njulia> BSON.@load \"mymodel.bson\" model\n\njulia> model\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 476 bytes.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"warning: Warning\nSaving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.","category":"page"}] +[{"location":"guide/models/quickstart/#man-quickstart","page":"Quick Start","title":"A Neural Network in One Minute","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you have used neural networks before, then this simple example might be helpful for seeing how the major parts of Flux work together. Try pasting the code into the REPL prompt.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you haven't, then you might prefer the Fitting a Straight Line page.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"# Install everything, including CUDA, and load packages:\nusing Pkg; Pkg.add([\"Flux\", \"CUDA\", \"cuDNN\", \"ProgressMeter\"])\nusing Flux, Statistics, ProgressMeter\nusing CUDA # optional\ndevice = gpu_device() # function to move data and model to the GPU\n\n# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:\nnoisy = rand(Float32, 2, 1000) # 2×1000 Matrix{Float32}\ntruth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element Vector{Bool}\n\n# Define our model, a multi-layer perceptron with one hidden layer of size 3:\nmodel = Chain(\n Dense(2 => 3, tanh), # activation function inside layer\n BatchNorm(3),\n Dense(3 => 2)) |> device # move model to GPU, if one is available\n\n# The model encapsulates parameters, randomly initialised. Its initial output is:\nout1 = model(noisy |> device) # 2×1000 Matrix{Float32}, or CuArray{Float32}\nprobs1 = softmax(out1) |> cpu # normalise to get probabilities (and move off GPU)\n\n# To train the model, we use batches of 64 samples, and one-hot encoding:\ntarget = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix\nloader = Flux.DataLoader((noisy, target), batchsize=64, shuffle=true);\n\nopt_state = Flux.setup(Flux.Adam(0.01), model) # will store optimiser momentum, etc.\n\n# Training loop, using the whole data set 1000 times:\nlosses = []\n@showprogress for epoch in 1:1_000\n for xy_cpu in loader\n # Unpack batch of data, and move to GPU:\n x, y = xy_cpu |> device\n loss, grads = Flux.withgradient(model) do m\n # Evaluate model and loss inside gradient context:\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\n Flux.update!(opt_state, model, grads[1])\n push!(losses, loss) # logging, outside gradient context\n end\nend\n\nopt_state # parameters, momenta and output have all changed\n\nout2 = model(noisy |> device) # first row is prob. of true, second row p(false)\nprobs2 = softmax(out2) |> cpu # normalise to get probabilities\nmean((probs2[1,:] .> 0.5) .== truth) # accuracy 94% so far!","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"(Image: )","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"using Plots # to draw the above figure\n\np_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title=\"True classification\", legend=false)\np_raw = scatter(noisy[1,:], noisy[2,:], zcolor=probs1[1,:], title=\"Untrained network\", label=\"\", clims=(0,1))\np_done = scatter(noisy[1,:], noisy[2,:], zcolor=probs2[1,:], title=\"Trained network\", legend=false)\n\nplot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Here's the loss during training:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"plot(losses; xaxis=(:log10, \"iteration\"),\n yaxis=\"loss\", label=\"per batch\")\nn = length(loader)\nplot!(n:n:length(losses), mean.(Iterators.partition(losses, n)),\n label=\"epoch mean\", dpi=200)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"This XOR (\"exclusive or\") problem is a variant of the famous one which drove Minsky and Papert to invent deep neural networks in 1969. For small values of \"deep\" – this has one hidden layer, while earlier perceptrons had none. (What they call a hidden layer, Flux calls the output of the first layer, model[1](noisy).)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Since then things have developed a little. ","category":"page"},{"location":"guide/models/quickstart/#Features-to-Note","page":"Quick Start","title":"Features to Note","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Some things to notice in this example are:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"The batch dimension of data is always the last one. Thus a 2×1000 Matrix is a thousand observations, each a column of length 2. Flux defaults to Float32, but most of Julia to Float64.\nThe model can be called like a function, y = model(x). Each layer like Dense is an ordinary struct, which encapsulates some arrays of parameters (and possibly other state, as for BatchNorm).\nBut the model does not contain the loss function, nor the optimisation rule. The momenta needed by Adam are stored in the object returned by setup. And Flux.logitcrossentropy is an ordinary function that combines the softmax and crossentropy functions.\nThe do block creates an anonymous function, as the first argument of gradient. Anything executed within this is differentiated.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Instead of calling gradient and update! separately, there is a convenience function train!. If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"for epoch in 1:1_000\n Flux.train!(model, loader |> device, opt_state) do m, x, y\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\nend","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Notice that the full dataset noisy lives on the CPU, and is moved to the GPU one batch at a time, by xy_cpu |> device. This is generally what you want for large datasets. Calling loader |> device similarly modifies the DataLoader to move one batch at a time.\nIn our simple example, we conveniently created the model has a Chain of layers.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"For more complex models, you can define a custom struct MyModel containing layers and arrays and implement the call operator (::MyModel)(x) = ... to define the forward pass. This is all it is needed for Flux to work. Marking the struct with Flux.@layer will add some more functionality, like pretty printing and the ability to mark some internal fields as trainable or not (also see trainable).","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/reference/#Training-API-Reference","page":"Training API","title":"Training API Reference","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The new version of Flux's training code was written as an independent package, Optimisers.jl. Only the function train! belongs to Flux itself.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The Optimisers package is designed to allow for immutable objects. But at present all Flux models contain parameter arrays (such as Arrays and CuArrays) which can be updated in-place. Because of this:","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The objects returned by Optimisers.update! can be ignored.\nFlux defines its own version of setup which checks this assumption. (Using instead Optimisers.setup will also work, they return the same thing.)","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The available optimization rules are listed the optimisation rules page here. See the Optimisers documentation for details on how the rules work.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Flux.Train.setup\nFlux.Train.train!(loss, model, data, state)\nOptimisers.update\nOptimisers.update!\nOptimisers.setup","category":"page"},{"location":"reference/training/reference/#Flux.Train.setup","page":"Training API","title":"Flux.Train.setup","text":"opt_state = setup(rule, model)\n\nThis is a version of Optimisers.setup, and is the first step before using train!. It differs from Optimisers.setup in that it:\n\nhas one extra check for mutability (since Flux expects to mutate the model in-place, while Optimisers.jl is designed to return an updated model)\nhas methods which accept Flux's old optimisers, and convert them. (The old Flux.Optimise.Adam and new Optimisers.Adam are distinct types.)\n\nExample\n\njulia> model = Dense(2 => 1, leakyrelu; init=ones);\n\njulia> opt_state = Flux.setup(Momentum(0.1), model) # this encodes the optimiser and its state\n(weight = Leaf(Momentum(0.1, 0.9), [0.0 0.0]), bias = Leaf(Momentum(0.1, 0.9), [0.0]), σ = ())\n\njulia> x1, y1 = [0.2, -0.3], [0.4]; # use the same data for two steps:\n\njulia> Flux.train!(model, [(x1, y1), (x1, y1)], opt_state) do m, x, y\n sum(abs.(m(x) .- y)) * 100\n end\n\njulia> model.bias # was zero, mutated by Flux.train!\n1-element Vector{Float64}:\n 10.19\n\njulia> opt_state # mutated by Flux.train!\n(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())\n\n\n\n\n\nopt_state = setup(rule, model::Duplicated) = setup(rule, model.val)\n\nSpecial method for use with Enzyme.jl, ignores the stored gradient.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Flux.Train.train!-NTuple{4, Any}","page":"Training API","title":"Flux.Train.train!","text":"train!(loss, model, data, opt_state)\n\nUses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.\n\nIf model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.\n\nFor example, with these definitions...\n\ndata = [(x1, y1), (x2, y2), (x3, y3)]\n\nloss3(m, x, y) = norm(m(x) .- y) # the model is the first argument\n\nopt_state = Flux.setup(Adam(), model) # explicit setup of optimiser momenta\n\n...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:\n\nfor d in data\n ∂L∂m = gradient(loss3, model, d...)[1]\n update!(opt_state, model, ∂L∂m)\nend\n\nYou can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:\n\nStop with a DomainError if the loss is infinite or NaN at any point.\nShow a progress bar using @withprogress.\n\ncompat: New\nThis method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's \"implicit\" parameter handling, with Grads.)\nInstead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.\nopt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.\nCallback functions are not supported. (But any code can be included in the above for loop.)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/reference/#Optimisers.update","page":"Training API","title":"Optimisers.update","text":"Optimisers.update(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nSee also update!, which will be faster for models of ordinary Arrays or CuArrays.\n\nExample\n\njulia> m = (x = Float32[1,2,3], y = tanh);\n\njulia> t = Optimisers.setup(Descent(0.1), m)\n(x = Leaf(Descent(0.1), nothing), y = ())\n\njulia> g = (x = [1,1,1], y = nothing); # fake gradient\n\njulia> Optimisers.update(t, m, g)\n((x = Leaf(Descent(0.1), nothing), y = ()), (x = Float32[0.9, 1.9, 2.9], y = tanh))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.update!","page":"Training API","title":"Optimisers.update!","text":"Optimisers.update!(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nThis is used in exactly the same manner as update, but because it may mutate arrays within the old model (and the old state), it will be faster for models of ordinary Arrays or CuArrays. However, you should not rely on the old model being fully updated but rather use the returned model. (The original state tree is always mutated, as each Leaf is mutable.)\n\nExample\n\njulia> using StaticArrays, Zygote, Optimisers\n\njulia> m = (x = [1f0, 2f0], y = SA[4f0, 5f0]); # partly mutable model\n\njulia> t = Optimisers.setup(Momentum(1/30, 0.9), m) # tree of states\n(x = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]), y = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]))\n\njulia> g = gradient(m -> sum(abs2.(m.x .+ m.y)), m)[1] # structural gradient\n(x = Float32[10.0, 14.0], y = Float32[10.0, 14.0])\n\njulia> t2, m2 = Optimisers.update!(t, m, g);\n\njulia> m2 # after update or update!, this is the new model\n(x = Float32[0.6666666, 1.5333333], y = Float32[3.6666667, 4.5333333])\n\njulia> m2.x === m.x # update! has re-used this array, for efficiency\ntrue\n\njulia> m # original should be discarded, may be mutated but no guarantee\n(x = Float32[0.6666666, 1.5333333], y = Float32[4.0, 5.0])\n\njulia> t == t2 # original state tree is guaranteed to be mutated\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.setup","page":"Training API","title":"Optimisers.setup","text":"Optimisers.setup(rule, model) -> state_tree\n\nInitialises the given optimiser for every trainable parameter within the model. Returns a tree of the relevant states, which must be passed to update or update!.\n\nExample\n\njulia> m = (x = rand(3), y = (true, false), z = tanh);\n\njulia> Optimisers.setup(Momentum(), m) # same field names as m\n(x = Leaf(Momentum(0.01, 0.9), [0.0, 0.0, 0.0]), y = ((), ()), z = ())\n\nThe recursion into structures uses Functors.jl, and any new structs containing parameters need to be marked with Functors.@functor before use. See the Flux docs for more about this.\n\njulia> struct Layer; mat; fun; end\n\njulia> model = (lay = Layer([1 2; 3 4f0], sin), vec = [5, 6f0]);\n\njulia> Optimisers.setup(Momentum(), model) # new struct is by default ignored\n(lay = (), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[5.0, 6.0], Restructure(NamedTuple, ..., 2))\n\njulia> using Functors; @functor Layer # annotate this type as containing parameters\n\njulia> Optimisers.setup(Momentum(), model)\n(lay = (mat = Leaf(Momentum(0.01, 0.9), Float32[0.0 0.0; 0.0 0.0]), fun = ()), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[1.0, 3.0, 2.0, 4.0, 5.0, 6.0], Restructure(NamedTuple, ..., 6))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"train! uses @progress which should show a progress bar in VSCode automatically. To see one in a terminal, you will need to install TerminalLoggers.jl and follow its setup instructions.","category":"page"},{"location":"reference/training/reference/#Optimisation-Modifiers","page":"Training API","title":"Optimisation Modifiers","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The state returned by setup can be modified to temporarily prevent training of some parts of the model, or to change the learning rate or other hyperparameter. The functions for doing so may be accessed as Flux.freeze!, Flux.thaw!, and Flux.adjust!. All mutate the state (or part of it) and return nothing.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Optimisers.adjust!\nOptimisers.freeze!\nOptimisers.thaw!","category":"page"},{"location":"reference/training/reference/#Optimisers.adjust!","page":"Training API","title":"Optimisers.adjust!","text":"Optimisers.adjust!(tree, η)\n\nAlters the state tree = setup(rule, model) to change the parameters of the optimisation rule, without destroying its stored state. Typically used mid-way through training.\n\nCan be applied to part of a model, by acting only on the corresponding part of the state tree.\n\nTo change just the learning rate, provide a number η::Real.\n\nExample\n\njulia> m = (vec = rand(Float32, 2), fun = sin);\n\njulia> st = Optimisers.setup(Nesterov(), m) # stored momentum is initialised to zero\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[0.0, 0.0]), fun = ())\n\njulia> st, m = Optimisers.update(st, m, (vec = [16, 88], fun = nothing)); # with fake gradient\n\njulia> st\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[-0.016, -0.088]), fun = ())\n\njulia> Optimisers.adjust!(st, 0.123) # change learning rate, stored momentum untouched\n\njulia> st\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\nTo change other parameters, adjust! also accepts keyword arguments matching the field names of the optimisation rule's type.\n\njulia> fieldnames(Adam)\n(:eta, :beta, :epsilon)\n\njulia> st2 = Optimisers.setup(OptimiserChain(ClipGrad(), Adam()), m)\n(vec = Leaf(OptimiserChain(ClipGrad(10.0), Adam(0.001, (0.9, 0.999), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st2; beta = (0.777, 0.909), delta = 11.1) # delta acts on ClipGrad\n(vec = Leaf(OptimiserChain(ClipGrad(11.1), Adam(0.001, (0.777, 0.909), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st; beta = \"no such field\") # silently ignored!\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.freeze!","page":"Training API","title":"Optimisers.freeze!","text":"Optimisers.freeze!(tree)\n\nTemporarily alters the state tree = setup(rule, model) so that parameters will not be updated. Un-done by thaw!.\n\nCan be applied to the state corresponding to only part of a model, for instance with model::Chain, to freeze model.layers[1] you should call freeze!(tree.layers[1]).\n\nExample\n\njulia> m = (x = ([1.0], 2.0), y = [3.0]);\n\njulia> s = Optimisers.setup(Momentum(), m);\n\njulia> Optimisers.freeze!(s.x)\n\njulia> Optimisers.update!(s, m, (x = ([pi], 10pi), y = [100pi])); # with fake gradient\n\njulia> m\n(x = ([1.0], 2.0), y = [-0.14159265358979312])\n\njulia> s\n(x = (Leaf(Momentum(0.01, 0.9), [0.0], frozen = true), ()), y = Leaf(Momentum(0.01, 0.9), [3.14159]))\n\njulia> Optimisers.thaw!(s)\n\njulia> s.x\n(Leaf(Momentum(0.01, 0.9), [0.0]), ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.thaw!","page":"Training API","title":"Optimisers.thaw!","text":"Optimisers.thaw!(tree)\n\nThe reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).\n\n\n\n\n\n","category":"function"},{"location":"tutorials/logistic_regression/#Logistic-Regression","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The following page contains a step-by-step walkthrough of the logistic regression algorithm in Julia using Flux. We will then create a simple logistic regression model without any usage of Flux and compare the different working parts with Flux's implementation.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing the required Julia packages.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames, OneHotArrays","category":"page"},{"location":"tutorials/logistic_regression/#Dataset","page":"Logistic Regression","title":"Dataset","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing a dataset from MLDatasets.jl. We will use the Iris dataset that contains the data of three different Iris species. The data consists of 150 data points (xs), each having four features. Each of these x is mapped to a label (or target) y, the name of a particular Iris species. The following code will download the Iris dataset when run for the first time.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> Iris()\ndataset Iris:\n metadata => Dict{String, Any} with 4 entries\n features => 150×4 DataFrame\n targets => 150×1 DataFrame\n dataframe => 150×5 DataFrame\n\njulia> x, y = Iris(as_df=false)[:];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's have a look at our dataset -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> y\n1×150 Matrix{InlineStrings.String15}:\n \"Iris-setosa\" \"Iris-setosa\" … \"Iris-virginica\" \"Iris-virginica\"\n\njulia> x |> summary\n\"4×150 Matrix{Float64}\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The y values here corresponds to a type of iris plant, with a total of 150 data points. The x values depict the sepal length, sepal width, petal length, and petal width (all in cm) of 150 iris plant (hence the matrix size 4×150). Different type of iris plants have different lengths and widths of sepals and petals associated with them, and there is a definitive pattern for this in nature. We can leverage this to train a simple classifier that outputs the type of iris plant using the length and width of sepals and petals as inputs.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step would be to convert this data into a form that can be fed to a machine learning model. The x values are arranged in a matrix and should ideally be converted to Float32 type (see Performance tips), but the labels must be one hot encoded. Here is a great discourse thread on different techniques that can be used to one hot encode data with or without using any external Julia package.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> x = Float32.(x);\n\njulia> y = vec(y);\n\njulia> custom_y_onehot = unique(y) .== permutedims(y)\n3×150 BitMatrix:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"This same operation can also be performed using OneHotArrays' onehotbatch function. We will use both of these outputs parallelly to show how intuitive FluxML is!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> const classes = [\"Iris-setosa\", \"Iris-versicolor\", \"Iris-virginica\"];\n\njulia> flux_y_onehot = onehotbatch(y, classes)\n3×150 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our data is ready. The next step would be to build a classifier for the same.","category":"page"},{"location":"tutorials/logistic_regression/#Building-a-model","page":"Logistic Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A logistic regression model is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"model(x) = σ(Wx + b)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"where W is the weight matrix, b is the bias vector, and σ is any activation function. For our case, let's use the softmax activation function as we will be performing a multiclass classification task.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> m(W, b, x) = W*x .+ b\nm (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note that this model lacks an activation function, but we will come back to that.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now move ahead to initialize the parameters of our model. Given that our model has four inputs (4 features in every data point), and three outputs (3 different classes), the parameters can be initialized in the following way -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W = rand(Float32, 3, 4);\n\njulia> b = [0.0f0, 0.0f0, 0.0f0];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now our model can take in the complete dataset and predict the class of each x in one go. But, we need to ensure that our model outputs the probabilities of an input belonging to the respective classes. As our model has three outputs, each would denote the probability of the input belonging to a particular class.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We will use an activation function to map our outputs to a probability value. It would make sense to use a softmax activation function here, which is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"σ(vecx) = frace^z_isum_j=1^k e^z_j","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function scales down the outputs to probability values such that the sum of all the final outputs equals 1. Let's implement this in Julia.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_softmax(x) = exp.(x) ./ sum(exp.(x), dims=1)\ncustom_softmax (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The implementation looks straightforward enough! Note that we specify dims=1 in the sum function to calculate the sum of probabilities in each column. Remember, we will have a 3×150 matrix (predicted ys) as the output of our model, where each column would be an output of a corresponding input.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's combine this softmax function with our model to construct the complete custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) = m(W, b, x) |> custom_softmax\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's check if our model works.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) |> size\n(3, 150)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works! Let's check if the softmax function is working.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> all(0 .<= custom_model(W, b, x) .<= 1)\ntrue\n\njulia> sum(custom_model(W, b, x), dims=1)\n1×150 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Every output value is between 0 and 1, and every column adds to 1!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's convert our custom_model to a Flux model. Flux provides the users with a very elegant API that almost feels like writing your code!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note, all the flux_* variables in this tutorial would be general, that is, they can be used as it is with some other similar-looking dataset, but the custom_* variables will remain specific to this tutorial.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model = Chain(Dense(4 => 3), softmax)\nChain(\n Dense(4 => 3), # 15 parameters\n softmax,\n)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A Dense(4 => 3) layer denotes a layer with four inputs (four features in every data point) and three outputs (three classes or labels). This layer is the same as the mathematical model defined by us above. Under the hood, Flux too calculates the output using the same expression, but we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function provided by NNLib.jl is re-exported by Flux, which has been used here. Lastly, Flux provides users with a Chain struct which makes stacking layers seamless.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A model's weights and biases can be accessed as follows -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model[1].weight, flux_model[1].bias\n(Float32[0.78588694 -0.45968163 -0.77409476 0.2358028; -0.9049773 -0.58643705 0.466441 -0.79523873; 0.82426906 0.4143493 0.7630932 0.020588955], Float32[0.0, 0.0, 0.0])","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now pass the complete data in one go, with each data point having four features (four inputs)!","category":"page"},{"location":"tutorials/logistic_regression/#Loss-and-accuracy","page":"Logistic Regression","title":"Loss and accuracy","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step should be to define some quantitative values for our model, which we will maximize or minimize during the complete training procedure. These values will be the loss function and the accuracy metric.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by defining a loss function, a logitcrossentropy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_logitcrossentropy(ŷ, y) = mean(.-sum(y .* logsoftmax(ŷ; dims = 1); dims = 1));","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can wrap the custom_logitcrossentropy inside a function that takes in the model parameters, xs, and ys, and returns the loss value.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_loss(weights, biases, features, labels_onehot)\n ŷ = custom_model(weights, biases, features)\n custom_logitcrossentropy(ŷ, labels_onehot)\n end;\n\njulia> custom_loss(W, b, x, custom_y_onehot)\n1.1714406827505623","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss function works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides us with many minimal yet elegant loss functions. In fact, the custom_logitcrossentropy defined above has been taken directly from Flux. The functions present in Flux includes sanity checks, ensures efficient performance, and behaves well with the overall FluxML ecosystem.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function flux_loss(flux_model, features, labels_onehot)\n ŷ = flux_model(features)\n Flux.logitcrossentropy(ŷ, labels_onehot)\n end;\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n1.2156688659673647","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Next, let's define an accuracy function, which we will try to maximize during our training procedure. Before jumping to accuracy, let's define a onecold function. The onecold function would convert our output, which remember, are probability values, to the actual class names.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can divide this task into two parts -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Identify the index of the maximum element of each column in the output matrix\nConvert this index to a class name","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The maximum index should be calculated along the columns (remember, each column is the output of a single x data point). We can use Julia's argmax function to achieve this.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> argmax(custom_y_onehot, dims=1) # calculate the cartesian index of max element column-wise\n1×150 Matrix{CartesianIndex{2}}:\n CartesianIndex(1, 1) CartesianIndex(1, 2) … CartesianIndex(3, 150)\n\njulia> max_idx = [x[1] for x in argmax(custom_y_onehot; dims=1)]\n1×150 Matrix{Int64}:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 3 3 3 3 3 3 3 3 3 3 3 3","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can write a function that calculates the indices of the maximum element in each column, and maps them to a class name.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_onecold(labels_onehot)\n max_idx = [x[1] for x in argmax(labels_onehot; dims=1)]\n return vec(classes[max_idx])\n end;\n\njulia> custom_onecold(custom_y_onehot)\n150-element Vector{String}:\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n ⋮\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides users with the onecold function so that we don't have to write it on our own. Let's see how our custom_onecold function compares to Flux.onecold.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> istrue = Flux.onecold(flux_y_onehot, classes) .== custom_onecold(custom_y_onehot);\n\njulia> all(istrue)\ntrue","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Both the functions act identically!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We now move to the accuracy metric and run it with the untrained custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_accuracy(W, b, x, y) = mean(custom_onecold(custom_model(W, b, x)) .== y);\n\njulia> custom_accuracy(W, b, x, y)\n0.3333333333333333","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We could also have used Flux's built-in functionality to define this accuracy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_accuracy(x, y) = mean(Flux.onecold(flux_model(x), classes) .== y);\n\njulia> flux_accuracy(x, y)\n0.24","category":"page"},{"location":"tutorials/logistic_regression/#Training-the-model","page":"Logistic Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality. gradient takes in a function and its arguments, and returns a tuple containing ∂f/∂x for each argument x. Let's pass in custom_loss and the arguments required by custom_loss to gradient. We will require the derivatives of the loss function (custom_loss) with respect to the weights (∂f/∂w) and the bias (∂f/∂b) to carry out gradient descent, but we can ignore the partial derivatives of the loss function (custom_loss) with respect to x (∂f/∂x) and one hot encoded y (∂f/∂y).","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, custom_y_onehot);","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W .= W .- 0.1 .* dLdW;\n\njulia> b .= b .- 0.1 .* dLdb;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The parameters have been updated! We can now check the value of our custom loss function -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n1.164742997664842","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss went down! Let's plug our super training logic inside a function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function train_custom_model!(f_loss, weights, biases, features, labels_onehot)\n dLdW, dLdb, _, _ = gradient(f_loss, weights, biases, features, labels_onehot)\n weights .= weights .- 0.1 .* dLdW\n biases .= biases .- 0.1 .* dLdb\n end;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can plug the training function inside a loop and train the model for more epochs. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia. Here we will train the model for a maximum of 500 epochs, but to ensure that the model does not overfit, we will break as soon as our accuracy value crosses or becomes equal to 0.98.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> for i = 1:500\n train_custom_model!(custom_loss, W, b, x, custom_y_onehot);\n custom_accuracy(W, b, x, y) >= 0.98 && break\n end\n\njulia> @show custom_accuracy(W, b, x, y);\ncustom_accuracy(W, b, x, y) = 0.98","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Everything works! Our model achieved an accuracy of 0.98! Let's have a look at the loss.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n0.6520349798243569","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"As expected, the loss went down too! Now, let's repeat the same steps with our flux_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can write a similar-looking training loop for our flux_model and train it similarly.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_loss(flux_model, x, flux_y_onehot)\n1.215731131385928\n\njulia> function train_flux_model!(f_loss, model, features, labels_onehot)\n dLdm, _, _ = gradient(f_loss, model, features, labels_onehot)\n @. model[1].weight = model[1].weight - 0.1 * dLdm[:layers][1][:weight]\n @. model[1].bias = model[1].bias - 0.1 * dLdm[:layers][1][:bias]\n end;\n\njulia> for i = 1:500\n train_flux_model!(flux_loss, flux_model, x, flux_y_onehot);\n flux_accuracy(x, y) >= 0.98 && break\n end","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Looking at the accuracy and loss value -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> @show flux_accuracy(x, y);\nflux_accuracy(x, y) = 0.98\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n0.6952386604624324","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We see a very similar final loss and accuracy.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"info: Info\nOriginally published on 1st April 2023, by Saransh Chopra.","category":"page"},{"location":"tutorials/model_zoo/#Model-Zoo","page":"Model Zoo","title":"Model Zoo","text":"","category":"section"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"The model zoo is a collection of examples that demonstrate how to build and train models using Flux. The examples are organised by domain and include vision, text, and audio. Each example includes a description of the model, the data used, and the training process.","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Some of the examples are pedagogical, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Multilayer Perceptron\nSimple Convolutional Neural Network","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Others are more advanced, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Variational Autoencoder","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/data/mlutils/#Working-with-Data,-using-MLUtils.jl","page":"Batching Data – MLUtils.jl","title":"Working with Data, using MLUtils.jl","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.","category":"page"},{"location":"reference/data/mlutils/#DataLoader","page":"Batching Data – MLUtils.jl","title":"DataLoader","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The DataLoader can be used to create mini-batches of data, in the format train! expects.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.DataLoader","category":"page"},{"location":"reference/data/mlutils/#MLUtils.DataLoader","page":"Batching Data – MLUtils.jl","title":"MLUtils.DataLoader","text":"DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])\n\nAn object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).\n\nTakes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.\n\nThe last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.\n\nThe original data is preserved in the data field of the DataLoader.\n\nArguments\n\ndata: The data to be iterated over. The data type has to be supported by numobs and getobs.\nbatchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.\nbuffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\nparallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.\npartial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.\nrng: A random number generator. Default Random.GLOBAL_RNG.\nshuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.\n\nExamples\n\njulia> Xtrain = rand(10, 100);\n\njulia> array_loader = DataLoader(Xtrain, batchsize=2);\n\njulia> for x in array_loader\n @assert size(x) == (10, 2)\n # do something with x, 50 times\n end\n\njulia> array_loader.data === Xtrain\ntrue\n\njulia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples\n\njulia> for x in tuple_loader\n @assert x isa Tuple{Matrix}\n @assert size(x[1]) == (10, 2)\n end\n\njulia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples\n\njulia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);\n\njulia> for epoch in 1:100\n for (x, y) in train_loader # access via tuple destructuring\n @assert size(x) == (10, 5)\n @assert size(y) == (5,)\n # loss += f(x, y) # etc, runs 100 * 20 times\n end\n end\n\njulia> first(train_loader).label isa Vector{Char} # access via property name\ntrue\n\njulia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true\nfalse\n\njulia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last\n10×30 Matrix{Int8}\n10×30 Matrix{Int8}\n10×4 Matrix{Int8}\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#Utility-Functions","page":"Batching Data – MLUtils.jl","title":"Utility Functions","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.batch\nMLUtils.batchsize\nMLUtils.batchseq\nMLUtils.BatchView\nMLUtils.chunk\nMLUtils.eachobs\nMLUtils.fill_like\nMLUtils.filterobs\nFlux.flatten\nMLUtils.flatten\nMLUtils.getobs\nMLUtils.getobs!\nMLUtils.joinobs\nMLUtils.group_counts\nMLUtils.group_indices\nMLUtils.groupobs\nMLUtils.kfolds\nMLUtils.leavepout\nMLUtils.mapobs\nMLUtils.numobs\nMLUtils.normalise\nMLUtils.obsview\nMLUtils.ObsView\nMLUtils.ones_like\nMLUtils.oversample\nMLUtils.randobs\nMLUtils.rand_like\nMLUtils.randn_like\nMLUtils.rpad_constant\nMLUtils.shuffleobs\nMLUtils.splitobs\nMLUtils.unbatch\nMLUtils.undersample\nMLUtils.unsqueeze\nMLUtils.unstack\nMLUtils.zeros_like","category":"page"},{"location":"reference/data/mlutils/#MLUtils.batch","page":"Batching Data – MLUtils.jl","title":"MLUtils.batch","text":"batch(xs)\n\nBatch the arrays in xs into a single array with an extra dimension.\n\nIf the elements of xs are tuples, named tuples, or dicts, the output will be of the same type. \n\nSee also unbatch.\n\nExamples\n\njulia> batch([[1,2,3], \n [4,5,6]])\n3×2 Matrix{Int64}:\n 1 4\n 2 5\n 3 6\n\njulia> batch([(a=[1,2], b=[3,4])\n (a=[5,6], b=[7,8])]) \n(a = [1 5; 2 6], b = [3 7; 4 8])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchsize","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchsize","text":"batchsize(data::BatchView) -> Int\n\nReturn the fixed size of each batch in data.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert batchsize(A) == 30\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchseq","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchseq","text":"batchseq(seqs, val = 0)\n\nTake a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by val.\n\nExamples\n\njulia> batchseq([[1, 2, 3], [4, 5]], 0)\n3-element Vector{Vector{Int64}}:\n [1, 4]\n [2, 5]\n [3, 0]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.BatchView","page":"Batching Data – MLUtils.jl","title":"MLUtils.BatchView","text":"BatchView(data, batchsize; partial=true, collate=nothing)\nBatchView(data; batchsize=1, partial=true, collate=nothing)\n\nCreate a view of the given data that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize. In the case that the size of the dataset is not dividable by the specified batchsize, the remaining observations will be ignored if partial=false. If partial=true instead the last batch-size can be slightly smaller.\n\nNote that any data access is delayed until getindex is called.\n\nIf used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.\n\nFor BatchView to work on some data structure, the type of the given variable data must implement the data container interface. See ObsView for more info.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nbatchsize : The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).\npartial : If partial=false and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert typeof(A) <: BatchView <: AbstractVector\n@assert eltype(A) <: SubArray{Float64,2}\n@assert length(A) == 5 # Iris has 150 observations\n@assert size(A[1]) == (4,30) # Iris has 4 features\n\n# 5 batches of size 30 observations\nfor x in BatchView(X, batchsize=30)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert numobs(x) === 30\nend\n\n# 7 batches of size 20 observations\n# Note that the iris dataset has 150 observations,\n# which means that with a batchsize of 20, the last\n# 10 observations will be ignored\nfor (x, y) in BatchView((X, Y), batchsize=20, partial=false)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\n @assert numobs(x) == numobs(y) == 20\nend\n\n# collate tuple observations\nfor (x, y) in BatchView((rand(10, 3), [\"a\", \"b\", \"c\"]), batchsize=2, collate=true, partial=false)\n @assert size(x) == (10, 2)\n @assert size(y) == (2,)\nend\n\n\n# randomly assign observations to one and only one batch.\nfor (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\nend\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.chunk","page":"Batching Data – MLUtils.jl","title":"MLUtils.chunk","text":"chunk(x, n; [dims])\nchunk(x; [size, dims])\n\nSplit x into n parts or alternatively, if size is an integer, into equal chunks of size size. The parts contain the same number of elements except possibly for the last one that can be smaller.\n\nIn case size is a collection of integers instead, the elements of x are split into chunks of the given sizes.\n\nIf x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).\n\nExamples\n\njulia> chunk(1:10, 3)\n3-element Vector{UnitRange{Int64}}:\n 1:4\n 5:8\n 9:10\n\njulia> chunk(1:10; size = 2)\n5-element Vector{UnitRange{Int64}}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\njulia> x = reshape(collect(1:20), (5, 4))\n5×4 Matrix{Int64}:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n 4 9 14 19\n 5 10 15 20\n\njulia> xs = chunk(x, 2, dims=1)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:\n [1 6 11 16; 2 7 12 17; 3 8 13 18]\n [4 9 14 19; 5 10 15 20]\n\njulia> xs[1]\n3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n\njulia> xes = chunk(x; size = 2, dims = 2)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1 6; 2 7; … ; 4 9; 5 10]\n [11 16; 12 17; … ; 14 19; 15 20]\n\njulia> xes[2]\n5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:\n 11 16\n 12 17\n 13 18\n 14 19\n 15 20\n\njulia> chunk(1:6; size = [2, 4])\n2-element Vector{UnitRange{Int64}}:\n 1:2\n 3:6\n\n\n\n\n\nchunk(x, partition_idxs; [npartitions, dims])\n\nPartition the array x along the dimension dims according to the indexes in partition_idxs.\n\npartition_idxs must be sorted and contain only positive integers between 1 and the number of partitions. \n\nIf the number of partition npartitions is not provided, it is inferred from partition_idxs.\n\nIf dims is not provided, it defaults to the last dimension.\n\nSee also unbatch.\n\nExamples\n\njulia> x = reshape([1:10;], 2, 5)\n2×5 Matrix{Int64}:\n 1 3 5 7 9\n 2 4 6 8 10\n\njulia> chunk(x, [1, 2, 2, 3, 3])\n3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1; 2;;]\n [3 5; 4 6]\n [7 9; 8 10]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.eachobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.eachobs","text":"eachobs(data; kws...)\n\nReturn an iterator over data.\n\nSupports the same arguments as DataLoader. The batchsize default is -1 here while it is 1 for DataLoader.\n\nExamples\n\nX = rand(4,100)\n\nfor x in eachobs(X)\n # loop entered 100 times\n @assert typeof(x) <: Vector{Float64}\n @assert size(x) == (4,)\nend\n\n# mini-batch iterations\nfor x in eachobs(X, batchsize=10)\n # loop entered 10 times\n @assert typeof(x) <: Matrix{Float64}\n @assert size(x) == (4,10)\nend\n\n# support for tuples, named tuples, dicts\nfor (x, y) in eachobs((X, Y))\n # ...\nend\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.fill_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.fill_like","text":"fill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to val. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and ones_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.16087806\n 0.89916044\n\njulia> fill_like(x, 1.7, (3, 3))\n3×3 Matrix{Float32}:\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.803167 0.476101\n 0.303041 0.317581\n\njulia> fill_like(x, 1.7, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.7 1.7\n 1.7 1.7\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.filterobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.filterobs","text":"filterobs(f, data)\n\nReturn a subset of data container data including all indices i for which f(getobs(data, i)) === true.\n\ndata = 1:10\nnumobs(data) == 10\nfdata = filterobs(>(5), data)\nnumobs(fdata) == 5\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#Flux.flatten","page":"Batching Data – MLUtils.jl","title":"Flux.flatten","text":"flatten(x)\n\nSame as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.flatten","page":"Batching Data – MLUtils.jl","title":"MLUtils.flatten","text":"flatten(x::AbstractArray)\n\nReshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.\n\nSee also unsqueeze.\n\nExamples\n\njulia> rand(3,4,5) |> flatten |> size\n(12, 5)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs","text":"getobs(data, [idx])\n\nReturn the observations corresponding to the observation index idx. Note that idx can be any type as long as data has defined getobs for that type. If idx is not provided, then materialize all observations in data.\n\nIf data does not have getobs defined, then in the case of Tables.table(data) == true returns the row(s) in position idx, otherwise returns data[idx].\n\nAuthors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).\n\nThe returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this \"actual data\" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs! and numobs.\n\nExamples\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\n\ngetobs(x, 2) == (a = 2, b = x.b[:, 2])\ngetobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])\n\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\n\ngetobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])\ngetobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs!","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs!","text":"getobs!(buffer, data, idx)\n\nInplace version of getobs(data, idx). If this method is defined for the type of data, then buffer should be used to store the result, instead of allocating a dedicated object.\n\nImplementing this function is optional. In the case no such method is provided for the type of data, then buffer will be ignored and the result of getobs returned. This could be because the type of data may not lend itself to the concept of copy!. Thus, supporting a custom getobs! is optional and not required.\n\nSee also getobs and numobs. \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.joinobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.joinobs","text":"joinobs(datas...)\n\nConcatenate data containers datas.\n\ndata1, data2 = 1:10, 11:20\njdata = joinumobs(data1, data2)\ngetobs(jdata, 15) == 15\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_counts","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_counts","text":"group_counts(x)\n\nCount the number of times that each element of x appears.\n\nSee also group_indices\n\nExamples\n\njulia> group_counts(['a', 'b', 'b'])\nDict{Char, Int64} with 2 entries:\n 'a' => 1\n 'b' => 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_indices","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_indices","text":"group_indices(x) -> Dict\n\nComputes the indices of elements in the vector x for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.\n\nSee also group_counts.\n\nExamples\n\njulia> x = [:yes, :no, :maybe, :yes];\n\njulia> group_indices(x)\nDict{Symbol, Vector{Int64}} with 3 entries:\n :yes => [1, 4]\n :maybe => [3]\n :no => [2]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.groupobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.groupobs","text":"groupobs(f, data)\n\nSplit data container data data into different data containers, grouping observations by f(obs).\n\ndata = -10:10\ndatas = groupobs(>(0), data)\nlength(datas) == 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.kfolds","page":"Batching Data – MLUtils.jl","title":"MLUtils.kfolds","text":"kfolds(n::Integer, k = 5) -> Tuple\n\nCompute the train/validation assignments for k repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5 or k = 10. The following code snippet generates the indices assignments for k = 5\n\njulia> train_idx, val_idx = kfolds(10, 5);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nkfolds(data, [k = 5])\n\nRepartition a data container k times using a k folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs is invoked.\n\nConceptually, a k-folds repartitioning strategy divides the given data into k roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k different partitions of data.\n\nIn the case that the size of the dataset is not dividable by the specified k, the remaining observations will be evenly distributed among the parts.\n\nfor (x_train, x_val) in kfolds(X, k=10)\n # code called 10 times\n # nobs(x_val) may differ up to ±1 over iterations\nend\n\nMultiple variables are supported (e.g. for labeled data)\n\nfor ((x_train, y_train), val) in kfolds((X, Y), k=10)\n # ...\nend\n\nBy default the folds are created using static splits. Use shuffleobs to randomly assign observations to the folds.\n\nfor (x_train, x_val) in kfolds(shuffleobs(X), k = 10)\n # ...\nend\n\nSee leavepout for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.leavepout","page":"Batching Data – MLUtils.jl","title":"MLUtils.leavepout","text":"leavepout(n::Integer, [size = 1]) -> Tuple\n\nCompute the train/validation assignments for k ≈ n/size repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size or size+1 observations assigned to it. The following code snippet generates the index-vectors for size = 2.\n\njulia> train_idx, val_idx = leavepout(10, 2);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nleavepout(data, p = 1)\n\nRepartition a data container using a k-fold strategy, where k is chosen in such a way, that each validation subset of the resulting folds contains roughly p observations. Defaults to p = 1, which is also known as \"leave-one-out\" partitioning.\n\nThe resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until getobs is invoked.\n\nfor (train, val) in leavepout(X, p=2)\n # if nobs(X) is dividable by 2,\n # then numobs(val) will be 2 for each iteraton,\n # otherwise it may be 3 for the first few iterations.\nend\n\nSeekfolds for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.mapobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.mapobs","text":"mapobs(f, data; batched=:auto)\n\nLazily map f over the observations in a data container data. Returns a new data container mdata that can be indexed and has a length. Indexing triggers the transformation f.\n\nThe batched keyword argument controls the behavior of mdata[idx] and mdata[idxs] where idx is an integer and idxs is a vector of integers:\n\nbatched=:auto (default). Let f handle the two cases. Calls f(getobs(data, idx)) and f(getobs(data, idxs)).\nbatched=:never. The function f is always called on a single observation. Calls f(getobs(data, idx)) and [f(getobs(data, idx)) for idx in idxs].\nbatched=:always. The function f is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1) and f(getobs(data, idxs)).\n\nExamples\n\njulia> data = (a=[1,2,3], b=[1,2,3]);\n\njulia> mdata = mapobs(data) do x\n (c = x.a .+ x.b, d = x.a .- x.b)\n end\nmapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))\n\njulia> mdata[1]\n(c = 2, d = 0)\n\njulia> mdata[1:2]\n(c = [2, 4], d = [0, 0])\n\n\n\n\n\nmapobs(fs, data)\n\nLazily map each function in tuple fs over the observations in data container data. Returns a tuple of transformed data containers.\n\n\n\n\n\nmapobs(namedfs::NamedTuple, data)\n\nMap a NamedTuple of functions over data, turning it into a data container of NamedTuples. Field syntax can be used to select a column of the resulting data container.\n\ndata = 1:10\nnameddata = mapobs((x = sqrt, y = log), data)\ngetobs(nameddata, 10) == (x = sqrt(10), y = log(10))\ngetobs(nameddata.x, 10) == sqrt(10)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.numobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.numobs","text":"numobs(data)\n\nReturn the total number of observations contained in data.\n\nIf data does not have numobs defined, then in the case of Tables.table(data) == true returns the number of rows, otherwise returns length(data).\n\nAuthors of custom data containers should implement Base.length for their type instead of numobs. numobs should only be implemented for types where there is a difference between numobs and Base.length (such as multi-dimensional arrays).\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs.\n\nExamples\n\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\nnumobs(x) == 3\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\nnumobs(x) == 3\n\nAll internal containers must have the same number of observations:\n\njulia> x = (a = [1, 2, 3, 4], b = rand(6, 3));\n\njulia> numobs(x)\nERROR: DimensionMismatch: All data containers must have the same number of observations.\nStacktrace:\n [1] _check_numobs_error()\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163\n [2] _check_numobs\n @ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]\n [3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177\n [4] top-level scope\n @ REPL[35]:1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.normalise","page":"Batching Data – MLUtils.jl","title":"MLUtils.normalise","text":"normalise(x; dims=ndims(x), ϵ=1e-5)\n\nNormalise the array x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. \n\nϵ is a small additive factor added to the denominator for numerical stability.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.obsview","page":"Batching Data – MLUtils.jl","title":"MLUtils.obsview","text":"obsview(data, [indices])\n\nReturns a lazy view of the observations in data that correspond to the given indices. No data will be copied except of the indices. It is similar to constructing an ObsView, but returns a SubArray if the type of data is Array or SubArray. Furthermore, this function may be extended for custom types of data that also want to provide their own subset-type.\n\nIn case data is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView instead of a ObsView of tuples.\n\nIf instead you want to get the subset of observations corresponding to the given indices in their native type, use getobs.\n\nSee ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.ObsView","page":"Batching Data – MLUtils.jl","title":"MLUtils.ObsView","text":"ObsView(data, [indices])\n\nUsed to represent a subset of some data of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.\n\nThe main purpose for the existence of ObsView is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.\n\nAny data access is delayed until getindex is called, and even getindex returns the result of obsview which in general avoids data movement until getobs is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nindices : Optional. The index or indices of the observation(s) in data that the subset should represent. Can be of type Int or some subtype of AbstractVector.\n\nMethods\n\ngetindex : Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.\nnumobs : Returns the total number observations in the subset.\ngetobs : Returns the underlying data that the ObsView represents at the given relative indices. Note that these indices are in \"subset space\", and in general will not directly correspond to the same indices in the underlying data set.\n\nDetails\n\nFor ObsView to work on some data structure, the desired type MyType must implement the following interface:\n\ngetobs(data::MyType, idx) : Should return the observation(s) indexed by idx. In what form is up to the user. Note that idx can be of type Int or AbstractVector.\nnumobs(data::MyType) : Should return the total number of observations in data\n\nThe following methods can also be provided and are optional:\n\ngetobs(data::MyType) : By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.\nobsview(data::MyType, idx) : If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray for representing a subset of some AbstractArray.\ngetobs!(buffer, data::MyType, [idx]) : Inplace version of getobs(data, idx). If this method is provided for MyType, then eachobs can preallocate a buffer that is then reused every iteration. Note: buffer should be equivalent to the return value of getobs(::MyType, ...), since this is how buffer is preallocated by default.\n\nExamples\n\nX, Y = MLUtils.load_iris()\n\n# The iris set has 150 observations and 4 features\n@assert size(X) == (4,150)\n\n# Represents the 80 observations as a ObsView\nv = ObsView(X, 21:100)\n@assert numobs(v) == 80\n@assert typeof(v) <: ObsView\n# getobs indexes into v\n@assert getobs(v, 1:10) == X[:, 21:30]\n\n# Use `obsview` to avoid boxing into ObsView\n# for types that provide a custom \"subset\", such as arrays.\n# Here it instead creates a native SubArray.\nv = obsview(X, 1:100)\n@assert numobs(v) == 100\n@assert typeof(v) <: SubArray\n\n# Also works for tuples of arbitrary length\nsubset = obsview((X, Y), 1:100)\n@assert numobs(subset) == 100\n@assert typeof(subset) <: Tuple # tuple of SubArray\n\n# Use as iterator\nfor x in ObsView(X)\n @assert typeof(x) <: SubArray{Float64,1}\nend\n\n# iterate over each individual labeled observation\nfor (x, y) in ObsView((X, Y))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# same but in random order\nfor (x, y) in ObsView(shuffleobs((X, Y)))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# Indexing: take first 10 observations\nx, y = ObsView((X, Y))[1:10]\n\nSee also\n\nobsview, getobs, numobs, splitobs, shuffleobs, kfolds.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.ones_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.ones_like","text":"ones_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.8621633\n 0.5158395\n\njulia> ones_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.82297 0.656143\n 0.701828 0.391335\n\njulia> ones_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.oversample","page":"Batching Data – MLUtils.jl","title":"MLUtils.oversample","text":"oversample(data, classes; fraction=1, shuffle=true)\noversample(data::Tuple; fraction=1, shuffle=true)\n\nGenerate a re-balanced version of data by repeatedly sampling existing observations in such a way that every class will have at least fraction times the number observations of the largest class in classes. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data.\n\nAs an example, by default (i.e. with fraction = 1) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5 every class in the resulting data with have at least 50% as many observations as the largest class.\n\nThe classes input is an array with the same length as numobs(data). \n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# oversample the class \"a\" to match \"b\"\nX_bal, Y_bal = oversample(X, Y)\n\n# this results in a bigger dataset with repeated data\n@assert size(X_bal) == (3,8)\n@assert length(Y_bal) == 8\n\n# now both \"a\", and \"b\" have 4 observations each\n@assert sum(Y_bal .== \"a\") == 4\n@assert sum(Y_bal .== \"b\") == 4\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple and classes is not given, then it will be assumed that the last element of the tuple contains the classes.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(oversample(data, data.Y))\n8×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.376304 0.100022 a\n 2 │ 0.467095 0.185437 b\n 3 │ 0.481957 0.319906 b\n 4 │ 0.336762 0.390811 b\n 5 │ 0.376304 0.100022 a\n 6 │ 0.427064 0.0648339 a\n 7 │ 0.427064 0.0648339 a\n 8 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also undersample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.randobs","text":"randobs(data, [n])\n\nPick a random observation or a batch of n random observations from data. For this function to work, the type of data must implement numobs and getobs.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rand_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.rand_like","text":"rand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.rand and randn_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> rand_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.780032 0.920552 0.53689\n 0.121451 0.741334 0.5449\n 0.55348 0.138136 0.556404\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> rand_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.429274 0.135379\n 0.718895 0.0098756\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randn_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.randn_like","text":"randn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.randn and rand_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> randn_like(x, (3, 3))\n3×3 Matrix{Float32}:\n -0.385331 0.956231 0.0745102\n 1.43756 -0.967328 2.06311\n 0.0482372 1.78728 -0.902547\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> randn_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n -0.578527 0.823445\n -1.01338 -0.612053\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rpad_constant","page":"Batching Data – MLUtils.jl","title":"MLUtils.rpad_constant","text":"rpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)\n\nReturn the given sequence padded with val along the dimensions dims up to a maximum length in each direction specified by n.\n\nExamples\n\njulia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4\n4-element Vector{Int64}:\n 1\n 2\n -1\n -1\n\njulia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n\n3-element Vector{Int64}:\n 1\n 2\n 3\n\njulia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\njulia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.shuffleobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.shuffleobs","text":"shuffleobs([rng], data)\n\nReturn a \"subset\" of data that spans all observations, but has the order of the observations shuffled.\n\nThe values of data itself are not copied. Instead only the indices are shuffled. This function calls obsview to accomplish that, which means that the return value is likely of a different type than data.\n\n# For Arrays the subset will be of type SubArray\n@assert typeof(shuffleobs(rand(4,10))) <: SubArray\n\n# Iterate through all observations in random order\nfor x in eachobs(shuffleobs(X))\n ...\nend\n\nThe optional parameter rng allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random in Julia's standard library for more info.\n\nFor this function to work, the type of data must implement numobs and getobs. See ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.splitobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.splitobs","text":"splitobs(n::Int; at) -> Tuple\n\nCompute the indices for two or more disjoint subsets of the range 1:n with splits given by at.\n\nExamples\n\njulia> splitobs(100, at=0.7)\n(1:70, 71:100)\n\njulia> splitobs(100, at=(0.1, 0.4))\n(1:10, 11:50, 51:100)\n\n\n\n\n\nsplitobs(data; at, shuffle=false) -> Tuple\n\nPartition the data into two or more subsets. When at is a number (between 0 and 1) this specifies the proportion in the first subset. When at is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at). In all there are length(at)+1 subsets returned.\n\nIf shuffle=true, randomly permute the observations before splitting.\n\nSupports any datatype implementing the numobs and getobs interfaces – including arrays, tuples & NamedTuples of arrays.\n\nExamples\n\njulia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix\n([1 2 … 69 70], [71 72 … 99 100])\n\njulia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension\n(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)\n\njulia> splitobs(data, at=(0.5, 0.3)) # a 50%-30%-20% split, e.g. train/test/validation\n((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))\n\njulia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true); # split a Tuple\n\njulia> vec(test[1]) .+ 100 == test[2]\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unbatch","page":"Batching Data – MLUtils.jl","title":"MLUtils.unbatch","text":"unbatch(x)\n\nReverse of the batch operation, unstacking the last dimension of the array x.\n\nSee also unstack and chunk.\n\nExamples\n\njulia> unbatch([1 3 5 7;\n 2 4 6 8])\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.undersample","page":"Batching Data – MLUtils.jl","title":"MLUtils.undersample","text":"undersample(data, classes; shuffle=true)\n\nGenerate a class-balanced version of data by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data.\n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# subsample the class \"b\" to match \"a\"\nX_bal, Y_bal = undersample(X, Y)\n\n# this results in a smaller dataset\n@assert size(X_bal) == (3,4)\n@assert length(Y_bal) == 4\n\n# now both \"a\", and \"b\" have 2 observations each\n@assert sum(Y_bal .== \"a\") == 2\n@assert sum(Y_bal .== \"b\") == 2\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple, then it will be assumed that the last element of the tuple contains the targets.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(undersample(data, data.Y))\n4×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.427064 0.0648339 a\n 2 │ 0.376304 0.100022 a\n 3 │ 0.467095 0.185437 b\n 4 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also oversample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unsqueeze","page":"Batching Data – MLUtils.jl","title":"MLUtils.unsqueeze","text":"unsqueeze(x; dims)\n\nReturn x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended. dims can be an integer between 1 and ndims(x)+1.\n\nSee also flatten, stack.\n\nExamples\n\njulia> unsqueeze([1 2; 3 4], dims=2)\n2×1×2 Array{Int64, 3}:\n[:, :, 1] =\n 1\n 3\n\n[:, :, 2] =\n 2\n 4\n\n\njulia> xs = [[1, 2], [3, 4], [5, 6]]\n3-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n\njulia> unsqueeze(xs, dims=1)\n1×3 Matrix{Vector{Int64}}:\n [1, 2] [3, 4] [5, 6]\n\n\n\n\n\nunsqueeze(; dims)\n\nReturns a function which, acting on an array, inserts a dimension of size 1 at dims.\n\nExamples\n\njulia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size\n(21, 1, 22, 23)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unstack","page":"Batching Data – MLUtils.jl","title":"MLUtils.unstack","text":"unstack(xs; dims)\n\nUnroll the given xs into an array of arrays along the given dimension dims.\n\nSee also stack, unbatch, and chunk.\n\nExamples\n\njulia> unstack([1 3 5 7; 2 4 6 8], dims=2)\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.zeros_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.zeros_like","text":"zeros_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also ones_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.4005432\n 0.36934233\n\njulia> zeros_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.0695155 0.667979\n 0.558468 0.59903\n\njulia> zeros_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.0 0.0\n 0.0 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/callbacks/#man-callback-helpers","page":"Callback Helpers","title":"Callback Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.throttle","category":"page"},{"location":"reference/training/callbacks/#Flux.throttle","page":"Callback Helpers","title":"Flux.throttle","text":"throttle(f, timeout; leading=true, trailing=false)\n\nReturn a function that when invoked, will only be triggered at most once during timeout seconds.\n\nNormally, the throttled function will run as much as it can, without ever going more than once per wait duration; but if you'd like to disable the execution on the leading edge, pass leading=false. To enable execution on the trailing edge, pass trailing=true.\n\nExamples\n\njulia> a = Flux.throttle(() -> println(\"Flux\"), 2);\n\njulia> for i = 1:4 # a called in alternate iterations\n a()\n sleep(1)\n end\nFlux\nFlux\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Patience-Helpers","page":"Callback Helpers","title":"Patience Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-loss that decreases for 4 calls, then starts increasing\n# we call this like loss()\nloss = let t = 0\n () -> begin\n t += 1\n (t - 4) ^ 2\n end\nend\n\n# create an early stopping trigger\n# returns true when the loss increases for two consecutive steps\nes = early_stopping(loss, 2; init_score = 9)\n\n# this will stop at the 6th (4 decreasing + 2 increasing calls) epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"The keyword argument distance of early_stopping is a function of the form distance(best_score, score). By default distance is -, which implies that the monitored metric f is expected to be decreasing and minimized. If you use some increasing metric (e.g. accuracy), you can customize the distance function: (best_score, score) -> score - best_score.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-accuracy that increases by 0.01 each time from 0 to 1\n# we call this like acc()\nacc = let v = 0\n () -> v = max(1, v + 0.01)\nend\n\n# create an early stopping trigger for accuracy\nes = early_stopping(acc, 3; delta = (best_score, score) -> score - best_score)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"early_stopping and plateau are both built on top of patience. You can use patience to build your own triggers that use a patient counter. For example, if you want to trigger when the loss is below a threshold for several consecutive iterations:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"threshold(f, thresh, delay) = patience(delay) do\n f() < thresh\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Both predicate in patience and f in early_stopping / plateau can accept extra arguments. You can pass such extra arguments to predicate or f through the returned function:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"trigger = patience((a; b) -> a > b, 3)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n trigger(1; b = 2) && break\nend\n\n# this will stop at the 3rd epoch\nfor epoch in 1:10\n trigger(3; b = 2) && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.patience\nFlux.early_stopping\nFlux.plateau","category":"page"},{"location":"reference/training/callbacks/#Flux.patience","page":"Callback Helpers","title":"Flux.patience","text":"patience(predicate, wait)\n\nReturn a function that internally counts by one when predicate(...) == true, otherwise the count is reset to zero. If the count is greater than or equal to wait, the function returns true, otherwise it returns false.\n\nExamples\n\njulia> loss() = rand();\n\njulia> trigger = Flux.patience(() -> loss() < 1, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.early_stopping","page":"Callback Helpers","title":"Flux.early_stopping","text":"early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)\n\nReturn a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.\n\nExamples\n\njulia> loss = let l = 0\n () -> l += 1\n end; # pseudo loss function that returns increasing values\n\njulia> es = Flux.early_stopping(loss, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n es() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.plateau","page":"Callback Helpers","title":"Flux.plateau","text":"plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)\n\nReturn a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.\n\nExamples\n\njulia> f = let v = 10\n () -> v = v / abs(v) - v\n end; # -9, 8, -7, 6, ...\n\njulia> trigger = Flux.plateau(f, 3; init_score=10, min_dist=18);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n[ Info: Epoch 4\n\n\n\n\n\n","category":"function"},{"location":"guide/training/training/#man-training","page":"Training","title":"Training a Flux Model","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Training refers to the process of slowly adjusting the parameters of a model to make it work better. Besides the model itself, we will need three things:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"An objective function that evaluates how well a model is doing on some input.\nAn optimisation rule which describes how the model's parameters should be adjusted.\nSome training data to use as the input during this process.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Usually the training data is some collection of examples (or batches of examples) which are handled one-by-one. One epoch of training means that each example is used once, something like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise the optimiser for this model:\nopt_state = Flux.setup(rule, model)\n\nfor data in train_set\n # Unpack this element (for supervised training):\n input, label = data\n\n # Calculate the gradient of the objective\n # with respect to the parameters within the model:\n grads = Flux.gradient(model) do m\n result = m(input)\n loss(result, label)\n end\n\n # Update the parameters so as to reduce the objective,\n # according the chosen optimisation rule:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This loop can also be written using the function train!, but it's helpful to understand the pieces first:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\nend","category":"page"},{"location":"guide/training/training/#Model-Gradients","page":"Training","title":"Model Gradients","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Fist recall from the section on taking gradients that Flux.gradient(f, a, b) always calls f(a, b), and returns a tuple (∂f_∂a, ∂f_∂b). In the code above, the function f passed to gradient is an anonymous function with one argument, created by the do block, hence grads is a tuple with one element. Instead of a do block, we could have written:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(m -> loss(m(input), label), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Since the model is some nested set of layers, grads[1] is a similarly nested set of NamedTuples, ultimately containing gradient components. If (for example) θ = model.layers[1].weight[2,3] is one scalar parameter, an entry in a matrix of weights, then the derivative of the loss with respect to it is ∂f_∂θ = grads[1].layers[1].weight[2,3].","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is important that the execution of the model takes place inside the call to gradient, in order for the influence of the model's parameters to be observed by Zygote.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is also important that every update! step receives a newly computed gradient, as it will change whenever the model's parameters are changed, and for each new data point.","category":"page"},{"location":"guide/training/training/#Loss-Functions","page":"Training","title":"Loss Functions","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The objective function must return a number representing how far the model is from the desired result. This is termed the loss of the model.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This number can be produced by any ordinary Julia code, but this must be executed within the call to gradient. For instance, we could define a function","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"loss(y_hat, y) = sum((y_hat .- y).^2)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or write this directly inside the do block above. Many commonly used functions, like mse for mean-squared error or crossentropy for cross-entropy loss, are available from the Flux.Losses module.","category":"page"},{"location":"guide/training/training/#Optimisation-Rules","page":"Training","title":"Optimisation Rules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The simplest kind of optimisation using the gradient is termed gradient descent (or sometimes stochastic gradient descent when, as here, it is not applied to the entire dataset at once).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Gradient descent needs a learning rate which is a small number describing how fast to walk downhill, usually written as the Greek letter \"eta\", η. This is often described as a hyperparameter, to distinguish it from the parameters which are being updated θ = θ - η * ∂loss_∂θ. We want to update all the parameters in the model, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"η = 0.01 # learning rate\n\n# For each parameter array, update\n# according to the corresponding gradient:\nfmap(model, grads[1]) do p, g\n p .= p .- η .* g\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"A slightly more refined version of this loop to update all the parameters is wrapped up as a function update!(opt_state, model, grads[1]). And the learning rate is the only thing stored in the Descent struct.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, there are many other optimisation rules, which adjust the step size and direction in various clever ways. Most require some memory of the gradients from earlier steps, rather than always walking straight downhill – Momentum is the simplest. The function setup creates the necessary storage for this, for a particular model. It should be called once, before training, and returns a tree-like object which is the first argument of update!. Like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise momentum \nopt_state = Flux.setup(Momentum(0.01, 0.9), model)\n\nfor data in train_set\n grads = [...]\n\n # Update both model parameters and optimiser state:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Many commonly-used optimisation rules, such as Adam, are built-in. These are listed on the optimisers page.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"compat: Implicit-style optimiser state\nThis setup makes another tree-like structure. Old versions of Flux did not do this, and instead stored a dictionary-like structure within the optimiser Adam(0.001). This was initialised on first use of the version of update! for \"implicit\" parameters.","category":"page"},{"location":"guide/training/training/#Datasets-and-Batches","page":"Training","title":"Datasets & Batches","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The loop above iterates through train_set, expecting at each step a tuple (input, label). The very simplest such object is a vector of tuples, such as this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"x = randn(28, 28)\ny = rand(10)\ndata = [(x, y)]","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or data = [(x, y), (x, y), (x, y)] for the same values three times.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Very often, the initial data is large arrays which you need to slice into examples. To produce one iterator of pairs (x, y), you might want zip:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"X = rand(28, 28, 60_000); # many images, each 28 × 28\nY = rand(10, 60_000)\ndata = zip(eachslice(X; dims=3), eachcol(Y))\n\nfirst(data) isa Tuple{AbstractMatrix, AbstractVector} # true","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Here each iteration will use one matrix x (an image, perhaps) and one vector y. It is very common to instead train on batches of such inputs (or mini-batches, the two words mean the same thing) both for efficiency and for better results. This can be easily done using the DataLoader:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"data = Flux.DataLoader((X, Y), batchsize=32)\n\nx1, y1 = first(data)\nsize(x1) == (28, 28, 32)\nlength(data) == 1875 === 60_000 ÷ 32","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's layers are set up to accept such a batch of input data, and the convolutional layers such as Conv require it. The batch index is always the last dimension.","category":"page"},{"location":"guide/training/training/#Training-Loops","page":"Training","title":"Training Loops","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Simple training loops like the one above can be written compactly using the train! function. Including setup, this reads:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nfor epoch in 1:100\n Flux.train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Or explicitly writing the anonymous function which this do block creates, train!((m,x,y) -> loss(m(x),y), model, train_set, opt_state) is exactly equivalent.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Real training loops often need more flexibility, and the best way to do this is just to write the loop. This is ordinary Julia code, without any need to work through some callback API. Here is an example, in which it may be helpful to note:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The function withgradient is like gradient but also returns the value of the function, for logging or diagnostic use.\nLogging or printing is best done outside of the gradient call, as there is no need to differentiate these commands.\nTo use result for logging purposes, you could change the do block to end with return my_loss(result, label), result, i.e. make the function passed to withgradient return a tuple. The first element is always the loss.\nJulia's break and continue keywords let you exit from parts of the loop.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nmy_log = []\nfor epoch in 1:100\n losses = Float32[]\n for (i, data) in enumerate(train_set)\n input, label = data\n\n val, grads = Flux.withgradient(model) do m\n # Any code inside here is differentiated.\n # Evaluation of the model and loss must be inside!\n result = m(input)\n my_loss(result, label)\n end\n\n # Save the loss from the forward pass. (Done outside of gradient.)\n push!(losses, val)\n\n # Detect loss of Inf or NaN. Print a warning, and then skip update!\n if !isfinite(val)\n @warn \"loss is $val on item $i\" epoch\n continue\n end\n\n Flux.update!(opt_state, model, grads[1])\n end\n\n # Compute some accuracy, and save details as a NamedTuple\n acc = my_accuracy(model, train_set)\n push!(my_log, (; acc, losses))\n\n # Stop training when some criterion is reached\n if acc > 0.95\n println(\"stopping after $epoch epochs\")\n break\n end\nend","category":"page"},{"location":"guide/training/training/#Regularisation","page":"Training","title":"Regularisation","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The term regularisation covers a wide variety of techniques aiming to improve the result of training. This is often done to avoid overfitting.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Some of these can be implemented by simply modifying the loss function. L₂ regularisation (sometimes called ridge regression) adds to the loss a penalty proportional to θ^2 for every scalar parameter. A very simple model could be implemented as follows:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(densemodel) do m\n result = m(input)\n penalty = sum(abs2, m.weight)/2 + sum(abs2, m.bias)/2\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Accessing each individual parameter array by hand won't work well for large models. Instead, we can use Flux.trainables to collect all of them, and then apply a function to each one, and sum the result:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"pen_l2(x::AbstractArray) = sum(abs2, x)/2\n\ngrads = Flux.gradient(model) do m\n result = m(input)\n penalty = sum(pen_l2, Flux.trainables(m))\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, the gradient of this penalty term is very simple: It is proportional to the original weights. So there is a simpler way to implement exactly the same thing, by modifying the optimiser instead of the loss function. This is done by replacing this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"with this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"decay_opt_state = Flux.setup(OptimiserChain(WeightDecay(0.42), Adam(0.1)), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's optimisers are really modifications applied to the gradient before using it to update the parameters, and OptimiserChain applies two such modifications. The first, WeightDecay adds 0.42 times the original parameter to the gradient, matching the gradient of the penalty above (with the same, unrealistically large, constant). After that, in either case, Adam computes the final update.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same trick works for L₁ regularisation (also called Lasso), where the penalty is pen_l1(x::AbstractArray) = sum(abs, x) instead. This is implemented by SignDecay(0.42).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same OptimiserChain mechanism can be used for other purposes, such as gradient clipping with ClipGrad or ClipNorm.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is provided by the Dropout layer. This turns off some outputs of the previous layer during training. It should switch automatically, but see trainmode! / testmode! to manually enable or disable this layer.","category":"page"},{"location":"guide/training/training/#Learning-Rate-Schedules","page":"Training","title":"Learning Rate Schedules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Finer control of training, you may wish to alter the learning rate mid-way through training. This can be done with adjust!, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model) # initialise once\n\nfor epoch in 1:1000\n train!([...], state) # Train with η = 0.1 for first 100,\n if epoch == 100 # then change to use η = 0.01 for the rest.\n Flux.adjust!(opt_state, 0.01)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Other hyper-parameters can also be adjusted, such as Flux.adjust!(opt_state, beta = (0.8, 0.99)). And such modifications can be applied to just one part of the model. For instance, this sets a different learning rate for the encoder and the decoder:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Consider some model with two parts:\nbimodel = Chain(enc = [...], dec = [...])\n\n# This returns a tree whose structure matches the model:\nopt_state = Flux.setup(Adam(0.02), bimodel)\n\n# Adjust the learning rate to be used for bimodel.layers.enc\nFlux.adjust!(opt_state.layers.enc, 0.03)","category":"page"},{"location":"guide/training/training/#Freezing-layer-parameters","page":"Training","title":"Freezing layer parameters","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"To completely disable training of some part of the model, use freeze!. This is a temporary modification, reversed by thaw!:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux.freeze!(opt_state.layers.enc)\n\n# Now training won't update parameters in bimodel.layers.enc\ntrain!(loss, bimodel, data, opt_state)\n\n# Un-freeze the entire model:\nFlux.thaw!(opt_state)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/activation/#man-activations","page":"Activation Functions","title":"Activation Functions from NNlib.jl","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"These non-linearities used between layers of your model are exported by the NNlib package.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on. Alternatively, they can be passed to a layer like Dense(784 => 1024, relu) which will handle this broadcasting.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Functions like softmax are sometimes described as activation functions, but not by Flux. They must see all the outputs, and hence cannot be broadcasted. See the next page for details.","category":"page"},{"location":"reference/models/activation/#Alphabetical-Listing","page":"Activation Functions","title":"Alphabetical Listing","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"celu\nelu\ngelu\nhardsigmoid\nhardswish\nhardtanh\nleakyrelu\nlisht\nlogcosh\nlogsigmoid\nmish\nrelu\nrelu6\nrrelu\nselu\nsigmoid\nsigmoid_fast\nsoftplus\nsoftshrink\nsoftsign\nswish\ntanhshrink\ntanh_fast\ntrelu","category":"page"},{"location":"reference/models/activation/#NNlib.celu","page":"Activation Functions","title":"NNlib.celu","text":"celu(x, α=1) = x ≥ 0 ? x : α * (exp(x/α) - 1)\n\nActivation function from \"Continuously Differentiable Exponential Linear Units\".\n\njulia> lineplot(celu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ celu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> celu(-10f0)\n-0.9999546f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.elu","page":"Activation Functions","title":"NNlib.elu","text":"elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)\n\nExponential Linear Unit activation function. See \"Fast and Accurate Deep Network Learning by Exponential Linear Units\". You can also specify the coefficient explicitly, e.g. elu(x, 1).\n\njulia> lineplot(elu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> elu(-10f0)\n-0.9999546f0\n\njulia> elu(-10f0, 2)\n-1.9999092f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.gelu","page":"Activation Functions","title":"NNlib.gelu","text":"gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))\n\nActivation function from \"Gaussian Error Linear Units\".\n\njulia> lineplot(gelu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣤⣤⣤⣤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⡤⡧⠶⠶⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(gelu, -5, 0, height=7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸│ gelu(x) \n │⠑⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇│ swish(x)\n │⠀⠀⠀⠀⠀⠈⠉⠒⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⢠⡇⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⣄⠀⠀⠀⠀⠀⢠⡞⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⢄⣀⣀⡤⢣⠃⠀⠀│ \n -0.2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠇⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardsigmoid","page":"Activation Functions","title":"NNlib.hardsigmoid","text":"hardσ(x) = max(0, min(1, (x + 3) / 6))\n\nPiecewise linear approximation of sigmoid.\n\njulia> lineplot(hardsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⡗⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⠤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardswish","page":"Activation Functions","title":"NNlib.hardswish","text":"hardswish(x) = x * hardσ(x)\n\nHard-Swish activation function. See \"Searching for MobileNetV3\".\n\njulia> lineplot(hardswish, -2, 5, height = 7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠒⠉⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣤⣤⣖⣚⣉⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠉⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(hardswish, -4, 0, height = 7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⢣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡜│ hardswish(x)\n │⠒⠒⠢⠤⢄⣀⡀⠀⠀⠀⠀⠱⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠎⠀│ swish(x) \n │⠀⠀⠀⠀⠀⠀⠈⠉⠑⠒⠦⢄⣘⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠃⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⡖⠦⢄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⢔⠏⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠣⣄⠀⠉⠑⠒⠦⠤⢄⣀⣀⣀⣀⡠⠤⠖⣊⠕⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⠤⡀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀│ \n -0.4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠢⠤⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-4⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> hardswish.(-5:5)'\n1×11 adjoint(::Vector{Float64}) with eltype Float64:\n -0.0 -0.0 -0.0 -0.333333 -0.333333 0.0 0.666667 1.66667 3.0 4.0 5.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardtanh","page":"Activation Functions","title":"NNlib.hardtanh","text":"hardtanh(x) = max(-1, min(1, x))\n\nSegment-wise linear approximation of tanh, much cheaper to compute. See \"Large Scale Machine Learning\".\n\nSee also tanh_fast.\n\njulia> lineplot(hardtanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠋⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⠔⠋⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x\n\njulia> lineplot(tanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠤⠒⠒⠒⠊⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠊⠁⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⡠⠤⠤⠤⠖⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.leakyrelu","page":"Activation Functions","title":"NNlib.leakyrelu","text":"leakyrelu(x, a=0.01) = max(a*x, x)\n\nLeaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).\n\njulia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠤⠒⠒⠋⠉⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⠤⠤⠒⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> leakyrelu(-10f0, 0.2)\n-2.0f0\n\njulia> leakyrelu(-10f0, 0.02)\n-0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.lisht","page":"Activation Functions","title":"NNlib.lisht","text":"lisht(x) = x * tanh(x)\n\nActivation function from \"LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ...\"\n\njulia> lineplot(lisht, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)\n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ \n │⠀⠀⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⣄⣀⣀⣇⣀⣀⠤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, logcosh)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x) \n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ logcosh(x)\n │⠢⣄⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⣀⠔│ \n f(x) │⠀⠈⠑⠢⣀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⣀⠔⠊⠁⠀│ \n │⠀⠀⠀⠀⠀⠉⠢⢄⡀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⡠⠔⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠦⣌⡓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⣁⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠪⠷⣦⣄⣀⣀⣇⣀⣀⣤⠶⠕⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logcosh","page":"Activation Functions","title":"NNlib.logcosh","text":"logcosh(x)\n\nReturn log(cosh(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logcosh, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)\n │⠉⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ \n │⠀⠀⠀⠑⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠑⠦⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠢⢄⣀⣀⣇⣀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logsigmoid","page":"Activation Functions","title":"NNlib.logsigmoid","text":"logσ(x)\n\nReturn log(σ(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡤⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.mish","page":"Activation Functions","title":"NNlib.mish","text":"mish(x) = x * tanh(softplus(x))\n\nActivation function from \"Mish: A Self Regularized Non-Monotonic Neural Activation Function\".\n\njulia> lineplot(mish, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣧⣔⣊⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu","page":"Activation Functions","title":"NNlib.relu","text":"relu(x) = max(0, x)\n\nRectified Linear Unit activation function.\n\njulia> lineplot(relu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu6","page":"Activation Functions","title":"NNlib.relu6","text":"relu6(x) = min(max(0, x), 6)\n\nRectified Linear Unit activation function capped at 6. See \"Convolutional Deep Belief Networks\" from CIFAR-10.\n\njulia> lineplot(relu6, -10, 10, height=7)\n ┌────────────────────────────────────────┐ \n 6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡤⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⡠⠎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡧⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-10⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.rrelu","page":"Activation Functions","title":"NNlib.rrelu","text":"rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)\n# where `a` is randomly sampled from uniform distribution `U(lo, hi)`\n\nRandomized Leaky Rectified Linear Unit activation function. See \"Empirical Evaluation of Rectified Activations\" You can also specify the bound explicitly, e.g. rrelu(x, 0.0, 1.0).\n\njulia> lineplot(rrelu, -20, 10, height=7)\n ┌────────────────────────────────────────┐ \n 10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⠤⣤⣤⢤⣤⣤⠤⠤⠤⢼⠮⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⣰⢀⣆⡄⣄⡄⡠⡰⠦⠷⡜⢢⠷⠳⠢⠊⠉⠉⠀⠀⠁⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠃⠉⠙⠘⠃⠈⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-20⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> extrema(rrelu.(fill(-10f0, 1000)))\n(-3.3316886f0, -1.2548422f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.selu","page":"Activation Functions","title":"NNlib.selu","text":"selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))\n\nλ ≈ 1.05070...\nα ≈ 1.67326...\n\nScaled exponential linear units. See \"Self-Normalizing Neural Networks\".\n\njulia> lineplot(selu, -3, 2, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ selu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠊⠉⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⡠⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⣉⠭⠛⡏⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡤⠤⠒⠊⠉⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠤⠤⠖⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> selu(-10f0)\n-1.7580194f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid","page":"Activation Functions","title":"NNlib.sigmoid","text":"σ(x) = 1 / (1 + exp(-x))\n\nClassic sigmoid activation function. Unicode σ can be entered as \\sigma then tab, in many editors. The ascii name sigmoid is also exported.\n\nSee also sigmoid_fast.\n\njulia> using UnicodePlots\n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> sigmoid === σ\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid_fast","page":"Activation Functions","title":"NNlib.sigmoid_fast","text":"sigmoid_fast(x)\n\nThis is a faster, and very slightly less accurate, version of sigmoid. For x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.\n\nSee also tanh_fast.\n\njulia> sigmoid(0.2f0)\n0.54983395f0\n\njulia> sigmoid_fast(0.2f0)\n0.54983395f0\n\njulia> hardσ(0.2f0)\n0.53333336f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softplus","page":"Activation Functions","title":"NNlib.softplus","text":"softplus(x) = log(exp(x) + 1)\n\nSee \"Deep Sparse Rectifier Neural Networks\", JMLR 2011.\n\njulia> lineplot(softplus, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⠤⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⡠⠤⠤⠤⠤⠔⠒⠒⠚⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, relu)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠│ relu(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡴⠞⠋⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣤⡴⠞⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⢤⡲⠝⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⣉⠥⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣠⣤⣤⣤⣤⣔⣒⣒⣚⣉⣉⣁⣀⣇⠴⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softplus(16f0)\n16.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softshrink","page":"Activation Functions","title":"NNlib.softshrink","text":"softshrink(x, λ=0.5) =\n (x ≥ λ ? x - λ : (-λ ≥ x ? x + λ : 0))\n\nSee \"Softshrink Activation Function\".\n\njulia> lineplot(softshrink, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⠉⠁│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⡤⠤⠤⠤⠤⠤⠤⡧⠤⠤⠤⠤⠶⠮⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⢀⣀⠤⠖⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⣀⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanhshrink)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⣉⡡│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⣒⣋⠥⠤⠒⠊⠉⠁⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠾⠿⠯⠭⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⢀⣀⡠⠤⠖⢒⣋⠭⠗⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠊⣉⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀\n\njulia> softshrink.((-10f0, 10f0))\n(-9.5f0, 9.5f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softsign","page":"Activation Functions","title":"NNlib.softsign","text":"softsign(x) = x / (1 + |x|)\n\nSee \"Quadratic Polynomials Learn Better Image Features\" (2009).\n\njulia> lineplot(softsign, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⠋⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠒⠒⠒⠒⠒⠊⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanh)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠊⠉⠉⠉⣉⣉⣉⣉⣉⠭⠭⠭⠭⠭│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⣃⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanh(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⡞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡴⠃⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⢋⠕⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣒⣒⣒⣒⣒⣊⣉⣉⣉⣉⣁⣀⣀⡠⠤⠒⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softsign(1f0)\n0.5f0\n\njulia> softsign(100f0)\n0.990099f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.swish","page":"Activation Functions","title":"NNlib.swish","text":"swish(x) = x * σ(x)\n\nSelf-gated activation function. See \"Swish: a Self-Gated Activation Function\".\n\njulia> lineplot(swish, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋⠁⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⣀⡤⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⣤⡤⡧⠴⠶⠯⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠉⠑⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanhshrink","page":"Activation Functions","title":"NNlib.tanhshrink","text":"tanhshrink(x) = x - tanh(x)\n\nSee \"Tanhshrink Activation Function\".\n\njulia> lineplot(tanhshrink, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠊│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⣀⡠⠤⠒⠊⠉⠁⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠮⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⣀⡠⠴⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡠⠴⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> tanhshrink.((-10f0, 10f0))\n(-9.0f0, 9.0f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanh_fast","page":"Activation Functions","title":"NNlib.tanh_fast","text":"tanh_fast(x)\n\nThis is a faster but slighly less accurate version of tanh.\n\nWhere Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit. \n\nFor x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.\n\nSee also sigmoid_fast.\n\njulia> tanh(0.5f0)\n0.46211717f0\n\njulia> tanh_fast(0.5f0)\n0.46211714f0\n\njulia> hard_tanh(0.5f0)\n0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.trelu","page":"Activation Functions","title":"NNlib.trelu","text":"trelu(x, theta=1) = x > theta ? x : 0\n\nThreshold gated rectified linear activation function. See \"Zero-bias autoencoders and the benefits of co-adapting features\"\n\njulia> lineplot(trelu, -2, 4, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣀⣀⣀⣀⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀4⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#One-More","page":"Activation Functions","title":"One More","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Julia's Base.Math also provides tanh, which can be used as an activation function.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"julia> using UnicodePlots\n\njulia> lineplot(tanh, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⣀⠤⠔⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡰⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⡤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠎⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ","category":"page"},{"location":"ecosystem/#The-Julia-Ecosystem-around-Flux","page":"Ecosystem","title":"The Julia Ecosystem around Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.","category":"page"},{"location":"ecosystem/#Flux-models","page":"Ecosystem","title":"Flux models","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Flux's model-zoo contains examples from many domains.","category":"page"},{"location":"ecosystem/#Computer-vision","page":"Ecosystem","title":"Computer vision","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ObjectDetector.jl provides ready-to-go image detection via YOLO.\nMetalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.\nUNet.jl is a generic UNet implementation.","category":"page"},{"location":"ecosystem/#Natural-language-processing","page":"Ecosystem","title":"Natural language processing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.\nTextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.","category":"page"},{"location":"ecosystem/#Reinforcement-learning","page":"Ecosystem","title":"Reinforcement learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.\nReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.","category":"page"},{"location":"ecosystem/#Graph-learning","page":"Ecosystem","title":"Graph learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.\nGeometricFlux.jl is the first graph neural network library for julia. \nNeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.\nSeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.","category":"page"},{"location":"ecosystem/#Time-series","page":"Ecosystem","title":"Time series","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FluxArchitectures.jl is a collection of advanced network architectures for time series forecasting.","category":"page"},{"location":"ecosystem/#Robust-networks","page":"Ecosystem","title":"Robust networks","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Tools-closely-associated-with-Flux","page":"Ecosystem","title":"Tools closely associated with Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Utility tools you're unlikely to have met if you never used Flux!","category":"page"},{"location":"ecosystem/#High-level-training-flows","page":"Ecosystem","title":"High-level training flows","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FastAI.jl is a Julia port of Python's fast.ai library.\nFluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl\nIgnite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.\nTsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.","category":"page"},{"location":"ecosystem/#Datasets","page":"Ecosystem","title":"Datasets","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"MLDatasets.jl focuses on downloading, unpacking, and accessing benchmark datasets.\nGraphMLDatasets.jl: a library for machine learning datasets on graph.","category":"page"},{"location":"ecosystem/#Plumbing","page":"Ecosystem","title":"Plumbing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Tools to put data into the right order for creating a model.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Augmentor.jl is a real-time library augmentation library for increasing the number of training images.\nDataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.\nMLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.","category":"page"},{"location":"ecosystem/#Parameters","page":"Ecosystem","title":"Parameters","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ParameterSchedulers.jl standard scheduling policies for machine learning.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Differentiable-programming","page":"Ecosystem","title":"Differentiable programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Packages based on differentiable programming but not necessarily related to Machine Learning. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.\nDiffEqFlux.jl provides tools for creating Neural Differential Equations.\nFlux3D.jl shows off machine learning on 3D data.\nRayTracer.jl combines ML with computer vision via a differentiable renderer.\nDuckietown.jl Differentiable Duckietown simulator.\nThe Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.\nAtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.\nDiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.","category":"page"},{"location":"ecosystem/#Probabilistic-programming","page":"Ecosystem","title":"Probabilistic programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.\nOmega.jl is a research project aimed at causal, higher-order probabilistic programming.\nStheno.jl provides flexible Gaussian processes.","category":"page"},{"location":"ecosystem/#Statistics","page":"Ecosystem","title":"Statistics","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"OnlineStats.jl provides single-pass algorithms for statistics.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Useful-miscellaneous-packages","page":"Ecosystem","title":"Useful miscellaneous packages","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Some useful and random packages!","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.\nMill.jl helps to prototype flexible multi-instance learning models.\nMLMetrics.jl is a utility for scoring models in data science and machine learning.\nTorch.jl exposes torch in Julia.\nValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.\nInvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.\nProgressMeter.jl progress meters for long-running computations.\nTensorBoardLogger.jl easy peasy logging to tensorboard in Julia\nArgParse.jl is a package for parsing command-line arguments to Julia programs.\nParameters.jl types with default field values, keyword constructors and (un-)pack macros.\nBSON.jl is a package for working with the Binary JSON serialisation format.\nDataFrames.jl in-memory tabular data in Julia.\nDrWatson.jl is a scientific project assistant software.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Alternatives-to-Flux","page":"Ecosystem","title":"Alternatives to Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Julia has several other libraries for making neural networks. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl) \nKnet.jl is a neural network library built around AutoGrad.jl.\nLux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"compat: Explicit or explicit?\nFlux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word \"explicit\" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/functors/#Recursive-transformations-from-Functors.jl","page":"Nested Structures – Functors.jl","title":"Recursive transformations from Functors.jl","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux models are deeply nested structures, and Functors.jl provides tools needed to explore such objects, apply functions to the parameters they contain (e.g. for moving them to gpu), and re-build them.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"compat: Flux ≤ v0.14\nAll layers were previously defined with the Functors.@functor macro. This still works, but it is recommended that you use the new Flux.@layer macro instead. Both allow Flux.setup to see the parameters inside, and gpu to move them to the GPU, but Flux.@layer also overloads printing, and offers a way to define trainable at the same time.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"compat: Functors v0.5\nWith Functors.jl v0.5, which is required by Flux v0.15 and later, every custom type is a functor by default. This means that applying Flux.@layer to a type is no longer strictly necessary, but it is still recommended for addictional features like pretty-printing.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Functors.jl has its own notes on basic usage for more details. Additionally, the Advanced Model Building and Customisation page covers the use cases of Functors in greater details.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux.@layer\nFunctors.@leaf\nFunctors.@functor\nFunctors.fmap\nFunctors.fmap_with_path\nFunctors.isleaf\nFunctors.children\nFunctors.fcollect\nFunctors.functor\nFunctors.fmapstructure\nFunctors.fmapstructure_with_path\nFunctors.execute\nFunctors.AbstractWalk\nFunctors.ExcludeWalk\nFunctors.CachedWalk","category":"page"},{"location":"reference/models/functors/#Flux.@layer","page":"Nested Structures – Functors.jl","title":"Flux.@layer","text":"@layer [showtype] MyModel [trainable=(field1,...)]\n\nThis macro adds convenience functionality to a custom type to serve as a neural network layer, as a module, or as an entire model.\n\nThe optional keyword trainable allows you to specify which fields of your model can be trained, instead of assuming all fieldnames(MyModel) to trainable. Note that it is never necessary to tell Flux to ignore non-array objects such as functions or sizes. This can be also be done by defining trainable(::MyModel) for your type.\n\nThe macro also handles overloads of the 3-arg show(::IO, ::MIME\"text/plain\", ::MyModel) for pretty printing. The optional argument showtype can take any of the following values:\n\n:expand (default): This will expand the representation of container types like Chain, while maintaining a compat representation of types like Dense containing only arrays.\n:noexpand: This is to be used in case your type contains other layers but you want to keep the representation simple.\n:ignore: To opt out of the pretty printing.\n\nYou probably still want to define 2-arg show(::IO, ::MyModel), the macro does not touch this.\n\nNote that re-running the macro with different options may not remove all methods, you will need to restart.\n\nExample\n\njulia> struct Trio; a; b; c end\n\njulia> tri = Trio(Dense([1.1 2.2], [0.0], tanh), Dense(hcat(3.3), false), Dropout(0.4))\nTrio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4))\n\njulia> Flux.@layer Trio\n\njulia> tri # now the layer is printed like Chain\nTrio(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1; bias=false), # 1 parameters\n Dropout(0.4),\n) # Total: 3 arrays, 4 parameters, 240 bytes.\n\njulia> Flux.@layer :noexpand Trio trainable=(a,b)\n\njulia> tri # now the layer is printed compactly\nTrio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4)) # 4 parameters\n\njulia> opt_state = Flux.setup(Adam(), tri); # `c` is not in the optimizer state\n\nThe macro also adds methods to make using Flux with Enzyme easier.\n\nDuplicated(m::Layer) allocates a copy for the gradient (initially zero).\nThis is made callable, (m::Duplicated{<:Layer})(x...) = m.val(x...)\nPretty printing for show(io, mime, ::Duplicated{<:Layer})\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.@leaf","page":"Nested Structures – Functors.jl","title":"Functors.@leaf","text":"@leaf T\n\nDefine functor for the type T so that isleaf(x::T) == true.\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.@functor","page":"Nested Structures – Functors.jl","title":"Functors.@functor","text":"@functor T\n@functor T (x,)\n\nAdds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.\n\nBy default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> Functors.children(Foo(1,2))\n(x = 1, y = 2)\n\njulia> _, re = Functors.functor(Foo(1,2));\n\njulia> re((10, 20))\nFoo(10, 20)\n\njulia> @functor Foo # same as before, nothing changes\n\njulia> struct TwoThirds a; b; c; end\n\njulia> @functor TwoThirds (a, c)\n\njulia> ch2, re3 = Functors.functor(TwoThirds(10,20,30));\n\njulia> ch2\n(a = 10, c = 30)\n\njulia> re3((\"ten\", \"thirty\"))\nTwoThirds(\"ten\", 20, \"thirty\")\n\njulia> fmap(x -> 10x, TwoThirds(Foo(1,2), Foo(3,4), 56))\nTwoThirds(Foo(10, 20), Foo(3, 4), 560)\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.fmap","page":"Nested Structures – Functors.jl","title":"Functors.fmap","text":"fmap(f, x, ys...; exclude = Functors.isleaf, walk = Functors.DefaultWalk(), [prune])\n\nA structure and type preserving map.\n\nBy default it transforms every leaf node (identified by exclude, default isleaf) by applying f, and otherwise traverses x recursively using functor. Optionally, it may also be associated with objects ys with the same tree structure. In that case, f is applied to the corresponding leaf nodes in x and ys.\n\nSee also fmap_with_path and fmapstructure.\n\nExamples\n\njulia> fmap(string, (x=1, y=(2, 3)))\n(x = \"1\", y = (\"2\", \"3\"))\n\njulia> nt = (a = [1,2], b = [23, (45,), (x=6//7, y=())], c = [8,9]);\n\njulia> fmap(println, nt)\n[1, 2]\n23\n45\n6//7\n()\n[8, 9]\n(a = nothing, b = Any[nothing, (nothing,), (x = nothing, y = nothing)], c = nothing)\n\njulia> fmap(println, nt; exclude = x -> x isa Array)\n[1, 2]\nAny[23, (45,), (x = 6//7, y = ())]\n[8, 9]\n(a = nothing, b = nothing, c = nothing)\n\njulia> twice = [1, 2]; # println only acts once on this\n\njulia> fmap(println, (i = twice, ii = 34, iii = [5, 6], iv = (twice, 34), v = 34.0))\n[1, 2]\n34\n[5, 6]\n34\n34.0\n(i = nothing, ii = nothing, iii = nothing, iv = (nothing, nothing), v = nothing)\n\njulia> d1 = Dict(\"x\" => [1,2], \"y\" => 3);\n\njulia> d2 = Dict(\"x\" => [4,5], \"y\" => 6, \"z\" => \"an_extra_value\");\n\njulia> fmap(+, d1, d2) == Dict(\"x\" => [5, 7], \"y\" => 9) # Note that \"z\" is ignored\ntrue\n\nMutable objects which appear more than once are only handled once (by caching f(x) in an IdDict). Thus the relationship x.i === x.iv[1] will be preserved. An immutable object which appears twice is not stored in the cache, thus f(34) will be called twice, and the results will agree only if f is pure.\n\nBy default, almost all container-like types have children to recurse into. Arrays of numbers do not.\n\nTo opt out of recursion for custom types use @leaf or pass a custom exclude function.\n\njulia> struct Foo; x; y; end\n\njulia> struct Bar; x; end\n\njulia> m = Foo(Bar([1,2,3]), (4, 5, Bar(Foo(6, 7))));\n\njulia> fmap(x -> 10x, m)\nFoo(Bar([10, 20, 30]), (40, 50, Bar(Foo(60, 70))))\n\njulia> fmap(string, m)\nFoo(Bar(\"[1, 2, 3]\"), (\"4\", \"5\", Bar(Foo(\"6\", \"7\"))))\n\njulia> fmap(string, m, exclude = v -> v isa Bar)\nFoo(\"Bar([1, 2, 3])\", (4, 5, \"Bar(Foo(6, 7))\"))\n\nTo recurse into custom types without reconstructing them afterwards, use fmapstructure.\n\nFor advanced customization of the traversal behaviour, pass a custom walk function that subtypes Functors.AbstractWalk. The call fmap(f, x, ys...; walk = mywalk) will wrap mywalk in ExcludeWalk then CachedWalk. Here, ExcludeWalk is responsible for applying f at excluded nodes. For a low-level interface for executing a user-constructed walk, see execute.\n\njulia> struct MyWalk <: Functors.AbstractWalk end\n\njulia> (::MyWalk)(recurse, x) = x isa Bar ? \"hello\" :\n Functors.DefaultWalk()(recurse, x)\n\njulia> fmap(x -> 10x, m; walk = MyWalk())\nFoo(\"hello\", (40, 50, \"hello\"))\n\nThe behaviour when the same node appears twice can be altered by giving a value to the prune keyword, which is then used in place of all but the first:\n\njulia> twice = [1, 2];\n\njulia> fmap(float, (x = twice, y = [1,2], z = twice); prune = missing)\n(x = [1.0, 2.0], y = [1.0, 2.0], z = missing)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmap_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmap_with_path","text":"fmap_with_path(f, x, ys...; exclude = isleaf, walk = DefaultWalkWithPath(), [prune])\n\nLike fmap, but also passes a KeyPath to f for each node in the recursion. The KeyPath is a tuple of the indices used to reach the current node from the root of the recursion. The KeyPath is constructed by the walk function, and can be used to reconstruct the path to the current node from the root of the recursion.\n\nf has to accept two arguments: the associated KeyPath and the value of the current node.\n\nexclude also receives the KeyPath as its first argument and a node as its second. It should return true if the recursion should not continue on its children and f applied to it.\n\nprune is used to control the behaviour when the same node appears twice, see fmap for more information.\n\nExamples\n\njulia> x = ([1, 2, 3], 4, (a=5, b=Dict(\"A\"=>6, \"B\"=>7), c=Dict(\"C\"=>8, \"D\"=>9)));\n\njulia> exclude(kp, x) = kp == KeyPath(3, :c) || Functors.isleaf(x);\n\njulia> fmap_with_path((kp, x) -> x isa Dict ? nothing : x.^2, x; exclude = exclude)\n([1, 4, 9], 16, (a = 25, b = Dict(\"B\" => 49, \"A\" => 36), c = nothing))\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.isleaf","page":"Nested Structures – Functors.jl","title":"Functors.isleaf","text":"isleaf(x)\n\nReturn true if x has no children according to functor.\n\nExamples\n\njulia> Functors.isleaf(1)\ntrue\n\njulia> Functors.isleaf([2, 3, 4])\ntrue\n\njulia> Functors.isleaf([\"five\", [6, 7]])\nfalse\n\njulia> Functors.isleaf([])\nfalse\n\njulia> Functors.isleaf((8, 9))\nfalse\n\njulia> Functors.isleaf(())\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.children","page":"Nested Structures – Functors.jl","title":"Functors.children","text":"children(x)\n\nReturn the children of x as defined by functor. Equivalent to functor(x)[1].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fcollect","page":"Nested Structures – Functors.jl","title":"Functors.fcollect","text":"fcollect(x; exclude = v -> false)\n\nTraverse x by recursing each child of x as defined by functor and collecting the results into a flat array, ordered by a breadth-first traversal of x, respecting the iteration order of children calls.\n\nDoesn't recurse inside branches rooted at nodes v for which exclude(v) == true. In such cases, the root v is also excluded from the result. By default, exclude always yields false.\n\nSee also children.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> struct Bar; x; end\n\njulia> struct TypeWithNoChildren; x; y; end\n\njulia> @leaf TypeWithNoChildren\n\njulia> m = Foo(Bar([1,2,3]), TypeWithNoChildren(:a, :b))\nFoo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n\njulia> fcollect(m)\n4-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n [1, 2, 3]\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> v isa Bar)\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> Functors.isleaf(v))\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.functor","page":"Nested Structures – Functors.jl","title":"Functors.functor","text":"functor(x)\nfunctor(typeof(x), x)\n\nReturns a tuple containing, first, a NamedTuple of the children of x (typically its fields), and second, a reconstruction function. This controls the behaviour of fmap.\n\nMethods should be added to functor(::Type{T}, x) for custom types, usually using the macro @functor.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure","text":"fmapstructure(f, x, ys...; exclude = isleaf, [prune])\n\nLike fmap, but doesn't preserve the type of custom structs. Instead, it returns a NamedTuple (or a Tuple, or an array), or a nested set of these.\n\nUseful for when the output must not contain custom structs.\n\nSee also fmap and fmapstructure_with_path.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> m = Foo([1,2,3], [4, (5, 6), Foo(7, 8)]);\n\njulia> fmapstructure(x -> 2x, m)\n(x = [2, 4, 6], y = Any[8, (10, 12), (x = 14, y = 16)])\n\njulia> fmapstructure(println, m)\n[1, 2, 3]\n4\n5\n6\n7\n8\n(x = nothing, y = Any[nothing, (nothing, nothing), (x = nothing, y = nothing)])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure_with_path","text":"fmapstructure_with_path(f, x, ys...; [exclude, prune])\n\nLike fmap_with_path, but doesn't preserve the type of custom structs. Instead, it returns a named tuple, a tuple, an array, a dict, or a nested set of these.\n\nSee also fmapstructure.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.execute","page":"Nested Structures – Functors.jl","title":"Functors.execute","text":"execute(walk, x, ys...)\n\nExecute a walk that recursively calls itself, starting at a node x in a Functors tree, as well as optional associated nodes ys... in other Functors trees. Any custom walk function that subtypes Functors.AbstractWalk is permitted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.AbstractWalk","page":"Nested Structures – Functors.jl","title":"Functors.AbstractWalk","text":"AbstractWalk\n\nAny walk for use with fmap should inherit from this type. A walk subtyping AbstractWalk must satisfy the walk function interface:\n\nstruct MyWalk <: AbstractWalk end\n\nfunction (::MyWalk)(recurse, x, ys...)\n # implement this\nend\n\nThe walk function is called on a node x in a Functors tree. It may also be passed associated nodes ys... in other Functors trees. The walk function recurses further into (x, ys...) by calling recurse on the child nodes. The choice of which nodes to recurse and in what order is custom to the walk.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.ExcludeWalk","page":"Nested Structures – Functors.jl","title":"Functors.ExcludeWalk","text":"ExcludeWalk(walk, fn, exclude)\n\nA walk that recurses nodes (x, ys...) according to walk, except when exclude(x) is true. Then, fn(x, ys...) is applied instead of recursing further.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.CachedWalk","page":"Nested Structures – Functors.jl","title":"Functors.CachedWalk","text":"CachedWalk(walk[; prune])\n\nA walk that recurses nodes (x, ys...) according to walk and storing the output of the recursion in a cache indexed by x (based on object ID). Whenever the cache already contains x, either:\n\nprune is specified, then it is returned, or\nprune is unspecified, and the previously cached recursion of (x, ys...) returned.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Moving-models,-or-data,-to-the-GPU","page":"Nested Structures – Functors.jl","title":"Moving models, or data, to the GPU","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux provides some convenience functions based on fmap. Some (f16, f32, f64) change the precision of all arrays in a model. Others are used for moving a model to of from GPU memory:","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"cpu\ngpu(::Any)\ngpu(::Flux.DataLoader)","category":"page"},{"location":"reference/models/functors/#Flux.cpu","page":"Nested Structures – Functors.jl","title":"Flux.cpu","text":"cpu(m)\n\nCopies m onto the CPU, the opposite of gpu. Recurses into structs (thanks to Functors.jl).\n\nExample\n\njulia> m_gpu = Dense(CUDA.randn(2, 5))\nDense(5 => 2) # 12 parameters\n\njulia> m_gpu.bias # matches the given weight matrix\n2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.0\n 0.0\n\njulia> m = m_gpu |> cpu\nDense(5 => 2) # 12 parameters\n\njulia> m.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Flux.gpu-Tuple{Any}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(m)\n\nCopies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time). It recurses into structs according to Functors.jl.\n\nUse cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.\n\nThis function is just defined for convenience around gpu_device, and is equivalent to gpu_device()(m). You may consider defining device = gpu_device() once and then using device(m) to move data.\n\nExample\n\njulia> m = Dense(rand(2, 3)) # constructed with Float64 weight matrix\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m.weight)\nMatrix{Float64} (alias for Array{Float64, 2})\n\njulia> m_gpu = gpu(m) # can equivalently be written m_gpu = m |> gpu\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m_gpu.weight)\nCUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}\n\n\n\n\n\n","category":"method"},{"location":"reference/models/functors/#Flux.gpu-Tuple{DataLoader}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(data::DataLoader)\ncpu(data::DataLoader)\n\nTransforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)\n\nExample\n\njulia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64})\n\njulia> first(dl)\n(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c')\n\njulia> c_dl = gpu(dl)\n4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3)\n with first element:\n (; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64})\n\njulia> first(c_dl).x\n2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\nFor large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:\n\njulia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})\n\nwarning: Warning\nThis only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.\n\n\n\n\n\n","category":"method"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/losses/#man-losses","page":"Loss Functions","title":"Loss Functions","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux provides a large number of common loss functions used for training machine learning models. They are grouped together in the Flux.Losses module.","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Loss functions for supervised learning typically expect as inputs a target y, and a prediction ŷ from your model. In Flux's convention, the order of the arguments is the following","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y)","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Most loss functions in Flux have an optional argument agg, denoting the type of aggregation performed over the batch:","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y) # defaults to `mean`\nloss(ŷ, y, agg=sum) # use `sum` for reduction\nloss(ŷ, y, agg=x->sum(x, dims=2)) # partial reduction\nloss(ŷ, y, agg=x->mean(w .* x)) # weighted mean\nloss(ŷ, y, agg=identity) # no aggregation.","category":"page"},{"location":"reference/models/losses/#Function-listing","page":"Loss Functions","title":"Function listing","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux.Losses.mae\nFlux.Losses.mse\nFlux.Losses.msle\nFlux.Losses.huber_loss\nFlux.Losses.label_smoothing\nFlux.Losses.crossentropy\nFlux.Losses.logitcrossentropy\nFlux.Losses.binarycrossentropy\nFlux.Losses.logitbinarycrossentropy\nFlux.Losses.kldivergence\nFlux.Losses.poisson_loss\nFlux.Losses.hinge_loss\nFlux.Losses.squared_hinge_loss\nFlux.Losses.dice_coeff_loss\nFlux.Losses.tversky_loss\nFlux.Losses.binary_focal_loss\nFlux.Losses.focal_loss\nFlux.Losses.siamese_contrastive_loss","category":"page"},{"location":"reference/models/losses/#Flux.Losses.mae","page":"Loss Functions","title":"Flux.Losses.mae","text":"mae(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean absolute error:\n\nagg(abs.(ŷ .- y))\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> Flux.mae(y_model, 1:3)\n0.10000000000000009\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.mse","page":"Loss Functions","title":"Flux.Losses.mse","text":"mse(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean square error:\n\nagg((ŷ .- y) .^ 2)\n\nSee also: mae, msle, crossentropy.\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> y_true = 1:3;\n\njulia> Flux.mse(y_model, y_true)\n0.010000000000000018\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.msle","page":"Loss Functions","title":"Flux.Losses.msle","text":"msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nThe loss corresponding to mean squared logarithmic errors, calculated as\n\nagg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)\n\nThe ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.\n\nExample\n\njulia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)\n0.009084041f0\n\njulia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)\n0.011100831f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.huber_loss","page":"Loss Functions","title":"Flux.Losses.huber_loss","text":"huber_loss(ŷ, y; delta = 1, agg = mean)\n\nReturn the mean of the Huber loss given the prediction ŷ and true values y.\n\n | 0.5 * |ŷ - y|^2, for |ŷ - y| <= δ\nHuber loss = |\n | δ * (|ŷ - y| - 0.5 * δ), otherwise\n\nExample\n\njulia> ŷ = [1.1, 2.1, 3.1];\n\njulia> Flux.huber_loss(ŷ, 1:3) # default δ = 1 > |ŷ - y|\n0.005000000000000009\n\njulia> Flux.huber_loss(ŷ, 1:3, delta=0.05) # changes behaviour as |ŷ - y| > δ\n0.003750000000000005\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.label_smoothing","page":"Loss Functions","title":"Flux.Losses.label_smoothing","text":"label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)\n\nReturns smoothed labels, meaning the confidence on label values are relaxed.\n\nWhen y is given as one-hot vector or batch of one-hot, its calculated as\n\ny .* (1 - α) .+ α / size(y, dims)\n\nwhen y is given as a number or batch of numbers for binary classification, its calculated as\n\ny .* (1 - α) .+ α / 2\n\nin which case the labels are squeezed towards 0.5.\n\nα is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.\n\ndims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.\n\nExample\n\njulia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)\n2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ ⋅ ⋅ 1 ⋅ 1\n 1 1 1 ⋅ 1 ⋅\n\njulia> y_smoothed = Flux.label_smoothing(y, 0.2f0)\n2×6 Matrix{Float32}:\n 0.1 0.1 0.1 0.9 0.1 0.9\n 0.9 0.9 0.9 0.1 0.9 0.1\n\njulia> y_sim = softmax(y .* log(2f0))\n2×6 Matrix{Float32}:\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n\njulia> y_dis = vcat(y_sim[2,:]', y_sim[1,:]')\n2×6 Matrix{Float32}:\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n\njulia> Flux.crossentropy(y_sim, y) < Flux.crossentropy(y_sim, y_smoothed)\ntrue\n\njulia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.crossentropy","page":"Loss Functions","title":"Flux.Losses.crossentropy","text":"crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)\n\nReturn the cross entropy between the given probability distributions; calculated as\n\nagg(-sum(y .* log.(ŷ .+ ϵ); dims))\n\nCross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction ŷ is supposed to sum to one across dims, as would be the case with the output of a softmax operation.\n\nFor numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .\n\nUse label_smoothing to smooth the true labels as preprocessing before computing the loss.\n\nSee also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.\n\nExample\n\njulia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)\n3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ ⋅ 1\n ⋅ 1 ⋅ 1 ⋅\n ⋅ ⋅ 1 ⋅ ⋅\n\njulia> y_model = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> sum(y_model; dims=1)\n1×5 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0\n\njulia> Flux.crossentropy(y_model, y_label)\n1.6076053f0\n\njulia> 5 * ans ≈ Flux.crossentropy(y_model, y_label; agg=sum)\ntrue\n\njulia> y_smooth = Flux.label_smoothing(y_label, 0.15f0)\n3×5 Matrix{Float32}:\n 0.9 0.05 0.05 0.05 0.9\n 0.05 0.9 0.05 0.9 0.05\n 0.05 0.05 0.9 0.05 0.05\n\njulia> Flux.crossentropy(y_model, y_smooth)\n1.5776052f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitcrossentropy","page":"Loss Functions","title":"Flux.Losses.logitcrossentropy","text":"logitcrossentropy(ŷ, y; dims = 1, agg = mean)\n\nReturn the cross entropy calculated by\n\nagg(-sum(y .* logsoftmax(ŷ; dims); dims))\n\nThis is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.\n\nSee also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.\n\nExample\n\njulia> y_label = Flux.onehotbatch(collect(\"abcabaa\"), 'a':'c')\n3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 1\n ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n\njulia> y_model = reshape(vcat(-9:0, 0:9, 7.5f0), 3, 7)\n3×7 Matrix{Float32}:\n -9.0 -6.0 -3.0 0.0 2.0 5.0 8.0\n -8.0 -5.0 -2.0 0.0 3.0 6.0 9.0\n -7.0 -4.0 -1.0 1.0 4.0 7.0 7.5\n\njulia> Flux.logitcrossentropy(y_model, y_label)\n1.5791205f0\n\njulia> Flux.crossentropy(softmax(y_model), y_label)\n1.5791197f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binarycrossentropy","page":"Loss Functions","title":"Flux.Losses.binarycrossentropy","text":"binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the binary cross-entropy loss, computed as\n\nagg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))\n\nWhere typically, the prediction ŷ is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.\n\nUse label_smoothing to smooth the y value as preprocessing before computing the loss.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1]\n3-element Vector{Bool}:\n 1\n 0\n 1\n\njulia> y_prob = softmax(reshape(vcat(1:3, 3:5), 2, 3) .* 1f0)\n2×3 Matrix{Float32}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binarycrossentropy(y_prob[2,:], y_bin)\n0.43989f0\n\njulia> all(p -> 0 < p < 1, y_prob[2,:]) # else DomainError\ntrue\n\njulia> y_hot = Flux.onehotbatch(y_bin, 0:1)\n2×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n\njulia> Flux.crossentropy(y_prob, y_hot)\n0.43989f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitbinarycrossentropy","page":"Loss Functions","title":"Flux.Losses.logitbinarycrossentropy","text":"logitbinarycrossentropy(ŷ, y; agg = mean)\n\nMathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1];\n\njulia> y_model = Float32[2, -1, pi]\n3-element Vector{Float32}:\n 2.0\n -1.0\n 3.1415927\n\njulia> Flux.logitbinarycrossentropy(y_model, y_bin)\n0.160832f0\n\njulia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)\n0.16083185f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.kldivergence","page":"Loss Functions","title":"Flux.Losses.kldivergence","text":"kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the Kullback-Leibler divergence between the given probability distributions.\n\nThe KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.\n\nExample\n\njulia> p1 = [1 0; 0 1]\n2×2 Matrix{Int64}:\n 1 0\n 0 1\n\njulia> p2 = fill(0.5, 2, 2)\n2×2 Matrix{Float64}:\n 0.5 0.5\n 0.5 0.5\n\njulia> Flux.kldivergence(p2, p1) ≈ log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p1; agg = sum) ≈ 2log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p2; eps = 0) # about -2e-16 with the regulator\n0.0\n\njulia> Flux.kldivergence(p1, p2; eps = 0) # about 17.3 with the regulator\nInf\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.poisson_loss","page":"Loss Functions","title":"Flux.Losses.poisson_loss","text":"poisson_loss(ŷ, y; agg = mean)\n\nReturn how much the predicted distribution ŷ diverges from the expected Poisson distribution y; calculated as -\n\nsum(ŷ .- y .* log.(ŷ)) / size(y, 2)\n\nMore information..\n\nExample\n\njulia> y_model = [1, 3, 3]; # data should only take integral values\n\njulia> Flux.poisson_loss(y_model, 1:3)\n0.5023128522198171\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.hinge_loss","page":"Loss Functions","title":"Flux.Losses.hinge_loss","text":"hinge_loss(ŷ, y; agg = mean)\n\nReturn the hinge_loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum(max.(0, 1 .- ŷ .* y)) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: squared_hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.hinge_loss(y_pred, y_true)\n0.55\n\njulia> Flux.hinge_loss(y_pred[1], y_true[1]) != 0 # same sign but |ŷ| < 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[end], y_true[end]) == 0 # same sign but |ŷ| >= 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.squared_hinge_loss","page":"Loss Functions","title":"Flux.Losses.squared_hinge_loss","text":"squared_hinge_loss(ŷ, y)\n\nReturn the squared hinge_loss loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.squared_hinge_loss(y_pred, y_true)\n0.625\n\njulia> Flux.squared_hinge_loss(y_pred[1], y_true[1]) != 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[end], y_true[end]) == 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.dice_coeff_loss","page":"Loss Functions","title":"Flux.Losses.dice_coeff_loss","text":"dice_coeff_loss(ŷ, y; smooth = 1)\n\nReturn a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:\n\n1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)\n\nExample\n\njulia> y_pred = [1.1, 2.1, 3.1];\n\njulia> Flux.dice_coeff_loss(y_pred, 1:3)\n0.000992391663909964\n\njulia> 1 - Flux.dice_coeff_loss(y_pred, 1:3) # ~ F1 score for image segmentation\n0.99900760833609\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.tversky_loss","page":"Loss Functions","title":"Flux.Losses.tversky_loss","text":"tversky_loss(ŷ, y; beta = 0.7)\n\nReturn the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:\n\n1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binary_focal_loss","page":"Loss Functions","title":"Flux.Losses.binary_focal_loss","text":"binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nFor gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.\n\nSee also: Losses.focal_loss for multi-class setting\n\nExample\n\njulia> y = [0 1 0\n 1 0 1]\n2×3 Matrix{Int64}:\n 0 1 0\n 1 0 1\n\njulia> ŷ = [0.268941 0.5 0.268941\n 0.731059 0.5 0.731059]\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.focal_loss","page":"Loss Functions","title":"Flux.Losses.focal_loss","text":"focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nThe modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.\n\nExample\n\njulia> y = [1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0]\n3×5 Matrix{Int64}:\n 1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0\n\njulia> ŷ = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628\ntrue\n\nSee also: Losses.binary_focal_loss for binary (not one-hot) labels\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.siamese_contrastive_loss","page":"Loss Functions","title":"Flux.Losses.siamese_contrastive_loss","text":"siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)\n\nReturn the contrastive loss which can be useful for training Siamese Networks. It is given by\n\nagg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)\n\nSpecify margin to set the baseline for distance at which pairs are dissimilar.\n\nExample\n\njulia> ŷ = [0.5, 1.5, 2.5];\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3)\n-4.833333333333333\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)\n-4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#man-layers","page":"Built-in Layers","title":"Built-in Layer Types","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"If you started at the beginning of the guide, then you have already met the basic Dense layer, and seen Chain for combining layers. These core layers form the foundation of almost all neural networks.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The Dense exemplifies several features:","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"It contains an an activation function, which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.\nIt take an init keyword, which accepts a function acting like rand. That is, init(2,3,4) should create an array of this size. Flux has many such functions built-in. All make a CPU array, moved later with gpu if desired.\nThe bias vector is always initialised Flux.zeros32. The keyword bias=false will turn this off, i.e. keeping the bias permanently zero.\nIt is annotated with @layer, which means that Flux.setup will see the contents, and gpu will move their arrays to the GPU.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"By contrast, Chain itself contains no parameters, but connects other layers together. The section on dataflow layers introduces others like this.","category":"page"},{"location":"reference/models/layers/#Fully-Connected","page":"Built-in Layers","title":"Fully Connected","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Dense\nFlux.Bilinear\nFlux.Scale","category":"page"},{"location":"reference/models/layers/#Flux.Dense","page":"Built-in Layers","title":"Flux.Dense","text":"Dense(in => out, σ=identity; bias=true, init=glorot_uniform)\nDense(W::AbstractMatrix, [bias, σ])\n\nCreate a traditional fully connected layer, whose forward pass is given by:\n\ny = σ.(W * x .+ bias)\n\nThe input x should be a vector of length in, or batch of vectors represented as an in × N matrix, or any array with size(x,1) == in. The out y will be a vector of length out, or a batch with size(y) == (out, size(x)[2:end]...)\n\nKeyword bias=false will switch off trainable bias for the layer. The initialisation of the weight matrix is W = init(out, in), calling the function given to keyword init, with default glorot_uniform. The weight matrix and/or the bias vector (of length out) may also be provided explicitly.\n\nExamples\n\njulia> model = Dense(5 => 2)\nDense(5 => 2) # 12 parameters\n\njulia> model(rand32(5, 64)) |> size\n(2, 64)\n\njulia> model(rand32(5, 6, 4, 64)) |> size # treated as three batch dimensions\n(2, 6, 4, 64)\n\njulia> model2 = Dense(ones(2, 5), false, tanh) # using provided weight matrix\nDense(5 => 2, tanh; bias=false) # 10 parameters\n\njulia> model2(ones(5))\n2-element Vector{Float64}:\n 0.9999092042625951\n 0.9999092042625951\n\njulia> Flux.trainables(model2) # no trainable bias\n1-element Vector{AbstractArray}:\n [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Bilinear","page":"Built-in Layers","title":"Flux.Bilinear","text":"Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)\nBilinear(W::AbstractArray, [bias, σ])\n\nCreates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:\n\nz[i] = σ(x' * W[i,:,:] * y + bias[i])\n\nIf x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.\n\nIf the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)\n\nThe two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.\n\nIf the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).\n\nThe initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.\n\nExamples\n\njulia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);\n\njulia> B = Flux.Bilinear((5, 5) => 7)\nBilinear(5 => 7) # 182 parameters\n\njulia> B(x) |> size # interactions based on one input\n(7, 32)\n\njulia> B(x,y) == B((x,y)) # two inputs, may be given as a tuple\ntrue\n\njulia> sc = SkipConnection(\n Chain(Dense(5 => 20, tanh), Dense(20 => 9, tanh)),\n Flux.Bilinear((9, 5) => 3, bias=false),\n ); # used as the recombinator, with skip as the second input\n\njulia> sc(x) |> size\n(3, 32)\n\njulia> Flux.Bilinear(rand(4,8,16), false, tanh) # first dim of weight is the output\nBilinear((8, 16) => 4, tanh; bias=false) # 512 parameters\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Scale","page":"Built-in Layers","title":"Flux.Scale","text":"Scale(size::Integer..., σ=identity; bias=true, init=ones32)\nScale(scale::AbstractArray, [bias, σ])\n\nCreate an element-wise layer, whose forward pass is given by:\n\ny = σ.(scale .* x .+ bias)\n\nThis uses .* instead of matrix multiplication * of Dense.\n\nThe learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.\n\nUsed by LayerNorm with affine=true.\n\nExamples\n\njulia> a = Flux.Scale(2)\nScale(2) # 4 parameters\n\njulia> Flux.trainables(a)\n2-element Vector{AbstractArray}:\n Float32[1.0, 1.0]\n Float32[0.0, 0.0]\n\njulia> a([1 2 3])\n2×3 Matrix{Float32}:\n 1.0 2.0 3.0\n 1.0 2.0 3.0\n\njulia> b = Flux.Scale(Float32[1 2 3 4], false, abs2)\nScale(1, 4, abs2; bias=false) # 4 parameters\n\njulia> b([1, 10])\n2×4 Matrix{Float32}:\n 1.0 4.0 9.0 16.0\n 100.0 400.0 900.0 1600.0\n\njulia> Flux.trainables(b)\n1-element Vector{AbstractArray}:\n Float32[1.0 2.0 3.0 4.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.","category":"page"},{"location":"reference/models/layers/#Convolution-Models","page":"Built-in Layers","title":"Convolution Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are used to build convolutional neural networks (CNNs).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Conv\nConvTranspose\nCrossCor\nDepthwiseConv\nSamePad","category":"page"},{"location":"reference/models/layers/#Flux.Conv","page":"Built-in Layers","title":"Flux.Conv","text":"Conv(filter, in => out, σ = identity;\n stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])\nConv(weight, [bias, activation; stride, pad, dilation])\n\nStandard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nImage data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.\n\nTo take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:\n\nfilter should be a tuple of N integers.\nKeywords stride and dilation should each be either single integer, or a tuple with N integers.\nKeyword pad specifies the number of elements added to the borders of the data array. It can be\na single integer for equal padding all around,\na tuple of N integers, to apply the same padding at begin/end of each spatial dimension,\na tuple of 2*N integers, for asymmetric padding, or\nthe singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.\nKeyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.\n\nKeywords to control initialization of the layer:\n\ninit - Function used to generate initial weights. Defaults to glorot_uniform.\nbias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).\n\nThe second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself.\n\nSee also ConvTranspose, DepthwiseConv, CrossCor.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = Conv((5,5), 3 => 7, relu; bias = false)\nConv((5, 5), 3 => 7, relu, bias=false) # 525 parameters\n\njulia> layer(xs) |> size\n(96, 96, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2)(xs) |> size\n(48, 48, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, pad = SamePad())(xs) |> size\n(50, 50, 7, 50)\n\njulia> Conv((1,1), 3 => 7; pad = (20,10,0,0))(xs) |> size\n(130, 100, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size\n(42, 42, 7, 50)\n\njulia> weight = rand(Float32, 3, 4, 5);\n\njulia> bias = zeros(Float32, 5);\n\njulia> layer = Conv(weight, bias, sigmoid) # expects 1 spatial dimension\nConv((3,), 4 => 5, σ) # 65 parameters\n\njulia> layer(randn(Float32, 100, 4, 64)) |> size\n(98, 5, 64)\n\njulia> Flux.trainables(layer) |> length\n2\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.ConvTranspose","page":"Built-in Layers","title":"Flux.ConvTranspose","text":"ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])\nConvTranspose(weight, [bias, activation; stride, pad, outpad, dilation])\n\nStandard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.\n\nNote that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.\n\nTo conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nThe second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself.\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = ConvTranspose((5,5), 3 => 7, relu)\nConvTranspose((5, 5), 3 => 7, relu) # 532 parameters\n\njulia> layer(xs) |> size\n(104, 104, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2)(xs) |> size\n(203, 203, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2, outpad=1)(xs) |> size\n(204, 204, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size\n(300, 300, 7, 50)\n\njulia> weight = rand(Float32, 3, 4, 5);\n\njulia> bias = zeros(Float32, 4);\n\njulia> layer = ConvTranspose(weight, bias, sigmoid)\nConvTranspose((3,), 5 => 4, σ) # 64 parameters\n\njulia> layer(randn(Float32, 100, 5, 64)) |> size # transposed convolution will increase the dimension size (upsampling)\n(102, 4, 64)\n\njulia> Flux.trainables(layer) |> length\n2\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.CrossCor","page":"Built-in Layers","title":"Flux.CrossCor","text":"CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\nCrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nStandard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nThe second form of the constructor allows you to pass in a pre-constructed weight matrix and bias vector. This is useful when you want to initialize the weights yourself\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)\nCrossCor((5, 5), 3 => 6, relu, bias=false) # 450 parameters\n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size\n(34, 32, 7, 50)\n\njulia> weight = rand(Float32, 3, 4, 5);\n\njulia> bias = zeros(Float32, 5);\n\njulia> layer = CrossCor(weight, bias, relu)\nCrossCor((3,), 4 => 5, relu) # 65 parameters\n\njulia> layer(randn(Float32, 100, 4, 64)) |> size\n(98, 5, 64)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.DepthwiseConv","page":"Built-in Layers","title":"Flux.DepthwiseConv","text":"DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\nDepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nReturn a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.\n\nSee Conv for a description of the arguments.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)\nConv((5, 5), 3 => 6, relu, groups=3, bias=false) # 150 parameters \n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size\n(50, 50, 9, 50)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.SamePad","page":"Built-in Layers","title":"Flux.SamePad","text":"SamePad()\n\nPassed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).\n\nSee also Conv, MaxPool.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of images\n\njulia> layer = Conv((2,2), 3 => 7, pad=SamePad())\nConv((2, 2), 3 => 7, pad=(1, 0, 1, 0)) # 91 parameters\n\njulia> layer(xs) |> size # notice how the dimensions stay the same with this padding\n(100, 100, 7, 50)\n\njulia> layer2 = Conv((2,2), 3 => 7)\nConv((2, 2), 3 => 7) # 91 parameters\n\njulia> layer2(xs) |> size # the output dimension changes as the padding was not \"same\"\n(99, 99, 7, 50)\n\njulia> layer3 = Conv((5, 5), 3 => 7, stride=2, pad=SamePad())\nConv((5, 5), 3 => 7, pad=2, stride=2) # 532 parameters\n\njulia> layer3(xs) |> size # output size = `ceil(input_size/stride)` = 50\n(50, 50, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#MultiHeadAttention","page":"Built-in Layers","title":"MultiHeadAttention","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"MultiHeadAttention","category":"page"},{"location":"reference/models/layers/#Flux.MultiHeadAttention","page":"Built-in Layers","title":"Flux.MultiHeadAttention","text":"MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])\n\nThe multi-head dot-product attention layer used in Transformer architectures [1].\n\nReturns the transformed input sequence and the attention scores.\n\n[1] Vaswani et al. \"Attention is all you need.\" Advances in Neural Information Processing Systems. 2017.\n\nArguments\n\ndims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.\nnheads: number of heads. Default 8.\ninit: weight initializer for the Dense layers. Default glorot_uniform.\nbias : whether pointwise QKVO dense transforms use bias. Default false.\ndropout_prob: dropout probability for the attention scores. Default 0.0.\n\nForward\n\n(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])\n\nThe arguments of the forward pass are:\n\nq_in: Input query array of size (q_in_dim, q_len, batch_size).\nk_in: Input key array of size (k_in_dim, kv_len, batch_size).\nv_in: Input value array of size (v_in_dim, kv_len, batch_size).\nbias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.\nmask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.\n\nAlternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).\n\nSee also NNlib.dot_product_attention.\n\nExamples\n\nmha = MultiHeadAttention(64, nheads = 8)\nq = rand(Float32, (64, 10, 32))\nk = rand(Float32, (64, 20, 32))\nv = rand(Float32, (64, 20, 32))\ny, α = mha(q, k, v) \n# [y] = [64, 10, 32]\n# [α] = [20, 10, 8, 32]\n\nmha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)\ny, α = mha(q) # self-attention\n# [y] = [1024, 10, 32]\n# [α] = [10, 10, 8, 32]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Pooling","page":"Built-in Layers","title":"Pooling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"AdaptiveMaxPool\nMaxPool\nGlobalMaxPool\nAdaptiveMeanPool\nMeanPool\nGlobalMeanPool","category":"page"},{"location":"reference/models/layers/#Flux.AdaptiveMaxPool","page":"Built-in Layers","title":"Flux.AdaptiveMaxPool","text":"AdaptiveMaxPool(out::NTuple)\n\nAdaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMaxPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MaxPool","page":"Built-in Layers","title":"Flux.MaxPool","text":"MaxPool(window::NTuple; pad=0, stride=window)\n\nMax pooling layer, which replaces all pixels in a block of size window with one.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7, pad=2), # 532 parameters\n MaxPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(100, 100, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\njulia> layer = MaxPool((5,), pad=2, stride=(3,)) # one-dimensional window\nMaxPool((5,), pad=2, stride=3)\n\njulia> layer(rand(Float32, 100, 7, 50)) |> size\n(34, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMaxPool","page":"Built-in Layers","title":"Flux.GlobalMaxPool","text":"GlobalMaxPool()\n\nGlobal max pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.\n\nSee also MaxPool, GlobalMeanPool.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\njulia> GlobalMaxPool()(rand(3,5,7)) |> size # preserves 2 dimensions\n(1, 5, 7)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AdaptiveMeanPool","page":"Built-in Layers","title":"Flux.AdaptiveMeanPool","text":"AdaptiveMeanPool(out::NTuple)\n\nAdaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMeanPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MeanPool","page":"Built-in Layers","title":"Flux.MeanPool","text":"MeanPool(window::NTuple; pad=0, stride=window)\n\nMean pooling layer, averaging all pixels in a block of size window.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7), # 532 parameters\n MeanPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(96, 96, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMeanPool","page":"Built-in Layers","title":"Flux.GlobalMeanPool","text":"GlobalMeanPool()\n\nGlobal mean pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Upsampling","page":"Built-in Layers","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Upsample\nPixelShuffle","category":"page"},{"location":"reference/models/layers/#Flux.Upsample","page":"Built-in Layers","title":"Flux.Upsample","text":"Upsample(mode = :nearest; [scale, size]) \nUpsample(scale, mode = :nearest)\n\nAn upsampling layer. One of two keywords must be given:\n\nIf scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.\n\nCurrently supported upsampling modes and corresponding NNlib's methods are:\n\n:nearest -> NNlib.upsample_nearest \n:bilinear -> NNlib.upsample_bilinear\n:trilinear -> NNlib.upsample_trilinear\n\nExamples\n\njulia> m = Upsample(scale = (2, 3))\nUpsample(:nearest, scale = (2, 3))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 6, 1, 1)\n\njulia> m = Upsample(:bilinear, size = (4, 5))\nUpsample(:bilinear, size = (4, 5))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 5, 1, 1)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PixelShuffle","page":"Built-in Layers","title":"Flux.PixelShuffle","text":"PixelShuffle(r::Int)\n\nPixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.\n\nSee NNlib.pixel_shuffle.\n\nExamples\n\njulia> p = PixelShuffle(2);\n\njulia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]\n2×2×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 4.1\n 5.1 6.1\n\n[:, :, 2, 1] =\n 3.2 4.2\n 5.2 6.2\n\n[:, :, 3, 1] =\n 3.3 4.3\n 5.3 6.3\n\n[:, :, 4, 1] =\n 3.4 4.4\n 5.4 6.4\n\njulia> p(xs)\n4×4×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 3.3 4.1 4.3\n 3.2 3.4 4.2 4.4\n 5.1 5.3 6.1 6.3\n 5.2 5.4 6.2 6.4\n\njulia> xs = [3row + col + channel/10 for row in 1:2, col in 1:3, channel in 1:4, n in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 5.1 6.1\n 7.1 8.1 9.1\n\n[:, :, 2, 1] =\n 4.2 5.2 6.2\n 7.2 8.2 9.2\n\n[:, :, 3, 1] =\n 4.3 5.3 6.3\n 7.3 8.3 9.3\n\n[:, :, 4, 1] =\n 4.4 5.4 6.4\n 7.4 8.4 9.4\n\njulia> p(xs)\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 4.3 5.1 5.3 6.1 6.3\n 4.2 4.4 5.2 5.4 6.2 6.4\n 7.1 7.3 8.1 8.3 9.1 9.3\n 7.2 7.4 8.2 8.4 9.2 9.4\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Embedding-Vectors","page":"Built-in Layers","title":"Embedding Vectors","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Flux.Embedding\nFlux.EmbeddingBag","category":"page"},{"location":"reference/models/layers/#Flux.Embedding","page":"Built-in Layers","title":"Flux.Embedding","text":"Embedding(in => out; init=randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.\n\nThis layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.\n\nFor indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).\n\nExamples\n\njulia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))\nEmbedding(26 => 4) # 104 parameters\n\njulia> emb(2) # one column of e.weight (here not random!)\n4-element Vector{Float32}:\n 0.0\n 22.0\n 0.0\n 0.0\n\njulia> emb([3, 1, 20, 14, 4, 15, 7]) # vocabulary indices, in 1:26\n4×7 Matrix{Float32}:\n 0.0 22.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n 22.0 0.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 22.0 0.0 0.0\n\njulia> ans == emb(Flux.onehotbatch(\"cat&dog\", 'a':'z', 'n'))\ntrue\n\njulia> emb(rand(1:26, (10, 1, 12))) |> size # three batch dimensions\n(4, 10, 1, 12)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.EmbeddingBag","page":"Built-in Layers","title":"Flux.EmbeddingBag","text":"EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a \"bag\". Their individual embedding vectors are reduced to one, using mean or some other function.\n\nInstead of acting on one \"bag\", such as x::Vector{Int}, the layer can also act on several:\n\nActing on a vector of \"bags\", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).\nAny higher-rank array of integers is interpreted as a collection of \"bags\" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all \"bags\" have the same length.\nA vector of \"bags\" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.\n\nThe \"bag\" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.\n\nExamples\n\njulia> vocab_size = 26; # embed into 3 dimensions, with non-random vectors:\n\njulia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))\nEmbeddingBag(26 => 3) # 78 parameters\n\njulia> eb([2]) # one bag of 1 item\n3-element Vector{Float32}:\n 0.0\n 100.0\n 0.0\n\njulia> eb([3,3,1]) # one bag of 3 items, one mean embedding\n3-element Vector{Float32}:\n 33.333332\n 0.0\n 66.666664\n\njulia> eb([[3,1,3], [2,1]]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 50.0\n 0.0 50.0\n 66.6667 0.0\n\njulia> eb([1 1 1 1; 1 2 3 4]) # 4 bags each of 2 items, eachcol([1 1 1 1; 1 2 3 4])\n3×4 Matrix{Float32}:\n 100.0 50.0 50.0 50.0\n 0.0 50.0 0.0 0.0\n 0.0 0.0 50.0 0.0\n\njulia> eb(rand(1:26, 10, 5, 5)) |> size # 25 bags each of 10 items\n(3, 5, 5)\n\nAnother way to specify \"many bags of many items\" is to provide a vector data (each in 1:in) and a vector at stating where to split that up into \"bags\". The first bag starts with data[at[1]], the second at data[at[2]], and so on, with no overlaps and nothing left out (thus it requires at[1]==1).\n\njulia> data = [11, 1, 12, 2, 13, 3, 14];\n\njulia> data[1:3], data[4:end]\n([11, 1, 12], [2, 13, 3, 14])\n\njulia> eb(data, [1, 4]) # two bags, of 3 and 4 items\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 0.0 25.0\n 0.0 25.0\n\nFinally, each bag may also be also be represented as a OneHotMatrix.\n\njulia> eb(Flux.onehotbatch(\"bba\", 'a':'z')) # same as [2,2,1], one bag of 3 items\n3-element Vector{Float32}:\n 33.333332\n 66.666664\n 0.0\n\njulia> eb([Flux.onehotbatch(\"bba\", 'a':'z'), Flux.onehotbatch(\"cc\", 'a':'z')]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 66.6667 0.0\n 0.0 100.0\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#man-dataflow-layers","page":"Built-in Layers","title":"Dataflow Layers, or Containers","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Chain\nFlux.activations\nMaxout\nSkipConnection\nParallel\nPairwiseFusion","category":"page"},{"location":"reference/models/layers/#Flux.Chain","page":"Built-in Layers","title":"Flux.Chain","text":"Chain(layers...)\nChain(name = layer, ...)\n\nCollects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.\n\nExamples\n\njulia> m = Chain(x -> x^2, x -> x+1);\n\njulia> m(5) == 26\ntrue\n\njulia> m = Chain(Dense(10 => 5, tanh), Dense(5 => 2));\n\njulia> x = rand32(10, 32);\n\njulia> m(x) == m[2](m[1](x))\ntrue\n\njulia> m2 = Chain(enc = Chain(Flux.flatten, Dense(10 => 5, tanh)), \n dec = Dense(5 => 2));\n\njulia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)\ntrue\n\nA chain may be called with multiple arguments, which is equivalent to calling it with one tuple of these arguments. Such a tuple is understood by Parallel to mean the same as several arguments:\n\njulia> Chain(println, println)(1, 2, 3) # three arguments become a tuple\n(1, 2, 3)\nnothing\n\njulia> Chain(x->@show(x), Parallel(+, inv, abs2))(4, 5) # returns 1/4 + 5^2\nx = (4, 5)\n25.25\n\nFor large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.activations","page":"Built-in Layers","title":"Flux.activations","text":"activations(c::Chain, input)\n\nLike calling a Chain, but saves the result of each layer as an output.\n\nExamples\n\njulia> using Flux: activations\n\njulia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);\n\njulia> activations(c, 1)\n(2, 4, 64)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.Maxout","page":"Built-in Layers","title":"Flux.Maxout","text":"Maxout(layers...)\nMaxout(f, n_alts)\n\nThis contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.\n\nInstead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.\n\nMaxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio \"Maxout Networks\" https://arxiv.org/abs/1302.4389.\n\nSee also Parallel to reduce with other operators.\n\nExamples\n\njulia> m = Maxout(x -> abs2.(x), x -> x .* 3);\n\njulia> m([-2 -1 0 1 2])\n1×5 Matrix{Int64}:\n 4 1 0 3 6\n\njulia> m3 = Maxout(() -> Dense(5 => 7, tanh), 3)\nMaxout(\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n) # Total: 6 arrays, 126 parameters, 816 bytes.\n\njulia> Flux.outputsize(m3, (5, 11))\n(7, 11)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.SkipConnection","page":"Built-in Layers","title":"Flux.SkipConnection","text":"SkipConnection(layer, connection)\n\nCreate a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, \"skipped\" input.\n\nThe simplest \"ResNet\"-type connection is just SkipConnection(layer, +). Here is a more complicated example:\n\njulia> m = Conv((3,3), 4 => 7, pad=(1,1));\n\njulia> x = ones(Float32, 5, 5, 4, 10);\n\njulia> size(m(x)) == (5, 5, 7, 10)\ntrue\n\njulia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));\n\njulia> size(sm(x)) == (5, 5, 11, 10)\ntrue\n\nSee also Parallel, Maxout.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Parallel","page":"Built-in Layers","title":"Flux.Parallel","text":"Parallel(connection, layers...)\nParallel(connection; name = layer, ...)\n\nCreate a layer which passes an input array to each path in layers, before reducing the output with connection.\n\nObeys the similar rules to broadcasting:\n\nCalled with one input x, this is equivalent to connection([l(x) for l in layers]...).\nWith multiple inputs and just one layer, it is instead connection([layer(x) for x in inputs]...).\nWith multiple inputs and multiple layers, one input is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).\n\nLike Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.\n\nSee also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.\n\nExamples\n\njulia> p = Parallel(+, abs2, sqrt);\n\njulia> p(3, 4) # == 3^2 + √4, two functions two inputs\n11.0\n\njulia> p((3, 4)) # tuple is always splatted\n11.0\n\njulia> p(4) # == 4^2 + √4, one input used twice\n18.0\n\njulia> Parallel(hcat, inv)(1, 2, 4) # one function three inputs\n1×3 Matrix{Float64}:\n 1.0 0.5 0.25\n\nWith Flux layers:\n\njulia> model = Chain(Dense(3 => 5),\n Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),\n Dense(8 => 17));\n\njulia> model(rand32(3)) |> size\n(17,)\n\njulia> model2 = Parallel(+; α = Dense(10 => 2, tanh), β = Dense(5 => 2))\nParallel(\n +,\n α = Dense(10 => 2, tanh), # 22 parameters\n β = Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 34 parameters, 344 bytes.\n\njulia> model2(rand32(10), rand32(5)) |> size\n(2,)\n\njulia> model2[:α](rand32(10)) |> size\n(2,)\n\njulia> model2[:β] == model2[2]\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PairwiseFusion","page":"Built-in Layers","title":"Flux.PairwiseFusion","text":"PairwiseFusion(connection, layers...)\n\nArguments\n\nconnection: A function taking 2 inputs and combining them into a single output \nlayers: The layers whose outputs are combined\n\nInputs\n\nThis layer behaves differently based on input type:\n\nIf input x is a tuple of length N (or the input is xs with N x's), matching the number of layers, \n\nthen each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:\n\nx1 → layer1 → y1 ↘\n connection → layer2 → y2 ↘\n x2 ↗ connection → layer3 → y3\n x3 ↗\n\n... or written as:\n\ny1 = layer1(x1)\ny2 = layer2(connection(y1, x2))\ny3 = layer3(connection(y2, x3))\n\nWith just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:\n\ny[1] == layers[1](x)\nfor i in 2:length(layers)\n y[i] == connection(layers[i](y[i-1]), x)\nend\n\nReturns\n\nA tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Recurrent-Models","page":"Built-in Layers","title":"Recurrent Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"RNNCell\nRNN\nLSTMCell\nLSTM\nGRUCell\nGRU\nGRUv3Cell\nGRUv3","category":"page"},{"location":"reference/models/layers/#Flux.RNNCell","page":"Built-in Layers","title":"Flux.RNNCell","text":"RNNCell(in => out, σ = tanh; init_kernel = glorot_uniform, \n init_recurrent_kernel = glorot_uniform, bias = true)\n\nThe most basic recurrent layer. Essentially acts as a Dense layer, but with the output fed back into the input each time step.\n\nIn the forward pass, implements the function\n\nh^prime = sigma(W_i x + W_h h + b)\n\nand returns h'.\n\nSee RNN for a layer that processes entire sequences.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\nσ: The non-linearity to apply to the output. Default is tanh.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nrnncell(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the RNN. It should be a vector of size in or a matrix of size in x batch_size.\nh: The hidden state of the RNN. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nExamples\n\nr = RNNCell(3 => 5)\n\n# A sequence of length 10 and batch size 4\nx = [rand(Float32, 3, 4) for _ in 1:10]\n\n# Initialize the hidden state\nh = zeros(Float32, 5)\n\n# We collect the hidden states in an array `history`\n# in case the loss depends on the entire sequence.\nŷ = []\n\nfor x_t in x\n h = r(x_t, h)\n ŷ = [ŷ..., h] # Cannot use `push!(ŷ, h)` here since mutation \n # is not automatic differentiation friendly yet.\n # Can use `y = vcat(y, [h])` as an alternative.\nend\n\nh # The final hidden state\nŷ # The hidden states at each time step\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.RNN","page":"Built-in Layers","title":"Flux.RNN","text":"RNN(in => out, σ = tanh; init_kernel = glorot_uniform, \n init_recurrent_kernel = glorot_uniform, bias = true)\n\nThe most basic recurrent layer. Essentially acts as a Dense layer, but with the output fed back into the input each time step. \n\nIn the forward pass computes\n\nh_t = sigma(W_i x_t + W_h h_t-1 + b)\n\nfor all len steps t in the in input sequence. \n\nSee RNNCell for a layer that processes a single time step.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\nσ: The non-linearity to apply to the output. Default is tanh.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nrnn(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the RNN. It should be a matrix size in x len or an array of size in x len x batch_size.\nh: The initial hidden state of the RNN. If given, it is a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nReturns all new hidden states h_t as an array of size out x len x batch_size.\n\nExamples\n\njulia> d_in, d_out, len, batch_size = 4, 6, 3, 5;\n\njulia> x = rand(Float32, (d_in, len, batch_size));\n\njulia> h = zeros(Float32, (d_out, batch_size));\n\njulia> rnn = RNN(d_in => d_out)\nRNN(\n RNNCell(4 => 6, tanh), # 66 parameters\n) # Total: 3 arrays, 66 parameters, 424 bytes.\n\njulia> y = rnn(x, h); # [y] = [d_out, len, batch_size]\n\nSometimes, the initial hidden state is a learnable parameter. In this case, the RNN should be wrapped in a custom struct.\n\nstruct Model\n rnn::RNN\n h0::AbstractVector\nend\n\nFlux.@layer Model\n\n(m::Model)(x) = m.rnn(x, m.h0)\n\nmodel = Model(RNN(32 => 64), zeros(Float32, 64))\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LSTMCell","page":"Built-in Layers","title":"Flux.LSTMCell","text":"LSTMCell(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nThe Long Short Term Memory cell. Behaves like an RNN but generally exhibits a longer memory span over sequences.\n\nIn the forward pass, computes\n\ni_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nf_t = sigma(W_xf x_t + W_hf h_t-1 + b_f)\nc_t = f_t odot c_t-1 + i_t odot tanh(W_xc x_t + W_hc h_t-1 + b_c)\no_t = sigma(W_xo x_t + W_ho h_t-1 + b_o)\nh_t = o_t odot tanh(c_t)\n\nThe LSTMCell returns the new hidden state h_t and cell state c_t for a single time step. See also LSTM for a layer that processes entire sequences.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nlstmcell(x, (h, c))\nlstmcell(x)\n\nThe arguments of the forward pass are:\n\nx: The input to the LSTM. It should be a matrix of size in or an array of size in x batch_size.\n(h, c): A tuple containing the hidden and cell states of the LSTM. They should be vectors of size out or matrices of size out x batch_size. If not provided, they are assumed to be vectors of zeros.\n\nReturns a tuple (h′, c′) containing the new hidden state and cell state in tensors of size out or out x batch_size. \n\nExamples\n\njulia> l = LSTMCell(3 => 5)\nLSTMCell(3 => 5) # 180 parameters\n\njulia> h = zeros(Float32, 5); # hidden state\n\njulia> c = zeros(Float32, 5); # cell state\n\njulia> x = rand(Float32, 3, 4); # in x batch_size\n\njulia> h′, c′ = l(x, (h, c));\n\njulia> size(h′) # out x batch_size\n(5, 4)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LSTM","page":"Built-in Layers","title":"Flux.LSTM","text":"LSTM(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nLong Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.\n\nSee this article for a good overview of the internals.\n\nIn the forward pass, computes\n\ni_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nf_t = sigma(W_xf x_t + W_hf h_t-1 + b_f)\nc_t = f_t odot c_t-1 + i_t odot tanh(W_xc x_t + W_hc h_t-1 + b_c)\no_t = sigma(W_xo x_t + W_ho h_t-1 + b_o)\nh_t = o_t odot tanh(c_t)\n\nfor all len steps t in the input sequence. See LSTMCell for a layer that processes a single time step.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\nlstm(x, (h, c))\nlstm(x)\n\nThe arguments of the forward pass are:\n\nx: The input to the LSTM. It should be a matrix of size in x len or an array of size in x len x batch_size.\n(h, c): A tuple containing the hidden and cell states of the LSTM. They should be vectors of size out or matrices of size out x batch_size. If not provided, they are assumed to be vectors of zeros.\n\nReturns a tuple (h′, c′) containing all new hidden states h_t and cell states c_t in tensors of size out x len or out x len x batch_size.\n\nExamples\n\nstruct Model\n lstm::LSTM\n h0::AbstractVector\n c0::AbstractVector\nend\n\nFlux.@layer Model\n\n(m::Model)(x) = m.lstm(x, (m.h0, m.c0))\n\nd_in, d_out, len, batch_size = 2, 3, 4, 5\nx = rand(Float32, (d_in, len, batch_size))\nmodel = Model(LSTM(d_in => d_out), zeros(Float32, d_out), zeros(Float32, d_out))\nh, c = model(x)\nsize(h) # out x len x batch_size\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRUCell","page":"Built-in Layers","title":"Flux.GRUCell","text":"GRUCell(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.\n\nIn the forward pass, computes\n\nr = sigma(W_xi x + W_hi h + b_i)\nz = sigma(W_xz x + W_hz h + b_z)\nh = tanh(W_xh x + r odot W_hh h + b_h)\nh = (1 - z) odot h + z odot h\n\nSee also GRU for a layer that processes entire sequences.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\ngrucell(x, h)\ngrucell(x)\n\nThe arguments of the forward pass are:\n\nx: The input to the GRU. It should be a vector of size in or a matrix of size in x batch_size.\nh: The hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nReturns the new hidden state h' as an array of size out or out x batch_size.\n\nExamples\n\njulia> g = GRUCell(3 => 5)\nGRUCell(3 => 5) # 135 parameters\n\njulia> h = zeros(Float32, 5); # hidden state\n\njulia> x = rand(Float32, 3, 4); # in x batch_size\n\njulia> h′ = g(x, h);\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRU","page":"Built-in Layers","title":"Flux.GRU","text":"GRU(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.\n\nThe forward pass computes\n\nr_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nz_t = sigma(W_xz x_t + W_hz h_t-1 + b_z)\nh_t = tanh(W_xh x_t + r_t odot W_hh h_t-1 + b_h)\nh_t = (1 - z_t) odot h_t + z_t odot h_t-1\n\nfor all len steps t in the input sequence. See GRUCell for a layer that processes a single time step.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\ngru(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the GRU. It should be a matrix of size in x len or an array of size in x len x batch_size.\nh: The initial hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros. \n\nReturns all new hidden states h_t as an array of size out x len x batch_size.\n\nExamples\n\nd_in, d_out, len, batch_size = 2, 3, 4, 5\ngru = GRU(d_in => d_out)\nx = rand(Float32, (d_in, len, batch_size))\nh0 = zeros(Float32, d_out)\nh = gru(x, h0) # out x len x batch_size\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRUv3Cell","page":"Built-in Layers","title":"Flux.GRUv3Cell","text":"GRUv3Cell(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.\n\nThe forward pass computes\n\nr = sigma(W_xi x + W_hi h + b_i)\nz = sigma(W_xz x + W_hz h + b_z)\nh = tanh(W_xh x + W_hh (r odot W_hh h) + b_h)\nh = (1 - z) odot h + z odot h\n\nand returns h'. This is a single time step of the GRU.\n\nSee GRUv3 for a layer that processes entire sequences. See GRU and GRUCell for variants of this layer.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\nForward\n\ngruv3cell(x, [h])\n\nThe arguments of the forward pass are:\n\nx: The input to the GRU. It should be a vector of size in or a matrix of size in x batch_size.\nh: The hidden state of the GRU. It should be a vector of size out or a matrix of size out x batch_size. If not provided, it is assumed to be a vector of zeros.\n\nReturns the new hidden state h' as an array of size out or out x batch_size.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GRUv3","page":"Built-in Layers","title":"Flux.GRUv3","text":"GRUv3(in => out; init_kernel = glorot_uniform,\n init_recurrent_kernel = glorot_uniform, bias = true)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.\n\nThe forward pass computes\n\nr_t = sigma(W_xi x_t + W_hi h_t-1 + b_i)\nz_t = sigma(W_xz x_t + W_hz h_t-1 + b_z)\nh_t = tanh(W_xh x_t + W_hh (r_t odot W_hh h_t-1) + b_h)\nh_t = (1 - z_t) odot h_t + z_t odot h_t-1\n\nfor all len steps t in the input sequence. See GRUv3Cell for a layer that processes a single time step. See GRU and GRUCell for variants of this layer.\n\nNotice that GRUv3 is not a more advanced version of GRU but only a less popular variant.\n\nArguments\n\nin => out: The input and output dimensions of the layer.\ninit_kernel: The initialization function to use for the input to hidden connection weights. Default is glorot_uniform.\ninit_recurrent_kernel: The initialization function to use for the hidden to hidden connection weights. Default is glorot_uniform.\nbias: Whether to include a bias term initialized to zero. Default is true.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Normalisation-and-Regularisation","page":"Built-in Layers","title":"Normalisation & Regularisation","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"BatchNorm\nDropout\nAlphaDropout\nLayerNorm\nInstanceNorm\nGroupNorm\nFlux.normalise","category":"page"},{"location":"reference/models/layers/#Flux.BatchNorm","page":"Built-in Layers","title":"Flux.BatchNorm","text":"BatchNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=true, track_stats=true, active=nothing,\n eps=1f-5, momentum= 0.1f0)\n\nBatch Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.\n\nBatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nAfter normalisation, elementwise activation λ is applied.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nUse testmode! during inference.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = BatchNorm(3);\n\njulia> Flux.trainmode!(m);\n\njulia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Dropout","page":"Built-in Layers","title":"Flux.Dropout","text":"Dropout(p; [dims, rng, active])\n\nLayer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.\n\nWhile training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.\n\nBy default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.\n\nBy default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).\n\nKeyword rng lets you specify a custom random number generator. (Only supported on the CPU.)\n\nExamples\n\njulia> m = Chain(Dense(ones(3,2)), Dropout(0.4))\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4),\n)\n\njulia> m(ones(2, 7)) # test mode, no effect\n3×7 Matrix{Float64}:\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n\njulia> Flux.trainmode!(m) # equivalent to use within gradient\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4, active=true),\n)\n\njulia> m(ones(2, 7))\n3×7 Matrix{Float64}:\n 0.0 0.0 3.33333 0.0 0.0 0.0 0.0\n 3.33333 0.0 3.33333 0.0 3.33333 0.0 3.33333\n 3.33333 3.33333 0.0 3.33333 0.0 0.0 3.33333\n\njulia> y = m(ones(2, 10_000));\n\njulia> using Statistics\n\njulia> mean(y) # is about 2.0, same as in test mode\n1.9989999999999961\n\njulia> mean(iszero, y) # is about 0.4\n0.4003\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AlphaDropout","page":"Built-in Layers","title":"Flux.AlphaDropout","text":"AlphaDropout(p; [rng, active])\n\nA dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.\n\nDoes nothing to the input once testmode! is true.\n\nExamples\n\njulia> using Statistics\n\njulia> x = randn32(1000,1);\n\njulia> m = Chain(Dense(1000 => 1000, selu), AlphaDropout(0.2));\n\njulia> Flux.trainmode!(m);\n\njulia> y = m(x);\n\njulia> isapprox(std(x), std(y), atol=0.2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LayerNorm","page":"Built-in Layers","title":"Flux.LayerNorm","text":"LayerNorm(size..., λ=identity; affine=true, eps=1f-5)\n\nA normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.\n\nIn the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.\n\nIf affine=true, it also applies a learnable shift and rescaling using the Scale layer.\n\nSee also BatchNorm, InstanceNorm, GroupNorm, and normalise.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = LayerNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.InstanceNorm","page":"Built-in Layers","title":"Flux.InstanceNorm","text":"InstanceNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=false, track_stats=false,\n eps=1f-5, momentum=0.1f0)\n\nInstance Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nInstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nWarning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = InstanceNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GroupNorm","page":"Built-in Layers","title":"Flux.GroupNorm","text":"GroupNorm(channels::Int, G::Int, λ = identity;\n initβ = zeros32,\n initγ = ones32,\n affine = true,\n eps = 1f-5,\n momentum = 0.1f0)\n\nGroup Normalization layer.\n\nchs is the number of channels, the channel dimension of your input. For an array of N dimensions, the N-1th index is the channel dimension.\n\nG is the number of groups along which the statistics are computed. The number of channels must be an integer multiple of the number of groups.\n\nchannels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 4, 2); # a batch of 2 images, each having 4 channels\n\njulia> m = GroupNorm(4, 2);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y[:, :, 1:2, 1]), 1, atol=0.1) && std(xs[:, :, 1:2, 1]) != std(y[:, :, 1:2, 1])\ntrue\n\njulia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.normalise","page":"Built-in Layers","title":"Flux.normalise","text":"normalise(x; dims=ndims(x), eps=1f-5)\n\nNormalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the variance for numerical stability.\n\nExamples\n\njulia> using Statistics\n\njulia> x = [90, 100, 110, 130, 70];\n\njulia> mean(x), std(x; corrected=false)\n(100.0, 20.0)\n\njulia> y = Flux.normalise(x)\n5-element Vector{Float64}:\n -0.4999999999999375\n 0.0\n 0.4999999999999375\n 1.4999999999998124\n -1.4999999999998124\n\njulia> isapprox(std(y; corrected=false), 1, atol=1e-5)\ntrue\n\njulia> x = rand(10:100, 10, 10);\n\njulia> y = Flux.normalise(x, dims=1);\n\njulia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Test-vs.-Train","page":"Built-in Layers","title":"Test vs. Train","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"warning: Warning\nThis automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"testmode!\ntrainmode!","category":"page"},{"location":"reference/models/layers/#Flux.testmode!","page":"Built-in Layers","title":"Flux.testmode!","text":"testmode!(model, [mode]) -> model\n\nSet a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.\n\nIf you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.\n\nThere is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.\n\nExample\n\njulia> d = Dropout(0.3)\nDropout(0.3)\n\njulia> testmode!(d) # dropout is now always disabled\nDropout(0.3, active=false)\n\njulia> trainmode!(d) # dropout is now always enabled\nDropout(0.3, active=true)\n\njulia> testmode!(d, :auto) # back to default\nDropout(0.3)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.trainmode!","page":"Built-in Layers","title":"Flux.trainmode!","text":"trainmode!(model) -> model\n\nSet a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/enzyme/#autodiff-enzyme","page":"Gradients – Enzyme.jl","title":"Automatic Differentiation using Enzyme.jl","text":"","category":"section"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Enzyme.jl is a new package for automatic differentiation. Like Zygote.jl, calling gradient(f, x) causes it to hooks into the compiler and transform code that is executed while calculating f(x), in order to produce code for ∂f/∂x. But it does so much later in the optimisation process (on LLVM instead of Julia's untyped IR) which you can read about here]. It needs far fewer custom rules than Zygote/ChainRules, and in particular is able to support mutation of arrays.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Flux now builds in support for this, using Enzyme's own Duplicated type. Calling Duplicated on any Flux model which was defined using @layer will allocate space for the gradient, and passing that to gradient (or withgradient, or train!) will then use Enzyme instead of Zygote. The gradient functions still return the gradient as usual, which can then be passed to update!:","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> using Flux, Enzyme\n\njulia> model = Chain(Dense(28^2 => 32, sigmoid), Dense(32 => 10), softmax); # from model zoo\n\njulia> dup_model = Enzyme.Duplicated(model) # this allocates space for the gradient\nDuplicated(\n Chain(\n Dense(784 => 32, σ), # 25_120 parameters\n Dense(32 => 10), # 330 parameters\n NNlib.softmax,\n ),\n # norm(∇) ≈ 0.0f0\n) # Total: 4 arrays, 25_450 parameters, 199.391 KiB.\n\njulia> x1 = randn32(28*28, 1); # fake image\n\njulia> y1 = [i==3 for i in 0:9]; # fake label\n\njulia> grads_f = Flux.gradient((m,x,y) -> sum(abs2, m(x) .- y), dup_model, Const(x1), Const(y1)) # uses Enzyme\n((layers = ((weight = Float32[-0.010354728 0.032972857 …\n -0.0014538406], σ = nothing), nothing),), nothing, nothing)","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"The gradient returned here is also stored within dup_model. Both share the same arrays – what is returned is not a copy, just a view of the same memory (wrapped in NamedTuples instead of structs). They will all be set to zero when you call gradient again, then replaced with the new values. Alternatively, gradient(f, args...; zero=false) will add the new gradient to what's already stored.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Writing Const(x1) is optional, just plain x1 is implicitly constant. Any set of Duplicated and Const arguments may appear in any order, so long as there is at least one Duplicated.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"The gradient grads_f[1] can be passed to update! as usual. But for convenience, you may also use what is stored within Duplicated. These are equivalent ways to perform an update step:","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> opt_state = Flux.setup(Adam(), model)\n\njulia> ans == Flux.setup(Adam(), dup_model)\n\njulia> Flux.update!(opt_state, model, grads_f[1]) # exactly as for Zygote gradients\n\njulia> Flux.update!(opt_state, dup_model) # equivlent new path, Enzyme only","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Instead of using these FLux functions, you can also use Enzyme's own functions directly. Enzyme.gradient works like this:","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> grads_e = Enzyme.gradient(Reverse, (m,x,y) -> sum(abs2, m(x) .- y), model, Const(x1), Const(y1))\n(Chain(Dense(784 => 32, σ), Dense(32 => 10), softmax), nothing, nothing)\n\njulia> grads_f[1].layers[2].bias ≈ grads_e[1].layers[2].bias\ntrue","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Note that what Enzyme.gradient returns is an object like deepcopy(model) of the same type, grads_e[1] isa Chain. But its fields contain the same gradient.","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"There is also a method of train! which similarly takes Duplicated(model):","category":"page"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"julia> opt_state = Flux.setup(Adam(0), model);\n\njulia> Flux.train!((m,x,y) -> sum(abs2, m(x) .- y), dup_model, [(x1, y1)], opt_state)","category":"page"},{"location":"reference/training/enzyme/#Second-order-AD","page":"Gradients – Enzyme.jl","title":"Second-order AD","text":"","category":"section"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"If you calculate a gradient within the loss function, then training will involve 2nd derivatives. While this is in principle supported by Zygote.jl, there are many bugs, and Enzyme.jl is probably a better choice.","category":"page"},{"location":"reference/training/enzyme/#Listing","page":"Gradients – Enzyme.jl","title":"Listing","text":"","category":"section"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Flux.gradient(f, args::Union{Flux.EnzymeCore.Const, Flux.EnzymeCore.Duplicated}...)\nFlux.withgradient(f, args::Union{Flux.EnzymeCore.Const, Flux.EnzymeCore.Duplicated}...)\nFlux.train!(loss, model::Flux.EnzymeCore.Duplicated, data, opt)","category":"page"},{"location":"reference/training/enzyme/#Flux.gradient-Tuple{Any, Vararg{Union{EnzymeCore.Const, EnzymeCore.Duplicated}}}","page":"Gradients – Enzyme.jl","title":"Flux.gradient","text":"gradient(f, args::Union{Const,Duplicated}...)\n\nThis should return the same answer as gradient(f, args...), but it uses Enzyme.jl instead of Zygote.jl to compute the derivative.\n\nOnly available when Enzyme is loaded!\n\nThis method is used when at least one argument is of type Duplicated, and all unspecified aguments are wrapped in Const. Note that Enzyme's Active is not supported.\n\nBesides returning the gradient, this is also stored within the Duplicated object. Calling Enzyme.Duplicated(model) allocates space for the gradient, which is zero'd befor use when calling gradient. With the keyword zero=false, the new gradient will instead be added to what is already stored.\n\nwarning: Experimental\nEnzyme support like this is new and somewhat experimental. This method was added in Flux 0.15.\n\nExample\n\njulia> using Flux\n\njulia> model = Chain(Dense([3.0;;]));\n\njulia> Flux.gradient(model, [1]) do m, x # computed using Zygote\n sum(abs2, m(x))\n end\n((layers = ((weight = [6.0;;], bias = [6.0], σ = nothing),),), [18.0])\n\njulia> using Enzyme\n\njulia> dup_model = Duplicated(model); # allocates space for gradient\n\njulia> Flux.gradient(dup_model, Const([1])) do m, x # Enzyme, returns the same\n sum(abs2, m(x))\n end\n((layers = ((weight = [6.0;;], bias = [6.0], σ = nothing),),), nothing)\n\njulia> dup_model # same gradient is also stored within Duplicated\nDuplicated(\n Chain(\n Dense(1 => 1), # 2 parameters\n ),\n # norm(∇) ≈ 8.49\n)\n\njulia> Flux.destructure((weight = [6.0;;], bias = [6.0]))[1] |> norm\n8.48528137423857\n\njulia> Flux.gradient(dup_model, [1]; zero=false) do m, x # implict Const([1]), and grad accumulation\n sum(abs2, m(x))\n end\n((layers = ((weight = [12.0;;], bias = [12.0], σ = nothing),),), nothing)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/enzyme/#Flux.withgradient-Tuple{Any, Vararg{Union{EnzymeCore.Const, EnzymeCore.Duplicated}}}","page":"Gradients – Enzyme.jl","title":"Flux.withgradient","text":"withgradient(f, args::Union{Const,Duplicated}...)\n\nThis should return the same answer as withgradient(f, model, args...), but it uses Enzyme.jl instead of Zygote.jl to compute the derivative.\n\nOnly available when Enzyme is loaded!\n\nwarning: Experimental\nEnzyme support like this is new and somewhat experimental. This method was added in Flux 0.15.\n\nExample\n\njulia> using Flux, Enzyme\n\njulia> model = Chain(Embedding([1.1 2.2 3.3]), Dense([4.4;;]), only);\n\njulia> model(3)\n14.52\n\njulia> Flux.withgradient(m -> m(3), model) # this uses Zygote\n(val = 14.52, grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\njulia> Flux.withgradient(m -> m(3), Duplicated(model)) # this uses Enzyme\n(val = 14.52, grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\nThe function f may return Tuple or NamedTuple, with the loss as the first element. The gradient is then grad = gradient(first∘f, args...) but the returned value is val = f(args...):\n\njulia> Flux.withgradient(m -> (m(3), \"aux\"), Duplicated(model))\n(val = (14.52, \"aux\"), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\njulia> Flux.withgradient(m -> (loss=m(3), aux=round.(m.(1:3); digits=3)), Duplicated(model))\n(val = (loss = 14.52, aux = [4.84, 9.68, 14.52]), grad = ((layers = ((weight = [0.0 0.0 4.4],), (weight = [3.3;;], bias = [1.0], σ = nothing), nothing),),))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/enzyme/#Flux.Train.train!-Tuple{Any, EnzymeCore.Duplicated, Any, Any}","page":"Gradients – Enzyme.jl","title":"Flux.Train.train!","text":"train!(loss, Duplicated(model), data, opt_state)\n\nThis method uses Enzyme.jl instead of Zygote.jl to compute the gradients, but is otherwise the same as train!(loss, model, data, opt_state).\n\nOnly available when Enzyme is loaded.\n\ncompat: New\nThis method was added in Flux 0.13.9.\n\n\n\n\n\n","category":"method"},{"location":"reference/training/enzyme/","page":"Gradients – Enzyme.jl","title":"Gradients – Enzyme.jl","text":"Enzyme.jl has its own extensive documentation.","category":"page"},{"location":"reference/data/mldatadevices/","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transfer Data to GPU – MLDataDevices.jl","text":"CurrentModule = MLDataDevices\nCollapsedDocStrings = true","category":"page"},{"location":"reference/data/mldatadevices/#Transferring-data-across-devices","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transferring data across devices","text":"","category":"section"},{"location":"reference/data/mldatadevices/","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transfer Data to GPU – MLDataDevices.jl","text":"Flux relies on the MLDataDevices.jl package to manage devices and transfer data across them. You don't have to explicitly use the package, as Flux re-exports the necessary functions and types.","category":"page"},{"location":"reference/data/mldatadevices/","page":"Transfer Data to GPU – MLDataDevices.jl","title":"Transfer Data to GPU – MLDataDevices.jl","text":"MLDataDevices.cpu_device\nMLDataDevices.default_device_rng\nMLDataDevices.functional\nMLDataDevices.get_device\nMLDataDevices.gpu_device\nMLDataDevices.gpu_backend!\nMLDataDevices.get_device_type\nMLDataDevices.isleaf\nMLDataDevices.loaded\nMLDataDevices.reset_gpu_device!\nMLDataDevices.set_device!\nMLDataDevices.supported_gpu_backends\nMLDataDevices.DeviceIterator","category":"page"},{"location":"reference/data/mldatadevices/#MLDataDevices.cpu_device","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.cpu_device","text":"cpu_device() -> CPUDevice()\n\nReturn a CPUDevice object which can be used to transfer data to CPU.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.default_device_rng","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.default_device_rng","text":"default_device_rng(::AbstractDevice)\n\nReturns the default RNG for the device. This can be used to directly generate parameters and states on the device using WeightInitializers.jl.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.functional","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.functional","text":"functional(x::AbstractDevice) -> Bool\nfunctional(::Type{<:AbstractDevice}) -> Bool\n\nChecks if the device is functional. This is used to determine if the device can be used for computation. Note that even if the backend is loaded (as checked via MLDataDevices.loaded), the device may not be functional.\n\nNote that while this function is not exported, it is considered part of the public API.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.get_device","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.get_device","text":"get_device(x) -> dev::AbstractDevice | Exception | Nothing\n\nIf all arrays (on the leaves of the structure) are on the same device, we return that device. Otherwise, we throw an error. If the object is device agnostic, we return nothing.\n\nnote: Note\nTrigger Packages must be loaded for this to return the correct device.\n\nSpecial Retuened Values\n\nnothing – denotes that the object is device agnostic. For example, scalar, abstract range, etc.\nUnknownDevice() – denotes that the device type is unknown.\n\nSee also get_device_type for a faster alternative that can be used for dispatch based on device type.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.gpu_device","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.gpu_device","text":"gpu_device(device_id::Union{Nothing, Integer}=nothing;\n force::Bool=false) -> AbstractDevice\n\nSelects GPU device based on the following criteria:\n\nIf gpu_backend preference is set and the backend is functional on the system, then that device is selected.\nOtherwise, an automatic selection algorithm is used. We go over possible device backends in the order specified by supported_gpu_backends() and select the first functional backend.\nIf no GPU device is functional and force is false, then cpu_device() is invoked.\nIf nothing works, an error is thrown.\n\nArguments\n\ndevice_id::Union{Nothing, Integer}: The device id to select. If nothing, then we return the last selected device or if none was selected then we run the autoselection and choose the current device using CUDA.device() or AMDGPU.device() or similar. If Integer, then we select the device with the given id. Note that this is 1-indexed, in contrast to the 0-indexed CUDA.jl. For example, id = 4 corresponds to CUDA.device!(3).\n\nwarning: Warning\ndevice_id is only applicable for CUDA and AMDGPU backends. For Metal, oneAPI and CPU backends, device_id is ignored and a warning is printed.\n\nwarning: Warning\ngpu_device won't select a CUDA device unless both CUDA.jl and cuDNN.jl are loaded. This is to ensure that deep learning operations work correctly. Nonetheless, if cuDNN is not loaded you can still manually create a CUDADevice object and use it (e.g. dev = CUDADevice()).\n\nKeyword Arguments\n\nforce::Bool: If true, then an error is thrown if no functional GPU device is found.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.gpu_backend!","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.gpu_backend!","text":"gpu_backend!() = gpu_backend!(\"\")\ngpu_backend!(backend) = gpu_backend!(string(backend))\ngpu_backend!(backend::AbstractGPUDevice)\ngpu_backend!(backend::String)\n\nCreates a LocalPreferences.toml file with the desired GPU backend.\n\nIf backend == \"\", then the gpu_backend preference is deleted. Otherwise, backend is validated to be one of the possible backends and the preference is set to backend.\n\nIf a new backend is successfully set, then the Julia session must be restarted for the change to take effect.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.get_device_type","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.get_device_type","text":"get_device_type(x) -> Type{<:AbstractDevice} | Exception | Type{Nothing}\n\nSimilar to get_device but returns the type of the device instead of the device itself. This value is often a compile time constant and is recommended to be used instead of get_device where ever defining dispatches based on the device type.\n\nnote: Note\nTrigger Packages must be loaded for this to return the correct device.\n\nSpecial Retuened Values\n\nNothing – denotes that the object is device agnostic. For example, scalar, abstract range, etc.\nUnknownDevice – denotes that the device type is unknown.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.isleaf","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.isleaf","text":"isleaf(x) -> Bool\n\nReturns true if x is a leaf node in the data structure.\n\nDefining MLDataDevices.isleaf(x::T) = true for custom types can be used to customize the behavior the data movement behavior when an object with nested structure containing the type is transferred to a device.\n\nAdapt.adapt_structure(::AbstractDevice, x::T) or Adapt.adapt_structure(::AbstractDevice, x::T) will be called during data movement if isleaf(x::T).\n\nIf MLDataDevices.isleaf(x::T) is not defined, then it will fall back to Functors.isleaf(x).\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.loaded","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.loaded","text":"loaded(x::AbstractDevice) -> Bool\nloaded(::Type{<:AbstractDevice}) -> Bool\n\nChecks if the trigger package for the device is loaded. Trigger packages are as follows:\n\nCUDA.jl and cuDNN.jl (or just LuxCUDA.jl) for NVIDIA CUDA Support.\nAMDGPU.jl for AMD GPU ROCM Support.\nMetal.jl for Apple Metal GPU Support.\noneAPI.jl for Intel oneAPI GPU Support.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.reset_gpu_device!","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.reset_gpu_device!","text":"reset_gpu_device!()\n\nResets the selected GPU device. This is useful when automatic GPU selection needs to be run again.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.set_device!","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.set_device!","text":"set_device!(T::Type{<:AbstractDevice}, dev_or_id)\n\nSet the device for the given type. This is a no-op for CPUDevice. For CUDADevice and AMDGPUDevice, it prints a warning if the corresponding trigger package is not loaded.\n\nCurrently, MetalDevice and oneAPIDevice don't support setting the device.\n\nArguments\n\nT::Type{<:AbstractDevice}: The device type to set.\ndev_or_id: Can be the device from the corresponding package. For example for CUDA it can be a CuDevice. If it is an integer, it is the device id to set. This is 1-indexed.\n\ndanger: Danger\nThis specific function should be considered experimental at this point and is currently provided to support distributed training in Lux. As such please use Lux.DistributedUtils instead of using this function.\n\n\n\n\n\nset_device!(T::Type{<:AbstractDevice}, ::Nothing, rank::Integer)\n\nSet the device for the given type. This is a no-op for CPUDevice. For CUDADevice and AMDGPUDevice, it prints a warning if the corresponding trigger package is not loaded.\n\nCurrently, MetalDevice and oneAPIDevice don't support setting the device.\n\nArguments\n\nT::Type{<:AbstractDevice}: The device type to set.\nrank::Integer: Local Rank of the process. This is applicable for distributed training and must be 0-indexed.\n\ndanger: Danger\nThis specific function should be considered experimental at this point and is currently provided to support distributed training in Lux. As such please use Lux.DistributedUtils instead of using this function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.supported_gpu_backends","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.supported_gpu_backends","text":"supported_gpu_backends() -> Tuple{String, ...}\n\nReturn a tuple of supported GPU backends.\n\nwarning: Warning\nThis is not the list of functional backends on the system, but rather backends which MLDataDevices.jl supports.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mldatadevices/#MLDataDevices.DeviceIterator","page":"Transfer Data to GPU – MLDataDevices.jl","title":"MLDataDevices.DeviceIterator","text":"DeviceIterator(dev::AbstractDevice, iterator)\n\nCreate a DeviceIterator that iterates through the provided iterator via iterate. Upon each iteration, the current batch is copied to the device dev, and the previous iteration is marked as freeable from GPU memory (via unsafe_free!) (no-op for a CPU device).\n\nThe conversion follows the same semantics as dev().\n\ntip: Similarity to `CUDA.CuIterator`\nThe design inspiration was taken from CUDA.CuIterator and was generalized to work with other backends and more complex iterators (using Functors).\n\ntip: `MLUtils.DataLoader`\nCalling dev(::MLUtils.DataLoader) will automatically convert the dataloader to use the same semantics as DeviceIterator. This is generally preferred over looping over the dataloader directly and transferring the data to the device.\n\nExamples\n\nThe following was run on a computer with an NVIDIA GPU.\n\njulia> using MLDataDevices, MLUtils\n\njulia> X = rand(Float64, 3, 33);\n\njulia> dataloader = DataLoader(X; batchsize=13, shuffle=false);\n\njulia> for (i, x) in enumerate(dataloader)\n @show i, summary(x)\n end\n(i, summary(x)) = (1, \"3×13 Matrix{Float64}\")\n(i, summary(x)) = (2, \"3×13 Matrix{Float64}\")\n(i, summary(x)) = (3, \"3×7 Matrix{Float64}\")\n\njulia> for (i, x) in enumerate(CUDADevice()(dataloader))\n @show i, summary(x)\n end\n(i, summary(x)) = (1, \"3×13 CuArray{Float32, 2, CUDA.DeviceMemory}\")\n(i, summary(x)) = (2, \"3×13 CuArray{Float32, 2, CUDA.DeviceMemory}\")\n(i, summary(x)) = (3, \"3×7 CuArray{Float32, 2, CUDA.DeviceMemory}\")\n\n\n\n\n\n","category":"type"},{"location":"tutorials/custom_layers/#man-advanced","page":"Custom Layers","title":"Defining Customised Layers","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here we will try and describe usage of some more advanced features that Flux provides to give more control over model building.","category":"page"},{"location":"tutorials/custom_layers/#Custom-Model-Example","page":"Custom Layers","title":"Custom Model Example","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here is a basic example of a custom model. It simply adds the input to the result from the neural network.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"struct CustomModel{T <: Chain} # Parameter to avoid type instability\n chain::T\nend\n\nfunction (m::CustomModel)(x)\n # Arbitrary code can go here, but note that everything will be differentiated.\n # Zygote does not allow some operations, like mutating arrays.\n\n return m.chain(x) + x\nend\n\n# This is optional but recommended for pretty printing and other niceties\nFlux.@layer CustomModel","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice that we parameterized the type of the chain field. This is necessary for fast Julia code, so that that struct field can be given a concrete type. Chains have a type parameter fully specifying the types of the layers they contain. By using a type parameter, we are freeing Julia to determine the correct concrete type, so that we do not need to specify the full, possibly quite long, type ourselves.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"You can then use the model like:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"chain = Chain(Dense(10 => 10, relu), Dense(10 => 10))\nmodel = CustomModel(chain)\nmodel(rand(Float32, 10))","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"For an intro to Flux and automatic differentiation, see this tutorial.","category":"page"},{"location":"tutorials/custom_layers/#Customising-Parameter-Collection-for-a-Model","page":"Custom Layers","title":"Customising Parameter Collection for a Model","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Taking reference from our example Affine layer from the basics.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"By default all the fields in the Affine type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our \"layers\" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, the way to mark some fields of our layer as trainable is through overloading the trainable function:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> struct Affine\n W\n b\n end\n\njulia> Affine(in::Int, out::Int) = Affine(randn(out, in), randn(out));\n\njulia> (m::Affine)(x) = m.W * x .+ m.b;\n\njulia> Flux.@layer Affine\n\njulia> a = Affine(Float32[1 2; 3 4; 5 6], Float32[7, 8, 9])\nAffine(Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a) # default behavior\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], b = Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a::Affine) = (; W = a.W) # returns a NamedTuple using the field's name\n\njulia> Flux.trainable(a)\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0],)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Only the fields returned by trainable will be seen by Flux.setup and Flux.update! for training. But all fields wil be seen by gpu and similar functions, for example:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> a |> f16\nAffine(Float16[1.0 2.0; 3.0 4.0; 5.0 6.0], Float16[7.0, 8.0, 9.0])","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Note that there is no need to overload trainable to hide fields which do not contain numerical array (for example, activation functions, or Boolean flags). These are always ignored by training.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The exact same method of trainable can also be defined using the macro, for convenience:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Affine trainable=(W,)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling Functors.@functor Affine (W,) means that no exploration of the model will ever visit the other fields: They will not be moved to the GPU by gpu, and their precision will not be changed by f32. This requires the struct to have a corresponding constructor that accepts only W as an argument.","category":"page"},{"location":"tutorials/custom_layers/#Custom-multiple-input-or-output-layer","page":"Custom Layers","title":"Custom multiple input or output layer","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the inception module.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. That would mean a new struct for each different block, e.g. one would have a TransformerBlock struct for a transformer block, and a ResNetBlock struct for a ResNet block, each block being composed by smaller sub-blocks. This is often the simplest and cleanest way to implement complex models.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"This guide instead will show you how to construct a high-level layer (like Chain) that is made of multiple sub-layers for each path. It may be the case that using the layers described as follows makes the definition of your model harder to read and to change. In that case, consider using the simpler approach of defining a custom structure described above.","category":"page"},{"location":"tutorials/custom_layers/#Multiple-inputs:-a-custom-Join-layer","page":"Custom Layers","title":"Multiple inputs: a custom Join layer","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Join layer will accept multiple inputs at once, pass each input through a separate path, then combine the results together. Note that this layer can already be constructed using Parallel, but we will first walk through how do this manually.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by defining a new struct, Join, that stores the different paths and a combine operation as its fields.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom join layer\nstruct Join{T, F}\n combine::F\n paths::T\nend\n\n# allow Join(op, m1, m2, ...) as a constructor\nJoin(combine, paths...) = Join(combine, paths)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice again that we parameterized the type of the combine and paths fields. In addition to the performance considerations of concrete types, this allows either field to be Vectors, Tuples, or one of each - we don't need to pay attention to which.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The next step is to use Flux.@layer to make our struct behave like a Flux layer. In Flux < v0.15 this used to be important so that calling Flux.setup on a Join maps over the underlying trainable arrays on each path. Since Flux v0.15, this is no longer necessary, since now Functors.jl automatically traverses custom types. However, Flux.@layer is still recommended for pretty printing and other niceties.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Join","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Finally, we define the forward pass. For Join, this means applying each path in paths to each input array, then using combine to merge the results.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs)...)\n(m::Join)(xs...) = m(xs)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Lastly, we can test our new layer. Thanks to the proper abstractions in Julia, our layer works on GPU arrays out of the box!","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)), # branch 1\n Dense(1 => 2), # branch 2\n Dense(1 => 1) # branch 3\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Join layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"tutorials/custom_layers/#Using-Parallel","page":"Custom Layers","title":"Using Parallel","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux already provides Parallel that can offer the same functionality. In this case, Join is going to just be syntactic sugar for Parallel.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Join(combine, paths) = Parallel(combine, paths)\nJoin(combine, paths...) = Join(combine, paths)\n\n# use vararg/tuple version of Parallel forward pass\nmodel = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)),\n Dense(1 => 2),\n Dense(1 => 1)\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"tutorials/custom_layers/#Multiple-outputs:-a-custom-Split-layer","page":"Custom Layers","title":"Multiple outputs: a custom Split layer","text":"","category":"section"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Split layer will accept a single input, then pass the input through a separate path to produce multiple outputs.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by following the same steps as the Join layer: define a struct, use Flux.@layer, and define the forward pass.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom split layer\nstruct Split{T}\n paths::T\nend\n\nSplit(paths...) = Split(paths)\n\nFlux.@layer Split\n\n(m::Split)(x::AbstractArray) = map(f -> f(x), m.paths)","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Now we can test to see that our Split does indeed produce multiple outputs.","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Dense(10 => 5),\n Split(Dense(5 => 1, tanh), Dense(5 => 3, tanh), Dense(5 => 2))\n ) |> gpu\n\nmodel(gpu(rand(10)))\n# returns a tuple with three float vectors","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"A custom loss function for the multiple outputs may look like this:","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Statistics\n\n# assuming model returns the output of a Split\n# x is a single input\n# ys is a tuple of outputs\nfunction loss(x, ys, model)\n # rms over all the mse\n ŷs = model(x)\n return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs)))\nend","category":"page"},{"location":"tutorials/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Split layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"guide/models/overview/#man-overview","page":"Fitting a Line","title":"Flux Overview: Fitting a Straight Line","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Here's how you'd use Flux to build and train the most basic of models, step by step.","category":"page"},{"location":"guide/models/overview/#A-Trivial-Prediction","page":"Fitting a Line","title":"A Trivial Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will predict the output of the function 4x + 2. Making such predictions is called \"linear regression\", and is really too simple to need a neural network. But it's a nice toy example.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, import Flux and define the function we want to simulate:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux\n\njulia> actual(x) = 4x + 2\nactual (generic function with 1 method)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will build a model to approximate the actual function.","category":"page"},{"location":"guide/models/overview/#1.-Provide-Training-and-Test-Data","page":"Fitting a Line","title":"1. Provide Training and Test Data","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Use the actual function to build sets of data for training and verification:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> x_train, x_test = hcat(0:5...), hcat(6:10...)\n([0 1 … 4 5], [6 7 … 9 10])\n\njulia> y_train, y_test = actual.(x_train), actual.(x_test)\n([2 6 … 18 22], [26 30 … 38 42])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Normally, your training and test data come from real world observations, but here we simulate them.","category":"page"},{"location":"guide/models/overview/#2.-Build-a-Model-to-Make-Predictions","page":"Fitting a Line","title":"2. Build a Model to Make Predictions","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, build a model to make predictions with 1 input and 1 output:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters\n\njulia> model.weight\n1×1 Matrix{Float32}:\n 0.95041317\n\njulia> model.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, a dense layer is a struct with fields weight and bias. weight represents a weights' matrix and bias represents a bias vector. There's another way to think about a model. In Flux, models are conceptually predictive functions: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Dense(1 => 1) also implements the function σ(Wx+b) where W and b are the weights and biases. σ is an activation function (more on activations later). Our model has one weight and one bias, but typical models will have many more. Think of weights and biases as knobs and levers Flux can use to tune predictions. Activation functions are transformations that tailor models to your needs. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This model will already make predictions, though not accurate ones yet:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_train)\n1×6 Matrix{Float32}:\n 0.0 0.906654 1.81331 2.71996 3.62662 4.53327","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In order to make better predictions, you'll need to provide a loss function to tell Flux how to objectively evaluate the quality of a prediction. Loss functions compute the cumulative distance between actual values and predictions. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Statistics\n\njulia> loss(model, x, y) = mean(abs2.(model(x) .- y));\n\njulia> loss(predict, x_train, y_train)\n122.64734f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"More accurate predictions will yield a lower loss. You can write your own loss functions or rely on those already provided by Flux. This loss function is called mean squared error (and built-in as mse). Flux works by iteratively reducing the loss through training.","category":"page"},{"location":"guide/models/overview/#3.-Improve-the-Prediction","page":"Fitting a Line","title":"3. Improve the Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, the Flux Flux.train! function uses a loss function and training data to improve the parameters of your model based on a pluggable optimiser:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux: train!\n\njulia> opt = Descent()\nDescent(0.1f0)\n\njulia> data = [(x_train, y_train)]\n1-element Vector{Tuple{Matrix{Int64}, Matrix{Int64}}}:\n ([0 1 … 4 5], [2 6 … 18 22])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, we have the optimiser and data we'll pass to train!. All that remains are the parameters of the model. Remember, each model is a Julia struct with a function and configurable parameters. Remember, the dense layer has weights and biases that depend on the dimensions of the inputs and outputs: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight\n1×1 Matrix{Float32}:\n 0.9066542\n\njulia> predict.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The dimensions of these model parameters depend on the number of inputs and outputs.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux will adjust predictions by iteratively changing these parameters according to the optimiser.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This optimiser implements the classic gradient descent strategy. Now improve the parameters of the model with a call to Flux.train! like this:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> train!(loss, predict, data, opt)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"And check the loss:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> loss(predict, x_train, y_train)\n116.38745f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"It went down. Why? ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight, predict.bias\n(Float32[7.246838;;], Float32[1.748103])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The parameters have changed. This single step is the essence of machine learning.","category":"page"},{"location":"guide/models/overview/#3.-Iteratively-Train-the-Model","page":"Fitting a Line","title":"3+. Iteratively Train the Model","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In the previous section, we made a single call to train! which iterates over the data we passed in just once. An epoch refers to one pass over the dataset. Typically, we will run the training for multiple epochs to drive the loss down even further. Let's run it a few more times:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> for epoch in 1:200\n train!(loss, predict, data, opt)\n end\n\njulia> loss(predict, x_train, y_train)\n0.00339581f0\n\njulia> predict.weight, predict.bias\n(Float32[4.0159144;;], Float32[2.004479])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After 200 training steps, the loss went down, and the parameters are getting close to those in the function the model is built to predict.","category":"page"},{"location":"guide/models/overview/#4.-Verify-the-Results","page":"Fitting a Line","title":"4. Verify the Results","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, let's verify the predictions:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_test)\n1×5 Matrix{Float32}:\n 26.1121 30.13 34.1479 38.1657 42.1836\n\njulia> y_test\n1×5 Matrix{Int64}:\n 26 30 34 38 42","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The predictions are good. Here's how we got there. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After we trained the model, we verified it with the test data to verify the results. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/destructure/#man-destructure","page":"Flat vs. Nested","title":"Flat vs. Nested Structures","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"A Flux model is a nested structure, with parameters stored within many layers. Sometimes you may want a flat representation of them, to interact with functions expecting just one vector. This is provided by destructure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> model = Chain(Dense(2=>1, tanh), Dense(1=>1))\nChain(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1), # 2 parameters\n) # Total: 4 arrays, 5 parameters, 276 bytes.\n\njulia> flat, rebuild = Flux.destructure(model)\n(Float32[0.863101, 1.2454957, 0.0, -1.6345707, 0.0], Restructure(Chain, ..., 5))\n\njulia> rebuild(zeros(5)) # same structure, new parameters\nChain(\n Dense(2 => 1, tanh), # 3 parameters (all zero)\n Dense(1 => 1), # 2 parameters (all zero)\n) # Total: 4 arrays, 5 parameters, 276 bytes.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Both destructure and the Restructure function can be used within gradient computations. For instance, this computes the Hessian ∂²L/∂θᵢ∂θⱼ of some loss function, with respect to all parameters of the Flux model. The resulting matrix has off-diagonal entries, which cannot really be expressed in a nested structure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> x = rand(Float32, 2, 16);\n\njulia> grad = gradient(m -> sum(abs2, m(x)), model) # nested gradient\n((layers = ((weight = Float32[10.339018 11.379145], bias = Float32[22.845667], σ = nothing), (weight = Float32[-29.565302;;], bias = Float32[-37.644184], σ = nothing)),),)\n\njulia> function loss(v::Vector)\n m = rebuild(v)\n y = m(x)\n sum(abs2, y)\n end;\n\njulia> gradient(loss, flat) # flat gradient, same numbers\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184],)\n\njulia> Zygote.hessian(loss, flat) # second derivative\n5×5 Matrix{Float32}:\n -7.13131 -5.54714 -11.1393 -12.6504 -8.13492\n -5.54714 -7.11092 -11.0208 -13.9231 -9.36316\n -11.1393 -11.0208 -13.7126 -27.9531 -22.741\n -12.6504 -13.9231 -27.9531 18.0875 23.03\n -8.13492 -9.36316 -22.741 23.03 32.0\n\njulia> Flux.destructure(grad) # acts on non-models, too\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"In order to collect all parameters of a model into a list instead, you can use the trainables function:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> Flux.trainables(model)\n5-element Vector{AbstractArray}:\n [0.863101 1.2454957]\n [0.0]\n [1.290355429422727;;]\n [0.0]","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Any mutation of the elements of the resulting list will affect the model's parameters.","category":"page"},{"location":"reference/destructure/#All-Parameters","page":"Flat vs. Nested","title":"All Parameters","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"The functions destructure and trainables live in Optimisers.jl.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Optimisers.destructure\nOptimisers.trainable\nOptimisers.trainables\nOptimisers.isnumeric\nFlux.params","category":"page"},{"location":"reference/destructure/#Optimisers.destructure","page":"Flat vs. Nested","title":"Optimisers.destructure","text":"destructure(model) -> vector, reconstructor\n\nCopies all trainable, isnumeric parameters in the model to a vector, and returns also a function which reverses this transformation. Differentiable.\n\nExample\n\njulia> v, re = destructure((x=[1.0, 2.0], y=(sin, [3.0 + 4.0im])))\n(ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], Restructure(NamedTuple, ..., 3))\n\njulia> re([3, 5, 7+11im])\n(x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))\n\nIf model contains various number types, they are promoted to make vector, and are usually restored by Restructure. Such restoration follows the rules of ChainRulesCore.ProjectTo, and thus will restore floating point precision, but will permit more exotic numbers like ForwardDiff.Dual.\n\nIf model contains only GPU arrays, then vector will also live on the GPU. At present, a mixture of GPU and ordinary CPU arrays is undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainable","page":"Flat vs. Nested","title":"Optimisers.trainable","text":"trainable(x::Layer) -> NamedTuple\n\nThis may be overloaded to make optimisers ignore some fields of every Layer, which would otherwise contain trainable parameters.\n\nwarning: Warning\nThis is very rarely required. Fields of struct Layer which contain functions, or integers like sizes, are always ignored anyway. Overloading trainable is only necessary when some arrays of numbers are to be optimised, and some arrays of numbers are not.\n\nThe default is Functors.children(x), usually a NamedTuple of all fields, and trainable(x) must contain a subset of these.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainables","page":"Flat vs. Nested","title":"Optimisers.trainables","text":"trainables(x, path = false)\n\nReturn an iterable over all the trainable parameters in x, that is all the numerical arrays (see isnumeric) which are reachable through trainable.\n\nParameters appearing multiple times in the model (tied weights) will be present only once in the output.\n\nIf path = false, the output is a list of numerical arrays.\n\nIf path = true, the output is a list of (KeyPath, AbstractArray) pairs, where KeyPath is a type representing the path to the array in the original structure.\n\nSee also destructure for a similar operation that returns a single flat vector instead.\n\nExamples\n\njulia> struct MyLayer\n w\n b\n end\n\njulia> Functors.@functor MyLayer\n\njulia> Optimisers.trainable(x::MyLayer) = (; w = x.w,) # only w is trainable in this example\n\njulia> x = MyLayer([1.0,2.0,3.0], [4.0,5.0,6.0]);\n\njulia> trainables(x)\n1-element Vector{AbstractArray}:\n [1.0, 2.0, 3.0]\n\n julia> x = MyLayer((a=[1.0,2.0], b=[3.0]), [4.0,5.0,6.0]);\n\n julia> trainables(x) # collects nested parameters\n 2-element Vector{AbstractArray}:\n [1.0, 2.0]\n [3.0]\n\njulia> x = (a = [1.0,2.0], b = (Dict(\"c\" => [3.0, 4.0], \"d\" => 5.0), [6.0,7.0]));\n\njulia> for (kp, y) in trainables(x, path = true)\n println(kp, \" => \", y)\n end\nKeyPath(:a,) => [1.0, 2.0]\nKeyPath(:b, 1, \"c\") => [3.0, 4.0]\nKeyPath(:b, 2) => [6.0, 7.0]\n\njulia> getkeypath(x, KeyPath(:b, 1, \"c\"))\n2-element Vector{Float64}:\n 3.0\n 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.isnumeric","page":"Flat vs. Nested","title":"Optimisers.isnumeric","text":"isnumeric(x) -> Bool\n\nReturns true on any parameter to be adjusted by Optimisers.jl, namely arrays of non-integer numbers. Returns false on all other types.\n\nRequires also that Functors.isleaf(x) == true, to focus on e.g. the parent of a transposed matrix, not the wrapper.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Flux.params","page":"Flat vs. Nested","title":"Flux.params","text":"params(model)\n\nReturns a Zygote.Params object containing all parameter arrays from the model. This is deprecated! This function was the cornerstone of how Flux used Zygote's implicit mode gradients, but since Flux 0.13 we use explicit mode gradient(m -> loss(m, x, y), model) instead. To collect all the parameter arrays for other purposes, use Flux.trainables(model).\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#All-Layers","page":"Flat vs. Nested","title":"All Layers","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Another kind of flat view of a nested model is provided by the modules command. This extracts a list of all layers:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.modules","category":"page"},{"location":"reference/destructure/#Flux.modules","page":"Flat vs. Nested","title":"Flux.modules","text":"modules(m)\n\nReturn an iterator over non-leaf objects that can be reached by recursing m over the children given by Functors.functor.\n\nUseful for applying a function (e.g. a regularizer) over specific modules or subsets of the parameters (e.g. the weights but not the biases).\n\nExamples\n\njulia> m1 = Chain(Dense(28^2, 64), BatchNorm(64, relu));\n\njulia> m2 = Chain(m1, Dense(64, 10))\nChain(\n Chain(\n Dense(784 => 64), # 50_240 parameters\n BatchNorm(64, relu), # 128 parameters, plus 128\n ),\n Dense(64 => 10), # 650 parameters\n) # Total: 6 trainable arrays, 51_018 parameters,\n # plus 2 non-trainable, 128 parameters, summarysize 200.211 KiB.\n\njulia> Flux.modules(m2)\n7-element Vector{Any}:\n Chain(Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10)) # 51_018 parameters, plus 128 non-trainable\n (Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10))\n Chain(Dense(784 => 64), BatchNorm(64, relu)) # 50_368 parameters, plus 128 non-trainable\n (Dense(784 => 64), BatchNorm(64, relu))\n Dense(784 => 64) # 50_240 parameters\n BatchNorm(64, relu) # 128 parameters, plus 128 non-trainable\n Dense(64 => 10) # 650 parameters\n\njulia> L2(m) = sum(sum(abs2, l.weight) for l in Flux.modules(m) if l isa Dense)\nL2 (generic function with 1 method)\n\njulia> L2(m2) isa Float32\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Save-and-Load","page":"Flat vs. Nested","title":"Save and Load","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.state\nFlux.loadmodel!","category":"page"},{"location":"reference/destructure/#Flux.state","page":"Flat vs. Nested","title":"Flux.state","text":"state(x)\n\nReturn an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).\n\nBesides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.\n\nThis method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.\n\nThe state can be passed to loadmodel! to restore the model.\n\nExamples\n\nCopy the state into another model\n\njulia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));\n\njulia> s = Flux.state(m1)\n(layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)\n\njulia> m2 = Chain(Dense(1, 2, tanh), Dense(2, 1; bias=false)); # weights are random numbers\n\njulia> Flux.loadmodel!(m2, s);\n\njulia> m2[1].weight # now the weights of m2 are the same as m1\n2×1 Matrix{Float32}:\n 1.0\n 1.0\n\njulia> Flux.state(trainmode!(Dropout(0.2))) # contains p & activity, but not RNG state\n(p = 0.2, dims = (), active = true, rng = ())\n\njulia> Flux.state(BatchNorm(1)) # contains non-trainable arrays μ, σ²\n(λ = (), β = Float32[0.0], γ = Float32[1.0], μ = Float32[0.0], σ² = Float32[1.0], ϵ = 1.0f-5, momentum = 0.1f0, affine = true, track_stats = true, active = nothing, chs = 1)\n\nSave and load with BSON\n\njulia> using BSON\n\njulia> BSON.@save \"checkpoint.bson\" model_state = s\n\njulia> Flux.loadmodel!(m2, BSON.load(\"checkpoint.bson\")[:model_state])\n\nSave and load with JLD2\n\njulia> using JLD2\n\njulia> JLD2.jldsave(\"checkpoint.jld2\", model_state = s)\n\njulia> Flux.loadmodel!(m2, JLD2.load(\"checkpoint.jld2\", \"model_state\"))\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Flux.loadmodel!","page":"Flat vs. Nested","title":"Flux.loadmodel!","text":"loadmodel!(dst, src)\n\nCopy all the parameters (trainable and non-trainable) from src into dst.\n\nRecursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).\n\nSee also Flux.state.\n\nExamples\n\njulia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))\nChain(\n Dense(5 => 2, tanh), # 12 parameters\n Dense(2 => 1), # 3 parameters\n) # Total: 4 arrays, 15 parameters, 316 bytes.\n\njulia> dst[1].weight ≈ ones(2, 5) # by construction\ntrue\n\njulia> src = Chain(Dense(5 => 2, relu), Dense(2 => 1, bias=false));\n\njulia> Flux.loadmodel!(dst, src);\n\njulia> dst[1].weight ≈ ones(2, 5) # values changed\nfalse\n\njulia> iszero(dst[2].bias)\ntrue\n\nExtended help\n\nThrows an error when:\n\ndst and src do not share the same fields (at any level)\nthe sizes of leaf nodes are mismatched between dst and src\ncopying non-array values to/from an array parameter (except inactive parameters described below)\ndst is a \"tied\" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values\n\nInactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#KeyPath","page":"Flat vs. Nested","title":"KeyPath","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Functors.KeyPath\nFunctors.getkeypath\nFunctors.haskeypath\nFunctors.setkeypath!","category":"page"},{"location":"reference/destructure/#Functors.KeyPath","page":"Flat vs. Nested","title":"Functors.KeyPath","text":"KeyPath(keys...)\n\nA type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, Int, or CartesianIndex.\n\nFor custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.\n\nFor string, integer, and cartesian index keys, the access is done with getindex instead.\n\nSee also getkeypath, haskeypath.\n\nExamples\n\njulia> kp = KeyPath(:b, 3)\nKeyPath(:b, 3)\n\njulia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths\nKeyPath(:a, :b, 3, :c, 4)\n\njulia> struct T\n a\n b\n end\n\njulia> function Base.getproperty(x::T, k::Symbol)\n if k in fieldnames(T)\n return getfield(x, k)\n elseif k === :ab\n return \"ab\"\n else \n error()\n end\n end;\n\njulia> Base.propertynames(::T) = (:a, :b, :ab);\n\njulia> x = T(3, Dict(:c => 4, :d => 5));\n\njulia> getkeypath(x, KeyPath(:ab)) # equivalent to x.ab\n\"ab\"\n\njulia> getkeypath(x, KeyPath(:b, :c)) # equivalent to (x.b)[:c]\n4\n\n\n\n\n\n","category":"type"},{"location":"reference/destructure/#Functors.getkeypath","page":"Flat vs. Nested","title":"Functors.getkeypath","text":"getkeypath(x, kp::KeyPath)\n\nReturn the value in x at the path kp.\n\nSee also KeyPath, haskeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> getkeypath(x, KeyPath(:b, \"d\", 2))\n6\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Functors.haskeypath","page":"Flat vs. Nested","title":"Functors.haskeypath","text":"haskeypath(x, kp::KeyPath)\n\nReturn true if x has a value at the path kp.\n\nSee also KeyPath, getkeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> haskeypath(x, KeyPath(:a))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 1))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 4))\nfalse\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Functors.setkeypath!","page":"Flat vs. Nested","title":"Functors.setkeypath!","text":"setkeypath!(x, kp::KeyPath, v)\n\nSet the value in x at the path kp to v.\n\nSee also KeyPath, getkeypath, and haskeypath.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/data/onehot/#One-Hot-Encoding-with-OneHotArrays.jl","page":"OneHotArrays.jl","title":"One-Hot Encoding with OneHotArrays.jl","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"It's common to encode categorical variables (like true, false or cat, dog) in \"one-of-k\" or \"one-hot\" form. OneHotArrays.jl provides the onehot function to make this easy.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehot(:b, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> onehot(:c, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n ⋅\n 1","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"There is also a onecold function, which is an inverse of onehot. It can also be given an array of numbers instead of booleans, in which case it performs an argmax-like operation, returning the label with the highest corresponding weight.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> onecold(ans, [:a, :b, :c])\n:c\n\njulia> onecold([true, false, false], [:a, :b, :c])\n:a\n\njulia> onecold([0.3, 0.2, 0.5], [:a, :b, :c])\n:c","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"For multiple samples at once, onehotbatch creates a batch (matrix) of one-hot vectors, and onecold treats matrices as batches.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehotbatch([:b, :a, :b], [:a, :b, :c])\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n ⋅ ⋅ ⋅\n\njulia> onecold(ans, [:a, :b, :c])\n3-element Vector{Symbol}:\n :b\n :a\n :b","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"Note that these operations returned OneHotVector and OneHotMatrix rather than Arrays. OneHotVectors behave like normal vectors but avoid any unnecessary cost compared to using an integer index directly. For example, multiplying a matrix with a one-hot vector simply slices out the relevant row of the matrix under the hood.","category":"page"},{"location":"reference/data/onehot/#Function-listing","page":"OneHotArrays.jl","title":"Function listing","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"OneHotArrays.onehot\nOneHotArrays.onecold\nOneHotArrays.onehotbatch\nOneHotArrays.OneHotArray\nOneHotArrays.OneHotVector\nOneHotArrays.OneHotMatrix","category":"page"},{"location":"reference/data/onehot/#OneHotArrays.onehot","page":"OneHotArrays.jl","title":"OneHotArrays.onehot","text":"onehot(x, labels, [default])\n\nReturns a OneHotVector which is roughly a sparse representation of x .== labels.\n\nInstead of storing say Vector{Bool}, it stores the index of the first occurrence of x in labels. If x is not found in labels, then it either returns onehot(default, labels), or gives an error if no default is given.\n\nSee also onehotbatch to apply this to many xs, and onecold to reverse either of these, as well as to generalise argmax.\n\nExamples\n\njulia> β = onehot(:b, (:a, :b, :c))\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> αβγ = (onehot(0, 0:2), β, onehot(:z, [:a, :b, :c], :c)) # uses default\n(Bool[1, 0, 0], Bool[0, 1, 0], Bool[0, 0, 1])\n\njulia> hcat(αβγ...) # preserves sparsity\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅\n ⋅ 1 ⋅\n ⋅ ⋅ 1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onecold","page":"OneHotArrays.jl","title":"OneHotArrays.onecold","text":"onecold(y::AbstractArray, labels = 1:size(y,1))\n\nRoughly the inverse operation of onehot or onehotbatch: This finds the index of the largest element of y, or each column of y, and looks them up in labels.\n\nIf labels are not specified, the default is integers 1:size(y,1) – the same operation as argmax(y, dims=1) but sometimes a different return type.\n\nExamples\n\njulia> onecold([false, true, false])\n2\n\njulia> onecold([0.3, 0.2, 0.5], (:a, :b, :c))\n:c\n\njulia> onecold([ 1 0 0 1 0 1 0 1 0 0 1\n 0 1 0 0 0 0 0 0 1 0 0\n 0 0 0 0 1 0 0 0 0 0 0\n 0 0 0 0 0 0 1 0 0 0 0\n 0 0 1 0 0 0 0 0 0 1 0 ], 'a':'e') |> String\n\"abeacadabea\"\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onehotbatch","page":"OneHotArrays.jl","title":"OneHotArrays.onehotbatch","text":"onehotbatch(xs, labels, [default])\n\nReturns a OneHotMatrix where kth column of the matrix is onehot(xs[k], labels). This is a sparse matrix, which stores just a Vector{UInt32} containing the indices of the nonzero elements.\n\nIf one of the inputs in xs is not found in labels, that column is onehot(default, labels) if default is given, else an error.\n\nIf xs has more dimensions, N = ndims(xs) > 1, then the result is an AbstractArray{Bool, N+1} which is one-hot along the first dimension, i.e. result[:, k...] == onehot(xs[k...], labels).\n\nNote that xs can be any iterable, such as a string. And that using a tuple for labels will often speed up construction, certainly for less than 32 classes.\n\nExamples\n\njulia> oh = onehotbatch(\"abracadabra\", 'a':'e', 'e')\n5×11 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ 1\n ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅\n\njulia> reshape(1:15, 3, 5) * oh # this matrix multiplication is done efficiently\n3×11 Matrix{Int64}:\n 1 4 13 1 7 1 10 1 4 13 1\n 2 5 14 2 8 2 11 2 5 14 2\n 3 6 15 3 9 3 12 3 6 15 3\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.OneHotArray","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotArray","text":"OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}\nOneHotArray(indices, L)\n\nA one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.\n\nTypically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotVector","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotVector","text":"OneHotVector{T} = OneHotArray{T, 0, 1, T}\nOneHotVector(indices, L)\n\nA one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotMatrix","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotMatrix","text":"OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}\nOneHotMatrix(indices, L)\n\nA one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/zygote/#autodiff-zygote","page":"Gradients – Zygote.jl","title":"Automatic Differentiation using Zygote.jl","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Flux's gradient function uses Zygote by default, and also uses this function within train! to differentiate the model. Zygote has its own documentation, in particular listing some important limitations.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Flux also has support for Enzyme.jl, documented on its own page.","category":"page"},{"location":"reference/training/zygote/#Explicit-style","page":"Gradients – Zygote.jl","title":"Explicit style","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"The preferred way of using Zygote, and the only way of using most other AD packages, is to explicitly provide a function and its arguments.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Zygote.gradient(f, args...)\nZygote.withgradient(f, args...)\nZygote.jacobian(f, args...)\nZygote.withjacobian(f, args...)\nZygote.hessian\nZygote.hessian_reverse\nZygote.diaghessian\nZygote.pullback","category":"page"},{"location":"reference/training/zygote/#Zygote.gradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.gradient","text":"gradient(f, args...)\n\nReturns a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or the gradient. If no gradient is defined, ∂f/∂x will be nothing.\n\nf(args...) must be a real number, see jacobian for array output.\n\nSee also withgradient to keep the value f(args...), and pullback for value and back-propagator.\n\njulia> gradient(*, 2.0, 3.0, 5.0)\n(15.0, 10.0, 6.0)\n\njulia> gradient(x -> sum(abs2,x), [7.0, 11.0, 13.0])\n([14.0, 22.0, 26.0],)\n\njulia> gradient([7, 11], 0, 1) do x, y, d\n p = size(x, d)\n sum(x.^p .+ y)\n end\n([14.0, 22.0], 2.0, nothing)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withgradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withgradient","text":"withgradient(f, args...)\nwithgradient(f, ::Params)\n\nReturns both the value of the function and the gradient, as a named tuple.\n\njulia> y, ∇ = withgradient(/, 1, 2)\n(val = 0.5, grad = (0.5, -0.25))\n\njulia> ∇ == gradient(/, 1, 2)\ntrue\n\nAllows you to capture auxillary outputs, in addition to the scalar used by gradient. To do this, f must return a Tuple or NamedTuple. Then it calculates grad = gradient(first∘f, args...) but returns the wholeval = f(args...)`:\n\njulia> withgradient([1,2,4]) do x\n z = 1 ./ x\n sum(z), z # here z is an auxillary output\n end\n(val = (1.75, [1.0, 0.5, 0.25]), grad = ([-1.0, -0.25, -0.0625],))\n\njulia> withgradient(3.0, 4.0) do x, y\n (div = x/y, mul = x*y)\n end\n(val = (div = 0.75, mul = 12.0), grad = (0.25, -0.1875))\n\nAlso supports implicit mode:\n\njulia> w = [3.0];\n\njulia> res = withgradient(() -> sum(abs2, w), Params([w]))\n(val = 9.0, grad = Grads(...))\n\njulia> res.grad[w]\n1-element Vector{Float64}:\n 6.0\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.jacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.jacobian","text":"jacobian(f, args...) -> Tuple\n\nFor each array a ∈ args this returns a matrix with Ja[k,i] = ∂y[k]/∂a[i] where y = f(args...) is usually a vector. Arrays of higher dimension are treated like vec(a), or vec(y) for output.\n\nFor scalar x::Number ∈ args, the result is a vector Jx[k] = ∂y[k]/∂x, while for scalar y all results have just one row.\n\nWith any other argument type, no result is produced, even if gradient would work.\n\nThis reverse-mode Jacobian needs to evaluate the pullback once for each element of y. Doing so is usually only efficient when length(y) is small compared to length(a), otherwise forward mode is likely to be better.\n\nSee also withjacobian, hessian, hessian_reverse.\n\nExamples\n\njulia> jacobian(a -> 100*a[1:3].^2, 1:7)[1] # first index (rows) is output\n3×7 Matrix{Int64}:\n 200 0 0 0 0 0 0\n 0 400 0 0 0 0 0\n 0 0 600 0 0 0 0\n\njulia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1) # scalar argument has vector jacobian\n([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])\n\njulia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)\n([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])\n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\njulia> jacobian((a,s) -> a.^length(s), [1,2,3], \"str\")\n([3 0 0; 0 12 0; 0 0 27], nothing)\n\njulia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))\n([4 4 4], nothing)\n\njulia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple\n([4 4 4], (6, 1))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withjacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withjacobian","text":"withjacobian(f, args...)\n\nReturns both the value f(args...) and the jacobian as a named tuple.\n\njulia> withjacobian(cumsum, [1,2,3])\n(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.hessian","page":"Gradients – Zygote.jl","title":"Zygote.hessian","text":"hessian(f, x)\n\nConstruct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.\n\nThis uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.\n\nSee also diaghessian to compute only the diagonal part.\n\nExamples\n\njulia> hessian(x -> x[1]*x[2], randn(2))\n2×2 Matrix{Float64}:\n 0.0 1.0\n 1.0 0.0\n\njulia> hessian(x -> sum(x.^3), [1 2; 3 4]) # uses linear indexing of x\n4×4 Matrix{Int64}:\n 6 0 0 0\n 0 18 0 0\n 0 0 12 0\n 0 0 0 24\n\njulia> hessian(sin, pi/2)\n-1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.hessian_reverse","page":"Gradients – Zygote.jl","title":"Zygote.hessian_reverse","text":"hessian_reverse(f, x)\n\nThis should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.diaghessian","page":"Gradients – Zygote.jl","title":"Zygote.diaghessian","text":"diaghessian(f, args...) -> Tuple\n\nDiagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.\n\nFor one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote. \n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\nExamples\n\njulia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]\n2×2 Matrix{Int64}:\n 6 12\n 18 24\n\njulia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4]) # full Hessian is diagonal\ntrue\n\njulia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666]) # two array arguments\n([0.0 0.0; 0.0 0.0], [2.0, 8.0])\n\njulia> diaghessian(atan, 1, 2) # two scalar arguments\n(-0.16, 0.16)\n\njulia> hessian(xy -> atan(xy[1], xy[2]), [1, 2]) # full Hessian is not diagonal\n2×2 Matrix{Float64}:\n -0.16 -0.12\n -0.12 0.16\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ZygoteRules.pullback","page":"Gradients – Zygote.jl","title":"ZygoteRules.pullback","text":"pullback(f, args...)\npullback(f, ::Params)\n\nReturns the value of the function f and a back-propagator function, which can be called to obtain a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or gradient.\n\ny, back = pullback(f, args...)\n∇ = back(seed)\n\nback must be called with a start value seed matching the output of f(args...). If f(args...) returns a number, seed should be a number. If f(args...) returns an array, seed should be an equally-sized array.\n\nSee also withgradient to obtain the value and gradients in one call, and gradient for obtaining just the gradients.\n\njulia> y, back = pullback(*, 2.0, 3.0, 5.0);\n\njulia> y\n30.0\n\njulia> back(1.0)\n(15.0, 10.0, 6.0)\n\njulia> back(2.0)\n(30.0, 20.0, 12.0)\n\njulia> y, back = pullback(x -> [x, x], 1.0);\n\njulia> y\n2-element Vector{Float64}:\n 1.0\n 1.0\n\njulia> back([1.0, 1.0])\n(2.0,)\n\njulia> back([2.0, nothing])\n(2.0,)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRules","page":"Gradients – Zygote.jl","title":"ChainRules","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Sometimes it is necessary to exclude some code, or a whole function, from automatic differentiation. This can be done using ChainRules:","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.ignore_derivatives\nChainRulesCore.@non_differentiable","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.ignore_derivatives","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ignore_derivatives","text":"ignore_derivatives(f::Function)\n\nTells the AD system to ignore the gradients of the wrapped closure. The primal computation (forward pass) is executed normally.\n\nignore_derivatives() do\n value = rand()\n push!(collection, value)\nend\n\nUsing this incorrectly could lead to incorrect gradients. For example, the following function will have zero gradients with respect to its argument:\n\nfunction wrong_grads(x)\n y = ones(3)\n ignore_derivatives() do\n push!(y, x)\n end\n return sum(y)\nend\n\n\n\n\n\nignore_derivatives(x)\n\nTells the AD system to ignore the gradients of the argument. Can be used to avoid unnecessary computation of gradients.\n\nignore_derivatives(x) * w\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@non_differentiable","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@non_differentiable","text":"@non_differentiable(signature_expression)\n\nA helper to make it easier to declare that a method is not differentiable. This is a short-hand for defining an frule and rrule that return NoTangent() for all partials (even for the function s̄elf-partial itself)\n\nKeyword arguments should not be included.\n\njulia> @non_differentiable Base.:(==)(a, b)\n\njulia> _, pullback = rrule(==, 2.0, 3.0);\n\njulia> pullback(1.0)\n(NoTangent(), NoTangent(), NoTangent())\n\nYou can place type-constraints in the signature:\n\njulia> @non_differentiable Base.length(xs::Union{Number, Array})\n\njulia> frule((ZeroTangent(), 1), length, [2.0, 3.0])\n(2, NoTangent())\n\nwarning: Warning\nThis helper macro covers only the simple common cases. It does not support where-clauses. For these you can declare the rrule and frule directly\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"To manually supply the gradient for one function, you should define a method of rrule. ChainRules has detailed documentation on how this works.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.rrule\nChainRulesCore.frule\nChainRulesCore.@scalar_rule\nChainRulesCore.NoTangent\nChainRulesCore.ZeroTangent\nChainRulesCore.RuleConfig\nChainRulesCore.Tangent\nChainRulesCore.canonicalize","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.rrule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.rrule","text":"rrule([::RuleConfig,] f, x...)\n\nExpressing x as the tuple (x₁, x₂, ...) and the output tuple of f(x...) as Ω, return the tuple:\n\n(Ω, (Ω̄₁, Ω̄₂, ...) -> (s̄elf, x̄₁, x̄₂, ...))\n\nWhere the second return value is the the propagation rule or pullback. It takes in cotangents corresponding to the outputs (x̄₁, x̄₂, ...), and s̄elf, the internal values of the function itself (for closures)\n\nIf no method matching rrule(f, xs...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> x = rand();\n\njulia> sinx, sin_pullback = rrule(sin, x);\n\njulia> sinx == sin(x)\ntrue\n\njulia> sin_pullback(1) == (NoTangent(), cos(x))\ntrue\n\nbinary input, unary output scalar function:\n\njulia> x, y = rand(2);\n\njulia> hypotxy, hypot_pullback = rrule(hypot, x, y);\n\njulia> hypotxy == hypot(x, y)\ntrue\n\njulia> hypot_pullback(1) == (NoTangent(), (x / hypot(x, y)), (y / hypot(x, y)))\ntrue\n\nThe optional RuleConfig option allows specifying rrules only for AD systems that support given features. If not needed, then it can be omitted and the rrule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: frule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.frule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.frule","text":"frule([::RuleConfig,] (Δf, Δx...), f, x...)\n\nExpressing the output of f(x...) as Ω, return the tuple:\n\n(Ω, ΔΩ)\n\nThe second return value is the tangent w.r.t. the output.\n\nIf no method matching frule((Δf, Δx...), f, x...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> dself = NoTangent();\n\njulia> x = rand()\n0.8236475079774124\n\njulia> sinx, Δsinx = frule((dself, 1), sin, x)\n(0.7336293678134624, 0.6795498147167869)\n\njulia> sinx == sin(x)\ntrue\n\njulia> Δsinx == cos(x)\ntrue\n\nUnary input, binary output scalar function:\n\njulia> sincosx, Δsincosx = frule((dself, 1), sincos, x);\n\njulia> sincosx == sincos(x)\ntrue\n\njulia> Δsincosx[1] == cos(x)\ntrue\n\njulia> Δsincosx[2] == -sin(x)\ntrue\n\nNote that techically speaking julia does not have multiple output functions, just functions that return a single output that is iterable, like a Tuple. So this is actually a Tangent:\n\njulia> Δsincosx\nTangent{Tuple{Float64, Float64}}(0.6795498147167869, -0.7336293678134624)\n\nThe optional RuleConfig option allows specifying frules only for AD systems that support given features. If not needed, then it can be omitted and the frule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: rrule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@scalar_rule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@scalar_rule","text":"@scalar_rule(f(x₁, x₂, ...),\n @setup(statement₁, statement₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nA convenience macro that generates simple scalar forward or reverse rules using the provided partial derivatives. Specifically, generates the corresponding methods for frule and rrule:\n\nfunction ChainRulesCore.frule((NoTangent(), Δx₁, Δx₂, ...), ::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, (\n (∂f₁_∂x₁ * Δx₁ + ∂f₁_∂x₂ * Δx₂ + ...),\n (∂f₂_∂x₁ * Δx₁ + ∂f₂_∂x₂ * Δx₂ + ...),\n ...\n )\nend\n\nfunction ChainRulesCore.rrule(::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, ((ΔΩ₁, ΔΩ₂, ...)) -> (\n NoTangent(),\n ∂f₁_∂x₁ * ΔΩ₁ + ∂f₂_∂x₁ * ΔΩ₂ + ...),\n ∂f₁_∂x₂ * ΔΩ₁ + ∂f₂_∂x₂ * ΔΩ₂ + ...),\n ...\n )\nend\n\nIf no type constraints in f(x₁, x₂, ...) within the call to @scalar_rule are provided, each parameter in the resulting frule/rrule definition is given a type constraint of Number. Constraints may also be explicitly be provided to override the Number constraint, e.g. f(x₁::Complex, x₂), which will constrain x₁ to Complex and x₂ to Number.\n\nAt present this does not support defining for closures/functors. Thus in reverse-mode, the first returned partial, representing the derivative with respect to the function itself, is always NoTangent(). And in forward-mode, the first input to the returned propagator is always ignored.\n\nThe result of f(x₁, x₂, ...) is automatically bound to Ω. This allows the primal result to be conveniently referenced (as Ω) within the derivative/setup expressions.\n\nThis macro assumes complex functions are holomorphic. In general, for non-holomorphic functions, the frule and rrule must be defined manually.\n\nIf the derivative is one, (e.g. for identity functions) true can be used as the most general multiplicative identity.\n\nThe @setup argument can be elided if no setup code is need. In other words:\n\n@scalar_rule(f(x₁, x₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nis equivalent to:\n\n@scalar_rule(f(x₁, x₂, ...),\n @setup(nothing),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nFor examples, see ChainRules' rulesets directory.\n\nSee also: frule, rrule.\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/#ChainRulesCore.NoTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.NoTangent","text":"NoTangent() <: AbstractZero\n\nThis tangent indicates that the derivative does not exist. It is the tangent type for primal types that are not differentiable, such as integers or booleans (when they are not being used to represent floating-point values). The only valid way to perturb such values is to not change them at all. As a consequence, NoTangent is functionally identical to ZeroTangent(), but it provides additional semantic information.\n\nAdding NoTangent() to a primal is generally wrong: gradient-based methods cannot be used to optimize over discrete variables. An optimization package making use of this might want to check for such a case.\n\nnote: Note\nThis does not indicate that the derivative is not implemented, but rather that mathematically it is not defined.\n\nThis mostly shows up as the derivative with respect to dimension, index, or size arguments.\n\nfunction rrule(fill, x, len::Int)\n y = fill(x, len)\n fill_pullback(ȳ) = (NoTangent(), @thunk(sum(Ȳ)), NoTangent())\n return y, fill_pullback\nend\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.ZeroTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ZeroTangent","text":"ZeroTangent() <: AbstractZero\n\nThe additive identity for tangents. This is basically the same as 0. A derivative of ZeroTangent() does not propagate through the primal function.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.RuleConfig","page":"Gradients – Zygote.jl","title":"ChainRulesCore.RuleConfig","text":"RuleConfig{T}\n\nThe configuration for what rules to use. T: traits. This should be a Union of all special traits needed for rules to be allowed to be defined for your AD. If nothing special this should be set to Union{}.\n\nAD authors should define a subtype of RuleConfig to use when calling frule/rrule.\n\nRule authors can dispatch on this config when defining rules. For example:\n\n# only define rrule for `pop!` on AD systems where mutation is supported.\nrrule(::RuleConfig{>:SupportsMutation}, typeof(pop!), ::Vector) = ...\n\n# this definition of map is for any AD that defines a forwards mode\nrrule(conf::RuleConfig{>:HasForwardsMode}, typeof(map), ::Vector) = ...\n\n# this definition of map is for any AD that only defines a reverse mode.\n# It is not as good as the rrule that can be used if the AD defines a forward-mode as well.\nrrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...\n\nFor more details see rule configurations and calling back into AD.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.Tangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.Tangent","text":"Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent\n\nThis type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.\n\nTangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.\n\nT is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.\n\nFor Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.canonicalize","page":"Gradients – Zygote.jl","title":"ChainRulesCore.canonicalize","text":"canonicalize(tangent::Tangent{P}) -> Tangent{P}\n\nReturn the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().\n\n\n\n\n\n","category":"function"},{"location":"guide/models/basics/#man-basics","page":"Gradients and Layers","title":"How Flux Works: Parameters, Gradients, and Layers","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"A neural network is a function with parameters. That is, it takes some input x and gives you some output y, whose value also depends on some other numbers θ.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"A sufficiently flexible function can, by adjusting the parameters just right, be made to do many things. And the one magic trick for adjusting parameters is to follow a gradient.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This page describes Flux's take on how to construct such flexible functions containing many parameters, and how to handle their gradients.","category":"page"},{"location":"guide/models/basics/#Parameterised-Functions","page":"Gradients and Layers","title":"Parameterised Functions","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Let's start with very simple functions. This is a polynomial in x::Real, returning another real number y which depends on some coefficients stored in a vector:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"θ = [10, 1, 0.1]\n\npoly1(x::Real) = θ[1] + θ[2]*x + θ[3]*x^2\n\npoly1(5) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Here the parameters are a global variable θ. They could be handled in other ways, for instance by explicitly passing them as an additional argument to the function:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly2(x::Real, θ2) = evalpoly(x, θ2) # built-in, from Base.Math\n\npoly2(5, θ) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux chooses a third path, by encapsulating the parameters within the function. The simplest way to do this is a closure, an anonymous function which Julia knows to depend on some local variable θ3:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly3 = let θ3 = [10, 1, 0.1]\n x -> evalpoly(x, θ3)\nend\n\npoly3(5) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"An equivalent, but tidier, way is to construct a struct in which to store the parameters. Any struct can be made callable, allowing its instances to act just like function:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"struct Poly3{T} # container struct\n θ3::T\nend\n(p::Poly3)(x::Real) = evalpoly(x, p.θ3) # make this callable\n\npoly3s = Poly3([10, 1, 0.1]) # construct an instance\n\npoly3s(5) == 17.5 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Internally, there is little difference between a closure and a struct. They have the same fields, and equivalent methods:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"dump(poly3), dump(poly3s) # both contain θ3: Array\npoly3s.θ3 == poly3.θ3 == θ # field called :θ3 has same value\nmethods(poly3)\nmethods(poly3s) # each has 1 method, accepting x","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The virtue of encapsulation is that it makes composition very easy. We can make more complicated functions by combining simple ones, and each will keep track of its own parameters. Juia writes function composition as ∘, for instance (inv ∘ sin)(pi/6) ≈ 2, and we can use exactly this for our parameterised polynomials:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"poly4 = Poly3([1, 0.5, 0]) ∘ Poly3([10, 1, 0.1])\n\npoly4 isa ComposedFunction # ∘ creates another struct...\npoly4.outer.θ3 == θ # which has fields :inner & :outer\n\npoly4(5) == 9.75 # true","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux models are precisely made by such function composition. In fact, poly3 and poly4 are already valid Flux models.","category":"page"},{"location":"guide/models/basics/#man-taking-gradients","page":"Gradients and Layers","title":"Structural Gradients","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The derivative of a scalar function is its slope: how fast the output changes as the input is changed slightly. This may be found approximately by evaluating at two nearby points, and exactly by taking the limit in which the distance between them approaches zero:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> (poly1(5 + 0.1) - poly1(5)) / 0.1\n2.010000000000005\n\njulia> (poly1(5 + 0.001) - poly1(5)) / 0.001 # answer is getting close to 2\n2.000100000003613","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's gradient(f, x) works this out for f(x), and gives exactly ∂f/∂x = 2.0 here:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> using Flux\n\njulia> gradient(poly1, 5)\n(2.0,)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The reason gradient returns a tuple, not just the number 2.0, is to allow for functions taking several arguments. (That's also why it's not called \"derivative\".) For instance, this returns ∂f/∂x, ∂f/∂y, ∂f/∂z:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> gradient((x,y,z) -> (x*y)+z, 30, 40, 50)\n(40.0, 30.0, 1.0)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For our parameterised polynomial, we have ∂f/∂x but we are really more interested in ∂f/∂θ, as this will tell us about how the parameters are affecting the answer. It is not impossible to track gradients with respect to global θ, but much clearer to track explicit arguments. Here's how this works for poly2 (which takes θ as a 2nd argument) and poly3 (which encapsulates θ):","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> grad2 = gradient(poly2, 5, θ)\n(2.0, [1.0, 5.0, 25.0])\n\njulia> grad3 = gradient((x,p) -> p(x), 5, poly3s)\n(2.0, (θ3 = [1.0, 5.0, 25.0],))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The first entry is ∂f/∂x as before, but the second entry is more interesting. For poly2, we get ∂f/∂θ as grad2[2] directly. It is a vector, because θ is a vector, and has elements [∂f/∂θ[1], ∂f/∂θ[2], ∂f/∂θ[3]].","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For poly3, however, we get a NamedTuple whose fields correspond to those of the struct Poly3. This is called a structural gradient. And the nice thing about them is that they work for arbitrarily complicated structures, for instance:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> grad4 = gradient(|>, 5, poly4)\n(1.0, (outer = (θ3 = [1.0, 17.5, 306.25],), inner = (θ3 = [0.5, 2.5, 12.5],)))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Here grad4.inner.θ3 corresponds to poly4.inner.θ3. These matching nested structures are at the core of how Flux works.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"note: Implicit gradients\nEarlier versions of Flux used a different way to relate parameters and gradients, which looks like this:g1 = gradient(() -> poly1(5), Params([θ]))\ng1[θ] == [1.0, 5.0, 25.0]Here Params is a set of references to global variables using objectid, and g1 isa Grads is a dictionary from these to their gradients. This method of gradient takes a zero-argument function, which only implicitly depends on θ.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"

 Zygote.jl

","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's gradient function by default calls a companion packages called Zygote. Zygote performs source-to-source automatic differentiation, meaning that gradient(f, x) hooks into Julia's compiler to find out what operations f contains, and transforms this to produce code for computing ∂f/∂x.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Zygote can in principle differentiate almost any Julia code. However, it's not perfect, and you may eventually want to read its page about limitations. In particular, a major limitation is that mutating an array is not allowed.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux can also be used with other automatic differentiation (AD) packages. It was originally written using Tracker, a more traditional operator-overloading approach. The future might be Enzyme, and Flux now builds in an easy way to use this instead, turned on by wrapping the model in Duplicated. (For details, see the Enzyme page in the manual.)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> using Enzyme: Const, Duplicated\n\njulia> grad3e = Flux.gradient((x,p) -> p(x), Const(5.0), Duplicated(poly3s))\n(nothing, (θ3 = [1.0, 5.0, 25.0],))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux.gradient follows Zygote's convention that arguments with no derivative are marked nothing. Here, this is because Const(5.0) is explicitly constant. Below, we will see an example where nothing shows up because the model struct has fields containing things other than parameters, such as an activation function. (It also adopts the convention that gradient(f, x, y) returns a tuple (∂f/∂x, ∂f/∂y), without a \"∂f/∂f\" term for the function. This is why we had to write gradient(|>, 5, poly4) above, not just gradient(poly4, 5).)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Finally, the function withgradient works the same way, but also returns the value of the function:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> Flux.withgradient((x,p) -> p(x), 5.0, poly3s)\n(val = 17.5, grad = (2.0, (θ3 = [1.0, 5.0, 25.0],)))","category":"page"},{"location":"guide/models/basics/#Simple-Neural-Networks","page":"Gradients and Layers","title":"Simple Neural Networks","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The polynomial functions above send a number x to another a number y. Neural networks typically take a vector of numbers, mix them all up, and return another vector. Here's a very simple one, which will take a vector like x = [1.0, 2.0, 3.0] and return another vector y = layer1(x) with length(y) == 2:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"W = randn(2, 3)\nb = zeros(2)\n\nsigmoid(x::Real) = 1 / (1 + exp(-x))\nlayer1(x) = sigmoid.(W*x .+ b)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Here sigmoid is a nonlinear function, applied element-wise because it is called with .(), called broadcasting.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Like poly1 above, this layer1 has as its parameters the global variables W, b. We can similarly define a version which takes these as arguments (like poly2), and a version which encapsulates them (like poly3 above):","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"layer2(x, W2, b2) = sigmoid.(W2*x .+ b2) # explicit parameter arguments\n\nlayer3 = let\n W3 = randn(2, 3)\n b3 = zeros(2)\n x -> sigmoid.(W3*x .+ b3) # closure over local variables\nend\n\nlayer3([1.0, 2.0, 3.0]) isa Vector # check that it runs","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This third way is precisely a Flux model. And we can again make a tidier version using a struct to hold the parameters:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"struct Layer # container struct\n W::Matrix\n b::Vector\n act::Function\nend\n\n(d::Layer)(x) = d.act.(d.W*x .+ d.b) # make it callabale\n\nLayer(in::Int, out::Int, act::Function=sigmoid) =\n Layer(randn(Float32, out, in), zeros(Float32, out), act)\n\nlayer3s = Layer(3, 2) # instance with its own parameters","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The one new thing here is a friendly constructor Layer(in, out, act). This is because we anticipate composing several instances of this thing, with independent parameter arrays, of different sizes and different random initial parameters.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"x = Float32[0.1, 0.2, 0.3] # input\n\nlayer3s(x) # output, 2-element Vector{Float32}\n\nFlux.gradient((x,d) -> d(x)[1], x, layer3s)[2] # NamedTuple{(:W, :b, :act)}","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This ∂f/∂layer3s is a named tuple with the same fields as Layer. Within it, the gradient with respect to W is a matrix of seemingly random numbers. Notice that there is also an entry for act, which is nothing, as this field of the struct is not a smoothly adjustible parameter.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"We can compose these layers just as we did the polynomials above. Here's a composition of 3, in which the last step is the function only which takes a 1-element vector and gives us the number inside:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model1 = only ∘ Layer(20, 1) ∘ Layer(1, 20)\n\ny = model1(Float32[0.1]) # output is a Float32 number\n\ngrad = Flux.gradient(|>, [1f0], model1)[2]","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This gradient is starting to be a complicated nested structure. But it works just like before: grad.outer.inner.W corresponds to model1.outer.inner.W.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"We don't have to use ∘ (which makes a ComposedFunction struct) to combine layers. Instead, we could define our own container struct, or use a closure:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model2 = let\n lay1 = Layer(1, 20) # local variables containing layers\n lay2 = Layer(20, 1)\n function fwd(x) # equivalent to x -> only(lay2(lay1(x)))\n mid = lay1(x)\n lay2(mid) |> only\n end\nend\n\nmodel2(Float32[0.1])\n\nFlux.gradient(|>, [1f0], model2)[2]","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"

 Flux's layers

","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Rather than define everything from scratch every time, Flux provides a library of commonly used layers. The same model could be defined:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model3 = Chain(Dense(1 => 20, σ), Dense(20 => 1), only)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"How does this model3 differ from the model1 we had before?","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's Chain works left-to-right, the reverse of Base's ∘. Its contents is stored in a tuple, thus model3.layers[1].weight is an array.\nFlux's layer Dense has only minor differences:\nLike struct Poly3{T} above, it has type parameters for its fields – the compiler does not know exactly what type layer3s.W will be, which costs speed.\nIts initialisation uses not randn (normal distribution) but glorot_uniform by default.\nIt reshapes some inputs (to allow several batch dimensions), and produces more friendly errors on wrong-size input.\nAnd it has some performance tricks: making sure element types match, and re-using some memory.\nThe function σ is calculated in a slightly better way, and has a rule telling Zygote how to differentiate it efficiently.\nFlux overloads Base.show so to give pretty printing at the REPL prompt. Calling Flux.@layer Layer will add this, and some other niceties.\nAll Flux layers accept a batch of samples: Instead of mapping one sample x::Vector to one output y::Vector, they map columns of a matrix xs::Matrix to columns of the output. This looks like f(xs) ≈ stack(f(x) for x in eachcol(xs)) but is done more efficiently.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"If what you need isn't covered by Flux's built-in layers, it's easy to write your own. There are more details later, but the steps are invariably those shown for struct Layer above:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Define a struct which will hold the parameters.\nMake it callable, to define how it uses them to transform the input x\nDefine a constructor which initialises the parameters (if the default constructor doesn't do what you want).\nAnnotate with @layer to opt-in to pretty printing, and other enhacements.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"

 Functors.jl

","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"To deal with such nested structures, Flux relies heavily on an associated package called Functors. Its basic function is fmap, which generalises map(f, x) to work on almost anything.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For example, this is how gpu moves all arrays within a model to the GPU, reconstructing another only ∘ Layer(...) ∘ Layer(...) (or a Chain etc.) around the new CuArrays:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using CUDA, Functors\nfmap(cu, model1)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"And this is a very simple gradient update of the parameters, walking over model and grad simultaneously:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"fmap((x, dx) -> x isa Array ? (x - dx/100) : x, model, grad)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"note: Note\nBefore Flux v0.15 (and Functors v0.5), this exploration of structs was opt-in. After defining struct Layer it was necessary to call @functor Layer (or @layer Layer) before Flux would look inside. This has now changed to be opt-out: Functors (and hence Flux) will explore arbitrary structs, unless told not to (using Functors.@leaf). This is why even \"anonymous structs\" created by closures, like poly3 and layer3 above, are now valid Flux models, although the use of named structs is still recommended practice.","category":"page"},{"location":"guide/models/basics/#Curve-Fitting","page":"Gradients and Layers","title":"Curve Fitting","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Above we took gradients of the output, or sometimes to the first element of the output – it must be a number, not a vector. Adjusting the parameters to make this smaller won't lead us anywhere interesting. Instead, we should minimise some loss function which compares the actual output to our desired output.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Perhaps the simplest example is curve fitting. The previous page fitted a linear model to data. With out two-layer model, we can fit a nonlinear function. For example, let us use f(x) = 2x - x^3 evaluated at some points x in -2:0.1:2 as the data, and adjust the parameters of model3 from above so that its output is similar.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"data = [([x], 2x-x^3) for x in -2:0.1f0:2] # training points (x, y)\n\nfor _ in 1:1000 # adjust parameters to minimise the error:\n Flux.train!((m,x,y) -> (m(x) - y)^2, model3, data, Descent(0.01))\nend","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The same code will also work with model1 or model2 instead. Here's how to plot the desired and actual outputs:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Plots\nplot(x -> 2x-x^3, -2, 2, label=\"truth\")\nscatter!(x -> model3([x]), -2:0.1f0:2, label=\"fitted\")","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"More detail about what exactly the function train! is doing, and how to use rules other than simple Descent, is what the next page in this guide is about: training.","category":"page"},{"location":"guide/models/recurrence/#Recurrent-Models","page":"Recurrence","title":"Recurrent Models","text":"","category":"section"},{"location":"guide/models/recurrence/#Recurrent-cells","page":"Recurrence","title":"Recurrent cells","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To introduce Flux's recurrence functionalities, we will consider the following vanilla recurrent neural network structure:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"(Image: )","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the above, we have a sequence of length 3, where x1 to x3 represent the input at each step. It could be a timestamp or a word in a sentence encoded as vectors. y1 to y3 are their respective outputs.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"An aspect to recognise is that in such a model, the recurrent cells A all refer to the same structure. What distinguishes it from a simple dense layer is that the cell A is fed, in addition to an input x, with information from the previous state of the model (hidden state denoted as h1 & h2 in the diagram).","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the most basic RNN case, cell A could be defined by the following: ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"output_size = 5\ninput_size = 2\nWxh = randn(Float32, output_size, input_size)\nWhh = randn(Float32, output_size, output_size)\nb = zeros(Float32, output_size)\n\nfunction rnn_cell(x, h)\n h = tanh.(Wxh * x .+ Whh * h .+ b)\n return h\nend\n\nseq_len = 3\n# dummy input data\nx = [rand(Float32, input_size) for i = 1:seq_len] \n# random initial hidden state\nh0 = zeros(Float32, output_size) \n\ny = []\nht = h0\nfor xt in x\n ht = rnn_cell(xt, ht)\n y = [y; [ht]] # concatenate in non-mutating (AD friendly) way\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Notice how the above is essentially a Dense layer that acts on two inputs, xt and ht.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The output at each time step, called the hidden state, is used as the input to the next time step and is also the output of the model. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"There are various recurrent cells available in Flux, notably RNNCell, LSTMCell and GRUCell, which are documented in the layer reference. The hand-written example above can be replaced with:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Flux\n\noutput_size = 5\ninput_size = 2\nseq_len = 3\nx = [rand(Float32, input_size) for i = 1:seq_len] \nh0 = zeros(Float32, output_size) \n\nrnn_cell = Flux.RNNCell(input_size => output_size)\n\ny = []\nht = h0\nfor xt in x\n ht = rnn_cell(xt, ht)\n y = [y; [ht]]\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The entire output y or just the last output y[end] can be used for further processing, such as classification or regression. ","category":"page"},{"location":"guide/models/recurrence/#Using-a-cell-as-part-of-a-model","page":"Recurrence","title":"Using a cell as part of a model","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Let's consider a simple model that is trained to predict a scalar quantity for each time step in a sequence. The model will have a single RNN cell, followed by a dense layer to produce the output. Since the RNNCell can deal with batches of data, we can define the model to accept an input where at each time step, the input is a matrix of size (input_size, batch_size). ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"struct RecurrentCellModel{H,C,D}\n h0::H\n cell::C\n dense::D\nend\n\n# we choose to not train the initial hidden state\nFlux.@layer RecurrentCellModel trainable=(cell,dense) \n\nfunction RecurrentCellModel(input_size::Int, hidden_size::Int)\n return RecurrentCellModel(\n zeros(Float32, hidden_size), \n RNNCell(input_size => hidden_size),\n Dense(hidden_size => 1))\nend\n\nfunction (m::RecurrentCellModel)(x)\n z = []\n ht = m.h0\n for xt in x\n ht = m.cell(xt, ht)\n z = [z; [ht]]\n end\n z = stack(z, dims=2) # [hidden_size, seq_len, batch_size] or [hidden_size, seq_len]\n ŷ = m.dense(z) # [1, seq_len, batch_size] or [1, seq_len]\n return ŷ\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Notice that we stack the hidden states z to form a tensor of size (hidden_size, seq_len, batch_size). This can speedup the final classification, since we then process all the outputs at once with a single forward pass of the dense layer. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Let's now define the training loop for this model:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Optimisers: AdamW\n\nfunction loss(model, x, y)\n ŷ = model(x)\n y = stack(y, dims=2)\n return Flux.mse(ŷ, y)\nend\n\n# create dummy data\nseq_len, batch_size, input_size = 3, 4, 2\nx = [rand(Float32, input_size, batch_size) for _ = 1:seq_len]\ny = [rand(Float32, 1, batch_size) for _ = 1:seq_len]\n\n# initialize the model and optimizer\nmodel = RecurrentCellModel(input_size, 5)\nopt_state = Flux.setup(AdamW(1e-3), model)\n\n# compute the gradient and update the model\ng = gradient(m -> loss(m, x, y),model)[1]\nFlux.update!(opt_state, model, g)","category":"page"},{"location":"guide/models/recurrence/#Handling-the-whole-sequence-at-once","page":"Recurrence","title":"Handling the whole sequence at once","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the above example, we processed the sequence one time step at a time using a recurrent cell. However, it is possible to process the entire sequence at once. This can be done by stacking the input data x to form a tensor of size (input_size, seq_len) or (input_size, seq_len, batch_size). One can then use the RNN, LSTM or GRU layers to process the entire input tensor. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Let's consider the same example as above, but this time we use an RNN layer instead of an RNNCell:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"struct RecurrentModel{H,C,D}\n h0::H\n rnn::C\n dense::D\nend\n\nFlux.@layer RecurrentModel trainable=(rnn, dense)\n\nfunction RecurrentModel(input_size::Int, hidden_size::Int)\n return RecurrentModel(\n zeros(Float32, hidden_size), \n RNN(input_size => hidden_size),\n Dense(hidden_size => 1))\nend\n\nfunction (m::RecurrentModel)(x)\n z = m.rnn(x, m.h0) # [hidden_size, seq_len, batch_size] or [hidden_size, seq_len]\n ŷ = m.dense(z) # [1, seq_len, batch_size] or [1, seq_len]\n return ŷ\nend\n\nseq_len, batch_size, input_size = 3, 4, 2\nx = rand(Float32, input_size, seq_len, batch_size)\ny = rand(Float32, 1, seq_len, batch_size)\n\nmodel = RecurrentModel(input_size, 5)\nopt_state = Flux.setup(AdamW(1e-3), model)\n\ng = gradient(m -> Flux.mse(m(x), y), model)[1]\nFlux.update!(opt_state, model, g)","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/nnlib/#Neural-Network-primitives-from-NNlib.jl","page":"Low-level Operations – NNlib.jl","title":"Neural Network primitives from NNlib.jl","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux re-exports all of the functions exported by the NNlib package. This includes activation functions, described on their own page. Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.","category":"page"},{"location":"reference/models/nnlib/#Attention","page":"Low-level Operations – NNlib.jl","title":"Attention","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Primitives for the MultiHeadAttention layer.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dot_product_attention\nNNlib.dot_product_attention_scores\nNNlib.make_causal_mask","category":"page"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention","text":"dot_product_attention(query, key, value, [bias]; [fdrop, mask, nheads])\n\nMultihead dot product attention used in transformer architectures.\n\nThe input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.\n\nReturns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).\n\nSee also dot_product_attention_scores if you only need the attention scores.\n\nArguments\n\nquery: Query array of size (qk_dim, q_len, batch_size...).\nkey: Key array of size (qk_dim, kv_len, batch_size...).\nvalue: Value array of size (v_dim, kv_len, batch_size...).\nbias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.\nfdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).\nmask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.\nnheads: Number of heads to split the input arrays into. Default 1.\n\nExamples\n\nq, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)\ny, α = dot_product_attention(q, k, v)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention_scores","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention_scores","text":"dot_product_attention_scores(query, key, [bias]; [fdrop, mask])\n\nReturn the attention scores for the dot_product_attention. Input arrays must have dimensions (num_features ÷ nheads, nheads, sequence_length, batch_size).\n\nSee dot_product_attention for more details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.make_causal_mask","page":"Low-level Operations – NNlib.jl","title":"NNlib.make_causal_mask","text":"make_causal_mask(x, dims=2)\n\nReturn a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.\n\nCan be used to mask the attention scores in dot_product_attention.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Softmax","page":"Low-level Operations – NNlib.jl","title":"Softmax","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"softmax\nlogsoftmax","category":"page"},{"location":"reference/models/nnlib/#NNlib.softmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.softmax","text":"softmax(x; dims = 1)\n\nSoftmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:\n\nsoftmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)\n\nwith additional manipulations enhancing numerical stability.\n\nFor a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.\n\nSee also logsoftmax.\n\nExamples\n\njulia> softmax([1, 2, 3])\n3-element Vector{Float64}:\n 0.09003057317038046\n 0.24472847105479764\n 0.6652409557748218\n\njulia> softmax([1 2 3; 2 2 2]) # dims=1\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.731059\n 0.731059 0.5 0.268941\n\njulia> softmax([1 2 3; 2 2 2]; dims=2)\n2×3 Matrix{Float64}:\n 0.0900306 0.244728 0.665241\n 0.333333 0.333333 0.333333\n\nNote that, when used with Flux.jl, softmax must not be passed to layers like Dense which accept an activation function. The activation is broadcasted over the result, thus applies to individual numbers. But softmax always needs to see the whole column.\n\njulia> using Flux\n\njulia> x = randn(Float32, 4, 4, 3, 13);\n\njulia> model = Chain(Conv((4, 4), 3 => 8, tanh), Flux.flatten, Dense(8 => 7), softmax);\n\njulia> model(x) |> size\n(7, 13)\n\njulia> Dense(4 => 7, softmax)(x)\nERROR: `softmax(x)` called with a number, but it expects an array. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.logsoftmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsoftmax","text":"logsoftmax(x; dims = 1)\n\nComputes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.\n\nIt is semantically equivalent to the following:\n\nlogsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))\n\nSee also softmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Pooling","page":"Low-level Operations – NNlib.jl","title":"Pooling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.PoolDims\nNNlib.lpnormpool\nNNlib.maxpool\nNNlib.meanpool","category":"page"},{"location":"reference/models/nnlib/#NNlib.PoolDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.PoolDims","text":"PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};\n stride=k, padding=0, dilation=1) where {M, L}\n\nDimensions for a \"pooling\" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.lpnormpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.lpnormpool","text":"lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\np is restricted to 0 < p < Inf.\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\nFor all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.\n\nThus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.maxpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.maxpool","text":"maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform max pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.meanpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.meanpool","text":"meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform mean pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Padding","page":"Low-level Operations – NNlib.jl","title":"Padding","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.pad_circular\nNNlib.pad_constant\nNNlib.pad_reflect\nNNlib.pad_repeat\nNNlib.pad_symmetric\nNNlib.pad_zeros","category":"page"},{"location":"reference/models/nnlib/#NNlib.pad_circular","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_circular","text":"pad_circular(x, pad::Tuple; [dims])\npad_circular(x, pad::Int; [dims])\n\nPad the array x \"circularly\" across the border by wrapping around values from the opposite side of x. \n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nThe pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.\n\nSee also pad_repeat, pad_reflect, pad_symmetric, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_circular(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_constant","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_constant","text":"pad_constant(x, pad::Tuple, val = 0; [dims = :])\npad_constant(x, pad::Int, val = 0; [dims = :])\n\nPad the array x with the constant value val.\n\npad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.\n\nFor integer pad input, it is applied on both sides on every dimension in dims.\n\nSee also pad_zeros, pad_repeat, pad_reflect, pad_symmetric, and pad_circular.\n\njulia> r = reshape(1:4, 2, 2)\n2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:\n 1 3\n 2 4\n\njulia> pad_constant(r, (1, 2, 3, 4), 8)\n5×9 Matrix{Int64}:\n 8 8 8 8 8 8 8 8 8\n 8 8 8 1 3 8 8 8 8\n 8 8 8 2 4 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n\njulia> pad_constant(r, 1, 8)\n4×4 Matrix{Int64}:\n 8 8 8 8\n 8 1 3 8\n 8 2 4 8\n 8 8 8 8\n\njulia> r = reshape(1:27, 3, 3, 3)\n3×3×3 reshape(::UnitRange{Int64}, 3, 3, 3) with eltype Int64:\n[:, :, 1] =\n 1 4 7\n 2 5 8\n 3 6 9\n\n[:, :, 2] =\n 10 13 16\n 11 14 17\n 12 15 18\n\n[:, :, 3] =\n 19 22 25\n 20 23 26\n 21 24 27\n\njulia> pad_constant(r, (2,1), dims = 1) # assymetric padding\n6×3×3 Array{Int64, 3}:\n[:, :, 1] =\n 0 0 0\n 0 0 0\n 1 4 7\n 2 5 8\n 3 6 9\n 0 0 0\n\n[:, :, 2] =\n 0 0 0\n 0 0 0\n 10 13 16\n 11 14 17\n 12 15 18\n 0 0 0\n\n[:, :, 3] =\n 0 0 0\n 0 0 0\n 19 22 25\n 20 23 26\n 21 24 27\n 0 0 0\n\njulia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it\nERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)\nStacktrace:\n[...]\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_reflect","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_reflect","text":"pad_reflect(x, pad::Tuple; [dims])\npad_reflect(x, pad::Int; [dims])\n\nPad the array x reflecting its values across the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_reflect(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n 5 2 5 8 5 2\n 6 3 6 9 6 3\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_repeat","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_repeat","text":"pad_repeat(x, pad::Tuple; [dims])\npad_repeat(x, pad::Int; [dims])\n\nPad the array x repeating the values on the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_reflect, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_repeat(r, (1,2,3,4))\n6×10 Matrix{Int64}:\n 1 1 1 1 4 7 7 7 7 7\n 1 1 1 1 4 7 7 7 7 7\n 2 2 2 2 5 8 8 8 8 8\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_symmetric","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_symmetric","text":"pad_symmetric(x, pad::Tuple; [dims])\npad_symmetric(x, pad::Int; [dims])\n\nPad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_reflect, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_symmetric(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 1 1 4 7 7 4\n 1 1 4 7 7 4\n 2 2 5 8 8 5\n 3 3 6 9 9 6\n 3 3 6 9 9 6\n 2 2 5 8 8 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_zeros","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_zeros","text":"pad_zeros(x, pad::Tuple; [dims])\npad_zeros(x, pad::Int; [dims])\n\nPad the array x with zeros. Equivalent to pad_constant with the constant equal to 0. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Convolution","page":"Low-level Operations – NNlib.jl","title":"Convolution","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally. ","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"conv\nConvDims\ndepthwiseconv\nDepthwiseConvDims\nDenseConvDims","category":"page"},{"location":"reference/models/nnlib/#NNlib.conv","page":"Low-level Operations – NNlib.jl","title":"NNlib.conv","text":"conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)\n\nApply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.ConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.ConvDims","text":"ConvDims\n\nType system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.depthwiseconv","page":"Low-level Operations – NNlib.jl","title":"NNlib.depthwiseconv","text":"depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)\n\nDepthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.DepthwiseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DepthwiseConvDims","text":"DepthwiseConvDims\n\nConcrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by C_in, C_mult, rather than C_in, C_out. Useful to be separate from DenseConvDims primarily for channel calculation differences.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.DenseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DenseConvDims","text":"DenseConvDims\n\nConcrete subclass of ConvDims for a normal, dense, conv2d/conv3d.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#Dropout","page":"Low-level Operations – NNlib.jl","title":"Dropout","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dropout\nNNlib.dropout!","category":"page"},{"location":"reference/models/nnlib/#NNlib.dropout","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout","text":"dropout([rng], A, p; [dims])\n\nReturns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).\n\nBy default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.\n\nOptional first argument is the random number generator used.\n\nExamples\n\njulia> dropout(ones(2, 10), 0.2)\n2×10 Matrix{Float64}:\n 1.25 1.25 0.0 1.25 1.25 1.25 1.25 1.25 1.25 1.25\n 1.25 1.25 1.25 0.0 1.25 1.25 0.0 1.25 1.25 1.25\n\njulia> mean(dropout(ones(10^4, 5), 0.2), dims=1)\n1×5 Matrix{Float64}:\n 0.998 1.00075 0.99125 0.99575 1.00075\n\njulia> dropout(ones(5, 5), 0.7, dims=1) # whole row the same\n5×5 Matrix{Float64}:\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n\njulia> mean(dropout(ones(10^4, 5), 0.3, dims=1), dims=1)\n1×5 Matrix{Float64}:\n 1.00571 1.00571 1.00571 1.00571 1.00571\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dropout!","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout!","text":"dropout!(B, A, p; [dims])\n\nThis does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Upsampling","page":"Low-level Operations – NNlib.jl","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"upsample_nearest\nupsample_linear\n∇upsample_linear\nupsample_bilinear\n∇upsample_bilinear\nupsample_trilinear\n∇upsample_trilinear\npixel_shuffle","category":"page"},{"location":"reference/models/nnlib/#NNlib.upsample_nearest","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_nearest","text":"upsample_nearest(x, scale::NTuple{S,Int})\nupsample_nearest(x; size::NTuple{S,Int})\n\nUpsamples the array x by integer multiples along the first S dimensions. Subsequent dimensions of x are not altered.\n\nEither the scale factors or the final output size can be specified.\n\nSee also upsample_bilinear, for two dimensions of an N=4 array.\n\nExample\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2, 3))\n4×9 Matrix{Int64}:\n 1 1 1 2 2 2 3 3 3\n 1 1 1 2 2 2 3 3 3\n 4 4 4 5 5 5 6 6 6\n 4 4 4 5 5 5 6 6 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6]; size=(4, 9)) # equivalent\ntrue\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2,))\n4×3 Matrix{Int64}:\n 1 2 3\n 1 2 3\n 4 5 6\n 4 5 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6], size=(4,))\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_linear","text":"upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)\nupsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)\n\nUpsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_linear","text":"∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_bilinear","text":"upsample_bilinear(x::AbstractArray{T,4}, scale::NTuple{2,Real}; align_corners::Bool = true)\nupsample_bilinear(x::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true)\n\nUpsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).\n\nExamples\n\njulia> x = reshape(Float32[1 2 3; 4 5 6], (2,3,1,1))\n2×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 2.0 3.0\n 4.0 5.0 6.0\n\njulia> upsample_bilinear(x, (2, 3))\n4×9×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.25 1.5 1.75 2.0 2.25 2.5 2.75 3.0\n 2.0 2.25 2.5 2.75 3.0 3.25 3.5 3.75 4.0\n 3.0 3.25 3.5 3.75 4.0 4.25 4.5 4.75 5.0\n 4.0 4.25 4.5 4.75 5.0 5.25 5.5 5.75 6.0\n\njulia> ans == upsample_bilinear(x; size=(4, 9)) # specify ouput size instead\ntrue\n\njulia> upsample_bilinear(x, (2.5, 3.5)) # non-integer scaling factors are allowed\n5×10×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.22222 1.44444 1.66667 1.88889 … 2.33333 2.55556 2.77778 3.0\n 1.75 1.97222 2.19444 2.41667 2.63889 3.08333 3.30556 3.52778 3.75\n 2.5 2.72222 2.94444 3.16667 3.38889 3.83333 4.05556 4.27778 4.5\n 3.25 3.47222 3.69444 3.91667 4.13889 4.58333 4.80556 5.02778 5.25\n 4.0 4.22222 4.44444 4.66667 4.88889 5.33333 5.55556 5.77778 6.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_bilinear","text":"∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral (W,H) size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_trilinear","text":"upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)\nupsample_trilinear(x::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true)\n\nUpsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).\n\nExamples\n\nupsample_trilinear(x, (2, 3, 4))\nupsample_trilinear(x; size=(4, 9, 11)) # specify ouput size instead\nupsample_trilinear(x, (2.5, 3.5, pi)) # non-integer scaling factors are allowed\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_trilinear","text":"∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral size & depth (W,H,D) of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pixel_shuffle","page":"Low-level Operations – NNlib.jl","title":"NNlib.pixel_shuffle","text":"pixel_shuffle(x, r::Integer)\n\nPixel shuffling operation, upscaling by a factor r.\n\nFor 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.\n\nUsed in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., \"Real-Time Single Image and Video Super-Resolution ...\", CVPR 2016, https://arxiv.org/abs/1609.05158\n\nExamples\n\njulia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 12.1 13.1\n 21.1 22.1 23.1\n\n[:, :, 2, 1] =\n 11.2 12.2 13.2\n 21.2 22.2 23.2\n\n[:, :, 3, 1] =\n 11.3 12.3 13.3\n 21.3 22.3 23.3\n\n[:, :, 4, 1] =\n 11.4 12.4 13.4\n 21.4 22.4 23.4\n\njulia> pixel_shuffle(x, 2) # 4 channels used up as 2x upscaling of image dimensions\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 11.3 12.1 12.3 13.1 13.3\n 11.2 11.4 12.2 12.4 13.2 13.4\n 21.1 21.3 22.1 22.3 23.1 23.3\n 21.2 21.4 22.2 22.4 23.2 23.4\n\njulia> y = [i + channel/10 for i in 1:3, channel in 1:6, batch in 1:1]\n3×6×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.2 1.3 1.4 1.5 1.6\n 2.1 2.2 2.3 2.4 2.5 2.6\n 3.1 3.2 3.3 3.4 3.5 3.6\n\njulia> pixel_shuffle(y, 2) # 1D image, with 6 channels reduced to 3\n6×3×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.3 1.5\n 1.2 1.4 1.6\n 2.1 2.3 2.5\n 2.2 2.4 2.6\n 3.1 3.3 3.5\n 3.2 3.4 3.6\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Batched-Operations","page":"Low-level Operations – NNlib.jl","title":"Batched Operations","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"batched_mul\nbatched_mul!\nbatched_adjoint\nbatched_transpose\nbatched_vec","category":"page"},{"location":"reference/models/nnlib/#NNlib.batched_mul","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul","text":"batched_mul(A, B) -> C\nA ⊠ B # \\boxtimes\n\nBatched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.\n\nIf ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.\n\nTo transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:\n\njulia> A, B = randn(2,5,17), randn(5,9,17);\n\njulia> A ⊠ B |> size\n(2, 9, 17)\n\njulia> batched_adjoint(A) |> size\n(5, 2, 17)\n\njulia> batched_mul(A, batched_adjoint(randn(9,5,17))) |> size\n(2, 9, 17)\n\njulia> A ⊠ randn(5,9,1) |> size\n(2, 9, 17)\n\njulia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))\ntrue\n\nThe equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.\n\nHowever, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.\n\nBoth this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV[\"JULIA_DEBUG\"] = NNlib will display them.\n\n\n\n\n\nbatched_mul(A::Array{T,3}, B::Matrix)\nbatched_mul(A::Matrix, B::Array{T,3})\nA ⊠ B\n\nThis is always matrix-matrix multiplication, but either A or B may lack a batch index.\n\nWhen B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.\nWhen A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.\n\njulia> randn(16,8,32) ⊠ randn(8,4) |> size\n(16, 4, 32)\n\njulia> randn(16,8,32) ⊠ randn(8,4,1) |> size # equivalent\n(16, 4, 32)\n\njulia> randn(16,8) ⊠ randn(8,4,32) |> size\n(16, 4, 32)\n\nSee also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_mul!","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul!","text":"batched_mul!(C, A, B) -> C\nbatched_mul!(C, A, B, α=1, β=0)\n\nIn-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.\n\nThis will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.\n\nFor complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_adjoint","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_adjoint","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_transpose","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_transpose","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_vec","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_vec","text":"batched_vec(A::Array{T,3}, B::Matrix)\nbatched_vec(A::Array{T,3}, b::Vector)\n\nBatched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.\n\nWith the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).\n\njulia> A, B, b = randn(16,8,32), randn(8,32), randn(8);\n\njulia> batched_vec(A,B) |> size\n(16, 32)\n\njulia> batched_vec(A,b) |> size\n(16, 32)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Gather-and-Scatter","page":"Low-level Operations – NNlib.jl","title":"Gather and Scatter","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Embedding layer uses NNlib.gather as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.gather\nNNlib.gather!\nNNlib.scatter\nNNlib.scatter!","category":"page"},{"location":"reference/models/nnlib/#NNlib.gather","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather","text":"NNlib.gather(src, idx) -> dst\n\nReverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather! for an in-place version.\n\nExamples\n\njulia> NNlib.gather([1,20,300,4000], [2,4,2])\n3-element Vector{Int64}:\n 20\n 4000\n 20\n\njulia> NNlib.gather([1 2 3; 4 5 6], [1,3,1,3,1])\n2×5 Matrix{Int64}:\n 1 3 1 3 1\n 4 6 4 6 4\n\n\n\n\n\ngather(src, IJK...)\n\nConvert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).\n\nExamples\n\njulia> src = reshape([1:15;], 3, 5)\n3×5 Matrix{Int64}:\n 1 4 7 10 13\n 2 5 8 11 14\n 3 6 9 12 15\n\njulia> NNlib.gather(src, [1, 2], [2, 4])\n2-element Vector{Int64}:\n 4\n 11\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.gather!","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather!","text":"NNlib.gather!(dst, src, idx)\n\nReverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather for an allocating version.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter","text":"NNlib.scatter(op, src, idx; [init, dstsize])\n\nScatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.\n\nIf keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).\nIf dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.\n\nSee scatter! for full details on how idx works.\n\nExamples\n\njulia> NNlib.scatter(+, [10,100,1000], [3,1,2])\n3-element Vector{Int64}:\n 100\n 1000\n 10\n\njulia> NNlib.scatter(+, [1 2 3 4; 5 6 7 8], [2,1,1,5])\n2×5 Matrix{Int64}:\n 5 1 0 0 4\n 13 5 0 0 8\n\njulia> NNlib.scatter(*, [10,200,3000], [1,4,2]; init = 10, dstsize = 6)\n6-element Vector{Int64}:\n 100\n 30000\n 10\n 2000\n 10\n 10\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter!","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter!","text":"NNlib.scatter!(op, dst, src, idx)\n\nScatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to\n\ndst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])\n\nSee also scatter, gather.\n\nArguments\n\nop: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.\ndst: The destination for src to aggregate to. This argument will be mutated.\nsrc: The source data for aggregating.\nidx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.\n\nExamples\n\njulia> NNlib.scatter!(+, ones(3), [10,100], [1,3])\n3-element Vector{Float64}:\n 11.0\n 1.0\n 101.0\n\njulia> NNlib.scatter!(*, fill(0.5, 2, 4), [1 10; 100 1000], [3,2])\n2×4 Matrix{Float64}:\n 0.5 5.0 0.5 0.5\n 0.5 500.0 50.0 0.5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Sampling","page":"Low-level Operations – NNlib.jl","title":"Sampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"grid_sample\n∇grid_sample","category":"page"},{"location":"reference/models/nnlib/#NNlib.grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.grid_sample","text":"grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)\n\nGiven input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.\n\nThis implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).\n\nArguments\n\ninput: Input array in (W_in, H_in, C, N) shape.\ngrid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.\nTherefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.\nOut-of-bound values are handled according to the padding_mode.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.\n\nReturns\n\n(W_out, H_out, C, N) sampled grid from input.\n\nExamples\n\nIn the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.\n\njulia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))\n2×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 2.0 4.0\n\njulia> grid = Array{Float64}(undef, 2, 3, 2, 1);\n\njulia> grid[:, 1, 1, 1] .= (-3, -1);\n\njulia> grid[:, 2, 1, 1] .= (0, -1);\n\njulia> grid[:, 3, 1, 1] .= (1, -1);\n\njulia> grid[:, 1, 2, 1] .= (-1, 1);\n\njulia> grid[:, 2, 2, 1] .= (0, 1);\n\njulia> grid[:, 3, 2, 1] .= (3, 1);\n\njulia> grid_sample(x, grid; padding_mode=:zeros)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 0.0 3.0\n 1.5 3.5\n 2.0 0.0\n\njulia> grid_sample(x, grid; padding_mode=:border)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 1.5 3.5\n 2.0 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇grid_sample","text":"∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T\n\nArguments\n\nΔ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).\ninput: Input from primal computation in (W_in, H_in, C, N) shape.\ngrid: Grid from primal computation in (2, W_out, H_out, N) shape.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.\n\nReturns\n\ndinput (same shape as input) and dgrid (same shape as grid) gradients.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Losses","page":"Low-level Operations – NNlib.jl","title":"Losses","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"ctc_loss","category":"page"},{"location":"reference/models/nnlib/#NNlib.ctc_loss","page":"Low-level Operations – NNlib.jl","title":"NNlib.ctc_loss","text":"ctc_loss(ŷ, y)\n\nComputes the connectionist temporal classification loss between ŷ and y. ŷ must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to ŷ, so ŷ must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with ŷ. The blank label is assumed to be the last label category in ŷ, so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Miscellaneous","page":"Low-level Operations – NNlib.jl","title":"Miscellaneous","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"logsumexp\nNNlib.glu","category":"page"},{"location":"reference/models/nnlib/#NNlib.logsumexp","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsumexp","text":"logsumexp(x; dims = :)\n\nComputes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.\n\nSee also logsoftmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.glu","page":"Low-level Operations – NNlib.jl","title":"NNlib.glu","text":"glu(x, dim = 1)\n\nThe gated linear unit from the \"Language Modeling with Gated Convolutional Networks\" paper.\n\nCalculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/training/optimisers/#man-optimisers","page":"Optimisation Rules","title":"Optimisation Rules","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Any optimization rule from Optimisers.jl can be used with train! and other training functions.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"For full details of how the new interface works, see the Optimisers.jl documentation.","category":"page"},{"location":"reference/training/optimisers/#Optimisers-Reference","page":"Optimisation Rules","title":"Optimisers Reference","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"All optimisers return an object that, when passed to train!, will update the parameters passed to it.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.Descent\nOptimisers.Momentum\nOptimisers.Nesterov\nOptimisers.RMSProp\nOptimisers.Adam\nOptimisers.RAdam\nOptimisers.AdaMax\nOptimisers.AdaGrad\nOptimisers.AdaDelta\nOptimisers.AMSGrad\nOptimisers.NAdam\nOptimisers.AdamW\nOptimisers.OAdam\nOptimisers.AdaBelief\nOptimisers.Lion","category":"page"},{"location":"reference/training/optimisers/#Optimisers.Descent","page":"Optimisation Rules","title":"Optimisers.Descent","text":"Descent(η = 1f-1)\nDescent(; [eta])\n\nClassic gradient descent optimiser with learning rate η. For each parameter p and its gradient dp, this runs p -= η*dp.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Momentum","page":"Optimisation Rules","title":"Optimisers.Momentum","text":"Momentum(η = 0.01, ρ = 0.9)\nMomentum(; [eta, rho])\n\nGradient descent optimizer with learning rate η and momentum ρ.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Nesterov","page":"Optimisation Rules","title":"Optimisers.Nesterov","text":"Nesterov(η = 0.001, ρ = 0.9)\nNesterov(; [eta, rho])\n\nGradient descent optimizer with learning rate η and Nesterov momentum ρ.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nNesterov momentum (ρ): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RMSProp","page":"Optimisation Rules","title":"Optimisers.RMSProp","text":"RMSProp(η = 0.001, ρ = 0.9, ϵ = 1e-8; centred = false)\nRMSProp(; [eta, rho, epsilon, centred])\n\nOptimizer using the RMSProp algorithm. Often a good choice for recurrent networks. Parameters other than learning rate generally don't need tuning.\n\nCentred RMSProp is a variant which normalises gradients by an estimate their variance, instead of their second moment.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\nKeyword centred (or centered): Indicates whether to use centred variant of the algorithm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Adam","page":"Optimisation Rules","title":"Optimisers.Adam","text":"Adam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nAdam(; [eta, beta, epsilon])\n\nAdam optimiser.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RAdam","page":"Optimisation Rules","title":"Optimisers.RAdam","text":"RAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nRAdam(; [eta, beta, epsilon])\n\nRectified Adam optimizer.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaMax","page":"Optimisation Rules","title":"Optimisers.AdaMax","text":"AdaMax(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nAdaMax(; [eta, beta, epsilon])\n\nAdaMax is a variant of Adam based on the ∞-norm.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaGrad","page":"Optimisation Rules","title":"Optimisers.AdaGrad","text":"AdaGrad(η = 0.1, ϵ = 1e-8)\nAdaGrad(; [eta, epsilon])\n\nAdaGrad optimizer. It has parameter specific learning rates based on how frequently it is updated. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaDelta","page":"Optimisation Rules","title":"Optimisers.AdaDelta","text":"AdaDelta(ρ = 0.9, ϵ = 1e-8)\nAdaDelta(; [rho, epsilon])\n\nAdaDelta is a version of AdaGrad adapting its learning rate based on a window of past gradient updates. Parameters don't need tuning.\n\nParameters\n\nRho (ρ == rho): Factor by which the gradient is decayed at each time step.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AMSGrad","page":"Optimisation Rules","title":"Optimisers.AMSGrad","text":"AMSGrad(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nAMSGrad(; [eta, beta, epsilon])\n\nThe AMSGrad version of the Adam optimiser. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.NAdam","page":"Optimisation Rules","title":"Optimisers.NAdam","text":"NAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\nNAdam(; [eta, beta, epsilon])\n\nNAdam is a Nesterov variant of Adam. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdamW","page":"Optimisation Rules","title":"Optimisers.AdamW","text":"AdamW(η = 0.001, β = (0.9, 0.999), λ = 0, ϵ = 1e-8; couple = true)\nAdamW(; [eta, beta, lambda, epsilon, couple])\n\nAdamW is a variant of Adam fixing (as in repairing) its weight decay regularization. Implemented as an OptimiserChain of Adam and WeightDecay`.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nWeight decay (λ == lambda): Controls the strength of L_2 regularisation.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\nKeyword couple: If true, the weight decay is coupled with the learning rate, as in pytorch's AdamW. This corresponds to an update of the form x = x - η * (dx + λ * x), where dx is the update from Adam with learning rate 1. If false, the weight decay is decoupled from the learning rate, in the spirit of the original paper. This corresponds to an update of the form x = x - η * dx - λ * x. Default is true.\n\nwarning: Breaking change in v0.4\nWith version 0.4 the default update rule for AdamW has changed to match the pytorch implementation. The previous rule, which is closer to the original paper, can be obtained by setting AdamW(..., couple=false). See this issue for more details.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.OAdam","page":"Optimisation Rules","title":"Optimisers.OAdam","text":"OAdam(η = 0.001, β = (0.5, 0.9), ϵ = 1e-8)\nOAdam(; [eta, beta, epsilon])\n\nOAdam (Optimistic Adam) is a variant of Adam adding an \"optimistic\" term suitable for adversarial training.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaBelief","page":"Optimisation Rules","title":"Optimisers.AdaBelief","text":"AdaBelief(η = 0.001, β = (0.9, 0.999), ϵ = 1e-16)\nAdaBelief(; [eta, beta, epsilon])\n\nThe AdaBelief optimiser is a variant of the well-known Adam optimiser.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Lion","page":"Optimisation Rules","title":"Optimisers.Lion","text":"Lion(η = 0.001, β = (0.9, 0.999))\nLion(; [eta, beta])\n\nLion optimiser.\n\nParameters\n\nLearning rate (η == eta): Magnitude by which gradients are updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Composing-Optimisers","page":"Optimisation Rules","title":"Composing Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Flux (through Optimisers.jl) defines a special kind of optimiser called OptimiserChain which takes in arbitrary optimisers as input. Its behaviour is similar to the usual optimisers, but differs in that it acts by calling the optimisers listed in it sequentially. Each optimiser produces a modified gradient that will be fed into the next, and the resultant update will be applied to the parameter as usual. A classic use case is where adding decays is desirable. Optimisers.jl defines the basic decay corresponding to an L_2 regularization in the loss as WeightDecay.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(WeightDecay(1e-4), Descent())","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Here we apply the weight decay to the Descent optimiser. The resulting optimiser opt can be used as any optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"w = [randn(10, 10), randn(10, 10)]\nopt_state = Flux.setup(opt, w)\n\nloss(w, x) = Flux.mse(w[1] * x, w[2] * x)\n\nloss(w, rand(10)) # around 0.9\n\nfor t = 1:10^5\n g = gradient(w -> loss(w[1], w[2], rand(10)), w)\n Flux.update!(opt_state, w, g)\nend\n\nloss(w, rand(10)) # around 0.9","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"It is possible to compose optimisers for some added flexibility.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.OptimiserChain","category":"page"},{"location":"reference/training/optimisers/#Optimisers.OptimiserChain","page":"Optimisation Rules","title":"Optimisers.OptimiserChain","text":"OptimiserChain(opts...)\n\nCompose a sequence of optimisers so that each opt in opts updates the gradient, in the order specified.\n\nWith an empty sequence, OptimiserChain() is the identity, so update! will subtract the full gradient from the parameters. This is equivalent to Descent(1).\n\nExample\n\njulia> o = OptimiserChain(ClipGrad(1.0), Descent(0.1));\n\njulia> m = (zeros(3),);\n\njulia> s = Optimisers.setup(o, m)\n(Leaf(OptimiserChain(ClipGrad(1.0), Descent(0.1)), (nothing, nothing)),)\n\njulia> Optimisers.update(s, m, ([0.3, 1, 7],))[2] # clips before discounting\n([-0.03, -0.1, -0.1],)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Scheduling-Optimisers","page":"Optimisation Rules","title":"Scheduling Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in ParameterSchedulers.jl. The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a cosine annealing schedule with a momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between 1e-4 and 1e-2 every 10 steps. We also create a new Momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers\n\nopt = Momentum()\nschedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)\nfor (eta, epoch) in zip(schedule, 1:100)\n opt.eta = eta\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"schedule can also be indexed (e.g. schedule(100)) or iterated like any iterator in Julia.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a stateful schedule, you can use ParameterSchedulers.Stateful:","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers: Stateful, next!\n\nschedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10))\nfor epoch in 1:100\n opt.eta = next!(schedule)\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.","category":"page"},{"location":"reference/training/optimisers/#Decays","page":"Optimisation Rules","title":"Decays","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.SignDecay\nOptimisers.WeightDecay","category":"page"},{"location":"reference/training/optimisers/#Optimisers.SignDecay","page":"Optimisation Rules","title":"Optimisers.SignDecay","text":"SignDecay(λ = 1e-3)\nSignDecay(; [lambda])\n\nImplements L_1 regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.\n\nSee also [WeightDecay] for L_2 normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.WeightDecay","page":"Optimisation Rules","title":"Optimisers.WeightDecay","text":"WeightDecay(λ = 5e-4)\nWeightDecay(; [lambda])\n\nImplements L_2 regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.\n\nSee also [SignDecay] for L_1 normalisation.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Gradient-Clipping","page":"Optimisation Rules","title":"Gradient Clipping","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.ClipGrad\nOptimisers.ClipNorm","category":"page"},{"location":"reference/training/optimisers/#Optimisers.ClipGrad","page":"Optimisation Rules","title":"Optimisers.ClipGrad","text":"ClipGrad(δ = 10)\nClipGrad(; [delta])\n\nRestricts every gradient component to obey -δ ≤ dx[i] ≤ δ.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipNorm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.ClipNorm","page":"Optimisation Rules","title":"Optimisers.ClipNorm","text":"ClipNorm(ω = 10, p = 2; throw = true)\n\nScales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).\n\nThrows an error if the norm is infinite or NaN, which you can turn off with throw = false.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipGrad.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#GPU-Support","page":"GPU Support","title":"GPU Support","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Most work on neural networks involves the use of GPUs, as they can typically perform the required computation much faster. This page describes how Flux co-operates with various other packages, which talk to GPU hardware.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For those in a hurry, see the quickstart page. Or do using CUDA and then call gpu on both the model and the data. ","category":"page"},{"location":"guide/gpu/#Basic-GPU-use:-from-Array-to-CuArray","page":"GPU Support","title":"Basic GPU use: from Array to CuArray","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Julia's GPU packages work with special array types, in place of the built-in Array. The most used is CuArray provided by CUDA.jl, for GPUs made by NVIDIA. That package provides a function cu which converts an ordinary Array (living in CPu memory) to a CuArray (living in GPU memory). Functions like * and broadcasting specialise so that, when given CuArrays, all the computation happens on the GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"W = randn(3, 4) # some weights, on CPU: 3×4 Array{Float64, 2}\nx = randn(4) # fake data\ny = tanh.(W * x) # computation on the CPU\n\nusing CUDA\n\ncu(W) isa CuArray{Float32}\n(cW, cx) = (W, x) |> cu # move both to GPU\ncy = tanh.(cW * cx) # computation on the GPU","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Notice that cu doesn't only move arrays, it also recurses into many structures, such as the tuple (W, x) above. (Notice also that it converts Julia's default Float64 numbers to Float32, as this is what most GPUs support efficiently – it calls itself \"opinionated\". Flux defaults to Float32 in all cases.)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To use CUDA with Flux, you can simply use cu to move both the model, and the data. It will create a copy of the Flux model, with all of its parameter arrays moved to the GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Pkg; Pkg.add([\"CUDA\", \"cuDNN\"]) # do this once\n\nusing Flux, CUDA\nCUDA.allowscalar(false) # recommended\n\nmodel = Dense(W, true, tanh) # wrap the same matrix W in a Flux layer\nmodel(x) ≈ y # same result, still on CPU\n\nc_model = cu(model) # move all the arrays within model to the GPU\nc_model(cx) # computation on the GPU","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Notice that you need using CUDA (every time) but also ] add cuDNN (once, when installing packages). This is a quirk of how these packages are set up. (The cuDNN.jl sub-package handles operations such as convolutions, called by Flux via NNlib.jl.)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux's gradient, and training functions like setup, update!, and train!, are all equally happy to accept GPU arrays and GPU models, and then perform all computations on the GPU. It is recommended that you move the model to the GPU before calling setup.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"grads = Flux.gradient((f,x) -> sum(abs2, f(x)), model, x) # on CPU\nc_grads = Flux.gradient((f,x) -> sum(abs2, f(x)), c_model, cx) # same result, all on GPU\n\nc_opt = Flux.setup(Adam(), c_model) # setup optimiser after moving model to GPU\n\nFlux.update!(c_opt, c_model, c_grads[1]) # mutates c_model but not model","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To move arrays and other objects back to the CPU, Flux provides a function cpu. This is recommended when saving models, Flux.state(c_model |> cpu), see below.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"cpu(cW) isa Array{Float32, 2}\n\nmodel2 = cpu(c_model) # copy model back to CPU\nmodel2(x)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"compat: Flux ≤ 0.13\nOld versions of Flux automatically loaded CUDA.jl to provide GPU support. Starting from Flux v0.14, it has to be loaded separately. Julia's package extensions allow Flux to automatically load some GPU-specific code when needed.","category":"page"},{"location":"guide/gpu/#Other-GPU-packages-for-AMD-and-Apple","page":"GPU Support","title":"Other GPU packages for AMD & Apple","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Non-NVIDIA graphics cards are supported by other packages. Each provides its own function which behaves like cu. AMD GPU support provided by AMDGPU.jl, on systems with ROCm and MIOpen installed. This package has a function roc which converts Array to ROCArray:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Flux, AMDGPU\nAMDGPU.allowscalar(false)\n\nr_model = roc(model)\nr_model(roc(x))\n\nFlux.gradient((f,x) -> sum(abs2, f(x)), r_model, roc(x))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Experimental support for Apple devices with M-series chips is provided by Metal.jl. This has a function mtl which works like cu, converting Array to MtlArray:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Flux, Metal\nMetal.allowscalar(false)\n\nm_model = mtl(model)\nm_y = m_model(mtl(x))\n\nFlux.gradient((f,x) -> sum(abs2, f(x)), m_model, mtl(x))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"danger: Experimental\nMetal support in Flux is experimental and many features are not yet available. AMD support is improving, but likely to have more rough edges than CUDA.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If you want your model to work with any brand of GPU, or none, then you may not wish to write cu everywhere. One simple way to be generic is, at the top of the file, to un-comment one of several lines which import a package and assign its \"adaptor\" to the same name:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using CUDA: cu as device # after this, `device === cu`\n# using AMDGPU: roc as device\n# device = identity # do-nothing, for CPU\n\nusing Flux\nmodel = Chain(...) |> device","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"note: Adapt.jl\nThe functions cu, mtl, roc all use Adapt.jl, to work within various wrappers. The reason they work on Flux models is that Flux.@layer Layer defines methods of Adapt.adapt_structure(to, lay::Layer).","category":"page"},{"location":"guide/gpu/#Automatic-GPU-choice-with-gpu-and-gpu_device","page":"GPU Support","title":"Automatic GPU choice with gpu and gpu_device","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux also provides a more automatic way of choosing which GPU (or none) to use. This is the function gpu:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default it does nothing.\nIf the package CUDA is loaded, and CUDA.functional() === true, then it behaves like cu.\nIf the package AMDGPU is loaded, and AMDGPU.functional() === true, then it behaves like roc.\nIf the package Metal is loaded, and Metal.functional() === true, then it behaves like mtl.\nIf two differnet GPU packages are loaded, the first one takes priority.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For the most part, this means that a script which says model |> gpu and data |> gpu will just work. It should always run, and if a GPU package is loaded (and finds the correct hardware) then that will be used.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The function gpu uses a lower-level function called gpu_device from MLDataDevices.jl, which checks what to do and then returns some device object. In fact, the entire implementation is just this:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"gpu(x) = gpu_device()(x)\ncpu(x) = cpu_device()(x)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Automatic backend selection through gpu is not type-stable. That doesn't matter if you do it once, or once per large batch – it costs a few microseconds. But it might matter if you do it within some loop.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To avoid this, you can first obtain a \"device object\" with device = gpu_device(), once, and then use this as the function to transfer data. Something like this:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"to_device = gpu_device()\ngpu_model = model |> to_device\n\nfor epoch in 1:num_epochs\n for (x, y) in dataloader\n x_gpu, y_gpu = (x, y) |> to_device\n # training code...","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Finally, setting a backend prefence with gpu_backend! gives type stability to the whole pipeline.","category":"page"},{"location":"guide/gpu/#Transferring-Training-Data","page":"GPU Support","title":"Transferring Training Data","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Iterating over the batches in a DataLoader object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:\ntrain_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true)\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x_cpu, y_cpu) in train_loader\n x = gpu(x_cpu)\n y = gpu(y_cpu)\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nRather than write this out every time, you can just call gpu(::DataLoader):\ngpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nThis is equivalent to DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...). Something similar can also be done with CUDA.CuIterator, gpu_train_loader = CUDA.CuIterator(train_loader). However, this only works with a limited number of data types: first(train_loader) should be a tuple (or NamedTuple) of arrays.\nTransferring all training data to the GPU at once before creating the DataLoader. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.\ngpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32)\n# ...\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n # ...\nHere (X, Y) |> gpu applies gpu to both arrays, as it recurses into structures.","category":"page"},{"location":"guide/gpu/#Saving-GPU-Trained-Models","page":"GPU Support","title":"Saving GPU-Trained Models","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"After the training process is done, we must always transfer the trained model back to the CPU memory before serializing or saving to disk. This can be done with cpu:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"model = cpu(model) # or model = model |> cpu","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"and then","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using BSON\n# ...\nBSON.@save \"./path/to/trained_model.bson\" model\n\n# in this approach the cpu-transferred model (referenced by the variable `model`)\n# only exists inside the `let` statement\nlet model = cpu(model)\n # ...\n BSON.@save \"./path/to/trained_model.bson\" model\nend\n\n# is equivalent to the above, but uses `key=value` storing directive from BSON.jl\nBSON.@save \"./path/to/trained_model.bson\" model = cpu(model)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The reason behind this is that models trained in the GPU but not transferred to the CPU memory scope will expect CuArrays as input. In other words, Flux models expect input data coming from the same kind device in which they were trained on.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In controlled scenarios in which the data fed to the loaded models is guaranteed to be in the GPU there's no need to transfer them back to CPU memory scope, however in production environments, where artifacts are shared among different processes, equipments or configurations, there is no guarantee that the CUDA.jl package will be available for the process performing inference on the model loaded from the disk.","category":"page"},{"location":"guide/gpu/#Disabling-CUDA-or-choosing-which-GPUs-are-visible-to-Flux","page":"GPU Support","title":"Disabling CUDA or choosing which GPUs are visible to Flux","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Sometimes it is required to control which GPUs are visible to julia on a system with multiple GPUs or disable GPUs entirely. This can be achieved with an environment variable CUDA_VISIBLE_DEVICES.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To disable all devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='-1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To select specific devices by device id:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='0,1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"More information for conditional use of GPUs in CUDA.jl can be found in its documentation, and information about the specific use of the variable is described in the Nvidia CUDA blog post.","category":"page"},{"location":"guide/gpu/#Data-movement-across-GPU-devices","page":"GPU Support","title":"Data movement across GPU devices","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux also supports getting handles to specific GPU devices, and transferring models from one GPU device to another GPU device from the same backend. Let's try it out for NVIDIA GPUs. First, we list all the available devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA;\n\njulia> CUDA.devices()\nCUDA.DeviceIterator() for 3 devices:\n0. NVIDIA TITAN RTX\n1. NVIDIA TITAN RTX\n2. NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's select the device with id 0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device0 = gpu_device(1)\n(::CUDADevice{CuDevice}) (generic function with 4 methods)\n\njulia> device0.device\nCuDevice(0): NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Notice that indexing starts from 0 in the CUDA.devices() output, but gpu_device! expects the device id starting from 1.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's move a simple dense layer to the GPU represented by device0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> dense_model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> dense_model = dense_model |> device0;\n\njulia> dense_model.weight\n3×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n -0.142062 -0.131455\n -0.828134 -1.06552\n 0.608595 -1.05375\n\njulia> CUDA.device(dense_model.weight) # check the GPU to which dense_model is attached\nCuDevice(0): NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Next, we'll get a handle to the device with id 1, and move dense_model to that device:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device1 = gpu_device(2)\n(::CUDADevice{CuDevice}) (generic function with 4 methods)\n\njulia> dense_model = dense_model |> device1; # don't directly print the model; see warning below\n\njulia> CUDA.device(dense_model.weight)\nCuDevice(1): NVIDIA TITAN RTX","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.","category":"page"},{"location":"guide/gpu/#Distributed-data-parallel-training","page":"GPU Support","title":"Distributed data parallel training","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"danger: Experimental\nDistributed support is experimental and could change in the future.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux supports now distributed data parallel training with DistributedUtils module. If you want to run your code on multiple GPUs, you have to install MPI.jl (see docs for more info).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using MPI\n\njulia> MPI.install_mpiexecjl()","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can run your code with mpiexecjl --project=. -n julia .jl from CLI.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You can use either the MPIBackend or NCCLBackend, the latter only if also NCCL.jl is loaded. First, initialize a backend with DistributedUtils.initialize, e.g.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, MPI, NCCL, CUDA\n\njulia> CUDA.allowscalar(false)\n\njulia> DistributedUtils.initialize(NCCLBackend)\n\njulia> backend = DistributedUtils.get_distributed_backend(NCCLBackend)\nNCCLBackend{Communicator, MPIBackend{MPI.Comm}}(Communicator(Ptr{NCCL.LibNCCL.ncclComm} @0x000000000607a660), MPIBackend{MPI.Comm}(MPI.Comm(1140850688)))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Pass your model, as well as any data to GPU device.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.\n\njulia> x = rand(Float32, 1, 16) |> gpu\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.239324 0.331029 0.924996 0.55593 0.853093 0.874513 0.810269 0.935858 0.477176 0.564591 0.678907 0.729682 0.96809 0.115833 0.66191 0.75822\n\njulia> y = x .^ 3\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.0137076 0.0362744 0.791443 0.171815 0.620854 0.668804 0.53197 0.819654 0.108651 0.179971 0.312918 0.388508 0.907292 0.00155418 0.29 0.435899","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In this case, we are training on a total of 16 * number of processes samples. You can also use DistributedUtils.DistributedDataContainer to split the data uniformly across processes (or do it manually).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> data = DistributedUtils.DistributedDataContainer(backend, x)\nFlux.DistributedUtils.DistributedDataContainer(Float32[0.23932439 0.33102947 … 0.66191036 0.75822026], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You have to wrap your model in DistributedUtils.FluxDistributedModel and synchronize it (broadcast accross all processes):","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0)\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Time to set up an optimizer by using DistributedUtils.DistributedOptimizer and synchronize it as well.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Optimisers\n\njulia> opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))\nDistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8))\n\njulia> st_opt = Optimisers.setup(opt, model)\n(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)\n\njulia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0)\n(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can define loss and train the model.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> loss(model) = mean((model(x) .- y).^2)\nloss (generic function with 1 method)\n\njulia> for epoch in 1:100\n global model, st_opt\n l, grad = Zygote.withgradient(loss, model)\n println(\"Epoch $epoch: Loss $l\")\n st_opt, model = Optimisers.update(st_opt, model, grad[1])\n end\nEpoch 1: Loss 0.011638729\nEpoch 2: Loss 0.0116432225\nEpoch 3: Loss 0.012763695\n...","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n julia .jl, where is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default MPI.jl MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl. Then test if your MPI is CUDA-aware by","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> import Pkg\njulia> Pkg.test(\"MPI\"; test_args=[\"--backend=CUDA\"])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If it is, set your local preference as below","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Preferences\njulia> set_preferences!(\"Flux\", \"FluxDistributedMPICUDAAware\" => true)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"warning: Known shortcomings\nWe don't run CUDA-aware tests so you're running it at own risk.","category":"page"},{"location":"guide/gpu/#Checking-GPU-Availability","page":"GPU Support","title":"Checking GPU Availability","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default, Flux will run the checks on your system to see if it can support GPU functionality. You can check if Flux identified a valid GPU setup by typing the following:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using CUDA\n\njulia> CUDA.functional()\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For AMD GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using AMDGPU\n\njulia> AMDGPU.functional()\ntrue\n\njulia> AMDGPU.functional(:MIOpen)\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For Metal GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Metal\n\njulia> Metal.functional()\ntrue","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/utilities/#man-init-funcs","page":"Weight Initialisation","title":"Random Weight Initialisation","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux initialises convolutional layers and recurrent cells with glorot_uniform by default. Most layers accept a function as an init keyword, which replaces this default. For example:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> conv = Conv((3, 3), 3 => 2, relu; init=Flux.glorot_normal)\nConv((3, 3), 3 => 2, relu) # 56 parameters\n\njulia> conv.bias\n2-element Vector{Float32}:\n 0.0\n 0.0","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Note that init creates the weight array, but not the bias vector.","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Many of the initialisation functions accept keywords such as gain, and a random number generator. To make it easy to pass these to layers, there are methods which return a function:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> Dense(4 => 5, tanh; init=Flux.glorot_uniform(gain=2))\nDense(4 => 5, tanh) # 25 parameters\n\njulia> Dense(4 => 5, tanh; init=Flux.randn32(MersenneTwister(1)))\nDense(4 => 5, tanh) # 25 parameters","category":"page"},{"location":"reference/utilities/#Initialisation-functions","page":"Weight Initialisation","title":"Initialisation functions","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.glorot_uniform\nFlux.glorot_normal\nFlux.kaiming_uniform\nFlux.kaiming_normal\nFlux.truncated_normal\nFlux.lecun_normal\nFlux.orthogonal\nFlux.sparse_init\nFlux.identity_init\nFlux.ones32\nFlux.zeros32\nFlux.rand32\nFlux.randn32\nFlux.create_bias","category":"page"},{"location":"reference/utilities/#Flux.glorot_uniform","page":"Weight Initialisation","title":"Flux.glorot_uniform","text":"glorot_uniform([rng], size...; gain = 1) -> Array\nglorot_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval -x x, where x = gain * sqrt(6 / (fan_in + fan_out)).\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> Flux.glorot_uniform(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.glorot_uniform(10, 100)), digits=3)\n(-0.233f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 10)), digits=3)\n(-0.234f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 100)), digits=3)\n(-0.173f0, 0.173f0)\n\njulia> Dense(3 => 2, tanh; init = Flux.glorot_uniform(MersenneTwister(1)))\nDense(3 => 2, tanh) # 8 parameters\n\njulia> ans.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.glorot_normal","page":"Weight Initialisation","title":"Flux.glorot_normal","text":"glorot_normal([rng], size...; gain = 1) -> Array\nglorot_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.glorot_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.glorot_normal(1000, 10)), digits=3)\n0.045f0\n\njulia> round(std(Flux.glorot_normal(1000, 1000)), digits=3)\n0.032f0\n\njulia> Dense(10 => 1000, tanh; init = Flux.glorot_normal(gain=100))\nDense(10 => 1000, tanh) # 11_000 parameters\n\njulia> round(std(ans.weight), sigdigits=3)\n4.45f0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_uniform","page":"Weight Initialisation","title":"Flux.kaiming_uniform","text":"kaiming_uniform([rng], size...; gain = √2) -> Array\nkaiming_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)\n(-0.774f0, 0.773f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(10, 100)), digits=3)\n(-0.243f0, 0.245f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)\n(-0.245f0, 0.245f0)\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_normal","page":"Weight Initialisation","title":"Flux.kaiming_normal","text":"kaiming_normal([rng], size...; gain = √2) -> Array\nkaiming_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.kaiming_normal(1000, 10)), digits=3)\n0.45f0\n\njulia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)\n0.045f0\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.truncated_normal","page":"Weight Initialisation","title":"Flux.truncated_normal","text":"truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array\ntruncated_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).\n\nThe values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.\n\nExamples\n\njulia> using Statistics\n\njulia> Flux.truncated_normal(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.truncated_normal(10^6)); digits=3)\n(-2.0f0, 2.0f0)\n\njulia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))\n1.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.lecun_normal","page":"Weight Initialisation","title":"Flux.lecun_normal","text":"lecun_normal([rng], size...) -> Array\nlecun_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a truncated normal distribution centered on 0 with stddev sqrt(1 / fan_in), where fan_in is the number of input units in the weight tensor.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.lecun_normal(10, 1000)), digits=3)\n0.032f0\n\njulia> round(std(Flux.lecun_normal(1000, 10)), digits=3)\n0.32f0\n\njulia> round(std(Flux.lecun_normal(1000, 1000)), digits=3)\n0.032f0\n\njulia> Dense(10 => 1000, selu; init = Flux.lecun_normal())\nDense(10 => 1000, selu) # 11_000 parameters\n\njulia> round(std(ans.weight), digits=3)\n0.313f0\n\nReferences\n\n[1] Lecun, Yann, et al. \"Efficient backprop.\" Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.orthogonal","page":"Weight Initialisation","title":"Flux.orthogonal","text":"orthogonal([rng], size...; gain = 1) -> Array\northogonal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].\n\nCannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.\n\nExamples\n\njulia> W = Flux.orthogonal(5, 7);\n\njulia> summary(W)\n\"5×7 Matrix{Float32}\"\n\njulia> W * W' ≈ I(5)\ntrue\n\njulia> W2 = Flux.orthogonal(7, 5);\n\njulia> W2 * W2' ≈ I(7)\nfalse\n\njulia> W2' * W2 ≈ I(5)\ntrue\n\njulia> W3 = Flux.orthogonal(3, 3, 2, 4);\n\njulia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)\ntrue\n\nReferences\n\n[1] Saxe, McClelland, Ganguli. \"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\", ICLR 2014, https://arxiv.org/abs/1312.6120\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.sparse_init","page":"Weight Initialisation","title":"Flux.sparse_init","text":"sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array\nsparse_init([rng]; kw...) -> Function\n\nReturn a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.\n\nThis method is described in [1].\n\nExamples\n\njulia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))\n20\n\njulia> sum(0 .== Flux.sparse_init(10, 11, sparsity=0.9), dims=1)\n1×11 Matrix{Int64}:\n 9 9 9 9 9 9 9 9 9 9 9\n\njulia> Dense(3 => 10, tanh; init=Flux.sparse_init(sparsity=0.5))\nDense(3 => 10, tanh) # 40 parameters\n\njulia> count(iszero, ans.weight, dims=1)\n1×3 Matrix{Int64}:\n 5 5 5\n\nReferences\n\n[1] Martens, J, \"Deep learning via Hessian-free optimization\" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.identity_init","page":"Weight Initialisation","title":"Flux.identity_init","text":"identity_init(size...; gain=1, shift=0) -> Array\nidentity_init(; kw...) -> Function\n\nReturn an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.\n\nOften useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.\n\nHas the following behaviour\n\n1D: A Vector of zeros (useful for an identity bias)\n2D: An identity matrix (useful for an identity matrix multiplication)\nMore than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)\n\nSome caveats: \n\nNot all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.\nLayers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.\nFor convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.\n\nUse keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).\n\nFor consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.\n\nExamples\n\njulia> Flux.identity_init(3,5)\n3×5 Matrix{Float32}:\n 1.0 0.0 0.0 0.0 0.0\n 0.0 1.0 0.0 0.0 0.0\n 0.0 0.0 1.0 0.0 0.0\n\njulia> Dense(5 => 3, relu, init=Flux.identity_init)([1,-2,3,-4,5])\n3-element Vector{Float32}:\n 1.0\n 0.0\n 3.0\n\njulia> Flux.identity_init(3,3,2; gain=100)\n3×3×2 Array{Float32, 3}:\n[:, :, 1] =\n 0.0 0.0 0.0\n 100.0 0.0 0.0\n 0.0 0.0 0.0\n\n[:, :, 2] =\n 0.0 0.0 0.0\n 0.0 100.0 0.0\n 0.0 0.0 0.0\n\njulia> x4 = cat([1 2 3; 4 5 6; 7 8 9]; dims=4);\n\njulia> Conv((2,2), 1 => 1, init=Flux.identity_init(gain=10), pad=SamePad())(x4)\n3×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 10.0 20.0 30.0\n 40.0 50.0 60.0\n 70.0 80.0 90.0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.ones32","page":"Weight Initialisation","title":"Flux.ones32","text":"ones32(size...) = ones(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 1s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.zeros32","page":"Weight Initialisation","title":"Flux.zeros32","text":"zeros32(size...) = zeros(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 0s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.rand32","page":"Weight Initialisation","title":"Flux.rand32","text":"rand32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.randn32","page":"Weight Initialisation","title":"Flux.randn32","text":"randn32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.create_bias","page":"Weight Initialisation","title":"Flux.create_bias","text":"create_bias(weights, bias, size...)\n\nReturn a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.\n\nbias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.\nbias == false returns false, which is understood by AD to be non-differentiable.\nbias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"These functions call:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.rng_from_array\nFlux.nfan","category":"page"},{"location":"reference/utilities/#Flux.rng_from_array","page":"Weight Initialisation","title":"Flux.rng_from_array","text":"rng_from_array(x)\n\nCreate an instance of the RNG most appropriate for x. As an example, if x is aCuArray, it will return a CUDA.default_rng(). If x is an Array instead, it will return a Random.default_rng().\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.nfan","page":"Weight Initialisation","title":"Flux.nfan","text":"nfan(n_out, n_in=1) -> Tuple\nnfan(dims...)\nnfan(dims::Tuple)\n\nFor a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.\n\nThis function is mainly used by weight initializers, e.g., kaiming_normal.\n\nExamples\n\njulia> layer = Dense(10, 20);\n\njulia> Flux.nfan(size(layer.weight))\n(10, 20)\n\njulia> layer = Conv((3, 3), 2=>10);\n\njulia> Flux.nfan(size(layer.weight))\n(18, 90)\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Changing-the-type-of-all-parameters","page":"Weight Initialisation","title":"Changing the type of all parameters","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.f64\nFlux.f32\nFlux.f16","category":"page"},{"location":"reference/utilities/#Flux.f64","page":"Weight Initialisation","title":"Flux.f64","text":"f64(m)\n\nConverts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.\n\nSee also f32 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f32","page":"Weight Initialisation","title":"Flux.f32","text":"f32(m)\n\nConverts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.\n\nSee also f64 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f16","page":"Weight Initialisation","title":"Flux.f16","text":"f16(m)\n\nConverts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.\n\nSupport for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.\n\nSee also f32 and f64.\n\nExample\n\njulia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10)) # all Float32\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 6.211 MiB.\n\njulia> m |> f16 # takes half the memory\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.\n\n\n\n\n\n","category":"function"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/outputsize/#Shape-Inference","page":"Shape Inference","title":"Shape Inference","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux has some tools to help generate models in an automated fashion, by inferring the size of arrays that layers will recieve, without doing any computation. This is especially useful for convolutional models, where the same Conv layer accepts any size of image, but the next layer may not. ","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The higher-level tool is a macro @autosize which acts on the code defining the layers, and replaces each appearance of _ with the relevant size. This simple example returns a model with Dense(845 => 10) as the last layer:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"@autosize (28, 28, 1, 32) Chain(Conv((3, 3), _ => 5, relu, stride=2), Flux.flatten, Dense(_ => 10))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The input size may be provided at runtime, like @autosize (sz..., 1, 32) Chain(Conv(..., but all the layer constructors containing _ must be explicitly written out – the macro sees the code as written.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"This macro relies on a lower-level function outputsize, which you can also use directly:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"c = Conv((3, 3), 1 => 5, relu, stride=2)\nFlux.outputsize(c, (28, 28, 1, 32)) # returns (13, 13, 5, 32)","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The function outputsize works by passing a \"dummy\" array into the model, which propagates through very cheaply. It should work for all layers, including custom layers, out of the box.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"An example of how to automate model building is this:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"\"\"\"\n make_model(width, height, [inchannels, nclasses; layer_config])\n\nCreate a CNN for a given set of configuration parameters. Arguments:\n- `width`, `height`: the input image size in pixels\n- `inchannels`: the number of channels in the input image, default `1`\n- `nclasses`: the number of output classes, default `10`\n- Keyword `layer_config`: a vector of the number of channels per layer, default `[16, 16, 32, 64]`\n\"\"\"\nfunction make_model(width, height, inchannels = 1, nclasses = 10;\n layer_config = [16, 16, 32, 64])\n # construct a vector of layers:\n conv_layers = []\n push!(conv_layers, Conv((5, 5), inchannels => layer_config[1], relu, pad=SamePad()))\n for (inch, outch) in zip(layer_config, layer_config[2:end])\n push!(conv_layers, Conv((3, 3), inch => outch, sigmoid, stride=2))\n end\n\n # compute the output dimensions after these conv layers:\n conv_outsize = Flux.outputsize(conv_layers, (width, height, inchannels); padbatch=true)\n\n # use this to define appropriate Dense layer:\n last_layer = Dense(prod(conv_outsize) => nclasses)\n return Chain(conv_layers..., Flux.flatten, last_layer)\nend\n\nm = make_model(28, 28, 3, layer_config = [9, 17, 33, 65])\n\nFlux.outputsize(m, (28, 28, 3, 42)) == (10, 42) == size(m(randn(Float32, 28, 28, 3, 42)))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Alternatively, using the macro, the definition of make_model could end with:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":" # compute the output dimensions & construct appropriate Dense layer:\n return @autosize (width, height, inchannels, 1) Chain(conv_layers..., Flux.flatten, Dense(_ => nclasses))\nend","category":"page"},{"location":"reference/outputsize/#Listing","page":"Shape Inference","title":"Listing","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux.@autosize\nFlux.outputsize","category":"page"},{"location":"reference/outputsize/#Flux.@autosize","page":"Shape Inference","title":"Flux.@autosize","text":"@autosize (size...,) Chain(Layer(_ => 2), Layer(_), ...)\n\nReturns the specified model, with each _ replaced by an inferred number, for input of the given size.\n\nThe unknown sizes are usually the second-last dimension of that layer's input, which Flux regards as the channel dimension. (A few layers, Dense & LayerNorm, instead always use the first dimension.) The underscore may appear as an argument of a layer, or inside a =>. It may be used in further calculations, such as Dense(_ => _÷4).\n\nExamples\n\njulia> @autosize (3, 1) Chain(Dense(_ => 2, sigmoid), BatchNorm(_, affine=false))\nChain(\n Dense(3 => 2, σ), # 8 parameters\n BatchNorm(2, affine=false),\n) \n\njulia> img = [28, 28];\n\njulia> @autosize (img..., 1, 32) Chain( # size is only needed at runtime\n Chain(c = Conv((3,3), _ => 5; stride=2, pad=SamePad()),\n p = MeanPool((3,3)),\n b = BatchNorm(_),\n f = Flux.flatten),\n Dense(_ => _÷4, relu, init=Flux.rand32), # can calculate output size _÷4\n SkipConnection(Dense(_ => _, relu), +),\n Dense(_ => 10),\n )\nChain(\n Chain(\n c = Conv((3, 3), 1 => 5, pad=1, stride=2), # 50 parameters\n p = MeanPool((3, 3)),\n b = BatchNorm(5), # 10 parameters, plus 10\n f = Flux.flatten,\n ),\n Dense(80 => 20, relu), # 1_620 parameters\n SkipConnection(\n Dense(20 => 20, relu), # 420 parameters\n +,\n ),\n Dense(20 => 10), # 210 parameters\n) # Total: 10 trainable arrays, 2_310 parameters,\n # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB.\n\njulia> outputsize(ans, (28, 28, 1, 32))\n(10, 32)\n\nLimitations:\n\nWhile @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.\nWhile Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).\n\n\n\n\n\n","category":"macro"},{"location":"reference/outputsize/#Flux.outputsize","page":"Shape Inference","title":"Flux.outputsize","text":"outputsize(m, x_size, y_size, ...; padbatch=false)\n\nFor model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.\n\nExamples\n\njulia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);\n\njulia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));\n\njulia> Flux.outputsize(par, (5, 64), (7, 64))\n(20, 64)\n\njulia> m = Chain(par, Dense(20 => 13), softmax);\n\njulia> Flux.outputsize(m, (5,), (7,); padbatch=true)\n(13, 1)\n\njulia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))\ntrue\n\nNotice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.\n\n\n\n\n\n","category":"function"},{"location":"guide/performance/#man-performance-tips","page":"Performance Tips","title":"Performance Tips","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"All the usual Julia performance tips apply. As always profiling your code is generally a useful way of finding bottlenecks. Below follow some Flux specific tips/reminders.","category":"page"},{"location":"guide/performance/#Don't-use-more-precision-than-you-need","page":"Performance Tips","title":"Don't use more precision than you need","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Flux works great with all kinds of number types. But often you do not need to be working with say Float64 (let alone BigFloat). Switching to Float32 can give you a significant speed up, not because the operations are faster, but because the memory usage is halved. Which means allocations occur much faster. And you use less memory.","category":"page"},{"location":"guide/performance/#Preserve-inputs'-types","page":"Performance Tips","title":"Preserve inputs' types","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Not only should your activation and loss functions be type-stable, they should also preserve the type of their inputs.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"A very artificial example using an activation function like","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"my_tanh(x) = Float64(tanh(x))","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"will result in performance on Float32 input orders of magnitude slower than the normal tanh would, because it results in having to use slow mixed type multiplication in the dense layers. Similar situations can occur in the loss function during backpropagation.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Which means if you change your data say from Float64 to Float32 (which should give a speedup: see above), you will see a large slow-down.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. E.g. the following will have run into the same problem as above:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = 0.01*x + tanh(x)","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While one could change the activation function (e.g. to use 0.01f0*x), the idiomatic (and safe way) to avoid type casts whenever inputs changes is to use oftype:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = oftype(x/1, 0.01)*x + tanh(x)","category":"page"},{"location":"guide/performance/#Evaluate-batches-as-matrices-of-features","page":"Performance Tips","title":"Evaluate batches as matrices of features","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While it can sometimes be tempting to process your observations (feature vectors) one at a time e.g.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})\n sum(zip(xs, ys)) do (x, y_target)\n y_pred = model(x) # evaluate the model\n return loss(y_pred, y_target)\n end\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"It is much faster to concatenate them into a matrix, as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. The improvement is enough that it is worthwhile allocating new memory to store them contiguously.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"x_batch = reduce(hcat, xs)\ny_batch = reduce(hcat, ys)\n...\nfunction loss_total(x_batch::Matrix, y_batch::Matrix)\n y_preds = model(x_batch)\n sum(loss.(y_preds, y_batch))\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.","category":"page"},{"location":"guide/performance/#Be-aware-of-GPU-memory-inefficiencies","page":"Performance Tips","title":"Be aware of GPU memory inefficiencies","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.","category":"page"},{"location":"#Flux:-The-Julia-Machine-Learning-Library","page":"Welcome","title":"Flux: The Julia Machine Learning Library","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Flux is a library for machine learning. It comes \"batteries-included\" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.\nExtensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.\nPlay nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.","category":"page"},{"location":"#Installation","page":"Welcome","title":"Installation","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Download Julia 1.10 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.","category":"page"},{"location":"#Learning-Flux","page":"Welcome","title":"Learning Flux","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"The quick start page trains a simple neural network.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"The rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.","category":"page"},{"location":"#Community","page":"Welcome","title":"Community","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.","category":"page"},{"location":"tutorials/linear_regression/#man-linear-regression","page":"Linear Regression","title":"Tutorial: Linear Regression","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The following page contains a step-by-step walkthrough of the linear regression algorithm in Julia using Flux! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by Flux.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let us start by building a simple linear regression model. This model would be trained on the data points of the form (x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ). In the real world, these xs can have multiple features, and the ys denote a label. In our example, each x has a single feature; hence, our data would have n data points, each point mapping a single feature to a single label.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Importing the required Julia packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Plots","category":"page"},{"location":"tutorials/linear_regression/#Generating-a-dataset","page":"Linear Regression","title":"Generating a dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the xs of our data points and map them to the respective ys using a simple function. Remember, here each x is equivalent to a feature, and each y is the corresponding label. Combining all the xs and ys would create the complete dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = hcat(collect(Float32, -3:0.1:3)...)\n1×61 Matrix{Float32}:\n -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 … 2.4 2.5 2.6 2.7 2.8 2.9 3.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The hcat call generates a Matrix with numbers ranging from -3.0 to 3.0 with a gap of 0.1 between them. Each column of this matrix holds a single x, a total of 61 xs. The next step would be to generate the corresponding labels or the ys.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> f(x) = @. 3x + 2;\n\njulia> y = f(x)\n1×61 Matrix{Float32}:\n -7.0 -6.7 -6.4 -6.1 -5.8 -5.5 … 9.5 9.8 10.1 10.4 10.7 11.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The function f maps each x to a y, and as x is a Matrix, the expression broadcasts the scalar values using @. macro. Our data points are ready, but they are too perfect. In a real-world scenario, we will not have an f function to generate y values, but instead, the labels would be manually added.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = x .* reshape(rand(Float32, 61), (1, 61));","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Visualizing the final data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(vec(x), vec(y), lw = 3, seriestype = :scatter, label = \"\", title = \"Generated data\", xlabel = \"x\", ylabel= \"y\");","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-data)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data looks random enough now! The x and y values are still somewhat correlated; hence, the linear regression algorithm should work fine on our dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed ahead and build a model for our dataset!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-model","page":"Linear Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A linear regression model is defined mathematically as -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"model(W b x) = Wx + b","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"where W is the weight matrix and b is the bias. For our case, the weight matrix (W) would constitute only a single element, as we have only a single feature. We can define our model in Julia using the exact same notation!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) = @. W*x + b\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The @. macro allows you to perform the calculations by broadcasting the scalar quantities (for example - the bias).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The next step would be to initialize the model parameters, which are the weight and the bias. There are a lot of initialization techniques available for different machine learning models, but for the sake of this example, let's pull out the weight from a uniform distribution and initialize the bias as 0.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = rand(Float32, 1, 1)\n1×1 Matrix{Float32}:\n 0.99285793\n\njulia> b = [0.0f0]\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Time to test if our model works!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) |> size\n(1, 61)\n\njulia> custom_model(W, b, x)[1], y[1]\n(-1.6116865f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function custom_loss(weights, biases, features, labels)\n ŷ = custom_model(weights, biases, features)\n sum((labels .- ŷ).^2) / length(features)\n end;\n\njulia> custom_loss(W, b, x, y)\n23.772217f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Calling the loss function on our xs and ys shows how far our predictions (ŷ) are from the real labels. More precisely, it calculates the sum of the squares of residuals and divides it by the total number of data points.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We have successfully defined our model and the loss function, but surprisingly, we haven't used Flux anywhere till now. Let's see how we can write the same code using Flux.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A Dense(1 => 1) layer denotes a layer of one neuron with one input (one feature) and one output. This layer is exactly same as the mathematical model defined by us above! Under the hood, Flux too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model.weight, flux_model.bias\n(Float32[-1.2678515;;], Float32[0.0])","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Now we can check if our model is acting right. We can pass the complete data in one go, with each x having exactly one feature (one input) -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model(x) |> size\n(1, 61)\n\njulia> flux_model(x)[1], y[1]\n(-1.8525281f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It is! The next step would be defining the loss function using Flux's functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function flux_loss(flux_model, features, labels)\n ŷ = flux_model(features)\n Flux.mse(ŷ, labels)\n end;\n\njulia> flux_loss(flux_model, x, y)\n22.74856f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Everything works as before! It almost feels like Flux provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the flux_model is from our custom_model. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom_model to match that of the flux_model -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = Float32[1.1412252]\n1-element Vector{Float32}:\n 1.1412252","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"To check how both the models are performing on the data, let's find out the losses using the loss and flux_loss functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)\n(22.74856f0, 22.74856f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The losses are identical! This means that our model and the flux_model are identical on some level, and the loss functions are completely identical! The difference in models would be that Flux's Dense layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom_model.","category":"page"},{"location":"tutorials/linear_regression/#Training-the-model","page":"Linear Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W .= W .- 0.1 .* dLdW\n1-element Vector{Float32}:\n 1.8144473\n\njulia> b .= b .- 0.1 .* dLdb\n1-element Vector{Float32}:\n 0.41325632","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The parameters have been updated! We can now check the value of the loss function -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y)\n17.157953f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, change in loss < 0.1. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's plug our super training logic inside a function and test it again -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_custom_model!(f_loss, weights, biases, features, labels)\n dLdW, dLdb, _, _ = gradient(f_loss, weights, biases, features, labels)\n @. weights = weights - 0.1 * dLdW\n @. biases = biases - 0.1 * dLdb\n end;\n\njulia> train_custom_model!(custom_loss, W, b, x, y);\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[2.340657], Float32[0.7516814], 13.64972f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> for i = 1:40\n train_custom_model!(custom_loss, W, b, x, y)\n end\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"There was a significant reduction in loss, and the parameters were updated!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from 22.74856 to 7.6680417f. Time for some visualization!","category":"page"},{"location":"tutorials/linear_regression/#Results","page":"Linear Regression","title":"Results","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, Wx + b is nothing more than a line's equation, with slope = W[1] and y-intercept = b[1] (indexing at 1 as W and b are iterable).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Plotting the line and the data points using Plot.jl -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = \"\", title = \"Simple Linear Regression\", xlabel = \"x\", ylabel= \"y\");\n\njulia> plot!((x) -> b[1] + W[1] * x, -3, 3, label=\"Custom model\", lw=2);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-line)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!","category":"page"},{"location":"tutorials/linear_regression/#Linear-regression-model-on-a-real-dataset","page":"Linear Regression","title":"Linear regression model on a real dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We now move on to a relatively complex linear regression model. Here we will use a real dataset from MLDatasets.jl, which will not confine our data points to have only one feature. Let's start by importing the required packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames","category":"page"},{"location":"tutorials/linear_regression/#Gathering-real-data","page":"Linear Regression","title":"Gathering real data","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's start by initializing our dataset. We will be using the BostonHousing dataset consisting of 506 data points. Each of these data points has 13 features and a corresponding label, the house's price. The xs are still mapped to a single y, but now, a single x data point has 13 features.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dataset = BostonHousing();\n\njulia> x, y = BostonHousing(as_df=false)[:];\n\njulia> x, y = Float32.(x), Float32.(y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now split the obtained data into training and testing data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end];\n\njulia> x_train |> size, x_test |> size, y_train |> size, y_test |> size\n((13, 400), (13, 106), (1, 400), (1, 106))","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to normalise the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> std(x_train)\n134.06786f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data is indeed not normalised. We can use the Flux.normalise function to normalise the training data.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train_n = Flux.normalise(x_train);\n\njulia> std(x_train_n)\n1.0000844f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The standard deviation is now close to one! Our data is ready!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-Flux-model","page":"Linear Regression","title":"Building a Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now directly use Flux and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and Flux will handle everything for us! Remember, we could have declared a model in plain Julia as well. The model will have 14 parameters: 13 weights and 1 bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> model = Dense(13 => 1)\nDense(13 => 1) # 14 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function loss(model, features, labels)\n ŷ = model(features)\n Flux.mse(ŷ, labels)\n end;\n\njulia> loss(model, x_train_n, y_train)\n676.1656f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed to the training phase!","category":"page"},{"location":"tutorials/linear_regression/#Training-the-Flux-model","page":"Linear Regression","title":"Training the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The training procedure would make use of the same mathematics, but now we can pass in the model inside the gradient call and let Flux and Zygote handle the derivatives!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_model!(f_loss, model, features, labels)\n dLdm, _, _ = gradient(f_loss, model, features, labels)\n @. model.weight = model.weight - 0.000001 * dLdm.weight\n @. model.bias = model.bias - 0.000001 * dLdm.bias\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when change in loss < δ. The quantity δ can be altered according to a user's need, but let's fix it to 10⁻³ for this tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can write such custom training loops effortlessly using Flux and plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss_init = Inf;\n\njulia> while true\n train_model!(loss, model, x_train_n, y_train)\n if loss_init == Inf\n loss_init = loss(model, x_train_n, y_train)\n continue\n end\n if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4\n break\n else\n loss_init = loss(model, x_train_n, y_train)\n end\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The code starts by initializing an initial value for the loss, infinity. Next, it runs an infinite loop that breaks if change in loss < 10⁻³, or the code changes the value of loss_init to the current loss and moves on to the next iteration.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This custom loop works! This shows how easily a user can write down any custom training routine using Flux and Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's have a look at the loss -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss(model, x_train_n, y_train)\n27.1272f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down significantly! It can be minimized further by choosing an even smaller δ.","category":"page"},{"location":"tutorials/linear_regression/#Testing-the-Flux-model","page":"Linear Regression","title":"Testing the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_test_n = Flux.normalise(x_test);\n\njulia> loss(model, x_test_n, y_test)\n66.91015f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"info: Info\nOriginally published on 21 November 2022, by Saransh Chopra.","category":"page"},{"location":"guide/saving/#Saving-and-Loading-Models","page":"Saving & Loading","title":"Saving and Loading Models","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You may wish to save models so that they can be loaded and run in a later session. Flux provides a number of ways to do this. The recommended way, which is the most robust one for long term storage, is to use Flux.state in combination with a serialization format like JLD2.jl or BSON.jl.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> struct MyModel\n net\n end\n\njulia> Flux.@layer MyModel\n\njulia> MyModel() = MyModel(Chain(Dense(10 => 5, relu), Dense(5 => 2)));\n\njulia> model = MyModel()\nMyModel(\n Chain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n ),\n) # Total: 4 arrays, 67 parameters, 484 bytes.\n\njulia> model_state = Flux.state(model);\n\njulia> using JLD2\n\njulia> jldsave(\"mymodel.jld2\"; model_state)","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session using Flux.loadmodel!:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, JLD2\n\njulia> model_state = JLD2.load(\"mymodel.jld2\", \"model_state\");\n\njulia> model = MyModel(); # MyModel definition must be available\n\njulia> Flux.loadmodel!(model, model_state);","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"note: Note\nIf a saved model's parameters are stored on the GPU, the model will not load later on if there is no GPU support available. It's best to move your model to the CPU with cpu(model) before saving it.","category":"page"},{"location":"guide/saving/#Checkpointing","page":"Saving & Loading","title":"Checkpointing","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"In longer training runs it's a good idea to periodically save your model, so that you can resume if training is interrupted (for example, if there's a power cut). ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux: throttle\n\njulia> using JLD2\n\njulia> m = Chain(Dense(10 => 5, relu), Dense(5 => 2))\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 476 bytes.\n\njulia> for epoch in 1:10\n # ... train model ...\n jldsave(\"model-checkpoint.jld2\", model_state = Flux.state(m))\n end;","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"This will update the \"model-checkpoint.jld2\" every epoch.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can get more advanced by saving a series of models throughout training, for example","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m))","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"will produce a series of models like \"model-2018-03-06T02:57:10.41.jld2\". You could also store the current test set loss, so that it's easy to (for example) revert to an older copy of the model if it starts to overfit.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m), loss = testloss())","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Note that to resume a model's training, you might need to restore other stateful parts of your training loop. Possible examples are the optimiser state and the randomness used to partition the original data into the training and validation sets.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can store the optimiser state alongside the model, to resume training exactly where you left off: ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"model = MyModel()\nopt_state = Flux.setup(AdamW(), model)\n\n# ... train model ...\n\nmodel_state = Flux.state(model)\njldsave(\"checkpoint_epoch=42.jld2\"; model_state, opt_state)","category":"page"},{"location":"guide/saving/#Saving-Models-as-Julia-Structs","page":"Saving & Loading","title":"Saving Models as Julia Structs","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Models are just normal Julia structs, so it's fine to use any Julia storage format to save the struct as it is instead of saving the state returned by Flux.state. BSON.jl is particularly convenient for this, since it can also save anonymous functions, which are sometimes part of a model definition.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> model = Chain(Dense(10 => 5, NNlib.relu), Dense(5 => 2));\n\njulia> using BSON: @save\n\njulia> @save \"mymodel.bson\" model","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, BSON\n\njulia> BSON.@load \"mymodel.bson\" model\n\njulia> model\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 476 bytes.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"warning: Warning\nSaving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.","category":"page"}] } diff --git a/previews/PR2535/tutorials/custom_layers/index.html b/previews/PR2535/tutorials/custom_layers/index.html index 210d4a629d..3b844d5025 100644 --- a/previews/PR2535/tutorials/custom_layers/index.html +++ b/previews/PR2535/tutorials/custom_layers/index.html @@ -104,4 +104,4 @@ # rms over all the mse ŷs = model(x) return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs))) -end
Note

This Split layer is available from the Fluxperimental.jl package.

+end
Note

This Split layer is available from the Fluxperimental.jl package.

diff --git a/previews/PR2535/tutorials/linear_regression/index.html b/previews/PR2535/tutorials/linear_regression/index.html index 6cf06234bc..ac678441b9 100644 --- a/previews/PR2535/tutorials/linear_regression/index.html +++ b/previews/PR2535/tutorials/linear_regression/index.html @@ -106,4 +106,4 @@ 27.1272f0

The loss went down significantly! It can be minimized further by choosing an even smaller δ.

Testing the Flux model

The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.

julia> x_test_n = Flux.normalise(x_test);
 
 julia> loss(model, x_test_n, y_test)
-66.91015f0

The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!


Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.

Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.

After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.

Info

Originally published on 21 November 2022, by Saransh Chopra.

+66.91015f0

The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!


Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.

Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.

After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.

Info

Originally published on 21 November 2022, by Saransh Chopra.

diff --git a/previews/PR2535/tutorials/logistic_regression/index.html b/previews/PR2535/tutorials/logistic_regression/index.html index 6c99bd8b76..31a490e3f3 100644 --- a/previews/PR2535/tutorials/logistic_regression/index.html +++ b/previews/PR2535/tutorials/logistic_regression/index.html @@ -131,4 +131,4 @@ flux_accuracy(x, y) = 0.98 julia> flux_loss(flux_model, x, flux_y_onehot) -0.6952386604624324

We see a very similar final loss and accuracy.


Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.

Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!

Info

Originally published on 1st April 2023, by Saransh Chopra.

+0.6952386604624324

We see a very similar final loss and accuracy.


Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.

Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!

Info

Originally published on 1st April 2023, by Saransh Chopra.

diff --git a/previews/PR2535/tutorials/model_zoo/index.html b/previews/PR2535/tutorials/model_zoo/index.html index ef177105a5..59682cf4a8 100644 --- a/previews/PR2535/tutorials/model_zoo/index.html +++ b/previews/PR2535/tutorials/model_zoo/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -
+