From 5370fff1048356bd4889e021ec4d86b49759fb32 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 3 Dec 2024 11:24:08 -0500
Subject: [PATCH 1/5] tweak quickstart

---
 docs/src/guide/models/quickstart.md | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/docs/src/guide/models/quickstart.md b/docs/src/guide/models/quickstart.md
index a0c92e0ef3..665381122d 100644
--- a/docs/src/guide/models/quickstart.md
+++ b/docs/src/guide/models/quickstart.md
@@ -5,22 +5,20 @@ If you have used neural networks before, then this simple example might be helpf
 If you haven't, then you might prefer the [Fitting a Straight Line](overview.md) page.
 
 ```julia
-# This will prompt if neccessary to install everything, including CUDA.
-# For CUDA acceleration, also cuDNN.jl has to be installed in your environment.
+# Install everything, including CUDA, and load packages:
+using Pkg; Pkg.add(["Flux", "CUDA", "cuDNN", "ProgressMeter"])
 using Flux, CUDA, Statistics, ProgressMeter
+device = gpu_device()  # function to move data and model to the GPU
 
 # Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:
 noisy = rand(Float32, 2, 1000)                                    # 2×1000 Matrix{Float32}
 truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)]   # 1000-element Vector{Bool}
 
-# Use this object to move data and model to the GPU, if available
-device = gpu_device()
-
 # Define our model, a multi-layer perceptron with one hidden layer of size 3:
 model = Chain(
     Dense(2 => 3, tanh),      # activation function inside layer
     BatchNorm(3),
-    Dense(3 => 2)) |> device  # move model to GPU, if available
+    Dense(3 => 2)) |> device  # move model to GPU, if one is available
 
 # The model encapsulates parameters, randomly initialised. Its initial output is:
 out1 = model(noisy |> device) |> cpu     # 2×1000 Matrix{Float32}
@@ -35,8 +33,9 @@ opt_state = Flux.setup(Flux.Adam(0.01), model)  # will store optimiser momentum,
 # Training loop, using the whole data set 1000 times:
 losses = []
 @showprogress for epoch in 1:1_000
-    for (x, y) in loader
-        x, y = device((x, y))
+    for xy_cpu in loader
+        # Unpack batch of data, and move to GPU:
+        x, y = xy_cpu |> device
         loss, grads = Flux.withgradient(model) do m
             # Evaluate model and loss inside gradient context:
             y_hat = m(x)
@@ -100,8 +99,7 @@ Instead of calling [`gradient`](@ref Zygote.gradient) and [`update!`](@ref Flux.
 
 ```julia
 for epoch in 1:1_000
-    Flux.train!(model, loader, opt_state) do m, x, y
-        x, y = device((x, y))
+    Flux.train!(model, loader |> device, opt_state) do m, x, y
         y_hat = m(x)
         Flux.logitcrossentropy(y_hat, y)
     end

From c76571e2046711ae350243f561e0f28a0df34a23 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 3 Dec 2024 11:35:56 -0500
Subject: [PATCH 2/5] avoid confusing line `model(noisy |> gpu) |> cpu`

---
 docs/src/guide/models/quickstart.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/src/guide/models/quickstart.md b/docs/src/guide/models/quickstart.md
index 665381122d..1a8e4ca256 100644
--- a/docs/src/guide/models/quickstart.md
+++ b/docs/src/guide/models/quickstart.md
@@ -21,8 +21,8 @@ model = Chain(
     Dense(3 => 2)) |> device  # move model to GPU, if one is available
 
 # The model encapsulates parameters, randomly initialised. Its initial output is:
-out1 = model(noisy |> device) |> cpu     # 2×1000 Matrix{Float32}
-probs1 = softmax(out1)                   # normalise to get probabilities
+out1 = model(noisy |> device)    # 2×1000 Matrix{Float32}, or CuArray{Float32}
+probs1 = softmax(out1) |> cpu    # normalise to get probabilities (and move off GPU)
 
 # To train the model, we use batches of 64 samples, and one-hot encoding:
 target = Flux.onehotbatch(truth, [true, false])                   # 2×1000 OneHotMatrix
@@ -48,9 +48,9 @@ end
 
 opt_state # parameters, momenta and output have all changed
 
-out2 = model(noisy |> device)  |> cpu   # first row is prob. of true, second row p(false)
-probs2 = softmax(out2)                  # normalise to get probabilities
-mean((probs2[1,:] .> 0.5) .== truth)    # accuracy 94% so far!
+out2 = model(noisy |> device)         # first row is prob. of true, second row p(false)
+probs2 = softmax(out2) |> cpu         # normalise to get probabilities
+mean((probs2[1,:] .> 0.5) .== truth)  # accuracy 94% so far!
 ```
 
 ![](../../assets/quickstart/oneminute.png)

From d8c3a9d9fbab1f5093d48f31fcce0e773178de57 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 3 Dec 2024 23:12:28 -0500
Subject: [PATCH 3/5] doc ref Flux.gradient

---
 docs/src/guide/models/quickstart.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/guide/models/quickstart.md b/docs/src/guide/models/quickstart.md
index 1a8e4ca256..1ea8912ef4 100644
--- a/docs/src/guide/models/quickstart.md
+++ b/docs/src/guide/models/quickstart.md
@@ -95,7 +95,7 @@ Some things to notice in this example are:
 
 * The `do` block creates an anonymous function, as the first argument of `gradient`. Anything executed within this is differentiated.
 
-Instead of calling [`gradient`](@ref Zygote.gradient) and [`update!`](@ref Flux.update!) separately, there is a convenience function [`train!`](@ref Flux.train!). If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:
+Instead of calling [`gradient`](@ref Flux.gradient) and [`update!`](@ref Flux.update!) separately, there is a convenience function [`train!`](@ref Flux.train!). If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:
 
 ```julia
 for epoch in 1:1_000

From 3976afd619c94591cd0f76d365e4007e021361cb Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 3 Dec 2024 23:13:10 -0500
Subject: [PATCH 4/5] moving data to GPU

---
 docs/src/guide/models/quickstart.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/src/guide/models/quickstart.md b/docs/src/guide/models/quickstart.md
index 1ea8912ef4..790b6aafe7 100644
--- a/docs/src/guide/models/quickstart.md
+++ b/docs/src/guide/models/quickstart.md
@@ -106,5 +106,7 @@ for epoch in 1:1_000
 end
 ```
 
-* In our simple example, we conveniently created the model has a [`Chain`](@ref Flux.Chain) of layers. 
+* Notice that the full dataset `noisy` lives on the CPU, and is moved to the GPU one batch at a time, by `xy_cpu |> device`. This is generally what you want for large datasets. Calling `loader |> device` similarly modifies the `DataLoader` to move one batch at a time.
+
+* In our simple example, we conveniently created the model has a [`Chain`](@ref Flux.Chain) of layers.
 For more complex models, you can define a custom struct `MyModel` containing layers and arrays and implement the call operator `(::MyModel)(x) = ...` to define the forward pass. This is all it is needed for Flux to work. Marking the struct with [`Flux.@layer`](@ref) will add some more functionality, like pretty printing and the ability to mark some internal fields as trainable or not (also see [`trainable`](@ref Optimisers.trainable)).

From 234295eea559b84071a89a2c7841fc619fbeb1e4 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Wed, 4 Dec 2024 22:20:20 -0500
Subject: [PATCH 5/5] Update docs/src/guide/models/quickstart.md

---
 docs/src/guide/models/quickstart.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/src/guide/models/quickstart.md b/docs/src/guide/models/quickstart.md
index 790b6aafe7..42e04b0f52 100644
--- a/docs/src/guide/models/quickstart.md
+++ b/docs/src/guide/models/quickstart.md
@@ -7,7 +7,8 @@ If you haven't, then you might prefer the [Fitting a Straight Line](overview.md)
 ```julia
 # Install everything, including CUDA, and load packages:
 using Pkg; Pkg.add(["Flux", "CUDA", "cuDNN", "ProgressMeter"])
-using Flux, CUDA, Statistics, ProgressMeter
+using Flux, Statistics, ProgressMeter
+using CUDA  # optional
 device = gpu_device()  # function to move data and model to the GPU
 
 # Generate some data for the XOR problem: vectors of length 2, as columns of a matrix: