In this tutorial, we will train neural closure models for the viscous Burgers equation in the Julia programming language. We here use Julia for ease of use, efficiency, and writing differentiable code to power scientific machine learning. This file is available as
- a commented Julia script,
- a markdown file,
- a Jupyter notebook.
Julia comes with many built in features, including array functionality.
Additional functionality is provided in various packages, some of which are
available in the Standard Library (LinearAlgebra, Printf, Random,
SparseArrays). Others are available in the General Registry, and can be added
using the built in package manager Pkg, e.g. using Pkg; Pkg.add("Plots")
.
If you ran the Colab setup section, the packages should already be added.
If you cloned the NeuralClosure repository and run this in VSCode,
there should be a file called Project.toml
, specifying dependencies.
This file specifies an environment, which can be activated.
If you are running this locally (not on Colab):
You can then install the dependencies by uncommenting and running the
following cell:
# using Pkg
# Pkg.instantiate()
Alernatively, you can add them manually:
# using Pkg
# Pkg.add([
# "ComponentArrays",
# "FFTW",
# "LinearAlgebra",
# "Lux",
# "NNlib",
# "Optimisers",
# "Plots",
# "Printf",
# "Random",
# "SparseArrays",
# "Zygote",
# ])
using ComponentArrays
using FFTW
using LinearAlgebra
using Lux
using NNlib
using Optimisers
using Plots
using Printf
using Random
using SparseArrays
using Zygote
Note that we have loaded the reverse mode AD framework
Zygote. This package provides the
function gradient
, which is able to differentiate functions that we write,
including our numerical solvers (to be defined in the next section).
Gradients allow us to perform gradient descent, and thus "train" our neural
networks.
The deep learning framework Lux likes to toss random number generators (RNGs) around, for reproducible science. We therefore need to initialize an RNG. The seed makes sure the same sequence of pseudo-random numbers are generated at each time the session is restarted.
Random.seed!(123)
rng = Random.default_rng()
Deep learning functions usually use single precision floating point numbers
by default, as this is preferred on GPUs. Julia itself, on the other hand, is
slightly opinionated towards double precision floating point numbers, e.g.
3.14
, 1 / 6
and 2π
are all of type Float64
(their single precision
counterparts would be the slightly more verbose 3.14f0
, 1f0 / 6
(or
Float32(1 / 6)
) and 2f0π
). For simplicity, we will only use Float64
.
The only place we will encounter Float32
in this file is in the default
neural network weight initializers, so here is an alternative weight
initializer using double precision.
glorot_uniform_64(rng::AbstractRNG, dims...) = glorot_uniform(rng, Float64, dims...)
Consider the periodic domain
where
For simplicity, we will use a uniform discretization
$x = \left( \frac{n}{N} \right){n = 1}^N$,
with the additional "ghost" points $x_0 = 0$ overlapping with
$x_N$ and $x{N + 1}$ overlapping with
for all
Note that
where
Solving the equation
We start by defining the right hand side function @.
makes sure that all
following operations are performed element-wise. Note that circshift
here
acts along the first dimension, so f
can be applied to multiple snapshots
at once (stored as columns in the matrix u
). Note the semicolon ;
in the
function signature: It is used to separate positional arguments (here u
)
from keyword arguments (here μ
).
function f_central(u; μ)
Δx = 1 / size(u, 1)
u₊ = circshift(u, -1)
u₋ = circshift(u, 1)
@. -(u₊^2 - u₋^2) / 4Δx + μ * (u₊ - 2u + u₋) / Δx^2
end
function f_shock(u; μ)
Δx = 1 / size(u, 1)
u₊ = circshift(u, -1)
μ₊ = @. μ + Δx * abs(u + u₊) / 4 - Δx * (u₊ - u) / 12
ϕ₊ = @. (u^2 + u * u₊ + u₊^2) / 6 - μ₊ * (u₊ - u) / Δx
ϕ₋ = circshift(ϕ₊, 1)
@. -(ϕ₊ - ϕ₋) / Δx
end
For time stepping, we do a simple explicit Runge-Kutta scheme (RK).
From a current state
where
A fourth order method is given by the following coefficients (
Note: The Runge-Kutta coefficients are usually presented in a shifted version of our $A$, and also includes two vectors $c$ and $b$. Since our system is autonomous and the RK scheme is explicit, we can write it in this simple form.
The following function performs one RK4 time step. Note that we never modify
any vectors, only create new ones. The AD-framework Zygote prefers it this
way. The splatting syntax params...
lets us pass a variable number of
keyword arguments to the right hand side function f
(for the above f
there is only one: μ
). Similarly, k...
splats the tuple k
, but but now
like a positional argument instead of keyword arguments (without names).
function step_rk4(f, u₀, dt; params...)
A = [
1/2 0 0 0
0 1/2 0 0
0 0 1 0
1/6 2/6 2/6 1/6
]
u = u₀
k = ()
for i = 1:size(A, 1)
ki = f(u; params...)
k = (k..., ki)
u = u₀
for j = 1:i
u = u + dt * A[i, j] * k[j]
end
end
u
end
Solving the ODE is done by chaining individual time steps. We here add the
option to call a callback function after each time step. Note that the path
to the final output u
is obtained by passing the inputs u₀
and
parameters params
through a finite amount of computational steps, each of
which should have a chain rule defined and recognized in the Zygote AD
framework. The ODE solution u
should be differentiable (with respect to
u₀
or params
), as long as f
is.
function solve_ode(f, u₀; dt, nt, callback = (u, t, i) -> nothing, ncallback = 1, params...)
t = 0.0
u = u₀
callback(u, t, 0)
for i = 1:nt
t += dt
u = step_rk4(f, u, dt; params...)
i % ncallback == 0 && callback(u, t, i)
end
u
end
For the initial conditions, we use the following random Fourier series:
where
-
$\mathfrak{R}$ denotes the real part -
$c_k = a_k d_k \mathrm{e}^{- 2 \pi \mathrm{i} b_k}$ are random Fourier series coefficients -
$a_k \sim \mathcal{N}(0, 1)$ is a normally distributed random amplitude -
$d_k = (1 + | k |)^{- 6 / 5}$ is a deterministic spectral decay profile, so that the large scale features dominate the initial flow -
$b_k \sim \mathcal{U}(0, 1)$ is a uniform random phase shift between 0 and 1 -
$\mathrm{e}^{2 \pi \mathrm{i} k x}$ is a sinusoidal Fourier series basis function evaluated at the point$x \in \Omega$
Note in particular that the constant coefficient
Since the same Fourier basis can be reused multiple times, we write a
function that creates multiple initial condition samples in one go. Each
discrete
function create_initial_conditions(
nx,
nsample;
kmax = 16,
decay = k -> (1 + abs(k))^(-6 / 5),
)
# Fourier basis
basis = [exp(2π * im * k * x / nx) for x ∈ 1:nx, k ∈ -kmax:kmax]
# Fourier coefficients with random phase and amplitude
c = [randn() * exp(-2π * im * rand()) * decay(k) for k ∈ -kmax:kmax, _ ∈ 1:nsample]
# Random data samples (real-valued)
# Note the matrix product for summing over $k$
real.(basis * c)
end
Let's test our method in action. The animation syntax will create one animated GIF from a collection of frames.
This is also the point where we have to provide some parameters, including
nx
: Number of grid pointsμ
: Viscositydt
: Time step incrementnt
: Number of time steps (final time isnt * dt
)ncallback
: Number of time steps between plot frames in the animation
let
nx = 1024
x = LinRange(0.0, 1.0, nx + 1)[2:end]
# Initial conditions (one sample vector)
u₀ = create_initial_conditions(nx, 1)
# u₀ = sin.(2π .* x)
# u₀ = sin.(4π .* x)
# u₀ = sin.(6π .* x)
# Time stepping
anim = Animation()
u = solve_ode(
f_shock,
# f_central,
u₀;
μ = 5.0e-4,
dt = 1.0e-4,
nt = 5000,
ncallback = 50,
callback = (u, t, i) -> frame(
anim,
plot(
x,
u;
xlabel = "x",
ylims = extrema(u₀),
# marker = :o,
title = @sprintf("Solution, t = %.3f", t)
),
),
)
gif(anim)
end
This is typical for the Burgers equation: The initial conditions merge to
form a shock, which is eventually dampened due to the viscosity. If we let
the simulation go on, diffusion will take over and we get a smooth solution
and eventually
The animation shows the velocity profile of a fluid.
- In which direction is the fluid moving?
- Rerun the above cell a couple of times to get different random initial conditions. Does the velocity always move in the same direction?
- Find a region where
$u > 0$ . In what direction is the curve moving? - Find a region where
$u < 0$ . In what direction is the curve moving? - Find a point where
$u = 0$ . Does this point move?
Choose one of the sine waves as initial conditions.
Change the resolution to nx = 128
.
Run once with f_shock
and once with f_central
.
You can try with and without the marker = :o
keyword to see
the discrete points better.
Questions:
- Is there something strange with
f_central
? - On which side(s) of the shock is there a problem?
- Does this problem go away when you increase
nx
?
We now assume that we are only interested in the large scale structures of the
flow. To compute those, we would ideally like to use a coarser resolution
(
To define "large scales", we consider a discrete filter
where
To close the equations, we approximate the unknown commutator error using a
closure model
We need to make two choices:
where the LES solution
Since the 1D Burgers equation does not create "turbulence" in the same way as the 3D Navier-Stokes equations do, we choose to use a discretization that is "bad" on coarse grids (the LES grid), but still performs fine on fine grids (the DNS grid). This will give the neural network closure models some work to do.
dns(u; μ) = f_central(u; μ)
The following right hand side function includes the correction term, and thus constitutes the LES right hand side.
les(u; μ, m, θ) = dns(u; μ) + m(u, θ)
This generic function creates a data structure containing filtered DNS data,
commutator errors and simulation parameters for a given filter
function create_data(nsample, Φ; μ, dt, nt, kmax = 10, decay = k -> (1 + abs(k))^(-6 / 5))
# Resolution
nx_les, nx_dns = size(Φ)
# Grids
x_les = LinRange(0.0, 1.0, nx_les + 1)[2:end]
x_dns = LinRange(0.0, 1.0, nx_dns + 1)[2:end]
# Output data
data = (;
# Filtered snapshots and commutator errors (including at t = 0)
u = zeros(nx_les, nsample, nt + 1),
c = zeros(nx_les, nsample, nt + 1),
# Simulation-specific parameters
dt,
μ,
)
# DNS Initial conditions
u₀ = create_initial_conditions(nx_dns, nsample; kmax, decay)
# Save filtered solution and commutator error after each DNS time step
function callback(u, t, i)
ubar = Φ * u
data.u[:, :, i+1] = ubar
data.c[:, :, i+1] = Φ * dns(u; μ) - dns(ubar; μ)
end
# Do DNS time stepping (save after each step)
u = solve_ode(dns, u₀; μ, dt, nt, callback, ncallback = 1)
# Return data
data
end
Now we set up an experiment. We need to decide on the following:
- Problem parameter:
$\mu$ - LES resolution
- DNS resolution
- Discrete filter
- Number of initial conditions
- Simulation time: Too short, and we won't have time to detect instabilities created by our model; too long, and most of the data will be too smooth for a closure model to be needed (due to viscosity)
In addition, we will split our data into
- Training data (for choosing
$\theta$ ) - Validation data (just for monitoring training, choose when to stop)
- Testing data (for testing performance on unseen data)
# Resolution
nx_les = 64
nx_dns = 1024
# Grids
x_les = LinRange(0.0, 1.0, nx_les + 1)[2:end]
x_dns = LinRange(0.0, 1.0, nx_dns + 1)[2:end]
# Grid sizes
Δx_les = 1 / nx_les
Δx_dns = 1 / nx_dns
We will use a Gaussian filter kernel, truncated to zero outside of
# Filter width
ΔΦ = 5 * Δx_les
# Filter kernel
gaussian(Δ, x) = sqrt(6 / π) / Δ * exp(-6x^2 / Δ^2)
top_hat(Δ, x) = (abs(x) ≤ Δ / 2) / Δ
kernel = gaussian
# Discrete filter matrix (with periodic extension and threshold for sparsity)
Φ = sum(-1:1) do z
d = @. x_les - x_dns' - z
@. kernel(ΔΦ, d) * (abs(d) ≤ 3 / 2 * ΔΦ)
end
Φ = Φ ./ sum(Φ; dims = 2) ## Normalize weights
Φ = sparse(Φ)
dropzeros!(Φ)
heatmap(Φ; yflip = true, xmirror = true, title = "Filter matrix")
To illustrate the closure problem, we will run an LES simulation without a closure model. Consider the following setup:
let
μ = 5.0e-4
u = sin.(6π * x_dns)
ubar = Φ * u
vbar = ubar
dt = 1.0e-4
anim = Animation()
for it = 0:5000
# Time step: Skip first step to get initial plot
if it > 0
u = step_rk4(dns, u, dt; μ)
vbar = step_rk4(dns, vbar, dt; μ)
ubar = Φ * u
end
# Plot
if it % 50 == 0
sol = plot(;
ylims = (-1.0, 1.0),
legend = :topright,
title = @sprintf("Solution, t = %.3f", it * dt)
)
plot!(sol, x_dns, u; label = "u")
plot!(sol, x_les, ubar; label = "ū")
plot!(sol, x_les, vbar; label = "v̄")
c = plot(
x_les,
Φ * dns(u; μ) - dns(ubar; μ);
label = "c(u, ū)",
legend = :topright,
xlabel = "x",
ylims = (-10.0, 10.0),
)
fig = plot(sol, c; layout = (2, 1))
frame(anim, fig)
end
end
gif(anim)
end
We observe the following:
- The DNS solution
$u$ contains shocks, but these are not visible in the filtered DNS solution$\bar{u}$ . - The LES solution
$\bar{v}$ does try to resolve the shocks, but does a very bad job. The discretization$f_\text{central}$ is also creating oscillations. - We see that there is a clear mismatch between
$\bar{u}$ (target) and$\bar{v}$ (LES prediction). This is because we did not take into account the commutator error (bottom curve). - The commutator error is large where
$u$ has a shock, and small everywhere else. The "subgrid content" for the 1D Burgers equations is the stress caused by the subgrid shocks.
The job of the closure model is to predict the bottom curve using the smooth orange top curve only.
We now create the training, validation, and testing datasets. We use a slightly different time step for testing to detect time step overfitting.
μ = 5.0e-4
data_train = create_data(10, Φ; μ, nt = 2000, dt = 1.0e-4);
data_valid = create_data(2, Φ; μ, nt = 500, dt = 1.3e-4);
data_test = create_data(3, Φ; μ, nt = 3000, dt = 1.1e-4);
The information about our problem is now fully contained in the data sets. We can now choose the closure model to solve the problem.
To choose
This loss function has a simple computational chain, that is mostly comprised
of evaluating the neural network
but we don't need to specify any of that, Zygote figures it out just fine on its own.
We call this loss function "prior" since it only measures the error of the prediction itself, and not the effect this error has on the LES solution $\bar{v}{\theta}$. Since instability in $\bar{v}{\theta}$ is not directly detected in this loss function, we add a regularization term to penalize extremely large weights.
mean_squared_error(m, u, c, θ; λ) =
sum(abs2, m(u, θ) - c) / sum(abs2, c) + λ * sum(abs2, θ) / length(θ)
We will only use a random subset nuse
of all nsample * nt
solution snapshots at each loss evaluation. This random sampling creates a
random variable in the loss function, which becomes stochastic. Minimizing it
using gradient descent is thus called stochastic gradient descent.
The Zygote.@ignore
macro just tells the AD engine Zygote not to complain
about the random index selection.
function create_randloss_commutator(m, data; nuse = 20, λ = 1.0e-8)
(; u, c) = data
u = reshape(u, size(u, 1), :)
c = reshape(c, size(c, 1), :)
nsample = size(u, 2)
@assert nsample ≥ nuse
function randloss(θ)
i = Zygote.@ignore sort(shuffle(1:nsample)[1:nuse])
uuse = Zygote.@ignore u[:, i]
cuse = Zygote.@ignore c[:, i]
mean_squared_error(m, uuse, cuse, θ; λ)
end
end
Ideally, we want the LES simulation to produce the filtered DNS velocity
where
For the posterior loss function, we provide the right hand side function
model
(including closure), reference trajectories u
, and model
parameters. We compute the error between the predicted and reference trajectories
at each time point.
function trajectory_loss(model, u; dt, params...)
nt = size(u, 3) - 1
loss = 0.0
v = u[:, :, 1]
for i = 1:nt
v = step_rk4(model, v, dt; params...)
ui = u[:, :, i+1]
loss += sum(abs2, v - ui) / sum(abs2, ui)
end
loss / nt
end
We also make a non-squared variant for error analysis.
function trajectory_error(model, u; dt, params...)
nt = size(u, 3) - 1
loss = 0.0
v = u[:, :, 1]
for i = 1:nt
v = step_rk4(model, v, dt; params...)
ui = u[:, :, i+1]
loss += norm(v - ui) / norm(ui)
end
loss / nt
end
To limit the length of the computational chain, we only unroll n_unroll
time steps at each loss evaluation. The time step from which to unroll is
chosen at random at each evaluation, as are the initial conditions (nuse
).
The non-trainable parameters (e.g. params
.
function create_randloss_trajectory(model, data; nuse = 1, n_unroll = 10, params...)
(; u, dt) = data
nsample = size(u, 2)
nt = size(u, 3)
@assert nt ≥ n_unroll
@assert nsample ≥ nuse
function randloss(θ)
isample = Zygote.@ignore sort(shuffle(1:nsample)[1:nuse])
istart = Zygote.@ignore rand(1:nt-n_unroll)
it = Zygote.@ignore istart:istart+n_unroll
uuse = Zygote.@ignore u[:, isample, it]
trajectory_loss(model, uuse; dt, params..., θ)
end
end
During training, we will monitor the error on the validation dataset with a callback. We will plot the history of the prior and posterior errors.
# Initial empty history
initial_callbackstate() = (; ihist = Int[], ehist_prior = zeros(0), ehist_post = zeros(0))
# Plot convergence
function plot_convergence(state, data)
e_post_ref = trajectory_error(dns, data.u; data.dt, data.μ)
fig = plot(; yscale = :log10, xlabel = "Iterations", title = "Relative error")
hline!(fig, [1.0]; color = 1, linestyle = :dash, label = "Prior: No model")
plot!(fig, state.ihist, state.ehist_prior; color = 1, label = "Prior: Model")
hline!(fig, [e_post_ref]; color = 2, linestyle = :dash, label = "Posterior: No model")
plot!(fig, state.ihist, state.ehist_post; color = 2, label = "Posterior: Model")
fig
end
# Create callback for given model and dataset
function create_callback(m, data; doplot = false)
(; u, c, dt, μ) = data
uu, cc = reshape(u, size(u, 1), :), reshape(c, size(c, 1), :)
function callback(i, θ, state)
(; ihist, ehist_prior, ehist_post) = state
eprior = norm(m(uu, θ) - cc) / norm(cc)
epost = trajectory_error(les, u; dt, μ, m, θ)
state = (;
ihist = vcat(ihist, i),
ehist_prior = vcat(ehist_prior, eprior),
ehist_post = vcat(ehist_post, epost),
)
doplot && display(plot_convergence(state, data))
@printf "Iteration %d,\t\tprior error: %.4g,\t\tposterior error: %.4g\n" i eprior epost
state
end
end
For training, we have to initialize an optimizer and a callbackstate (lots of state initilization in this session)
initial_trainstate(optimiser, θ) = (;
opt = Optimisers.setup(optimiser, θ),
θ,
callbackstate = initial_callbackstate(),
istart = 0,
)
"""
train(; loss, opt, θ, istart, niter, ncallback, callback, callbackstate)
Perform `niter` iterations of gradient descent on `loss(θ)` for given `optimiser`.
Every `ncallback` iteration, call `callback(i, θ, callbackstate)` and update
callback state, where `i` starts at `istart`.
Return a named tuple `(; opt, θ, callbcackstate)` containing the updated
optimizer state, parameters and callback state.
"""
function train(; loss, opt, θ, istart, niter, ncallback, callback, callbackstate)
for i = 1:niter
∇ = first(gradient(loss, θ))
opt, θ = Optimisers.update(opt, θ, ∇)
i % ncallback == 0 && (callbackstate = callback(istart + i, θ, callbackstate))
end
istart += niter
(; opt, θ, callbackstate, istart)
end
We are free to choose the model architecture Chain
. Note: If the chain includes
state-dependent layers such as Dropout
(which modify their RNGs at each
evaluation), this wrapper should not be used.
function create_model(chain, rng)
# Create parameter vector and empty state
θ, state = Lux.setup(rng, chain)
# Convert nested named tuples of arrays to a ComponentArray,
# which behaves like a long vector
θ = ComponentArray(θ)
# Convenience wrapper for empty state in input and output
m(v, θ) = first(chain(v, θ, state))
# Return model and initial parameters
m, θ
end
A CNN is an interesting closure model for the following reasons:
- The parameters are sparse, since the kernels are reused for each output
- A convolutional layer can be seen as a discretized differential operator
- Translation invariance, a desired physical property of the commutator error we try to predict, is baked in.
However, it only uses local spatial information, whereas an ideal closure model could maybe recover some of the missing information in far-away values.
Note that we start by adding input channels, stored in a tuple of functions. The Burgers RHS has a square term, so maybe the closure model can make use of the same "structure". See Melchers 2.
"""
create_cnn(;
radii,
channels,
activations,
use_bias,
rng,
input_channels = (u -> u,),
)
Create CNN.
Keyword arguments:
- `radii`: Vector of kernel radii
- `channels`: Vector layer output channel numbers
- `activations`: Vector of activation functions
- `use_bias`: Vectors of indicators for using bias
- `rng`: Random number generator
- `input_channels`: Tuple of input channel contstructors
Return `(cnn, θ)`, where `cnn(v, θ)` acts like a force on `v`.
"""
function create_cnn(;
radii,
channels,
activations,
use_bias,
rng,
input_channels = (u -> u,),
)
@assert channels[end] == 1 "A unique output channel is required"
# Add number of input channels
channels = [length(input_channels); channels]
# Padding length
padding = sum(radii)
# Create CNN
create_model(
Chain(
# Create singleton channel
u -> reshape(u, size(u, 1), 1, size(u, 2)),
# Create input channels
u -> hcat(map(i -> i(u), input_channels)...),
# Add padding so that output has same shape as commutator error
u -> pad_circular(u, padding; dims = 1),
# Some convolutional layers
(
Conv(
(2 * radii[i] + 1,),
channels[i] => channels[i+1],
activations[i];
use_bias = use_bias[i],
init_weight = glorot_uniform_64,
) for i ∈ eachindex(radii)
)...,
# Remove singleton output channel
u -> reshape(u, size(u, 1), size(u, 3)),
),
rng,
)
end
A common closure models consists of computing a turbulent viscosity
create_eddy_viscosity(turb) =
function eddy_visosity(u, θ)
Δx = 1 / size(u, 1)
u₊ = circshift(u, -1)
μ = turb(u, θ)
ϕ₊ = @. -μ * (u₊ - u) / Δx
ϕ₋ = circshift(ϕ₊, 1)
@. -(ϕ₊ - ϕ₋) / Δx
end
A Fourier neural operator 3 is a network composed of Fourier Layers (FL).
A Fourier layer
where
Note that a standard convolutional layer (CL) can also be written in spectral space, where the spatial convolution operation becomes an wavenumber element-wise product. The effective difference between the layers of a FNO and CNN becomes the following:
- A FL does not include the bias of a CL
- The spatial part of a FL corresponds to the central weight of the CL kernel
- A CL is chosen to be sparse in physical space (local kernel with radius
$r \ll N / 2$ ), and would therefore be dense in spectral space ($k_\text{max} = N / 2$ ) - The spectral part of a FL is chosen to be sparse in spectral space
(
$k_\text{max} \ll N / 2$ ), and would therefore dense in physical space (it can be written as a convolution stencil with radius$r = N / 2$ )
Lux lets us define our own layer types. All functions should be "pure" (functional programming), meaning that the same inputs should produce the same outputs. In particular, this also applies to random number generation and state modification. The weights are stored in a vector outside the layer, while the layer itself contains information for weight initialization.
If you are not familiar with Julia, feel free to go quickly through
the following code cells. Just note that all variables have a type (e.g.
kmax::Int
means that kmax
is an integer), but most of the time we don't have
to declare types explicitly. Structures (struct
s) can be
parametrized and specialized for the types we give them in the constructor
(e.g. σ::A
means that a specialized version of the struct is compiled for
each activation function we give it, creating an optimized FourierLayer for
σ = relu
where A = typeof(relu)
, and a different version optimized for σ = tanh
where A = typeof(tanh)
etc.). Here our layer will have the type
FourierLayer
, with a default and custom constructor (two constructor
methods, the latter making use of the default).
struct FourierLayer{A,F} <: Lux.AbstractLuxLayer
kmax::Int
cin::Int
cout::Int
σ::A
init_weight::F
end
FourierLayer(kmax, ch::Pair{Int,Int}; σ = identity, init_weight = glorot_uniform_64) =
FourierLayer(kmax, first(ch), last(ch), σ, init_weight)
We also need to specify how to initialize the parameters and states. The
Fourier layer does not have any hidden states that are modified. The below
code adds methods to some existing Lux functions. These new methods are only
used when the functions encounter FourierLayer
inputs. For example, in our
current environment, we have this many methods for the function
Lux.initialparameters
(including Dense
, Conv
, etc.):
length(methods(Lux.initialparameters))
Now, when we add our own method, there should be one more in the method table.
Lux.initialparameters(rng::AbstractRNG, (; kmax, cin, cout, init_weight)::FourierLayer) = (;
spatial_weight = init_weight(rng, cout, cin),
spectral_weights = init_weight(rng, kmax + 1, cout, cin, 2),
)
Lux.initialstates(::AbstractRNG, ::FourierLayer) = (;)
Lux.parameterlength((; kmax, cin, cout)::FourierLayer) =
cout * cin + (kmax + 1) * 2 * cout * cin
Lux.statelength(::FourierLayer) = 0
# Pretty printing
function Base.show(io::IO, (; kmax, cin, cout, σ)::FourierLayer)
print(io, "FourierLayer(", kmax)
print(io, ", ", cin, " => ", cout)
print(io, "; σ = ", σ)
print(io, ")")
end
# One more method now
length(methods(Lux.initialparameters))
This is one of the advantages of Julia: As users we can extend functions from
other authors without modifying their package or being forced to "inherit"
their data structures (classes). This has created an interesting package
ecosystem. For example, ODE
solvers can be used with exotic
number types such as BigFloat
, dual numbers, or quaternions. Iterative
solvers
work out of the box with different array types, including various GPU
arrays, without actually
containing any GPU array specific code.
We now define how to pass inputs through a Fourier layer. In tensor notation, multiple samples can be processed at the same time. We therefore assume the following:
- Input size:
(nx, cin, nsample)
- Output size:
(nx, cout, nsample)
# This makes FourierLayers callable
function ((; kmax, cout, cin, σ)::FourierLayer)(x, params, state)
nx = size(x, 1)
# Destructure params
# The real and imaginary parts of R are stored in two separate channels
W = params.spatial_weight
W = reshape(W, 1, cout, cin)
R = params.spectral_weights
R = selectdim(R, 4, 1) .+ im .* selectdim(R, 4, 2)
# Spatial part (applied point-wise)
y = reshape(x, nx, 1, cin, :)
y = sum(W .* y; dims = 3)
y = reshape(y, nx, cout, :)
# Spectral part (applied mode-wise)
#
# Steps:
#
# - go to complex-valued spectral space
# - chop off high wavenumbers
# - multiply with weights mode-wise
# - pad with zeros to restore original shape
# - go back to real valued spatial representation
ikeep = 1:kmax+1
nkeep = kmax + 1
z = rfft(x, 1)
z = z[ikeep, :, :]
z = reshape(z, nkeep, 1, cin, :)
z = sum(R .* z; dims = 3)
z = reshape(z, nkeep, cout, :)
z = vcat(z, zeros(nx ÷ 2 + 1 - kmax - 1, size(z, 2), size(z, 3)))
z = irfft(z, nx, 1)
# Outer layer: Activation over combined spatial and spectral parts
# Note: Even though high wavenumbers are chopped off in `z` and may
# possibly not be present in the input at all, `σ` creates new high
# wavenumbers. High wavenumber functions may thus be represented using a
# sequence of Fourier layers. In this case, the `y`s are the only place
# where information contained in high input wavenumbers survive in a
# Fourier layer.
w = σ.(z .+ y)
# Fourier layer does not modify state
w, state
end
We will chain some Fourier layers, with a final dense layer. As for the CNN, we allow for a tuple of predetermined input channels.
"""
create_fno(; channels, kmax, activations, rng, input_channels = (u -> u,))
Create FNO.
Keyword arguments:
- `channels`: Vector of output channel numbers
- `kmax`: Vector of cut-off wavenumbers
- `activations`: Vector of activation functions
- `rng`: Random number generator
- `input_channels`: Tuple of input channel constructors
Return `(fno, θ)`, where `fno(v, θ)` acts like a force on `v`.
"""
function create_fno(; channels, kmax, activations, rng, input_channels = (u -> u,))
# Add number of input channels
channels = [length(input_channels); channels]
# Model
create_model(
Chain(
# Create singleton channel
u -> reshape(u, size(u, 1), 1, size(u, 2)),
# Create input channels
u -> hcat(map(i -> i(u), input_channels)...),
# Some Fourier layers
(
FourierLayer(kmax[i], channels[i] => channels[i+1]; σ = activations[i]) for
i ∈ eachindex(kmax)
)...,
# Put channels in first dimension
u -> permutedims(u, (2, 1, 3)),
# Compress with a final dense layer
Dense(channels[end] => 2 * channels[end], gelu),
Dense(2 * channels[end] => 1; use_bias = false),
# Put channels back after spatial dimension
u -> permutedims(u, (2, 1, 3)),
# Remove singleton channel
u -> reshape(u, size(u, 1), size(u, 3)),
),
rng,
)
end
We now create a closure model. Note that the last activation is identity
, as we
don't want to restrict the output values. We can inspect the structure in the
wrapped Lux Chain
.
m_cnn, θ_cnn = create_cnn(;
radii = [2, 2, 2, 2],
channels = [8, 8, 8, 1],
activations = [leakyrelu, leakyrelu, leakyrelu, identity],
use_bias = [true, true, true, false],
input_channels = (u -> u, u -> u .^ 2),
rng,
)
m_cnn.chain
For the eddy visosity, we enforce a positive value with gelu
,
as we do not want any anti-diffusion.
turb, θ_eddy = create_cnn(;
radii = [2, 2, 2, 2],
channels = [8, 8, 8, 1],
activations = [leakyrelu, leakyrelu, leakyrelu, gelu],
use_bias = [true, true, true, false],
input_channels = (u -> u, u -> u .^ 2),
rng,
)
θ_eddy = θ_eddy / 10
m_eddy = create_eddy_viscosity(turb)
m_eddy.turb.chain
m_fno, θ_fno = create_fno(;
channels = [5, 5, 5, 5],
kmax = [16, 16, 16, 8],
activations = [gelu, gelu, gelu, identity],
input_channels = (u -> u, u -> u .^ 2),
rng,
)
m_fno.chain
m, θ, label = m_cnn, θ_cnn, "CNN";
# m, θ, label = m_eddy, θ_eddy, "Eddy";
# m, θ, label = m_fno, θ_fno, "FNO";
Choose loss
loss_prior = create_randloss_commutator(m, data_train; nuse = 50);
loss_post =
create_randloss_trajectory(les, data_train; nuse = 3, n_unroll = 10, data_train.μ, m);
loss = loss_prior;
# loss = loss_post;
Initialize training state. Note that we have to provide an optimizer, here
Adam(η)
where η
is the learning rate 4. This optimizer exploits the
random nature of our loss function.
trainstate = initial_trainstate(Adam(1.0e-3), θ);
Model warm-up: trigger compilation and get indication of complexity
loss(θ)
gradient(loss, θ);
@time loss(θ);
@time gradient(loss, θ);
Train the model. The cell below can be repeated to continue training where the
previous training session left off.
If you run this in a notebook, doplot = true
will create a lot of plots
below the cell.
trainstate = train(;
trainstate...,
loss,
niter = 1000,
ncallback = 20,
callback = create_callback(m, data_valid; doplot = false),
)
plot_convergence(trainstate.callbackstate, data_valid)
Final model weights
(; θ) = trainstate;
We will now make a comparison between our closure model, the baseline "no closure" model, and the reference testing data.
println("Relative posterior errors:")
e0 = trajectory_error(dns, data_test.u; data_test.dt, data_test.μ)
e = trajectory_error(les, data_test.u; data_test.dt, data_test.μ, m, θ)
println("m=0:\t$e0")
println("$label:\t$e")
Let's also plot the LES solutions with and without closure model
let
(; u, dt, μ) = data_test
u = u[:, 1, :]
nt = size(u, 2) - 1
v0 = u[:, 1]
v = u[:, 1]
anim = Animation()
for it = 0:nt
# Time step: Skip first step to get initial plot
if it > 0
v0 = step_rk4(dns, v0, dt; μ)
v = step_rk4(les, v, dt; μ, m, θ)
end
# Plot
if it % 50 == 0
fig = plot(;
ylims = extrema(u[:, 1]),
xlabel = "x",
title = @sprintf("Solution, t = %.3f", it * dt),
legend = :topright,
)
plot!(fig, x_les, u[:, it+1]; label = "Ref")
plot!(fig, x_les, v0; label = "m=0")
plot!(fig, x_les, v; label)
frame(anim, fig)
end
end
gif(anim)
end
To get confident with modeling ODE right hand sides using machine learning, the following exercises can be useful.
- Fit a closure model using the posterior loss function.
- Investigate the effect of the parameter
n_unroll
. Try for example@time randloss(θ)
forn_unroll = 10
andn_unroll = 20
(executerandloss
once first to trigger compilation). - Discuss the statement "$L^\text{prior}$ and
$L^\text{post}$ are almost the same whenn_unroll = 1
" with your neighbour. Are they exactly the same if we use forward Euler ($u^{n + 1} = u^n + \Delta t f(u^n)$) instead of RK4?
Recreate the CNN or FNO model without the additional square input channel
(prior physical knowledge).
The input should only have one singleton channel (pass the
keyword argument input_channels = (u -> u,)
to the constructor).
Do you expect this version to perform better or worse than with a square
channel?
-
Observe that, if we really wanted to, we could skip the term
$f(\bar{v})$ entirely, hoping that$m$ will be able to model it directly (in addition to the commutator error). The resulting model is$$ \frac{\mathrm{d} \bar{v}}{\mathrm{d} t} = m(\bar{v}, \theta). $$
This is known as a Neural ODE (see Chen 5).
-
Define a model that predicts the entire right hand side. This can be done by using the following little "hack":
- Create your model
m_entire_rhs, θ_entire_rhs = create_cnn(...)
- Define the effective closure model
m_effective(u, θ) = m_entire_rhs(u, θ) - dns(u; μ)
.
- Create your model
-
Train the CNN or FNO for
m_effective
. Is the model able to represent the solution correctly?
-
Make a new instance of the CNN closure, called
cnn_linear
with parametersθ_cnn_linear
, which only has one convolutional layer. This model should still add the square input channel. -
Observe that the simple Burgers DNS RHS
$f_\text{central}$ can actually be expressed in its entirety using this model, i.e.$$ f_\text{central}(u) = \mathop{\text{CNN}}(u, \theta). $$
- What is the kernel radius?
- Should there still be a nonlinear activation function?
- What is the exact expression for the model weights and bias?
-
"Improve" the discretization
$f_\text{central}$ :- Choose a kernel radius for
cnn_linear
that is larger than the one off_central
. - Train the model.
- What does the resulting kernel stencil (
Array(θ_cnn_linear)
) look like? Does it resemble the one off_central
?
- Choose a kernel radius for
-
Observe that if we really wanted to, we could do the same for the more complicated Burgers right hand side
$f_\text{shock}$ . Some extra difficulties in this version of$f$ :- It contains an absolute value (in the numerical viscosity)
- It contains cross terms
$u_n u_{n + 1}$ (instead of element-wise squares$u_n^2$ )
If we use a CNN with two layers (instead of one) and with
$\sigma(x) = \mathop{\text{relu}}(x) = \max(0, x)$ as an activation function between the two layers, then the absolute value can be written as the sum of two$\mathop{\text{relu}}$ functions with$|x| = \mathop{\text{relu}}(x) + \mathop{\text{relu}}(-x)$ . Expressions on the same level without absolute value can also be written as a sum of two$\mathop{\text{relu}}$ functions with$x = \mathop{\text{relu}}(x) - \mathop{\text{relu}}(-x)$ . For the cross term, we just add it in the input channel layer.
So, a simple discretization
Final question: What is a closure model? Are we just learning a better coarse discretization? Discuss with your neighbour.
This page was generated using Literate.jl.
Footnotes
-
Antony Jameson, Energy Estimates for Nonlinear Conservation Laws with Applications to Solutions of the Burgers Equation and One-Dimensional Vicous Flow in a Shock Tube by Central Difference Schemes, 18th AIAA Computational Fluid Dynamics Conference, 2007, https://doi.org/10.2514/6.2007-4620 ↩
-
Hugo Melchers, Daan Crommelin, Barry Koren, Vlado Menkovski, Benjamin Sanderse, Comparison of neural closure models for discretised PDEs, Computers & Mathematics with Applications, Volume 143, 2023, Pages 94-107, ISSN 0898-1221, https://doi.org/10.1016/j.camwa.2023.04.030. ↩
-
Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv:2010.08895, 2021. ↩
-
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arxiv:1412.6980, 2014. ↩
-
R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. arXiv:1806.07366, 2018. ↩