diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 000000000..7776482a9 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,9 @@ +################ +# Line endings # +################ +* text=auto + +################### +# GitHub Linguist # +################### +*.qmd linguist-language=Markdown diff --git a/404.qmd b/404.qmd index 7514919de..6c761d3f1 100755 --- a/404.qmd +++ b/404.qmd @@ -1,7 +1,7 @@ ---- -title: Page Not Found ---- - -The page you requested cannot be found (perhaps it was moved or renamed). - -You may want to visit [Get-Started Guide]({{< meta get-started >}}) or [Tutorials]({{< meta tutorials-intro >}}). +--- +title: Page Not Found +--- + +The page you requested cannot be found (perhaps it was moved or renamed). + +You may want to visit [Get-Started Guide]({{< meta get-started >}}) or [Tutorials]({{< meta tutorials-intro >}}). diff --git a/assets/images/turing-logo-wide.svg b/assets/images/turing-logo-wide.svg index 88907a1bd..af3eb6538 100755 --- a/assets/images/turing-logo-wide.svg +++ b/assets/images/turing-logo-wide.svg @@ -1,42 +1,42 @@ - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/assets/images/turing-logo.svg b/assets/images/turing-logo.svg index 252e14004..15e6979d4 100755 --- a/assets/images/turing-logo.svg +++ b/assets/images/turing-logo.svg @@ -1,28 +1,28 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/developers/compiler/design-overview/index.qmd b/developers/compiler/design-overview/index.qmd index 3645422e7..e389c44d0 100755 --- a/developers/compiler/design-overview/index.qmd +++ b/developers/compiler/design-overview/index.qmd @@ -1,306 +1,306 @@ ---- -title: Turing Compiler Design (Outdated) -engine: julia -aliases: - - ../../../tutorials/docs-05-for-developers-compiler/index.html ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In this section, the current design of Turing's model "compiler" is described which enables Turing to perform various types of Bayesian inference without changing the model definition. The "compiler" is essentially just a macro that rewrites the user's model definition to a function that generates a `Model` struct that Julia's dispatch can operate on and that Julia's compiler can successfully do type inference on for efficient machine code generation. - -# Overview - -The following terminology will be used in this section: - - - `D`: observed data variables conditioned upon in the posterior, - - `P`: parameter variables distributed according to the prior distributions, these will also be referred to as random variables, - - `Model`: a fully defined probabilistic model with input data - -`Turing`'s `@model` macro rewrites the user-provided function definition such that it can be used to instantiate a `Model` by passing in the observed data `D`. - -The following are the main jobs of the `@model` macro: - - 1. Parse `~` and `.~` lines, e.g. `y .~ Normal.(c*x, 1.0)` - 2. Figure out if a variable belongs to the data `D` and or to the parameters `P` - 3. Enable the handling of missing data variables in `D` when defining a `Model` and treating them as parameter variables in `P` instead - 4. Enable the tracking of random variables using the data structures `VarName` and `VarInfo` - 5. Change `~`/`.~` lines with a variable in `P` on the LHS to a call to `tilde_assume` or `dot_tilde_assume` - 6. Change `~`/`.~` lines with a variable in `D` on the LHS to a call to `tilde_observe` or `dot_tilde_observe` - 7. Enable type stable automatic differentiation of the model using type parameters - -## The model - -A `model::Model` is a callable struct that one can sample from by calling - -```{julia} -#| eval: false -(model::Model)([rng, varinfo, sampler, context]) -``` - -where `rng` is a random number generator (default: `Random.default_rng()`), `varinfo` is a data structure that stores information -about the random variables (default: `DynamicPPL.VarInfo()`), `sampler` is a sampling algorithm (default: `DynamicPPL.SampleFromPrior()`), -and `context` is a sampling context that can, e.g., modify how the log probability is accumulated (default: `DynamicPPL.DefaultContext()`). - -Sampling resets the log joint probability of `varinfo` and increases the evaluation counter of `sampler`. If `context` is a `LikelihoodContext`, -only the log likelihood of `D` will be accumulated, whereas with `PriorContext` only the log prior probability of `P` is. With the `DefaultContext` the log joint probability of both `P` and `D` is accumulated. - -The `Model` struct contains the four internal fields `f`, `args`, `defaults`, and `context`. -When `model::Model` is called, then the internal function `model.f` is called as `model.f(rng, varinfo, sampler, context, model.args...)` -(for multithreaded sampling, instead of `varinfo` a threadsafe wrapper is passed to `model.f`). -The positional and keyword arguments that were passed to the user-defined model function when the model was created are saved as a `NamedTuple` -in `model.args`. The default values of the positional and keyword arguments of the user-defined model functions, if any, are saved as a `NamedTuple` -in `model.defaults`. They are used for constructing model instances with different arguments by the `logprob` and `prob` string macros. -The `context` variable sets an evaluation context that can be used to control for instance whether log probabilities should be evaluated for the prior, likelihood, or joint probability. By default it is set to evaluate the log joint. - -# Example - -Let's take the following model as an example: - -```{julia} -#| eval: false -@model function gauss( - x=missing, y=1.0, ::Type{TV}=Vector{Float64} -) where {TV<:AbstractVector} - if x === missing - x = TV(undef, 3) - end - p = TV(undef, 2) - p[1] ~ InverseGamma(2, 3) - p[2] ~ Normal(0, 1.0) - @. x[1:2] ~ Normal(p[2], sqrt(p[1])) - x[3] ~ Normal() - return y ~ Normal(p[2], sqrt(p[1])) -end -``` - -The above call of the `@model` macro defines the function `gauss` with positional arguments `x`, `y`, and `::Type{TV}`, rewritten in -such a way that every call of it returns a `model::Model`. Note that only the function body is modified by the `@model` macro, and the -function signature is left untouched. It is also possible to implement models with keyword arguments such as - -```{julia} -#| eval: false -@model function gauss( - ::Type{TV}=Vector{Float64}; x=missing, y=1.0 -) where {TV<:AbstractVector} - return ... -end -``` - -This would allow us to generate a model by calling `gauss(; x = rand(3))`. - -If an argument has a default value `missing`, it is treated as a random variable. For variables which require an initialization because we -need to loop or broadcast over its elements, such as `x` above, the following needs to be done: - -```{julia} -#| eval: false -if x === missing - x = ... -end -``` - -Note that since `gauss` behaves like a regular function it is possible to define additional dispatches in a second step as well. For -instance, we could achieve the same behaviour by - -```{julia} -#| eval: false -@model function gauss(x, y=1.0, ::Type{TV}=Vector{Float64}) where {TV<:AbstractVector} - p = TV(undef, 2) - return ... -end - -function gauss(::Missing, y=1.0, ::Type{TV}=Vector{Float64}) where {TV<:AbstractVector} - return gauss(TV(undef, 3), y, TV) -end -``` - -If `x` is sampled as a whole from a distribution and not indexed, e.g., `x ~ Normal(...)` or `x ~ MvNormal(...)`, -there is no need to initialize it in an `if`-block. - -## Step 1: Break up the model definition - -First, the `@model` macro breaks up the user-provided function definition using `DynamicPPL.build_model_info`. This function -returns a dictionary consisting of: - - - `allargs_exprs`: The expressions of the positional and keyword arguments, without default values. - - `allargs_syms`: The names of the positional and keyword arguments, e.g., `[:x, :y, :TV]` above. - - `allargs_namedtuple`: An expression that constructs a `NamedTuple` of the positional and keyword arguments, e.g., `:((x = x, y = y, TV = TV))` above. - - `defaults_namedtuple`: An expression that constructs a `NamedTuple` of the default positional and keyword arguments, if any, e.g., `:((x = missing, y = 1, TV = Vector{Float64}))` above. - - `modeldef`: A dictionary with the name, arguments, and function body of the model definition, as returned by `MacroTools.splitdef`. - -## Step 2: Generate the body of the internal model function - -In a second step, `DynamicPPL.generate_mainbody` generates the main part of the transformed function body using the user-provided function body -and the provided function arguments, without default values, for figuring out if a variable denotes an observation or a random variable. -Hereby the function `DynamicPPL.generate_tilde` replaces the `L ~ R` lines in the model and the function `DynamicPPL.generate_dot_tilde` replaces -the `@. L ~ R` and `L .~ R` lines in the model. - -In the above example, `p[1] ~ InverseGamma(2, 3)` is replaced with something similar to - -```{julia} -#| eval: false -#= REPL[25]:6 =# -begin - var"##tmpright#323" = InverseGamma(2, 3) - var"##tmpright#323" isa Union{Distribution,AbstractVector{<:Distribution}} || throw( - ArgumentError( - "Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions.", - ), - ) - var"##vn#325" = (DynamicPPL.VarName)(:p, ((1,),)) - var"##inds#326" = ((1,),) - p[1] = (DynamicPPL.tilde_assume)( - _rng, - _context, - _sampler, - var"##tmpright#323", - var"##vn#325", - var"##inds#326", - _varinfo, - ) -end -``` - -Here the first line is a so-called line number node that enables more helpful error messages by providing users with the exact location -of the error in their model definition. Then the right hand side (RHS) of the `~` is assigned to a variable (with an automatically generated name). -We check that the RHS is a distribution or an array of distributions, otherwise an error is thrown. -Next we extract a compact representation of the variable with its name and index (or indices). Finally, the `~` expression is replaced with -a call to `DynamicPPL.tilde_assume` since the compiler figured out that `p[1]` is a random variable using the following -heuristic: - - 1. If the symbol on the LHS of `~`, `:p` in this case, is not among the arguments to the model, `(:x, :y, :T)` in this case, it is a random variable. - 2. If the symbol on the LHS of `~`, `:p` in this case, is among the arguments to the model but has a value of `missing`, it is a random variable. - 3. If the value of the LHS of `~`, `p[1]` in this case, is `missing`, then it is a random variable. - 4. Otherwise, it is treated as an observation. - -The `DynamicPPL.tilde_assume` function takes care of sampling the random variable, if needed, and updating its value and the accumulated log joint -probability in the `_varinfo` object. If `L ~ R` is an observation, `DynamicPPL.tilde_observe` is called with the same arguments except the -random number generator `_rng` (since observations are never sampled). - -A similar transformation is performed for expressions of the form `@. L ~ R` and `L .~ R`. For instance, -`@. x[1:2] ~ Normal(p[2], sqrt(p[1]))` is replaced with - -```{julia} -#| eval: false -#= REPL[25]:8 =# -begin - var"##tmpright#331" = Normal.(p[2], sqrt.(p[1])) - var"##tmpright#331" isa Union{Distribution,AbstractVector{<:Distribution}} || throw( - ArgumentError( - "Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions.", - ), - ) - var"##vn#333" = (DynamicPPL.VarName)(:x, ((1:2,),)) - var"##inds#334" = ((1:2,),) - var"##isassumption#335" = begin - let var"##vn#336" = (DynamicPPL.VarName)(:x, ((1:2,),)) - if !((DynamicPPL.inargnames)(var"##vn#336", _model)) || - (DynamicPPL.inmissings)(var"##vn#336", _model) - true - else - x[1:2] === missing - end - end - end - if var"##isassumption#335" - x[1:2] .= (DynamicPPL.dot_tilde_assume)( - _rng, - _context, - _sampler, - var"##tmpright#331", - x[1:2], - var"##vn#333", - var"##inds#334", - _varinfo, - ) - else - (DynamicPPL.dot_tilde_observe)( - _context, - _sampler, - var"##tmpright#331", - x[1:2], - var"##vn#333", - var"##inds#334", - _varinfo, - ) - end -end -``` - -The main difference in the expanded code between `L ~ R` and `@. L ~ R` is that the former doesn't assume `L` to be defined, it can be a new Julia variable in the scope, while the latter assumes `L` already exists. Moreover, `DynamicPPL.dot_tilde_assume` and `DynamicPPL.dot_tilde_observe` are called -instead of `DynamicPPL.tilde_assume` and `DynamicPPL.tilde_observe`. - -## Step 3: Replace the user-provided function body - -Finally, we replace the user-provided function body using `DynamicPPL.build_output`. This function uses `MacroTools.combinedef` to reassemble -the user-provided function with a new function body. In the modified function body an anonymous function is created whose function body -was generated in step 2 above and whose arguments are - - - a random number generator `_rng`, - - a model `_model`, - - a datastructure `_varinfo`, - - a sampler `_sampler`, - - a sampling context `_context`, - - and all positional and keyword arguments of the user-provided model function as positional arguments - without any default values. Finally, in the new function body a `model::Model` with this anonymous function as internal function is returned. - -# `VarName` - -In order to track random variables in the sampling process, `Turing` uses the `VarName` struct which acts as a random variable identifier generated at runtime. The `VarName` of a random variable is generated from the expression on the LHS of a `~` statement when the symbol on the LHS is in the set `P` of unobserved random variables. Every `VarName` instance has a type parameter `sym` which is the symbol of the Julia variable in the model that the random variable belongs to. For example, `x[1] ~ Normal()` will generate an instance of `VarName{:x}` assuming `x` is an unobserved random variable. Every `VarName` also has a field `indexing`, which stores the indices required to access the random variable from the Julia variable indicated by `sym` as a tuple of tuples. Each element of the tuple thereby contains the indices of one indexing operation (`VarName` also supports hierarchical arrays and range indexing). Some examples: - - - `x ~ Normal()` will generate a `VarName(:x, ())`. - - `x[1] ~ Normal()` will generate a `VarName(:x, ((1,),))`. - - `x[:,1] ~ MvNormal(zeros(2), I)` will generate a `VarName(:x, ((Colon(), 1),))`. - - `x[:,1][1+1] ~ Normal()` will generate a `VarName(:x, ((Colon(), 1), (2,)))`. - -The easiest way to manually construct a `VarName` is to use the `@varname` macro on an indexing expression, which will take the `sym` value from the actual variable name, and put the index values appropriately into the constructor. - -# `VarInfo` - -## Overview - -`VarInfo` is the data structure in `Turing` that facilitates tracking random variables and certain metadata about them that are required for sampling. For instance, the distribution of every random variable is stored in `VarInfo` because we need to know the support of every random variable when sampling using HMC for example. Random variables whose distributions have a constrained support are transformed using a bijector from [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl) so that the sampling happens in the unconstrained space. Different samplers require different metadata about the random variables. - -The definition of `VarInfo` in `Turing` is: - -```{julia} -#| eval: false -struct VarInfo{Tmeta, Tlogp} <: AbstractVarInfo - metadata::Tmeta - logp::Base.RefValue{Tlogp} - num_produce::Base.RefValue{Int} -end -``` - -Based on the type of `metadata`, the `VarInfo` is either aliased `UntypedVarInfo` or `TypedVarInfo`. `metadata` can be either a subtype of the union type `Metadata` or a `NamedTuple` of multiple such subtypes. Let `vi` be an instance of `VarInfo`. If `vi isa VarInfo{<:Metadata}`, then it is called an `UntypedVarInfo`. If `vi isa VarInfo{<:NamedTuple}`, then `vi.metadata` would be a `NamedTuple` mapping each symbol in `P` to an instance of `Metadata`. `vi` would then be called a `TypedVarInfo`. The other fields of `VarInfo` include `logp` which is used to accumulate the log probability or log probability density of the variables in `P` and `D`. `num_produce` keeps track of how many observations have been made in the model so far. This is incremented when running a `~` statement when the symbol on the LHS is in `D`. - -## `Metadata` - -The `Metadata` struct stores some metadata about the random variables sampled. This helps -query certain information about a variable such as: its distribution, which samplers -sample this variable, its value and whether this value is transformed to real space or -not. Let `md` be an instance of `Metadata`: - - - `md.vns` is the vector of all `VarName` instances. Let `vn` be an arbitrary element of `md.vns` - - `md.idcs` is the dictionary that maps each `VarName` instance to its index in - `md.vns`, `md.ranges`, `md.dists`, `md.orders` and `md.flags`. - - `md.vns[md.idcs[vn]] == vn`. - - `md.dists[md.idcs[vn]]` is the distribution of `vn`. - - `md.gids[md.idcs[vn]]` is the set of algorithms used to sample `vn`. This is used in - the Gibbs sampling process. - - `md.orders[md.idcs[vn]]` is the number of `observe` statements before `vn` is sampled. - - `md.ranges[md.idcs[vn]]` is the index range of `vn` in `md.vals`. - - `md.vals[md.ranges[md.idcs[vn]]]` is the linearized vector of values of corresponding to `vn`. - - `md.flags` is a dictionary of true/false flags. `md.flags[flag][md.idcs[vn]]` is the - value of `flag` corresponding to `vn`. - -Note that in order to make `md::Metadata` type stable, all the `md.vns` must have the same symbol and distribution type. However, one can have a single Julia variable, e.g. `x`, that is a matrix or a hierarchical array sampled in partitions, e.g. `x[1][:] ~ MvNormal(zeros(2), I); x[2][:] ~ MvNormal(ones(2), I)`. The symbol `x` can still be managed by a single `md::Metadata` without hurting the type stability since all the distributions on the RHS of `~` are of the same type. - -However, in `Turing` models one cannot have this restriction, so we must use a type unstable `Metadata` if we want to use one `Metadata` instance for the whole model. This is what `UntypedVarInfo` does. A type unstable `Metadata` will still work but will have inferior performance. - -To strike a balance between flexibility and performance when constructing the `spl::Sampler` instance, the model is first run by sampling the parameters in `P` from their priors using an `UntypedVarInfo`, i.e. a type unstable `Metadata` is used for all the variables. Then once all the symbols and distribution types have been identified, a `vi::TypedVarInfo` is constructed where `vi.metadata` is a `NamedTuple` mapping each symbol in `P` to a specialized instance of `Metadata`. So as long as each symbol in `P` is sampled from only one type of distributions, `vi::TypedVarInfo` will have fully concretely typed fields which brings out the peak performance of Julia. +--- +title: Turing Compiler Design (Outdated) +engine: julia +aliases: + - ../../../tutorials/docs-05-for-developers-compiler/index.html +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In this section, the current design of Turing's model "compiler" is described which enables Turing to perform various types of Bayesian inference without changing the model definition. The "compiler" is essentially just a macro that rewrites the user's model definition to a function that generates a `Model` struct that Julia's dispatch can operate on and that Julia's compiler can successfully do type inference on for efficient machine code generation. + +# Overview + +The following terminology will be used in this section: + + - `D`: observed data variables conditioned upon in the posterior, + - `P`: parameter variables distributed according to the prior distributions, these will also be referred to as random variables, + - `Model`: a fully defined probabilistic model with input data + +`Turing`'s `@model` macro rewrites the user-provided function definition such that it can be used to instantiate a `Model` by passing in the observed data `D`. + +The following are the main jobs of the `@model` macro: + + 1. Parse `~` and `.~` lines, e.g. `y .~ Normal.(c*x, 1.0)` + 2. Figure out if a variable belongs to the data `D` and or to the parameters `P` + 3. Enable the handling of missing data variables in `D` when defining a `Model` and treating them as parameter variables in `P` instead + 4. Enable the tracking of random variables using the data structures `VarName` and `VarInfo` + 5. Change `~`/`.~` lines with a variable in `P` on the LHS to a call to `tilde_assume` or `dot_tilde_assume` + 6. Change `~`/`.~` lines with a variable in `D` on the LHS to a call to `tilde_observe` or `dot_tilde_observe` + 7. Enable type stable automatic differentiation of the model using type parameters + +## The model + +A `model::Model` is a callable struct that one can sample from by calling + +```{julia} +#| eval: false +(model::Model)([rng, varinfo, sampler, context]) +``` + +where `rng` is a random number generator (default: `Random.default_rng()`), `varinfo` is a data structure that stores information +about the random variables (default: `DynamicPPL.VarInfo()`), `sampler` is a sampling algorithm (default: `DynamicPPL.SampleFromPrior()`), +and `context` is a sampling context that can, e.g., modify how the log probability is accumulated (default: `DynamicPPL.DefaultContext()`). + +Sampling resets the log joint probability of `varinfo` and increases the evaluation counter of `sampler`. If `context` is a `LikelihoodContext`, +only the log likelihood of `D` will be accumulated, whereas with `PriorContext` only the log prior probability of `P` is. With the `DefaultContext` the log joint probability of both `P` and `D` is accumulated. + +The `Model` struct contains the four internal fields `f`, `args`, `defaults`, and `context`. +When `model::Model` is called, then the internal function `model.f` is called as `model.f(rng, varinfo, sampler, context, model.args...)` +(for multithreaded sampling, instead of `varinfo` a threadsafe wrapper is passed to `model.f`). +The positional and keyword arguments that were passed to the user-defined model function when the model was created are saved as a `NamedTuple` +in `model.args`. The default values of the positional and keyword arguments of the user-defined model functions, if any, are saved as a `NamedTuple` +in `model.defaults`. They are used for constructing model instances with different arguments by the `logprob` and `prob` string macros. +The `context` variable sets an evaluation context that can be used to control for instance whether log probabilities should be evaluated for the prior, likelihood, or joint probability. By default it is set to evaluate the log joint. + +# Example + +Let's take the following model as an example: + +```{julia} +#| eval: false +@model function gauss( + x=missing, y=1.0, ::Type{TV}=Vector{Float64} +) where {TV<:AbstractVector} + if x === missing + x = TV(undef, 3) + end + p = TV(undef, 2) + p[1] ~ InverseGamma(2, 3) + p[2] ~ Normal(0, 1.0) + @. x[1:2] ~ Normal(p[2], sqrt(p[1])) + x[3] ~ Normal() + return y ~ Normal(p[2], sqrt(p[1])) +end +``` + +The above call of the `@model` macro defines the function `gauss` with positional arguments `x`, `y`, and `::Type{TV}`, rewritten in +such a way that every call of it returns a `model::Model`. Note that only the function body is modified by the `@model` macro, and the +function signature is left untouched. It is also possible to implement models with keyword arguments such as + +```{julia} +#| eval: false +@model function gauss( + ::Type{TV}=Vector{Float64}; x=missing, y=1.0 +) where {TV<:AbstractVector} + return ... +end +``` + +This would allow us to generate a model by calling `gauss(; x = rand(3))`. + +If an argument has a default value `missing`, it is treated as a random variable. For variables which require an initialization because we +need to loop or broadcast over its elements, such as `x` above, the following needs to be done: + +```{julia} +#| eval: false +if x === missing + x = ... +end +``` + +Note that since `gauss` behaves like a regular function it is possible to define additional dispatches in a second step as well. For +instance, we could achieve the same behaviour by + +```{julia} +#| eval: false +@model function gauss(x, y=1.0, ::Type{TV}=Vector{Float64}) where {TV<:AbstractVector} + p = TV(undef, 2) + return ... +end + +function gauss(::Missing, y=1.0, ::Type{TV}=Vector{Float64}) where {TV<:AbstractVector} + return gauss(TV(undef, 3), y, TV) +end +``` + +If `x` is sampled as a whole from a distribution and not indexed, e.g., `x ~ Normal(...)` or `x ~ MvNormal(...)`, +there is no need to initialize it in an `if`-block. + +## Step 1: Break up the model definition + +First, the `@model` macro breaks up the user-provided function definition using `DynamicPPL.build_model_info`. This function +returns a dictionary consisting of: + + - `allargs_exprs`: The expressions of the positional and keyword arguments, without default values. + - `allargs_syms`: The names of the positional and keyword arguments, e.g., `[:x, :y, :TV]` above. + - `allargs_namedtuple`: An expression that constructs a `NamedTuple` of the positional and keyword arguments, e.g., `:((x = x, y = y, TV = TV))` above. + - `defaults_namedtuple`: An expression that constructs a `NamedTuple` of the default positional and keyword arguments, if any, e.g., `:((x = missing, y = 1, TV = Vector{Float64}))` above. + - `modeldef`: A dictionary with the name, arguments, and function body of the model definition, as returned by `MacroTools.splitdef`. + +## Step 2: Generate the body of the internal model function + +In a second step, `DynamicPPL.generate_mainbody` generates the main part of the transformed function body using the user-provided function body +and the provided function arguments, without default values, for figuring out if a variable denotes an observation or a random variable. +Hereby the function `DynamicPPL.generate_tilde` replaces the `L ~ R` lines in the model and the function `DynamicPPL.generate_dot_tilde` replaces +the `@. L ~ R` and `L .~ R` lines in the model. + +In the above example, `p[1] ~ InverseGamma(2, 3)` is replaced with something similar to + +```{julia} +#| eval: false +#= REPL[25]:6 =# +begin + var"##tmpright#323" = InverseGamma(2, 3) + var"##tmpright#323" isa Union{Distribution,AbstractVector{<:Distribution}} || throw( + ArgumentError( + "Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions.", + ), + ) + var"##vn#325" = (DynamicPPL.VarName)(:p, ((1,),)) + var"##inds#326" = ((1,),) + p[1] = (DynamicPPL.tilde_assume)( + _rng, + _context, + _sampler, + var"##tmpright#323", + var"##vn#325", + var"##inds#326", + _varinfo, + ) +end +``` + +Here the first line is a so-called line number node that enables more helpful error messages by providing users with the exact location +of the error in their model definition. Then the right hand side (RHS) of the `~` is assigned to a variable (with an automatically generated name). +We check that the RHS is a distribution or an array of distributions, otherwise an error is thrown. +Next we extract a compact representation of the variable with its name and index (or indices). Finally, the `~` expression is replaced with +a call to `DynamicPPL.tilde_assume` since the compiler figured out that `p[1]` is a random variable using the following +heuristic: + + 1. If the symbol on the LHS of `~`, `:p` in this case, is not among the arguments to the model, `(:x, :y, :T)` in this case, it is a random variable. + 2. If the symbol on the LHS of `~`, `:p` in this case, is among the arguments to the model but has a value of `missing`, it is a random variable. + 3. If the value of the LHS of `~`, `p[1]` in this case, is `missing`, then it is a random variable. + 4. Otherwise, it is treated as an observation. + +The `DynamicPPL.tilde_assume` function takes care of sampling the random variable, if needed, and updating its value and the accumulated log joint +probability in the `_varinfo` object. If `L ~ R` is an observation, `DynamicPPL.tilde_observe` is called with the same arguments except the +random number generator `_rng` (since observations are never sampled). + +A similar transformation is performed for expressions of the form `@. L ~ R` and `L .~ R`. For instance, +`@. x[1:2] ~ Normal(p[2], sqrt(p[1]))` is replaced with + +```{julia} +#| eval: false +#= REPL[25]:8 =# +begin + var"##tmpright#331" = Normal.(p[2], sqrt.(p[1])) + var"##tmpright#331" isa Union{Distribution,AbstractVector{<:Distribution}} || throw( + ArgumentError( + "Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions.", + ), + ) + var"##vn#333" = (DynamicPPL.VarName)(:x, ((1:2,),)) + var"##inds#334" = ((1:2,),) + var"##isassumption#335" = begin + let var"##vn#336" = (DynamicPPL.VarName)(:x, ((1:2,),)) + if !((DynamicPPL.inargnames)(var"##vn#336", _model)) || + (DynamicPPL.inmissings)(var"##vn#336", _model) + true + else + x[1:2] === missing + end + end + end + if var"##isassumption#335" + x[1:2] .= (DynamicPPL.dot_tilde_assume)( + _rng, + _context, + _sampler, + var"##tmpright#331", + x[1:2], + var"##vn#333", + var"##inds#334", + _varinfo, + ) + else + (DynamicPPL.dot_tilde_observe)( + _context, + _sampler, + var"##tmpright#331", + x[1:2], + var"##vn#333", + var"##inds#334", + _varinfo, + ) + end +end +``` + +The main difference in the expanded code between `L ~ R` and `@. L ~ R` is that the former doesn't assume `L` to be defined, it can be a new Julia variable in the scope, while the latter assumes `L` already exists. Moreover, `DynamicPPL.dot_tilde_assume` and `DynamicPPL.dot_tilde_observe` are called +instead of `DynamicPPL.tilde_assume` and `DynamicPPL.tilde_observe`. + +## Step 3: Replace the user-provided function body + +Finally, we replace the user-provided function body using `DynamicPPL.build_output`. This function uses `MacroTools.combinedef` to reassemble +the user-provided function with a new function body. In the modified function body an anonymous function is created whose function body +was generated in step 2 above and whose arguments are + + - a random number generator `_rng`, + - a model `_model`, + - a datastructure `_varinfo`, + - a sampler `_sampler`, + - a sampling context `_context`, + - and all positional and keyword arguments of the user-provided model function as positional arguments + without any default values. Finally, in the new function body a `model::Model` with this anonymous function as internal function is returned. + +# `VarName` + +In order to track random variables in the sampling process, `Turing` uses the `VarName` struct which acts as a random variable identifier generated at runtime. The `VarName` of a random variable is generated from the expression on the LHS of a `~` statement when the symbol on the LHS is in the set `P` of unobserved random variables. Every `VarName` instance has a type parameter `sym` which is the symbol of the Julia variable in the model that the random variable belongs to. For example, `x[1] ~ Normal()` will generate an instance of `VarName{:x}` assuming `x` is an unobserved random variable. Every `VarName` also has a field `indexing`, which stores the indices required to access the random variable from the Julia variable indicated by `sym` as a tuple of tuples. Each element of the tuple thereby contains the indices of one indexing operation (`VarName` also supports hierarchical arrays and range indexing). Some examples: + + - `x ~ Normal()` will generate a `VarName(:x, ())`. + - `x[1] ~ Normal()` will generate a `VarName(:x, ((1,),))`. + - `x[:,1] ~ MvNormal(zeros(2), I)` will generate a `VarName(:x, ((Colon(), 1),))`. + - `x[:,1][1+1] ~ Normal()` will generate a `VarName(:x, ((Colon(), 1), (2,)))`. + +The easiest way to manually construct a `VarName` is to use the `@varname` macro on an indexing expression, which will take the `sym` value from the actual variable name, and put the index values appropriately into the constructor. + +# `VarInfo` + +## Overview + +`VarInfo` is the data structure in `Turing` that facilitates tracking random variables and certain metadata about them that are required for sampling. For instance, the distribution of every random variable is stored in `VarInfo` because we need to know the support of every random variable when sampling using HMC for example. Random variables whose distributions have a constrained support are transformed using a bijector from [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl) so that the sampling happens in the unconstrained space. Different samplers require different metadata about the random variables. + +The definition of `VarInfo` in `Turing` is: + +```{julia} +#| eval: false +struct VarInfo{Tmeta, Tlogp} <: AbstractVarInfo + metadata::Tmeta + logp::Base.RefValue{Tlogp} + num_produce::Base.RefValue{Int} +end +``` + +Based on the type of `metadata`, the `VarInfo` is either aliased `UntypedVarInfo` or `TypedVarInfo`. `metadata` can be either a subtype of the union type `Metadata` or a `NamedTuple` of multiple such subtypes. Let `vi` be an instance of `VarInfo`. If `vi isa VarInfo{<:Metadata}`, then it is called an `UntypedVarInfo`. If `vi isa VarInfo{<:NamedTuple}`, then `vi.metadata` would be a `NamedTuple` mapping each symbol in `P` to an instance of `Metadata`. `vi` would then be called a `TypedVarInfo`. The other fields of `VarInfo` include `logp` which is used to accumulate the log probability or log probability density of the variables in `P` and `D`. `num_produce` keeps track of how many observations have been made in the model so far. This is incremented when running a `~` statement when the symbol on the LHS is in `D`. + +## `Metadata` + +The `Metadata` struct stores some metadata about the random variables sampled. This helps +query certain information about a variable such as: its distribution, which samplers +sample this variable, its value and whether this value is transformed to real space or +not. Let `md` be an instance of `Metadata`: + + - `md.vns` is the vector of all `VarName` instances. Let `vn` be an arbitrary element of `md.vns` + - `md.idcs` is the dictionary that maps each `VarName` instance to its index in + `md.vns`, `md.ranges`, `md.dists`, `md.orders` and `md.flags`. + - `md.vns[md.idcs[vn]] == vn`. + - `md.dists[md.idcs[vn]]` is the distribution of `vn`. + - `md.gids[md.idcs[vn]]` is the set of algorithms used to sample `vn`. This is used in + the Gibbs sampling process. + - `md.orders[md.idcs[vn]]` is the number of `observe` statements before `vn` is sampled. + - `md.ranges[md.idcs[vn]]` is the index range of `vn` in `md.vals`. + - `md.vals[md.ranges[md.idcs[vn]]]` is the linearized vector of values of corresponding to `vn`. + - `md.flags` is a dictionary of true/false flags. `md.flags[flag][md.idcs[vn]]` is the + value of `flag` corresponding to `vn`. + +Note that in order to make `md::Metadata` type stable, all the `md.vns` must have the same symbol and distribution type. However, one can have a single Julia variable, e.g. `x`, that is a matrix or a hierarchical array sampled in partitions, e.g. `x[1][:] ~ MvNormal(zeros(2), I); x[2][:] ~ MvNormal(ones(2), I)`. The symbol `x` can still be managed by a single `md::Metadata` without hurting the type stability since all the distributions on the RHS of `~` are of the same type. + +However, in `Turing` models one cannot have this restriction, so we must use a type unstable `Metadata` if we want to use one `Metadata` instance for the whole model. This is what `UntypedVarInfo` does. A type unstable `Metadata` will still work but will have inferior performance. + +To strike a balance between flexibility and performance when constructing the `spl::Sampler` instance, the model is first run by sampling the parameters in `P` from their priors using an `UntypedVarInfo`, i.e. a type unstable `Metadata` is used for all the variables. Then once all the symbols and distribution types have been identified, a `vi::TypedVarInfo` is constructed where `vi.metadata` is a `NamedTuple` mapping each symbol in `P` to a specialized instance of `Metadata`. So as long as each symbol in `P` is sampled from only one type of distributions, `vi::TypedVarInfo` will have fully concretely typed fields which brings out the peak performance of Julia. diff --git a/developers/compiler/minituring-compiler/index.qmd b/developers/compiler/minituring-compiler/index.qmd index 605698694..c22894b17 100755 --- a/developers/compiler/minituring-compiler/index.qmd +++ b/developers/compiler/minituring-compiler/index.qmd @@ -1,295 +1,295 @@ ---- -title: "A Mini Turing Implementation I: Compiler" -engine: julia -aliases: - - ../../../tutorials/14-minituring/index.html ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In this tutorial we develop a very simple probabilistic programming language. -The implementation is similar to [DynamicPPL](https://github.com/TuringLang/DynamicPPL.jl). -This is intentional as we want to demonstrate some key ideas from Turing's internal implementation. - -To make things easy to understand and to implement we restrict our language to a very simple subset of the language that Turing actually supports. -Defining an accurate syntax description is not our goal here, instead, we give a simple example and all similar programs should work. - -# Consider a probabilistic model defined by - -$$ -\begin{aligned} -a &\sim \operatorname{Normal}(0.5, 1^2) \\ -b &\sim \operatorname{Normal}(a, 2^2) \\ -x &\sim \operatorname{Normal}(b, 0.5^2) -\end{aligned} -$$ - -We assume that `x` is data, i.e., an observed variable. -In our small language this model will be defined as - -```{julia} -#| eval: false -@mini_model function m(x) - a ~ Normal(0.5, 1) - b ~ Normal(a, 2) - x ~ Normal(b, 0.5) - return nothing -end -``` - -Specifically, we demand that - - - all observed variables are arguments of the program, - - the model definition does not contain any control flow, - - all variables are scalars, and - - the function returns `nothing`. - -First, we import some required packages: - -```{julia} -using MacroTools, Distributions, Random, AbstractMCMC, MCMCChains -``` - -Before getting to the actual "compiler", we first build the data structure for the program trace. -A program trace for a probabilistic programming language needs to at least record the values of stochastic variables and their log-probabilities. - -```{julia} -struct VarInfo{V,L} - values::V - logps::L -end - -VarInfo() = VarInfo(Dict{Symbol,Float64}(), Dict{Symbol,Float64}()) - -function Base.setindex!(varinfo::VarInfo, (value, logp), var_id) - varinfo.values[var_id] = value - varinfo.logps[var_id] = logp - return varinfo -end -``` - -Internally, our probabilistic programming language works with two main functions: - - - `assume` for sampling unobserved variables and computing their log-probabilities, and - - `observe` for computing log-probabilities of observed variables (but not sampling them). - -For different inference algorithms we may have to use different sampling procedures and different log-probability computations. -For instance, in some cases we might want to sample all variables from their prior distributions and in other cases we might only want to compute the log-likelihood of the observations based on a given set of values for the unobserved variables. -Thus depending on the inference algorithm we want to use different `assume` and `observe` implementations. -We can achieve this by providing this `context` information as a function argument to `assume` and `observe`. - -**Note:** *Although the context system in this tutorial is inspired by DynamicPPL, it is very simplistic. -We expand this mini Turing example in the [contexts]({{}}) tutorial with some more complexity, to illustrate how and why contexts are central to Turing's design. For the full details one still needs to go to the actual source of DynamicPPL though.* - -Here we can see the implementation of a sampler that draws values of unobserved variables from the prior and computes the log-probability for every variable. - -```{julia} -struct SamplingContext{S<:AbstractMCMC.AbstractSampler,R<:Random.AbstractRNG} - rng::R - sampler::S -end - -struct PriorSampler <: AbstractMCMC.AbstractSampler end - -function observe(context::SamplingContext, varinfo, dist, var_id, var_value) - logp = logpdf(dist, var_value) - varinfo[var_id] = (var_value, logp) - return nothing -end - -function assume(context::SamplingContext{PriorSampler}, varinfo, dist, var_id) - sample = Random.rand(context.rng, dist) - logp = logpdf(dist, sample) - varinfo[var_id] = (sample, logp) - return sample -end; -``` - -Next we define the "compiler" for our simple programming language. -The term compiler is actually a bit misleading here since its only purpose is to transform the function definition in the `@mini_model` macro by - - - adding the context information (`context`) and the tracing data structure (`varinfo`) as additional arguments, and - - replacing tildes with calls to `assume` and `observe`. - -Afterwards, as usual the Julia compiler will just-in-time compile the model function when it is called. - -The manipulation of Julia expressions is an advanced part of the Julia language. -The [Julia documentation](https://docs.julialang.org/en/v1/manual/metaprogramming/) provides an introduction to and more details about this so-called metaprogramming. - -```{julia} -macro mini_model(expr) - return esc(mini_model(expr)) -end - -function mini_model(expr) - # Split the function definition into a dictionary with its name, arguments, body etc. - def = MacroTools.splitdef(expr) - - # Replace tildes in the function body with calls to `assume` or `observe` - def[:body] = MacroTools.postwalk(def[:body]) do sub_expr - if MacroTools.@capture(sub_expr, var_ ~ dist_) - if var in def[:args] - # If the variable is an argument of the model function, it is observed - return :($(observe)(context, varinfo, $dist, $(Meta.quot(var)), $var)) - else - # Otherwise it is unobserved - return :($var = $(assume)(context, varinfo, $dist, $(Meta.quot(var)))) - end - else - return sub_expr - end - end - - # Add `context` and `varinfo` arguments to the model function - def[:args] = vcat(:varinfo, :context, def[:args]) - - # Reassemble the function definition from its name, arguments, body etc. - return MacroTools.combinedef(def) -end; -``` - -For inference, we make use of the [AbstractMCMC interface](https://turinglang.github.io/AbstractMCMC.jl/dev/). -It provides a default implementation of a `sample` function for sampling a Markov chain. -The default implementation already supports e.g. sampling of multiple chains in parallel, thinning of samples, or discarding initial samples. - -The AbstractMCMC interface requires us to at least - - - define a model that is a subtype of `AbstractMCMC.AbstractModel`, - - define a sampler that is a subtype of `AbstractMCMC.AbstractSampler`, - - implement `AbstractMCMC.step` for our model and sampler. - -Thus here we define a `MiniModel` model. -In this model we store the model function and the observed data. - -```{julia} -struct MiniModel{F,D} <: AbstractMCMC.AbstractModel - f::F - data::D # a NamedTuple of all the data -end -``` - -In the Turing compiler, the model-specific `DynamicPPL.Model` is constructed automatically when calling the model function. -But for the sake of simplicity here we construct the model manually. - -To illustrate probabilistic inference with our mini language we implement an extremely simplistic Random-Walk Metropolis-Hastings sampler. -We hard-code the proposal step as part of the sampler and only allow normal distributions with zero mean and fixed standard deviation. -The Metropolis-Hastings sampler in Turing is more flexible. - -```{julia} -struct MHSampler{T<:Real} <: AbstractMCMC.AbstractSampler - sigma::T -end - -MHSampler() = MHSampler(1) - -function assume(context::SamplingContext{<:MHSampler}, varinfo, dist, var_id) - sampler = context.sampler - old_value = varinfo.values[var_id] - - # propose a random-walk step, i.e, add the current value to a random - # value sampled from a Normal distribution centered at 0 - value = rand(context.rng, Normal(old_value, sampler.sigma)) - logp = Distributions.logpdf(dist, value) - varinfo[var_id] = (value, logp) - - return value -end; -``` - -We need to define two `step` functions, one for the first step and the other for the following steps. -In the first step we sample values from the prior distributions and in the following steps we sample with the random-walk proposal. -The two functions are identified by the different arguments they take. - -```{julia} -# The fist step: Sampling from the prior distributions -function AbstractMCMC.step( - rng::Random.AbstractRNG, model::MiniModel, sampler::MHSampler; kwargs... -) - vi = VarInfo() - ctx = SamplingContext(rng, PriorSampler()) - model.f(vi, ctx, values(model.data)...) - return vi, vi -end - -# The following steps: Sampling with random-walk proposal -function AbstractMCMC.step( - rng::Random.AbstractRNG, - model::MiniModel, - sampler::MHSampler, - prev_state::VarInfo; # is just the old trace - kwargs..., -) - vi = prev_state - new_vi = deepcopy(vi) - ctx = SamplingContext(rng, sampler) - model.f(new_vi, ctx, values(model.data)...) - - # Compute log acceptance probability - # Since the proposal is symmetric the computation can be simplified - logα = sum(values(new_vi.logps)) - sum(values(vi.logps)) - - # Accept proposal with computed acceptance probability - if -randexp(rng) < logα - return new_vi, new_vi - else - return prev_state, prev_state - end -end; -``` - -To make it easier to analyze the samples and compare them with results from Turing, additionally we define a version of `AbstractMCMC.bundle_samples` for our model and sampler that returns a `MCMCChains.Chains` object of samples. - -```{julia} -function AbstractMCMC.bundle_samples( - samples, model::MiniModel, ::MHSampler, ::Any, ::Type{Chains}; kwargs... -) - # We get a vector of traces - values = [sample.values for sample in samples] - params = [key for key in keys(values[1]) if key ∉ keys(model.data)] - vals = reduce(hcat, [value[p] for value in values] for p in params) - # Composing the `Chains` data-structure, of which analyzing infrastructure is provided - chains = Chains(vals, params) - return chains -end; -``` - -Let us check how our mini probabilistic programming language works. -We define the probabilistic model: - -```{julia} -@mini_model function m(x) - a ~ Normal(0.5, 1) - b ~ Normal(a, 2) - x ~ Normal(b, 0.5) - return nothing -end; -``` - -We perform inference with data `x = 3.0`: - -```{julia} -sample(MiniModel(m, (x=3.0,)), MHSampler(), 1_000_000; chain_type=Chains, progress=false) -``` - -We compare these results with Turing. - -```{julia} -using Turing -using PDMats - -@model function turing_m(x) - a ~ Normal(0.5, 1) - b ~ Normal(a, 2) - x ~ Normal(b, 0.5) - return nothing -end - -sample(turing_m(3.0), MH(ScalMat(2, 1.0)), 1_000_000, progress=false) -``` - -As you can see, with our simple probabilistic programming language and custom samplers we get similar results as Turing. +--- +title: "A Mini Turing Implementation I: Compiler" +engine: julia +aliases: + - ../../../tutorials/14-minituring/index.html +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In this tutorial we develop a very simple probabilistic programming language. +The implementation is similar to [DynamicPPL](https://github.com/TuringLang/DynamicPPL.jl). +This is intentional as we want to demonstrate some key ideas from Turing's internal implementation. + +To make things easy to understand and to implement we restrict our language to a very simple subset of the language that Turing actually supports. +Defining an accurate syntax description is not our goal here, instead, we give a simple example and all similar programs should work. + +# Consider a probabilistic model defined by + +$$ +\begin{aligned} +a &\sim \operatorname{Normal}(0.5, 1^2) \\ +b &\sim \operatorname{Normal}(a, 2^2) \\ +x &\sim \operatorname{Normal}(b, 0.5^2) +\end{aligned} +$$ + +We assume that `x` is data, i.e., an observed variable. +In our small language this model will be defined as + +```{julia} +#| eval: false +@mini_model function m(x) + a ~ Normal(0.5, 1) + b ~ Normal(a, 2) + x ~ Normal(b, 0.5) + return nothing +end +``` + +Specifically, we demand that + + - all observed variables are arguments of the program, + - the model definition does not contain any control flow, + - all variables are scalars, and + - the function returns `nothing`. + +First, we import some required packages: + +```{julia} +using MacroTools, Distributions, Random, AbstractMCMC, MCMCChains +``` + +Before getting to the actual "compiler", we first build the data structure for the program trace. +A program trace for a probabilistic programming language needs to at least record the values of stochastic variables and their log-probabilities. + +```{julia} +struct VarInfo{V,L} + values::V + logps::L +end + +VarInfo() = VarInfo(Dict{Symbol,Float64}(), Dict{Symbol,Float64}()) + +function Base.setindex!(varinfo::VarInfo, (value, logp), var_id) + varinfo.values[var_id] = value + varinfo.logps[var_id] = logp + return varinfo +end +``` + +Internally, our probabilistic programming language works with two main functions: + + - `assume` for sampling unobserved variables and computing their log-probabilities, and + - `observe` for computing log-probabilities of observed variables (but not sampling them). + +For different inference algorithms we may have to use different sampling procedures and different log-probability computations. +For instance, in some cases we might want to sample all variables from their prior distributions and in other cases we might only want to compute the log-likelihood of the observations based on a given set of values for the unobserved variables. +Thus depending on the inference algorithm we want to use different `assume` and `observe` implementations. +We can achieve this by providing this `context` information as a function argument to `assume` and `observe`. + +**Note:** *Although the context system in this tutorial is inspired by DynamicPPL, it is very simplistic. +We expand this mini Turing example in the [contexts]({{}}) tutorial with some more complexity, to illustrate how and why contexts are central to Turing's design. For the full details one still needs to go to the actual source of DynamicPPL though.* + +Here we can see the implementation of a sampler that draws values of unobserved variables from the prior and computes the log-probability for every variable. + +```{julia} +struct SamplingContext{S<:AbstractMCMC.AbstractSampler,R<:Random.AbstractRNG} + rng::R + sampler::S +end + +struct PriorSampler <: AbstractMCMC.AbstractSampler end + +function observe(context::SamplingContext, varinfo, dist, var_id, var_value) + logp = logpdf(dist, var_value) + varinfo[var_id] = (var_value, logp) + return nothing +end + +function assume(context::SamplingContext{PriorSampler}, varinfo, dist, var_id) + sample = Random.rand(context.rng, dist) + logp = logpdf(dist, sample) + varinfo[var_id] = (sample, logp) + return sample +end; +``` + +Next we define the "compiler" for our simple programming language. +The term compiler is actually a bit misleading here since its only purpose is to transform the function definition in the `@mini_model` macro by + + - adding the context information (`context`) and the tracing data structure (`varinfo`) as additional arguments, and + - replacing tildes with calls to `assume` and `observe`. + +Afterwards, as usual the Julia compiler will just-in-time compile the model function when it is called. + +The manipulation of Julia expressions is an advanced part of the Julia language. +The [Julia documentation](https://docs.julialang.org/en/v1/manual/metaprogramming/) provides an introduction to and more details about this so-called metaprogramming. + +```{julia} +macro mini_model(expr) + return esc(mini_model(expr)) +end + +function mini_model(expr) + # Split the function definition into a dictionary with its name, arguments, body etc. + def = MacroTools.splitdef(expr) + + # Replace tildes in the function body with calls to `assume` or `observe` + def[:body] = MacroTools.postwalk(def[:body]) do sub_expr + if MacroTools.@capture(sub_expr, var_ ~ dist_) + if var in def[:args] + # If the variable is an argument of the model function, it is observed + return :($(observe)(context, varinfo, $dist, $(Meta.quot(var)), $var)) + else + # Otherwise it is unobserved + return :($var = $(assume)(context, varinfo, $dist, $(Meta.quot(var)))) + end + else + return sub_expr + end + end + + # Add `context` and `varinfo` arguments to the model function + def[:args] = vcat(:varinfo, :context, def[:args]) + + # Reassemble the function definition from its name, arguments, body etc. + return MacroTools.combinedef(def) +end; +``` + +For inference, we make use of the [AbstractMCMC interface](https://turinglang.github.io/AbstractMCMC.jl/dev/). +It provides a default implementation of a `sample` function for sampling a Markov chain. +The default implementation already supports e.g. sampling of multiple chains in parallel, thinning of samples, or discarding initial samples. + +The AbstractMCMC interface requires us to at least + + - define a model that is a subtype of `AbstractMCMC.AbstractModel`, + - define a sampler that is a subtype of `AbstractMCMC.AbstractSampler`, + - implement `AbstractMCMC.step` for our model and sampler. + +Thus here we define a `MiniModel` model. +In this model we store the model function and the observed data. + +```{julia} +struct MiniModel{F,D} <: AbstractMCMC.AbstractModel + f::F + data::D # a NamedTuple of all the data +end +``` + +In the Turing compiler, the model-specific `DynamicPPL.Model` is constructed automatically when calling the model function. +But for the sake of simplicity here we construct the model manually. + +To illustrate probabilistic inference with our mini language we implement an extremely simplistic Random-Walk Metropolis-Hastings sampler. +We hard-code the proposal step as part of the sampler and only allow normal distributions with zero mean and fixed standard deviation. +The Metropolis-Hastings sampler in Turing is more flexible. + +```{julia} +struct MHSampler{T<:Real} <: AbstractMCMC.AbstractSampler + sigma::T +end + +MHSampler() = MHSampler(1) + +function assume(context::SamplingContext{<:MHSampler}, varinfo, dist, var_id) + sampler = context.sampler + old_value = varinfo.values[var_id] + + # propose a random-walk step, i.e, add the current value to a random + # value sampled from a Normal distribution centered at 0 + value = rand(context.rng, Normal(old_value, sampler.sigma)) + logp = Distributions.logpdf(dist, value) + varinfo[var_id] = (value, logp) + + return value +end; +``` + +We need to define two `step` functions, one for the first step and the other for the following steps. +In the first step we sample values from the prior distributions and in the following steps we sample with the random-walk proposal. +The two functions are identified by the different arguments they take. + +```{julia} +# The fist step: Sampling from the prior distributions +function AbstractMCMC.step( + rng::Random.AbstractRNG, model::MiniModel, sampler::MHSampler; kwargs... +) + vi = VarInfo() + ctx = SamplingContext(rng, PriorSampler()) + model.f(vi, ctx, values(model.data)...) + return vi, vi +end + +# The following steps: Sampling with random-walk proposal +function AbstractMCMC.step( + rng::Random.AbstractRNG, + model::MiniModel, + sampler::MHSampler, + prev_state::VarInfo; # is just the old trace + kwargs..., +) + vi = prev_state + new_vi = deepcopy(vi) + ctx = SamplingContext(rng, sampler) + model.f(new_vi, ctx, values(model.data)...) + + # Compute log acceptance probability + # Since the proposal is symmetric the computation can be simplified + logα = sum(values(new_vi.logps)) - sum(values(vi.logps)) + + # Accept proposal with computed acceptance probability + if -randexp(rng) < logα + return new_vi, new_vi + else + return prev_state, prev_state + end +end; +``` + +To make it easier to analyze the samples and compare them with results from Turing, additionally we define a version of `AbstractMCMC.bundle_samples` for our model and sampler that returns a `MCMCChains.Chains` object of samples. + +```{julia} +function AbstractMCMC.bundle_samples( + samples, model::MiniModel, ::MHSampler, ::Any, ::Type{Chains}; kwargs... +) + # We get a vector of traces + values = [sample.values for sample in samples] + params = [key for key in keys(values[1]) if key ∉ keys(model.data)] + vals = reduce(hcat, [value[p] for value in values] for p in params) + # Composing the `Chains` data-structure, of which analyzing infrastructure is provided + chains = Chains(vals, params) + return chains +end; +``` + +Let us check how our mini probabilistic programming language works. +We define the probabilistic model: + +```{julia} +@mini_model function m(x) + a ~ Normal(0.5, 1) + b ~ Normal(a, 2) + x ~ Normal(b, 0.5) + return nothing +end; +``` + +We perform inference with data `x = 3.0`: + +```{julia} +sample(MiniModel(m, (x=3.0,)), MHSampler(), 1_000_000; chain_type=Chains, progress=false) +``` + +We compare these results with Turing. + +```{julia} +using Turing +using PDMats + +@model function turing_m(x) + a ~ Normal(0.5, 1) + b ~ Normal(a, 2) + x ~ Normal(b, 0.5) + return nothing +end + +sample(turing_m(3.0), MH(ScalMat(2, 1.0)), 1_000_000, progress=false) +``` + +As you can see, with our simple probabilistic programming language and custom samplers we get similar results as Turing. diff --git a/developers/compiler/minituring-contexts/index.qmd b/developers/compiler/minituring-contexts/index.qmd index cbad94380..6468483ef 100755 --- a/developers/compiler/minituring-contexts/index.qmd +++ b/developers/compiler/minituring-contexts/index.qmd @@ -1,306 +1,306 @@ ---- -title: "A Mini Turing Implementation II: Contexts" -engine: julia -aliases: - - ../../../tutorials/16-contexts/index.html ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In the [Mini Turing]({{< meta minituring >}}) tutorial we developed a miniature version of the Turing language, to illustrate its core design. A passing mention was made of contexts. In this tutorial we develop that aspect of our mini Turing language further to demonstrate how and why contexts are an important part of Turing's design. - -# Mini Turing expanded, now with more contexts - -If you haven't read [Mini Turing]({{< meta minituring >}}) yet, you should do that first. We start by repeating verbatim much of the code from there. Define the type for holding values for variables: - -```{julia} -import MacroTools, Random, AbstractMCMC -using Distributions: Normal, logpdf -using MCMCChains: Chains -using AbstractMCMC: sample - -struct VarInfo{V,L} - values::V - logps::L -end - -VarInfo() = VarInfo(Dict{Symbol,Float64}(), Dict{Symbol,Float64}()) - -function Base.setindex!(varinfo::VarInfo, (value, logp), var_id) - varinfo.values[var_id] = value - varinfo.logps[var_id] = logp - return varinfo -end -``` - -Define the macro that expands `~` expressions to calls to `assume` and `observe`: - -```{julia} -# Methods will be defined for these later. -function assume end -function observe end - -macro mini_model(expr) - return esc(mini_model(expr)) -end - -function mini_model(expr) - # Split the function definition into a dictionary with its name, arguments, body etc. - def = MacroTools.splitdef(expr) - - # Replace tildes in the function body with calls to `assume` or `observe` - def[:body] = MacroTools.postwalk(def[:body]) do sub_expr - if MacroTools.@capture(sub_expr, var_ ~ dist_) - if var in def[:args] - # If the variable is an argument of the model function, it is observed - return :($(observe)(context, varinfo, $dist, $(Meta.quot(var)), $var)) - else - # Otherwise it is unobserved - return :($var = $(assume)(context, varinfo, $dist, $(Meta.quot(var)))) - end - else - return sub_expr - end - end - - # Add `context` and `varinfo` arguments to the model function - def[:args] = vcat(:varinfo, :context, def[:args]) - - # Reassemble the function definition from its name, arguments, body etc. - return MacroTools.combinedef(def) -end - - -struct MiniModel{F,D} <: AbstractMCMC.AbstractModel - f::F - data::D # a NamedTuple of all the data -end -``` - -Define an example model: - -```{julia} -@mini_model function m(x) - a ~ Normal(0.5, 1) - b ~ Normal(a, 2) - x ~ Normal(b, 0.5) - return nothing -end; - -mini_m = MiniModel(m, (x=3.0,)) -``` - -Previously in the mini Turing case, at this point we defined `SamplingContext`, a structure that holds a random number generator and a sampler, and gets passed to `observe` and `assume`. We then used it to implement a simple Metropolis-Hastings sampler. - -The notion of a context may have seemed overly complicated just to implement the sampler, but there are other things we may want to do with a model than sample from the posterior. Having the context passing in place lets us do that without having to touch the above macro at all. For instance, let's say we want to evaluate the log joint probability of the model for a given set of data and parameters. Using a new context type we can use the previously defined `model` function, but change its behavior by changing what the `observe` and `assume` functions do. - - - -```{julia} -struct JointContext end - -function observe(context::JointContext, varinfo, dist, var_id, var_value) - logp = logpdf(dist, var_value) - varinfo[var_id] = (var_value, logp) - return nothing -end - -function assume(context::JointContext, varinfo, dist, var_id) - if !haskey(varinfo.values, var_id) - error("Can't evaluate the log probability if the variable $(var_id) is not set.") - end - var_value = varinfo.values[var_id] - logp = logpdf(dist, var_value) - varinfo[var_id] = (var_value, logp) - return var_value -end - -function logjoint(model, parameter_values::NamedTuple) - vi = VarInfo() - for (var_id, value) in pairs(parameter_values) - # Set the log prob to NaN for now. These will get overwritten when model.f is - # called with JointContext. - vi[var_id] = (value, NaN) - end - model.f(vi, JointContext(), values(model.data)...) - return sum(values(vi.logps)) -end - -logjoint(mini_m, (a=0.5, b=1.0)) -``` - -When using the `JointContext` no sampling whatsoever happens in calling `mini_m`. Rather only the log probability of each given variable value is evaluated. `logjoint` then sums these results to get the total log joint probability. - -We can similarly define a context for evaluating the log prior probability: - -```{julia} -struct PriorContext end - -function observe(context::PriorContext, varinfo, dist, var_id, var_value) - # Since we are evaluating the prior, the log probability of all the observations - # is set to 0. This has the effect of ignoring the likelihood. - varinfo[var_id] = (var_value, 0.0) - return nothing -end - -function assume(context::PriorContext, varinfo, dist, var_id) - if !haskey(varinfo.values, var_id) - error("Can't evaluate the log probability if the variable $(var_id) is not set.") - end - var_value = varinfo.values[var_id] - logp = logpdf(dist, var_value) - varinfo[var_id] = (var_value, logp) - return var_value -end - -function logprior(model, parameter_values::NamedTuple) - vi = VarInfo() - for (var_id, value) in pairs(parameter_values) - vi[var_id] = (value, NaN) - end - model.f(vi, PriorContext(), values(model.data)...) - return sum(values(vi.logps)) -end - -logprior(mini_m, (a=0.5, b=1.0)) -``` - -Notice that the definition of `assume(context::PriorContext, args...)` is identical to the one for `JointContext`, and `logprior` and `logjoint` are also identical except for the context type they create. There's clearly an opportunity here for some refactoring using abstract types, but that's outside the scope of this tutorial. Rather, the point here is to demonstrate that we can extract different sorts of things from our model by defining different context types, and specialising `observe` and `assume` for them. - - -## Contexts within contexts - -Let's use the above two contexts to provide a slightly more general definition of the `SamplingContext` and the Metropolis-Hastings sampler we wrote in the mini Turing tutorial. - -```{julia} -struct SamplingContext{S<:AbstractMCMC.AbstractSampler,R<:Random.AbstractRNG} - rng::R - sampler::S - subcontext::Union{PriorContext, JointContext} -end -``` - -The new aspect here is the `subcontext` field. Note that this is a context within a context! The idea is that we don't need to hard code how the MCMC sampler evaluates the log probability, but rather can pass that work onto the subcontext. This way the same sampler can be used to sample from either the joint or the prior distribution. - -The methods for `SamplingContext` are largely as in the our earlier mini Turing case, except they now pass some of the work onto the subcontext: - -```{julia} -function observe(context::SamplingContext, args...) - # Sampling doesn't affect the observed values, so nothing to do here other than pass to - # the subcontext. - return observe(context.subcontext, args...) -end - -struct PriorSampler <: AbstractMCMC.AbstractSampler end - -function assume(context::SamplingContext{PriorSampler}, varinfo, dist, var_id) - sample = Random.rand(context.rng, dist) - varinfo[var_id] = (sample, NaN) - # Once the value has been sampled, let the subcontext handle evaluating the log - # probability. - return assume(context.subcontext, varinfo, dist, var_id) -end; - -# The subcontext field of the MHSampler determines which distribution this sampler -# samples from. -struct MHSampler{D, T<:Real} <: AbstractMCMC.AbstractSampler - sigma::T - subcontext::D -end - -MHSampler(subcontext) = MHSampler(1, subcontext) - -function assume(context::SamplingContext{<:MHSampler}, varinfo, dist, var_id) - sampler = context.sampler - old_value = varinfo.values[var_id] - - # propose a random-walk step, i.e, add the current value to a random - # value sampled from a Normal distribution centered at 0 - value = rand(context.rng, Normal(old_value, sampler.sigma)) - varinfo[var_id] = (value, NaN) - # Once the value has been sampled, let the subcontext handle evaluating the log - # probability. - return assume(context.subcontext, varinfo, dist, var_id) -end; - -# The following three methods are identical to before, except for passing -# `sampler.subcontext` to the context SamplingContext. -function AbstractMCMC.step( - rng::Random.AbstractRNG, model::MiniModel, sampler::MHSampler; kwargs... -) - vi = VarInfo() - ctx = SamplingContext(rng, PriorSampler(), sampler.subcontext) - model.f(vi, ctx, values(model.data)...) - return vi, vi -end - -function AbstractMCMC.step( - rng::Random.AbstractRNG, - model::MiniModel, - sampler::MHSampler, - prev_state::VarInfo; # is just the old trace - kwargs..., -) - vi = prev_state - new_vi = deepcopy(vi) - ctx = SamplingContext(rng, sampler, sampler.subcontext) - model.f(new_vi, ctx, values(model.data)...) - - # Compute log acceptance probability - # Since the proposal is symmetric the computation can be simplified - logα = sum(values(new_vi.logps)) - sum(values(vi.logps)) - - # Accept proposal with computed acceptance probability - if -Random.randexp(rng) < logα - return new_vi, new_vi - else - return prev_state, prev_state - end -end; - -function AbstractMCMC.bundle_samples( - samples, model::MiniModel, ::MHSampler, ::Any, ::Type{Chains}; kwargs... -) - # We get a vector of traces - values = [sample.values for sample in samples] - params = [key for key in keys(values[1]) if key ∉ keys(model.data)] - vals = reduce(hcat, [value[p] for value in values] for p in params) - # Composing the `Chains` data-structure, of which analyzing infrastructure is provided - chains = Chains(vals, params) - return chains -end; -``` - -We can use this to sample from the joint distribution just like before: - -```{julia} -sample(MiniModel(m, (x=3.0,)), MHSampler(JointContext()), 1_000_000; chain_type=Chains, progress=false) -``` - -or we can choose to sample from the prior instead - -```{julia} -sample(MiniModel(m, (x=3.0,)), MHSampler(PriorContext()), 1_000_000; chain_type=Chains, progress=false) -``` - -Of course, using an MCMC algorithm to sample from the prior is unnecessary and silly (`PriorSampler` exists, after all), but the point is to illustrate the flexibility of the context system. We could, for instance, use the same setup to implement an _Approximate Bayesian Computation_ (ABC) algorithm. - - -The use of contexts also goes far beyond just evaluating log probabilities and sampling. Some examples from Turing are - -* `FixedContext`, which fixes some variables to given values and removes them completely from the evaluation of any log probabilities. They power the `Turing.fix` and `Turing.unfix` functions. -* `ConditionContext` conditions the model on fixed values for some parameters. They are used by `Turing.condition` and `Turing.uncondition`, i.e. the `model | (parameter=value,)` syntax. The difference between `fix` and `condition` is whether the log probability for the corresponding variable is included in the overall log density. - -* `PriorExtractorContext` collects information about what the prior distribution of each variable is. -* `PrefixContext` adds prefixes to variable names, allowing models to be used within other models without variable name collisions. -* `PointwiseLikelihoodContext` records the log likelihood of each individual variable. -* `DebugContext` collects useful debugging information while executing the model. - -All of the above are what Turing calls _parent contexts_, which is to say that they all keep a subcontext just like our above `SamplingContext` did. Their implementations of `assume` and `observe` call the implementation of the subcontext once they are done doing their own work of fixing/conditioning/prefixing/etc. Contexts are often chained, so that e.g. a `DebugContext` may wrap within it a `PrefixContext`, which may in turn wrap a `ConditionContext`, etc. The only contexts that _don't_ have a subcontext in the Turing are the ones for evaluating the prior, likelihood, and joint distributions. These are called _leaf contexts_. - -The above version of mini Turing is still much simpler than the full Turing language, but the principles of how contexts are used are the same. +--- +title: "A Mini Turing Implementation II: Contexts" +engine: julia +aliases: + - ../../../tutorials/16-contexts/index.html +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In the [Mini Turing]({{< meta minituring >}}) tutorial we developed a miniature version of the Turing language, to illustrate its core design. A passing mention was made of contexts. In this tutorial we develop that aspect of our mini Turing language further to demonstrate how and why contexts are an important part of Turing's design. + +# Mini Turing expanded, now with more contexts + +If you haven't read [Mini Turing]({{< meta minituring >}}) yet, you should do that first. We start by repeating verbatim much of the code from there. Define the type for holding values for variables: + +```{julia} +import MacroTools, Random, AbstractMCMC +using Distributions: Normal, logpdf +using MCMCChains: Chains +using AbstractMCMC: sample + +struct VarInfo{V,L} + values::V + logps::L +end + +VarInfo() = VarInfo(Dict{Symbol,Float64}(), Dict{Symbol,Float64}()) + +function Base.setindex!(varinfo::VarInfo, (value, logp), var_id) + varinfo.values[var_id] = value + varinfo.logps[var_id] = logp + return varinfo +end +``` + +Define the macro that expands `~` expressions to calls to `assume` and `observe`: + +```{julia} +# Methods will be defined for these later. +function assume end +function observe end + +macro mini_model(expr) + return esc(mini_model(expr)) +end + +function mini_model(expr) + # Split the function definition into a dictionary with its name, arguments, body etc. + def = MacroTools.splitdef(expr) + + # Replace tildes in the function body with calls to `assume` or `observe` + def[:body] = MacroTools.postwalk(def[:body]) do sub_expr + if MacroTools.@capture(sub_expr, var_ ~ dist_) + if var in def[:args] + # If the variable is an argument of the model function, it is observed + return :($(observe)(context, varinfo, $dist, $(Meta.quot(var)), $var)) + else + # Otherwise it is unobserved + return :($var = $(assume)(context, varinfo, $dist, $(Meta.quot(var)))) + end + else + return sub_expr + end + end + + # Add `context` and `varinfo` arguments to the model function + def[:args] = vcat(:varinfo, :context, def[:args]) + + # Reassemble the function definition from its name, arguments, body etc. + return MacroTools.combinedef(def) +end + + +struct MiniModel{F,D} <: AbstractMCMC.AbstractModel + f::F + data::D # a NamedTuple of all the data +end +``` + +Define an example model: + +```{julia} +@mini_model function m(x) + a ~ Normal(0.5, 1) + b ~ Normal(a, 2) + x ~ Normal(b, 0.5) + return nothing +end; + +mini_m = MiniModel(m, (x=3.0,)) +``` + +Previously in the mini Turing case, at this point we defined `SamplingContext`, a structure that holds a random number generator and a sampler, and gets passed to `observe` and `assume`. We then used it to implement a simple Metropolis-Hastings sampler. + +The notion of a context may have seemed overly complicated just to implement the sampler, but there are other things we may want to do with a model than sample from the posterior. Having the context passing in place lets us do that without having to touch the above macro at all. For instance, let's say we want to evaluate the log joint probability of the model for a given set of data and parameters. Using a new context type we can use the previously defined `model` function, but change its behavior by changing what the `observe` and `assume` functions do. + + + +```{julia} +struct JointContext end + +function observe(context::JointContext, varinfo, dist, var_id, var_value) + logp = logpdf(dist, var_value) + varinfo[var_id] = (var_value, logp) + return nothing +end + +function assume(context::JointContext, varinfo, dist, var_id) + if !haskey(varinfo.values, var_id) + error("Can't evaluate the log probability if the variable $(var_id) is not set.") + end + var_value = varinfo.values[var_id] + logp = logpdf(dist, var_value) + varinfo[var_id] = (var_value, logp) + return var_value +end + +function logjoint(model, parameter_values::NamedTuple) + vi = VarInfo() + for (var_id, value) in pairs(parameter_values) + # Set the log prob to NaN for now. These will get overwritten when model.f is + # called with JointContext. + vi[var_id] = (value, NaN) + end + model.f(vi, JointContext(), values(model.data)...) + return sum(values(vi.logps)) +end + +logjoint(mini_m, (a=0.5, b=1.0)) +``` + +When using the `JointContext` no sampling whatsoever happens in calling `mini_m`. Rather only the log probability of each given variable value is evaluated. `logjoint` then sums these results to get the total log joint probability. + +We can similarly define a context for evaluating the log prior probability: + +```{julia} +struct PriorContext end + +function observe(context::PriorContext, varinfo, dist, var_id, var_value) + # Since we are evaluating the prior, the log probability of all the observations + # is set to 0. This has the effect of ignoring the likelihood. + varinfo[var_id] = (var_value, 0.0) + return nothing +end + +function assume(context::PriorContext, varinfo, dist, var_id) + if !haskey(varinfo.values, var_id) + error("Can't evaluate the log probability if the variable $(var_id) is not set.") + end + var_value = varinfo.values[var_id] + logp = logpdf(dist, var_value) + varinfo[var_id] = (var_value, logp) + return var_value +end + +function logprior(model, parameter_values::NamedTuple) + vi = VarInfo() + for (var_id, value) in pairs(parameter_values) + vi[var_id] = (value, NaN) + end + model.f(vi, PriorContext(), values(model.data)...) + return sum(values(vi.logps)) +end + +logprior(mini_m, (a=0.5, b=1.0)) +``` + +Notice that the definition of `assume(context::PriorContext, args...)` is identical to the one for `JointContext`, and `logprior` and `logjoint` are also identical except for the context type they create. There's clearly an opportunity here for some refactoring using abstract types, but that's outside the scope of this tutorial. Rather, the point here is to demonstrate that we can extract different sorts of things from our model by defining different context types, and specialising `observe` and `assume` for them. + + +## Contexts within contexts + +Let's use the above two contexts to provide a slightly more general definition of the `SamplingContext` and the Metropolis-Hastings sampler we wrote in the mini Turing tutorial. + +```{julia} +struct SamplingContext{S<:AbstractMCMC.AbstractSampler,R<:Random.AbstractRNG} + rng::R + sampler::S + subcontext::Union{PriorContext, JointContext} +end +``` + +The new aspect here is the `subcontext` field. Note that this is a context within a context! The idea is that we don't need to hard code how the MCMC sampler evaluates the log probability, but rather can pass that work onto the subcontext. This way the same sampler can be used to sample from either the joint or the prior distribution. + +The methods for `SamplingContext` are largely as in the our earlier mini Turing case, except they now pass some of the work onto the subcontext: + +```{julia} +function observe(context::SamplingContext, args...) + # Sampling doesn't affect the observed values, so nothing to do here other than pass to + # the subcontext. + return observe(context.subcontext, args...) +end + +struct PriorSampler <: AbstractMCMC.AbstractSampler end + +function assume(context::SamplingContext{PriorSampler}, varinfo, dist, var_id) + sample = Random.rand(context.rng, dist) + varinfo[var_id] = (sample, NaN) + # Once the value has been sampled, let the subcontext handle evaluating the log + # probability. + return assume(context.subcontext, varinfo, dist, var_id) +end; + +# The subcontext field of the MHSampler determines which distribution this sampler +# samples from. +struct MHSampler{D, T<:Real} <: AbstractMCMC.AbstractSampler + sigma::T + subcontext::D +end + +MHSampler(subcontext) = MHSampler(1, subcontext) + +function assume(context::SamplingContext{<:MHSampler}, varinfo, dist, var_id) + sampler = context.sampler + old_value = varinfo.values[var_id] + + # propose a random-walk step, i.e, add the current value to a random + # value sampled from a Normal distribution centered at 0 + value = rand(context.rng, Normal(old_value, sampler.sigma)) + varinfo[var_id] = (value, NaN) + # Once the value has been sampled, let the subcontext handle evaluating the log + # probability. + return assume(context.subcontext, varinfo, dist, var_id) +end; + +# The following three methods are identical to before, except for passing +# `sampler.subcontext` to the context SamplingContext. +function AbstractMCMC.step( + rng::Random.AbstractRNG, model::MiniModel, sampler::MHSampler; kwargs... +) + vi = VarInfo() + ctx = SamplingContext(rng, PriorSampler(), sampler.subcontext) + model.f(vi, ctx, values(model.data)...) + return vi, vi +end + +function AbstractMCMC.step( + rng::Random.AbstractRNG, + model::MiniModel, + sampler::MHSampler, + prev_state::VarInfo; # is just the old trace + kwargs..., +) + vi = prev_state + new_vi = deepcopy(vi) + ctx = SamplingContext(rng, sampler, sampler.subcontext) + model.f(new_vi, ctx, values(model.data)...) + + # Compute log acceptance probability + # Since the proposal is symmetric the computation can be simplified + logα = sum(values(new_vi.logps)) - sum(values(vi.logps)) + + # Accept proposal with computed acceptance probability + if -Random.randexp(rng) < logα + return new_vi, new_vi + else + return prev_state, prev_state + end +end; + +function AbstractMCMC.bundle_samples( + samples, model::MiniModel, ::MHSampler, ::Any, ::Type{Chains}; kwargs... +) + # We get a vector of traces + values = [sample.values for sample in samples] + params = [key for key in keys(values[1]) if key ∉ keys(model.data)] + vals = reduce(hcat, [value[p] for value in values] for p in params) + # Composing the `Chains` data-structure, of which analyzing infrastructure is provided + chains = Chains(vals, params) + return chains +end; +``` + +We can use this to sample from the joint distribution just like before: + +```{julia} +sample(MiniModel(m, (x=3.0,)), MHSampler(JointContext()), 1_000_000; chain_type=Chains, progress=false) +``` + +or we can choose to sample from the prior instead + +```{julia} +sample(MiniModel(m, (x=3.0,)), MHSampler(PriorContext()), 1_000_000; chain_type=Chains, progress=false) +``` + +Of course, using an MCMC algorithm to sample from the prior is unnecessary and silly (`PriorSampler` exists, after all), but the point is to illustrate the flexibility of the context system. We could, for instance, use the same setup to implement an _Approximate Bayesian Computation_ (ABC) algorithm. + + +The use of contexts also goes far beyond just evaluating log probabilities and sampling. Some examples from Turing are + +* `FixedContext`, which fixes some variables to given values and removes them completely from the evaluation of any log probabilities. They power the `Turing.fix` and `Turing.unfix` functions. +* `ConditionContext` conditions the model on fixed values for some parameters. They are used by `Turing.condition` and `Turing.uncondition`, i.e. the `model | (parameter=value,)` syntax. The difference between `fix` and `condition` is whether the log probability for the corresponding variable is included in the overall log density. + +* `PriorExtractorContext` collects information about what the prior distribution of each variable is. +* `PrefixContext` adds prefixes to variable names, allowing models to be used within other models without variable name collisions. +* `PointwiseLikelihoodContext` records the log likelihood of each individual variable. +* `DebugContext` collects useful debugging information while executing the model. + +All of the above are what Turing calls _parent contexts_, which is to say that they all keep a subcontext just like our above `SamplingContext` did. Their implementations of `assume` and `observe` call the implementation of the subcontext once they are done doing their own work of fixing/conditioning/prefixing/etc. Contexts are often chained, so that e.g. a `DebugContext` may wrap within it a `PrefixContext`, which may in turn wrap a `ConditionContext`, etc. The only contexts that _don't_ have a subcontext in the Turing are the ones for evaluating the prior, likelihood, and joint distributions. These are called _leaf contexts_. + +The above version of mini Turing is still much simpler than the full Turing language, but the principles of how contexts are used are the same. diff --git a/developers/contributing/index.qmd b/developers/contributing/index.qmd index e2e5b12aa..00040e7e2 100755 --- a/developers/contributing/index.qmd +++ b/developers/contributing/index.qmd @@ -1,78 +1,78 @@ ---- -title: Contributing -aliases: - - ../../tutorials/docs-01-contributing-guide/index.html ---- - -Turing is an open-source project and is [hosted on GitHub](https://github.com/TuringLang). -We welcome contributions from the community in all forms large or small: bug reports, feature implementations, code contributions, or improvements to documentation or infrastructure are all extremely valuable. -We would also very much appreciate examples of models written using Turing. - -### How to get involved - -Our outstanding issues are tabulated on our [issue tracker](https://github.com/TuringLang/Turing.jl/issues). -Closing one of these may involve implementing new features, fixing bugs, or writing example models. - -You can also join the `#turing` channel on the [Julia Slack](https://julialang.org/slack/) and say hello! - -If you are new to open-source software, please see [GitHub's introduction](https://guides.github.com/introduction/flow/) or [Julia's contribution guide](https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md) on using version control for collaboration. - -### Documentation - -Each of the packages in the Turing ecosystem (see [Libraries](/library)) has its own documentation, which is typically found in the `docs` folder of the corresponding package. -For example, the source code for DynamicPPL's documentation can be found in [its repository](https://github.com/TuringLang/DynamicPPL.jl). - -The documentation for Turing.jl itself consists of the tutorials that you see on this website, and is built from the separate [`docs` repository](https://github.com/TuringLang/docs). -None of the documentation is generated from the [main Turing.jl repository](https://github.com/TuringLang/Turing.jl); in particular, the API that Turing exports does not currently form part of the documentation. - -Other sections of the website (anything that isn't a package, or a tutorial) – for example, the list of libraries – is built from the [`turinglang.github.io` repository](https://github.com/TuringLang/turinglang.github.io). - -### Tests - -Turing, like most software libraries, has a test suite. You can run the whole suite by running `julia --project=.` from the root of the Turing repository, and then running - -```julia -import Pkg; Pkg.test("Turing") -``` - -The test suite subdivides into files in the `test` folder, and you can run only some of them using commands like - -```julia -import Pkg; Pkg.test("Turing"; test_args=["optim", "hmc", "--skip", "ext"]) -``` - -This one would run all files with "optim" or "hmc" in their path, such as `test/optimisation/Optimisation.jl`, but not files with "ext" in their path. Alternatively, you can set these arguments as command line arguments when you run Julia - -```julia -julia --project=. -e 'import Pkg; Pkg.test(; test_args=ARGS)' -- optim hmc --skip ext -``` - -Or otherwise, set the global `ARGS` variable, and call `include("test/runtests.jl")`. - -### Style Guide - -Turing has a style guide, described below. -Reviewing it before making a pull request is not strictly necessary, but you may be asked to change portions of your code to conform with the style guide before it is merged. - -Most Turing code follows [Blue: a Style Guide for Julia](https://github.com/JuliaDiff/BlueStyle). -These conventions were created from a variety of sources including Python's [PEP8](http://legacy.python.org/dev/peps/pep-0008/), Julia's [Notes for Contributors](https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md), and Julia's [Style Guide](https://docs.julialang.org/en/v1/manual/style-guide/). - -#### Synopsis - - - Use 4 spaces per indentation level, no tabs. - - Try to adhere to a 92 character line length limit. - - Use upper camel case convention for [modules](https://docs.julialang.org/en/v1/manual/modules/) and [types](https://docs.julialang.org/en/v1/manual/types/). - - Use lower case with underscores for method names (note: Julia code likes to use lower case without underscores). - - Comments are good, try to explain the intentions of the code. - - Use whitespace to make the code more readable. - - No whitespace at the end of a line (trailing whitespace). - - Avoid padding brackets with spaces. ex. `Int64(value)` preferred over `Int64( value )`. - -#### A Word on Consistency - -When adhering to the Blue style, it's important to realize that these are guidelines, not rules. This is [stated best in the PEP8](http://legacy.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds): - -> A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important. - -> But most importantly: know when to be inconsistent – sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask! - +--- +title: Contributing +aliases: + - ../../tutorials/docs-01-contributing-guide/index.html +--- + +Turing is an open-source project and is [hosted on GitHub](https://github.com/TuringLang). +We welcome contributions from the community in all forms large or small: bug reports, feature implementations, code contributions, or improvements to documentation or infrastructure are all extremely valuable. +We would also very much appreciate examples of models written using Turing. + +### How to get involved + +Our outstanding issues are tabulated on our [issue tracker](https://github.com/TuringLang/Turing.jl/issues). +Closing one of these may involve implementing new features, fixing bugs, or writing example models. + +You can also join the `#turing` channel on the [Julia Slack](https://julialang.org/slack/) and say hello! + +If you are new to open-source software, please see [GitHub's introduction](https://guides.github.com/introduction/flow/) or [Julia's contribution guide](https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md) on using version control for collaboration. + +### Documentation + +Each of the packages in the Turing ecosystem (see [Libraries](/library)) has its own documentation, which is typically found in the `docs` folder of the corresponding package. +For example, the source code for DynamicPPL's documentation can be found in [its repository](https://github.com/TuringLang/DynamicPPL.jl). + +The documentation for Turing.jl itself consists of the tutorials that you see on this website, and is built from the separate [`docs` repository](https://github.com/TuringLang/docs). +None of the documentation is generated from the [main Turing.jl repository](https://github.com/TuringLang/Turing.jl); in particular, the API that Turing exports does not currently form part of the documentation. + +Other sections of the website (anything that isn't a package, or a tutorial) – for example, the list of libraries – is built from the [`turinglang.github.io` repository](https://github.com/TuringLang/turinglang.github.io). + +### Tests + +Turing, like most software libraries, has a test suite. You can run the whole suite by running `julia --project=.` from the root of the Turing repository, and then running + +```julia +import Pkg; Pkg.test("Turing") +``` + +The test suite subdivides into files in the `test` folder, and you can run only some of them using commands like + +```julia +import Pkg; Pkg.test("Turing"; test_args=["optim", "hmc", "--skip", "ext"]) +``` + +This one would run all files with "optim" or "hmc" in their path, such as `test/optimisation/Optimisation.jl`, but not files with "ext" in their path. Alternatively, you can set these arguments as command line arguments when you run Julia + +```julia +julia --project=. -e 'import Pkg; Pkg.test(; test_args=ARGS)' -- optim hmc --skip ext +``` + +Or otherwise, set the global `ARGS` variable, and call `include("test/runtests.jl")`. + +### Style Guide + +Turing has a style guide, described below. +Reviewing it before making a pull request is not strictly necessary, but you may be asked to change portions of your code to conform with the style guide before it is merged. + +Most Turing code follows [Blue: a Style Guide for Julia](https://github.com/JuliaDiff/BlueStyle). +These conventions were created from a variety of sources including Python's [PEP8](http://legacy.python.org/dev/peps/pep-0008/), Julia's [Notes for Contributors](https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md), and Julia's [Style Guide](https://docs.julialang.org/en/v1/manual/style-guide/). + +#### Synopsis + + - Use 4 spaces per indentation level, no tabs. + - Try to adhere to a 92 character line length limit. + - Use upper camel case convention for [modules](https://docs.julialang.org/en/v1/manual/modules/) and [types](https://docs.julialang.org/en/v1/manual/types/). + - Use lower case with underscores for method names (note: Julia code likes to use lower case without underscores). + - Comments are good, try to explain the intentions of the code. + - Use whitespace to make the code more readable. + - No whitespace at the end of a line (trailing whitespace). + - Avoid padding brackets with spaces. ex. `Int64(value)` preferred over `Int64( value )`. + +#### A Word on Consistency + +When adhering to the Blue style, it's important to realize that these are guidelines, not rules. This is [stated best in the PEP8](http://legacy.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds): + +> A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important. + +> But most importantly: know when to be inconsistent – sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask! + diff --git a/developers/inference/abstractmcmc-interface/index.qmd b/developers/inference/abstractmcmc-interface/index.qmd index aa8cfc210..993936209 100755 --- a/developers/inference/abstractmcmc-interface/index.qmd +++ b/developers/inference/abstractmcmc-interface/index.qmd @@ -1,323 +1,323 @@ ---- -title: Interface Guide -engine: julia -aliases: - - ../../tutorials/docs-06-for-developers-interface/index.html ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -# The sampling interface - -Turing implements a sampling interface (hosted at [AbstractMCMC](https://github.com/TuringLang/AbstractMCMC.jl)) that is intended to provide a common framework for Markov chain Monte Carlo samplers. The interface presents several structures and functions that one needs to overload in order to implement an interface-compatible sampler. - -This guide will demonstrate how to implement the interface without Turing. - -## Interface overview - -Any implementation of an inference method that uses the AbstractMCMC interface should implement a subset of the following types and functions: - -1. A subtype of `AbstractSampler`, defined as a mutable struct containing state information or sampler parameters. -2. A function `sample_init!` which performs any necessary set-up (default: do not perform any set-up). -3. A function `step!` which returns a transition that represents a single draw from the sampler. -4. A function `transitions_init` which returns a container for the transitions obtained from the sampler (default: return a `Vector{T}` of length `N` where `T` is the type of the transition obtained in the first step and `N` is the number of requested samples). -5. A function `transitions_save!` which saves transitions to the container (default: save the transition of iteration `i` at position `i` in the vector of transitions). -6. A function `sample_end!` which handles any sampler wrap-up (default: do not perform any wrap-up). -7. A function `bundle_samples` which accepts the container of transitions and returns a collection of samples (default: return the vector of transitions). - -The interface methods with exclamation points are those that are intended to allow for state mutation. Any mutating function is meant to allow mutation where needed -- you might use: - -- `sample_init!` to run some kind of sampler preparation, before sampling begins. This could mutate a sampler's state. -- `step!` might mutate a sampler flag after each sample. -- `sample_end!` contains any wrap-up you might need to do. If you were sampling in a transformed space, this might be where you convert everything back to a constrained space. - -## Why do you have an interface? - -The motivation for the interface is to allow Julia's fantastic probabilistic programming language community to have a set of standards and common implementations so we can all thrive together. Markov chain Monte Carlo methods tend to have a very similar framework to one another, and so a common interface should help more great inference methods built in single-purpose packages to experience more use among the community. - -## Implementing Metropolis-Hastings without Turing - -[Metropolis-Hastings](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) is often the first sampling method that people are exposed to. It is a very straightforward algorithm and is accordingly the easiest to implement, so it makes for a good example. In this section, you will learn how to use the types and functions listed above to implement the Metropolis-Hastings sampler using the MCMC interface. - -The full code for this implementation is housed in [AdvancedMH.jl](https://github.com/TuringLang/AdvancedMH.jl). - -### Imports - -Let's begin by importing the relevant libraries. We'll import `AbstractMCMC`, which contains the interface framework we'll fill out. We also need `Distributions` and `Random`. - -```{julia} -# Import the relevant libraries. -using AbstractMCMC: AbstractMCMC -using Distributions -using Random -``` - -An interface extension (like the one we're writing right now) typically requires that you overload or implement several functions. Specifically, you should `import` the functions you intend to overload. This next code block accomplishes that. - -From `Distributions`, we need `Sampleable`, `VariateForm`, and `ValueSupport`, three abstract types that define a distribution. Models in the interface are assumed to be subtypes of `Sampleable{VariateForm, ValueSupport}`. In this section our model is going be be extremely simple, so we will not end up using these except to make sure that the inference functions are dispatching correctly. - -### Sampler - -Let's begin our sampler definition by defining a sampler called `MetropolisHastings` which is a subtype of `AbstractSampler`. Correct typing is very important for proper interface implementation -- if you are missing a subtype, your method may not be dispatched to when you call `sample`. - -```{julia} -# Define a sampler type. -struct MetropolisHastings{T,D} <: AbstractMCMC.AbstractSampler - init_θ::T - proposal::D -end - -# Default constructors. -MetropolisHastings(init_θ::Real) = MetropolisHastings(init_θ, Normal(0, 1)) -function MetropolisHastings(init_θ::Vector{<:Real}) - return MetropolisHastings(init_θ, MvNormal(zero(init_θ), I)) -end -``` - -Above, we have defined a sampler that stores the initial parameterization of the prior, and a distribution object from which proposals are drawn. You can have a struct that has no fields, and simply use it for dispatching onto the relevant functions, or you can store a large amount of state information in your sampler. - -The general intuition for what to store in your sampler struct is that anything you may need to perform inference between samples but you don't want to store in a transition should go into the sampler struct. It's the only way you can carry non-sample related state information between `step!` calls. - -### Model - -Next, we need to have a model of some kind. A model is a struct that's a subtype of `AbstractModel` that contains whatever information is necessary to perform inference on your problem. In our case we want to know the mean and variance parameters for a standard Normal distribution, so we can keep our model to the log density of a Normal. - -Note that we only have to do this because we are not yet integrating the sampler with Turing -- Turing has a very sophisticated modelling engine that removes the need to define custom model structs. - -```{julia} -# Define a model type. Stores the log density function. -struct DensityModel{F<:Function} <: AbstractMCMC.AbstractModel - ℓπ::F -end -``` - -### Transition - -The next step is to define some transition which we will return from each `step!` call. We'll keep it simple by just defining a wrapper struct that contains the parameter draws and the log density of that draw: - -```{julia} -# Create a very basic Transition type, only stores the -# parameter draws and the log probability of the draw. -struct Transition{T,L} - θ::T - lp::L -end - -# Store the new draw and its log density. -Transition(model::DensityModel, θ) = Transition(θ, ℓπ(model, θ)) -``` - -`Transition` can now store any type of parameter, whether it's a vector of draws from multiple parameters or a single univariate draw. - -### Metropolis-Hastings - -Now it's time to get into the actual inference. We've defined all of the core pieces we need, but we need to implement the `step!` function which actually performs inference. - -As a refresher, Metropolis-Hastings implements a very basic algorithm: - -1. Pick some initial state, ``\theta_0``. - -2. For ``t`` in ``[1,N],`` do - - + Generate a proposal parameterization ``\theta^\prime_t \sim q(\theta^\prime_t \mid \theta_{t-1}).`` - - + Calculate the acceptance probability, ``\alpha = \text{min}\left[1,\frac{\pi(\theta'_t)}{\pi(\theta_{t-1})} \frac{q(\theta_{t-1} \mid \theta'_t)}{q(\theta'_t \mid \theta_{t-1})}) \right].`` - - + If ``U \le \alpha`` where ``U \sim [0,1],`` then ``\theta_t = \theta'_t.`` Otherwise, ``\theta_t = \theta_{t-1}.`` - -Of course, it's much easier to do this in the log space, so the acceptance probability is more commonly written as - -```{.cell-bg} -\log \alpha = \min\left[0, \log \pi(\theta'_t) - \log \pi(\theta_{t-1}) + \log q(\theta_{t-1} \mid \theta^\prime_t) - \log q(\theta\prime_t \mid \theta_{t-1}) \right]. -``` - -In interface terms, we should do the following: - -1. Make a new transition containing a proposed sample. -2. Calculate the acceptance probability. -3. If we accept, return the new transition, otherwise, return the old one. - -### Steps - -The `step!` function is the function that performs the bulk of your inference. In our case, we will implement two `step!` functions -- one for the very first iteration, and one for every subsequent iteration. - -```{julia} -#| eval: false -# Define the first step! function, which is called at the -# beginning of sampling. Return the initial parameter used -# to define the sampler. -function AbstractMCMC.step!( - rng::AbstractRNG, - model::DensityModel, - spl::MetropolisHastings, - N::Integer, - ::Nothing; - kwargs..., -) - return Transition(model, spl.init_θ) -end -``` - -The first `step!` function just packages up the initial parameterization inside the sampler, and returns it. We implicitly accept the very first parameterization. - -The other `step!` function performs the usual steps from Metropolis-Hastings. Included are several helper functions, `proposal` and `q`, which are designed to replicate the functions in the pseudocode above. - -- `proposal` generates a new proposal in the form of a `Transition`, which can be univariate if the value passed in is univariate, or it can be multivariate if the `Transition` given is multivariate. Proposals use a basic `Normal` or `MvNormal` proposal distribution. -- `q` returns the log density of one parameterization conditional on another, according to the proposal distribution. -- `step!` generates a new proposal, checks the acceptance probability, and then returns either the previous transition or the proposed transition. - - -```{julia} -#| eval: false -# Define a function that makes a basic proposal depending on a univariate -# parameterization or a multivariate parameterization. -function propose(spl::MetropolisHastings, model::DensityModel, θ::Real) - return Transition(model, θ + rand(spl.proposal)) -end -function propose(spl::MetropolisHastings, model::DensityModel, θ::Vector{<:Real}) - return Transition(model, θ + rand(spl.proposal)) -end -function propose(spl::MetropolisHastings, model::DensityModel, t::Transition) - return propose(spl, model, t.θ) -end - -# Calculates the probability `q(θ|θcond)`, using the proposal distribution `spl.proposal`. -q(spl::MetropolisHastings, θ::Real, θcond::Real) = logpdf(spl.proposal, θ - θcond) -function q(spl::MetropolisHastings, θ::Vector{<:Real}, θcond::Vector{<:Real}) - return logpdf(spl.proposal, θ - θcond) -end -q(spl::MetropolisHastings, t1::Transition, t2::Transition) = q(spl, t1.θ, t2.θ) - -# Calculate the density of the model given some parameterization. -ℓπ(model::DensityModel, θ) = model.ℓπ(θ) -ℓπ(model::DensityModel, t::Transition) = t.lp - -# Define the other step function. Returns a Transition containing -# either a new proposal (if accepted) or the previous proposal -# (if not accepted). -function AbstractMCMC.step!( - rng::AbstractRNG, - model::DensityModel, - spl::MetropolisHastings, - ::Integer, - θ_prev::Transition; - kwargs..., -) - # Generate a new proposal. - θ = propose(spl, model, θ_prev) - - # Calculate the log acceptance probability. - α = ℓπ(model, θ) - ℓπ(model, θ_prev) + q(spl, θ_prev, θ) - q(spl, θ, θ_prev) - - # Decide whether to return the previous θ or the new one. - if log(rand(rng)) < min(α, 0.0) - return θ - else - return θ_prev - end -end -``` - -### Chains - -In the default implementation, `sample` just returns a vector of all transitions. If instead you would like to obtain a `Chains` object (e.g., to simplify downstream analysis), you have to implement the `bundle_samples` function as well. It accepts the vector of transitions and returns a collection of samples. Fortunately, our `Transition` is incredibly simple, and we only need to build a little bit of functionality to accept custom parameter names passed in by the user. - -```{julia} -#| eval: false -# A basic chains constructor that works with the Transition struct we defined. -function AbstractMCMC.bundle_samples( - rng::AbstractRNG, - ℓ::DensityModel, - s::MetropolisHastings, - N::Integer, - ts::Vector{<:Transition}, - chain_type::Type{Any}; - param_names=missing, - kwargs..., -) - # Turn all the transitions into a vector-of-vectors. - vals = copy(reduce(hcat, [vcat(t.θ, t.lp) for t in ts])') - - # Check if we received any parameter names. - if ismissing(param_names) - param_names = ["Parameter $i" for i in 1:(length(first(vals)) - 1)] - end - - # Add the log density field to the parameter names. - push!(param_names, "lp") - - # Bundle everything up and return a Chains struct. - return Chains(vals, param_names, (internals=["lp"],)) -end -``` - -All done! - -You can even implement different output formats by implementing `bundle_samples` for different `chain_type`s, which can be provided as keyword argument to `sample`. As default `sample` uses `chain_type = Any`. - -### Testing the implementation - -Now that we have all the pieces, we should test the implementation by defining a model to calculate the mean and variance parameters of a Normal distribution. We can do this by constructing a target density function, providing a sample of data, and then running the sampler with `sample`. - -```{julia} -#| eval: false -# Generate a set of data from the posterior we want to estimate. -data = rand(Normal(5, 3), 30) - -# Define the components of a basic model. -insupport(θ) = θ[2] >= 0 -dist(θ) = Normal(θ[1], θ[2]) -density(θ) = insupport(θ) ? sum(logpdf.(dist(θ), data)) : -Inf - -# Construct a DensityModel. -model = DensityModel(density) - -# Set up our sampler with initial parameters. -spl = MetropolisHastings([0.0, 0.0]) - -# Sample from the posterior. -chain = sample(model, spl, 100000; param_names=["μ", "σ"]) -``` - -If all the interface functions have been extended properly, you should get an output from `display(chain)` that looks something like this: - - -```{.cell-bg} -Object of type Chains, with data of type 100000×3×1 Array{Float64,3} - -Iterations = 1:100000 -Thinning interval = 1 -Chains = 1 -Samples per chain = 100000 -internals = lp -parameters = μ, σ - -2-element Array{ChainDataFrame,1} - -Summary Statistics - -│ Row │ parameters │ mean │ std │ naive_se │ mcse │ ess │ r_hat │ -│ │ Symbol │ Float64 │ Float64 │ Float64 │ Float64 │ Any │ Any │ -├─────┼────────────┼─────────┼──────────┼────────────┼────────────┼─────────┼─────────┤ -│ 1 │ μ │ 5.33157 │ 0.854193 │ 0.0027012 │ 0.00893069 │ 8344.75 │ 1.00009 │ -│ 2 │ σ │ 4.54992 │ 0.632916 │ 0.00200146 │ 0.00534942 │ 14260.8 │ 1.00005 │ - -Quantiles - -│ Row │ parameters │ 2.5% │ 25.0% │ 50.0% │ 75.0% │ 97.5% │ -│ │ Symbol │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ -├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ -│ 1 │ μ │ 3.6595 │ 4.77754 │ 5.33182 │ 5.89509 │ 6.99651 │ -│ 2 │ σ │ 3.5097 │ 4.09732 │ 4.47805 │ 4.93094 │ 5.96821 │ -``` - -It looks like we're extremely close to our true parameters of `Normal(5,3)`, though with a fairly high variance due to the low sample size. - -## Conclusion - -We've seen how to implement the sampling interface for general projects. Turing's interface methods are ever-evolving, so please open an issue at [AbstractMCMC](https://github.com/TuringLang/AbstractMCMC.jl) with feature requests or problems. +--- +title: Interface Guide +engine: julia +aliases: + - ../../tutorials/docs-06-for-developers-interface/index.html +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +# The sampling interface + +Turing implements a sampling interface (hosted at [AbstractMCMC](https://github.com/TuringLang/AbstractMCMC.jl)) that is intended to provide a common framework for Markov chain Monte Carlo samplers. The interface presents several structures and functions that one needs to overload in order to implement an interface-compatible sampler. + +This guide will demonstrate how to implement the interface without Turing. + +## Interface overview + +Any implementation of an inference method that uses the AbstractMCMC interface should implement a subset of the following types and functions: + +1. A subtype of `AbstractSampler`, defined as a mutable struct containing state information or sampler parameters. +2. A function `sample_init!` which performs any necessary set-up (default: do not perform any set-up). +3. A function `step!` which returns a transition that represents a single draw from the sampler. +4. A function `transitions_init` which returns a container for the transitions obtained from the sampler (default: return a `Vector{T}` of length `N` where `T` is the type of the transition obtained in the first step and `N` is the number of requested samples). +5. A function `transitions_save!` which saves transitions to the container (default: save the transition of iteration `i` at position `i` in the vector of transitions). +6. A function `sample_end!` which handles any sampler wrap-up (default: do not perform any wrap-up). +7. A function `bundle_samples` which accepts the container of transitions and returns a collection of samples (default: return the vector of transitions). + +The interface methods with exclamation points are those that are intended to allow for state mutation. Any mutating function is meant to allow mutation where needed -- you might use: + +- `sample_init!` to run some kind of sampler preparation, before sampling begins. This could mutate a sampler's state. +- `step!` might mutate a sampler flag after each sample. +- `sample_end!` contains any wrap-up you might need to do. If you were sampling in a transformed space, this might be where you convert everything back to a constrained space. + +## Why do you have an interface? + +The motivation for the interface is to allow Julia's fantastic probabilistic programming language community to have a set of standards and common implementations so we can all thrive together. Markov chain Monte Carlo methods tend to have a very similar framework to one another, and so a common interface should help more great inference methods built in single-purpose packages to experience more use among the community. + +## Implementing Metropolis-Hastings without Turing + +[Metropolis-Hastings](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) is often the first sampling method that people are exposed to. It is a very straightforward algorithm and is accordingly the easiest to implement, so it makes for a good example. In this section, you will learn how to use the types and functions listed above to implement the Metropolis-Hastings sampler using the MCMC interface. + +The full code for this implementation is housed in [AdvancedMH.jl](https://github.com/TuringLang/AdvancedMH.jl). + +### Imports + +Let's begin by importing the relevant libraries. We'll import `AbstractMCMC`, which contains the interface framework we'll fill out. We also need `Distributions` and `Random`. + +```{julia} +# Import the relevant libraries. +using AbstractMCMC: AbstractMCMC +using Distributions +using Random +``` + +An interface extension (like the one we're writing right now) typically requires that you overload or implement several functions. Specifically, you should `import` the functions you intend to overload. This next code block accomplishes that. + +From `Distributions`, we need `Sampleable`, `VariateForm`, and `ValueSupport`, three abstract types that define a distribution. Models in the interface are assumed to be subtypes of `Sampleable{VariateForm, ValueSupport}`. In this section our model is going be be extremely simple, so we will not end up using these except to make sure that the inference functions are dispatching correctly. + +### Sampler + +Let's begin our sampler definition by defining a sampler called `MetropolisHastings` which is a subtype of `AbstractSampler`. Correct typing is very important for proper interface implementation -- if you are missing a subtype, your method may not be dispatched to when you call `sample`. + +```{julia} +# Define a sampler type. +struct MetropolisHastings{T,D} <: AbstractMCMC.AbstractSampler + init_θ::T + proposal::D +end + +# Default constructors. +MetropolisHastings(init_θ::Real) = MetropolisHastings(init_θ, Normal(0, 1)) +function MetropolisHastings(init_θ::Vector{<:Real}) + return MetropolisHastings(init_θ, MvNormal(zero(init_θ), I)) +end +``` + +Above, we have defined a sampler that stores the initial parameterization of the prior, and a distribution object from which proposals are drawn. You can have a struct that has no fields, and simply use it for dispatching onto the relevant functions, or you can store a large amount of state information in your sampler. + +The general intuition for what to store in your sampler struct is that anything you may need to perform inference between samples but you don't want to store in a transition should go into the sampler struct. It's the only way you can carry non-sample related state information between `step!` calls. + +### Model + +Next, we need to have a model of some kind. A model is a struct that's a subtype of `AbstractModel` that contains whatever information is necessary to perform inference on your problem. In our case we want to know the mean and variance parameters for a standard Normal distribution, so we can keep our model to the log density of a Normal. + +Note that we only have to do this because we are not yet integrating the sampler with Turing -- Turing has a very sophisticated modelling engine that removes the need to define custom model structs. + +```{julia} +# Define a model type. Stores the log density function. +struct DensityModel{F<:Function} <: AbstractMCMC.AbstractModel + ℓπ::F +end +``` + +### Transition + +The next step is to define some transition which we will return from each `step!` call. We'll keep it simple by just defining a wrapper struct that contains the parameter draws and the log density of that draw: + +```{julia} +# Create a very basic Transition type, only stores the +# parameter draws and the log probability of the draw. +struct Transition{T,L} + θ::T + lp::L +end + +# Store the new draw and its log density. +Transition(model::DensityModel, θ) = Transition(θ, ℓπ(model, θ)) +``` + +`Transition` can now store any type of parameter, whether it's a vector of draws from multiple parameters or a single univariate draw. + +### Metropolis-Hastings + +Now it's time to get into the actual inference. We've defined all of the core pieces we need, but we need to implement the `step!` function which actually performs inference. + +As a refresher, Metropolis-Hastings implements a very basic algorithm: + +1. Pick some initial state, ``\theta_0``. + +2. For ``t`` in ``[1,N],`` do + + + Generate a proposal parameterization ``\theta^\prime_t \sim q(\theta^\prime_t \mid \theta_{t-1}).`` + + + Calculate the acceptance probability, ``\alpha = \text{min}\left[1,\frac{\pi(\theta'_t)}{\pi(\theta_{t-1})} \frac{q(\theta_{t-1} \mid \theta'_t)}{q(\theta'_t \mid \theta_{t-1})}) \right].`` + + + If ``U \le \alpha`` where ``U \sim [0,1],`` then ``\theta_t = \theta'_t.`` Otherwise, ``\theta_t = \theta_{t-1}.`` + +Of course, it's much easier to do this in the log space, so the acceptance probability is more commonly written as + +```{.cell-bg} +\log \alpha = \min\left[0, \log \pi(\theta'_t) - \log \pi(\theta_{t-1}) + \log q(\theta_{t-1} \mid \theta^\prime_t) - \log q(\theta\prime_t \mid \theta_{t-1}) \right]. +``` + +In interface terms, we should do the following: + +1. Make a new transition containing a proposed sample. +2. Calculate the acceptance probability. +3. If we accept, return the new transition, otherwise, return the old one. + +### Steps + +The `step!` function is the function that performs the bulk of your inference. In our case, we will implement two `step!` functions -- one for the very first iteration, and one for every subsequent iteration. + +```{julia} +#| eval: false +# Define the first step! function, which is called at the +# beginning of sampling. Return the initial parameter used +# to define the sampler. +function AbstractMCMC.step!( + rng::AbstractRNG, + model::DensityModel, + spl::MetropolisHastings, + N::Integer, + ::Nothing; + kwargs..., +) + return Transition(model, spl.init_θ) +end +``` + +The first `step!` function just packages up the initial parameterization inside the sampler, and returns it. We implicitly accept the very first parameterization. + +The other `step!` function performs the usual steps from Metropolis-Hastings. Included are several helper functions, `proposal` and `q`, which are designed to replicate the functions in the pseudocode above. + +- `proposal` generates a new proposal in the form of a `Transition`, which can be univariate if the value passed in is univariate, or it can be multivariate if the `Transition` given is multivariate. Proposals use a basic `Normal` or `MvNormal` proposal distribution. +- `q` returns the log density of one parameterization conditional on another, according to the proposal distribution. +- `step!` generates a new proposal, checks the acceptance probability, and then returns either the previous transition or the proposed transition. + + +```{julia} +#| eval: false +# Define a function that makes a basic proposal depending on a univariate +# parameterization or a multivariate parameterization. +function propose(spl::MetropolisHastings, model::DensityModel, θ::Real) + return Transition(model, θ + rand(spl.proposal)) +end +function propose(spl::MetropolisHastings, model::DensityModel, θ::Vector{<:Real}) + return Transition(model, θ + rand(spl.proposal)) +end +function propose(spl::MetropolisHastings, model::DensityModel, t::Transition) + return propose(spl, model, t.θ) +end + +# Calculates the probability `q(θ|θcond)`, using the proposal distribution `spl.proposal`. +q(spl::MetropolisHastings, θ::Real, θcond::Real) = logpdf(spl.proposal, θ - θcond) +function q(spl::MetropolisHastings, θ::Vector{<:Real}, θcond::Vector{<:Real}) + return logpdf(spl.proposal, θ - θcond) +end +q(spl::MetropolisHastings, t1::Transition, t2::Transition) = q(spl, t1.θ, t2.θ) + +# Calculate the density of the model given some parameterization. +ℓπ(model::DensityModel, θ) = model.ℓπ(θ) +ℓπ(model::DensityModel, t::Transition) = t.lp + +# Define the other step function. Returns a Transition containing +# either a new proposal (if accepted) or the previous proposal +# (if not accepted). +function AbstractMCMC.step!( + rng::AbstractRNG, + model::DensityModel, + spl::MetropolisHastings, + ::Integer, + θ_prev::Transition; + kwargs..., +) + # Generate a new proposal. + θ = propose(spl, model, θ_prev) + + # Calculate the log acceptance probability. + α = ℓπ(model, θ) - ℓπ(model, θ_prev) + q(spl, θ_prev, θ) - q(spl, θ, θ_prev) + + # Decide whether to return the previous θ or the new one. + if log(rand(rng)) < min(α, 0.0) + return θ + else + return θ_prev + end +end +``` + +### Chains + +In the default implementation, `sample` just returns a vector of all transitions. If instead you would like to obtain a `Chains` object (e.g., to simplify downstream analysis), you have to implement the `bundle_samples` function as well. It accepts the vector of transitions and returns a collection of samples. Fortunately, our `Transition` is incredibly simple, and we only need to build a little bit of functionality to accept custom parameter names passed in by the user. + +```{julia} +#| eval: false +# A basic chains constructor that works with the Transition struct we defined. +function AbstractMCMC.bundle_samples( + rng::AbstractRNG, + ℓ::DensityModel, + s::MetropolisHastings, + N::Integer, + ts::Vector{<:Transition}, + chain_type::Type{Any}; + param_names=missing, + kwargs..., +) + # Turn all the transitions into a vector-of-vectors. + vals = copy(reduce(hcat, [vcat(t.θ, t.lp) for t in ts])') + + # Check if we received any parameter names. + if ismissing(param_names) + param_names = ["Parameter $i" for i in 1:(length(first(vals)) - 1)] + end + + # Add the log density field to the parameter names. + push!(param_names, "lp") + + # Bundle everything up and return a Chains struct. + return Chains(vals, param_names, (internals=["lp"],)) +end +``` + +All done! + +You can even implement different output formats by implementing `bundle_samples` for different `chain_type`s, which can be provided as keyword argument to `sample`. As default `sample` uses `chain_type = Any`. + +### Testing the implementation + +Now that we have all the pieces, we should test the implementation by defining a model to calculate the mean and variance parameters of a Normal distribution. We can do this by constructing a target density function, providing a sample of data, and then running the sampler with `sample`. + +```{julia} +#| eval: false +# Generate a set of data from the posterior we want to estimate. +data = rand(Normal(5, 3), 30) + +# Define the components of a basic model. +insupport(θ) = θ[2] >= 0 +dist(θ) = Normal(θ[1], θ[2]) +density(θ) = insupport(θ) ? sum(logpdf.(dist(θ), data)) : -Inf + +# Construct a DensityModel. +model = DensityModel(density) + +# Set up our sampler with initial parameters. +spl = MetropolisHastings([0.0, 0.0]) + +# Sample from the posterior. +chain = sample(model, spl, 100000; param_names=["μ", "σ"]) +``` + +If all the interface functions have been extended properly, you should get an output from `display(chain)` that looks something like this: + + +```{.cell-bg} +Object of type Chains, with data of type 100000×3×1 Array{Float64,3} + +Iterations = 1:100000 +Thinning interval = 1 +Chains = 1 +Samples per chain = 100000 +internals = lp +parameters = μ, σ + +2-element Array{ChainDataFrame,1} + +Summary Statistics + +│ Row │ parameters │ mean │ std │ naive_se │ mcse │ ess │ r_hat │ +│ │ Symbol │ Float64 │ Float64 │ Float64 │ Float64 │ Any │ Any │ +├─────┼────────────┼─────────┼──────────┼────────────┼────────────┼─────────┼─────────┤ +│ 1 │ μ │ 5.33157 │ 0.854193 │ 0.0027012 │ 0.00893069 │ 8344.75 │ 1.00009 │ +│ 2 │ σ │ 4.54992 │ 0.632916 │ 0.00200146 │ 0.00534942 │ 14260.8 │ 1.00005 │ + +Quantiles + +│ Row │ parameters │ 2.5% │ 25.0% │ 50.0% │ 75.0% │ 97.5% │ +│ │ Symbol │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ +├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ +│ 1 │ μ │ 3.6595 │ 4.77754 │ 5.33182 │ 5.89509 │ 6.99651 │ +│ 2 │ σ │ 3.5097 │ 4.09732 │ 4.47805 │ 4.93094 │ 5.96821 │ +``` + +It looks like we're extremely close to our true parameters of `Normal(5,3)`, though with a fairly high variance due to the low sample size. + +## Conclusion + +We've seen how to implement the sampling interface for general projects. Turing's interface methods are ever-evolving, so please open an issue at [AbstractMCMC](https://github.com/TuringLang/AbstractMCMC.jl) with feature requests or problems. diff --git a/developers/inference/abstractmcmc-turing/index.qmd b/developers/inference/abstractmcmc-turing/index.qmd index bf1bd1489..6d313f232 100755 --- a/developers/inference/abstractmcmc-turing/index.qmd +++ b/developers/inference/abstractmcmc-turing/index.qmd @@ -1,329 +1,329 @@ ---- -title: How Turing Implements AbstractMCMC -engine: julia -aliases: - - ../../tutorials/docs-04-for-developers-abstractmcmc-turing/index.html ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -Prerequisite: [Interface guide]({{}}). - -## Introduction - -Consider the following Turing, code block: - -```{julia} -using Turing - -@model function gdemo(x, y) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - x ~ Normal(m, sqrt(s²)) - return y ~ Normal(m, sqrt(s²)) -end - -mod = gdemo(1.5, 2) -alg = IS() -n_samples = 1000 - -chn = sample(mod, alg, n_samples, progress=false) -``` - -The function `sample` is part of the AbstractMCMC interface. As explained in the [interface guide]({{}}), building a sampling method that can be used by `sample` consists in overloading the structs and functions in `AbstractMCMC`. The interface guide also gives a standalone example of their implementation, [`AdvancedMH.jl`](). - -Turing sampling methods (most of which are written [here](https://github.com/TuringLang/Turing.jl/tree/master/src/mcmc)) also implement `AbstractMCMC`. Turing defines a particular architecture for `AbstractMCMC` implementations, that enables working with models defined by the `@model` macro, and uses DynamicPPL as a backend. The goal of this page is to describe this architecture, and how you would go about implementing your own sampling method in Turing, using Importance Sampling as an example. I don't go into all the details: for instance, I don't address selectors or parallelism. - -First, we explain how Importance Sampling works in the abstract. Consider the model defined in the first code block. Mathematically, it can be written: - -$$ -\begin{align*} -s &\sim \text{InverseGamma}(2, 3), \\ -m &\sim \text{Normal}(0, \sqrt{s}), \\ -x &\sim \text{Normal}(m, \sqrt{s}), \\ -y &\sim \text{Normal}(m, \sqrt{s}). -\end{align*} -$$ - -The **latent** variables are $s$ and $m$, the **observed** variables are $x$ and $y$. The model **joint** distribution $p(s,m,x,y)$ decomposes into the **prior** $p(s,m)$ and the **likelihood** $p(x,y \mid s,m).$ Since $x = 1.5$ and $y = 2$ are observed, the goal is to infer the **posterior** distribution $p(s,m \mid x,y).$ - -Importance Sampling produces independent samples $(s_i, m_i)$ from the prior distribution. It also outputs unnormalized weights - -$$ -w_i = \frac {p(x,y,s_i,m_i)} {p(s_i, m_i)} = p(x,y \mid s_i, m_i) -$$ - -such that the empirical distribution - -$$ -\frac{1}{N} \sum_{i =1}^N \frac {w_i} {\sum_{j=1}^N w_j} \delta_{(s_i, m_i)} -$$ - -is a good approximation of the posterior. - -## 1. Define a Sampler - -Recall the last line of the above code block: - -```{julia} -chn = sample(mod, alg, n_samples, progress=false) -``` - -Here `sample` takes as arguments a **model** `mod`, an **algorithm** `alg`, and a **number of samples** `n_samples`, and returns an instance `chn` of `Chains` which can be analysed using the functions in `MCMCChains`. - -### Models - -To define a **model**, you declare a joint distribution on variables in the `@model` macro, and specify which variables are observed and which should be inferred, as well as the value of the observed variables. Thus, when implementing Importance Sampling, - -```{julia} -mod = gdemo(1.5, 2) -``` - -creates an instance `mod` of the struct `Model`, which corresponds to the observations of a value of `1.5` for `x`, and a value of `2` for `y`. - -This is all handled by DynamicPPL, more specifically [here](https://github.com/TuringLang/DynamicPPL.jl/blob/master/src/model.jl). I will return to how models are used to inform sampling algorithms [below](#assumeobserve). - -### Algorithms - -An **algorithm** is just a sampling method: in Turing, it is a subtype of the abstract type `InferenceAlgorithm`. Defining an algorithm may require specifying a few high-level parameters. For example, "Hamiltonian Monte-Carlo" may be too vague, but "Hamiltonian Monte Carlo with 10 leapfrog steps per proposal and a stepsize of 0.01" is an algorithm. "Metropolis-Hastings" may be too vague, but "Metropolis-Hastings with proposal distribution `p`" is an algorithm. -Thus - -```{julia} -stepsize = 0.01 -L = 10 -alg = HMC(stepsize, L) -``` - -defines a Hamiltonian Monte-Carlo algorithm, an instance of `HMC`, which is a subtype of `InferenceAlgorithm`. - -In the case of Importance Sampling, there is no need to specify additional parameters: - -```{julia} -alg = IS() -``` - -defines an Importance Sampling algorithm, an instance of `IS`, a subtype of `InferenceAlgorithm`. - -When creating your own Turing sampling method, you must, therefore, build a subtype of `InferenceAlgorithm` corresponding to your method. - -### Samplers - -Samplers are **not** the same as algorithms. An algorithm is a generic sampling method, a sampler is an object that stores information about how algorithm and model interact during sampling, and is modified as sampling progresses. The `Sampler` struct is defined in DynamicPPL. - -Turing implements `AbstractMCMC`'s `AbstractSampler` with the `Sampler` struct defined in `DynamicPPL`. The most important attributes of an instance `spl` of `Sampler` are: - -- `spl.alg`: the sampling method used, an instance of a subtype of `InferenceAlgorithm` -- `spl.state`: information about the sampling process, see [below](#states) - -When you call `sample(mod, alg, n_samples)`, Turing first uses `model` and `alg` to build an instance `spl` of `Sampler` , then calls the native `AbstractMCMC` function `sample(mod, spl, n_samples)`. - -When you define your own Turing sampling method, you must therefore build: - -- a **sampler constructor** that uses a model and an algorithm to initialize an instance of `Sampler`. For Importance Sampling: - -```{julia} -#| eval: false -function Sampler(alg::IS, model::Model, s::Selector) - info = Dict{Symbol,Any}() - state = ISState(model) - return Sampler(alg, info, s, state) -end -``` - -- a **state** struct implementing `AbstractSamplerState` corresponding to your method: we cover this in the following paragraph. - -### States - -The `vi` field contains all the important information about sampling: first and foremost, the values of all the samples, but also the distributions from which they are sampled, the names of model parameters, and other metadata. As we will see below, many important steps during sampling correspond to queries or updates to `spl.state.vi`. - -By default, you can use `SamplerState`, a concrete type defined in `inference/Inference.jl`, which extends `AbstractSamplerState` and has no field except for `vi`: - -```{julia} -#| eval: false -mutable struct SamplerState{VIType<:VarInfo} <: AbstractSamplerState - vi::VIType -end -``` - -When doing Importance Sampling, we care not only about the values of the samples but also their weights. We will see below that the weight of each sample is also added to `spl.state.vi`. Moreover, the average - -$$ -\frac 1 N \sum_{j=1}^N w_i = \frac 1 N \sum_{j=1}^N p(x,y \mid s_i, m_i) -$$ - -of the sample weights is a particularly important quantity: - -- it is used to **normalize** the **empirical approximation** of the posterior distribution -- its logarithm is the importance sampling **estimate** of the **log evidence** $\log p(x, y)$ - -To avoid having to compute it over and over again, `is.jl`defines an IS-specific concrete type `ISState` for sampler states, with an additional field `final_logevidence` containing - -$$ -\log \frac 1 N \sum_{j=1}^N w_i. -$$ - -```{julia} -#| eval: false -mutable struct ISState{V<:VarInfo,F<:AbstractFloat} <: AbstractSamplerState - vi::V - final_logevidence::F -end - -# additional constructor -ISState(model::Model) = ISState(VarInfo(model), 0.0) -``` - -The following diagram summarizes the hierarchy presented above. - -```{dot} -//| echo: false -digraph G { - node [shape=box]; - - spl [label=Sampler
<:AbstractSampler>, style=rounded, xlabel="", shape=box]; - state [label=State
<:AbstractSamplerState>, style=rounded, xlabel="", shape=box]; - alg [label=Algorithm
<:InferenceAlgorithm>, style=rounded, xlabel="", shape=box]; - vi [label=VarInfo
<:AbstractVarInfo>, style=rounded, xlabel="", shape=box]; - placeholder1 [label="...", width=1]; - placeholder2 [label="...", width=1]; - placeholder3 [label="...", width=1]; - placeholder4 [label="...", width=1]; - - spl -> state; - spl -> alg; - spl -> placeholder1; - - state -> vi; - state -> placeholder2; - - alg -> placeholder3; - placeholder1 -> placeholder4; -} -``` - -## 2. Overload the functions used inside mcmcsample - -A lot of the things here are method-specific. However, Turing also has some functions that make it easier for you to implement these functions, for example. - -### Transitions - -`AbstractMCMC` stores information corresponding to each individual sample in objects called `transition`, but does not specify what the structure of these objects could be. You could decide to implement a type `MyTransition` for transitions corresponding to the specifics of your methods. However, there are many situations in which the only information you need for each sample is: - -- its value: $\theta$ -- log of the joint probability of the observed data and this sample: `lp` - -`Inference.jl` [defines](https://github.com/TuringLang/Turing.jl/blob/master/src/inference/Inference.jl#L103) a struct `Transition`, which corresponds to this default situation - -```{julia} -#| eval: false -struct Transition{T,F<:AbstractFloat} - θ::T - lp::F -end -``` - -It also [contains](https://github.com/TuringLang/Turing.jl/blob/master/src/inference/Inference.jl#L108) a constructor that builds an instance of `Transition` from an instance `spl` of `Sampler`: $\theta$ is `spl.state.vi` converted to a `namedtuple`, and `lp` is `getlogp(spl.state.vi)`. `is.jl` uses this default constructor at the end of the `step!` function [here](https://github.com/TuringLang/Turing.jl/blob/master/src/inference/is.jl#L58). - -### How `sample` works - -A crude summary, which ignores things like parallelism, is the following: - -`sample` calls `mcmcsample`, which calls - -- `sample_init!` to set things up -- `step!` repeatedly to produce multiple new transitions -- `sample_end!` to perform operations once all samples have been obtained -- `bundle_samples` to convert a vector of transitions into a more palatable type, for instance a `Chain`. - -You can, of course, implement all of these functions, but `AbstractMCMC` as well as Turing, also provide default implementations for simple cases. For instance, importance sampling uses the default implementations of `sample_init!` and `bundle_samples`, which is why you don't see code for them inside `is.jl`. - -## 3. Overload assume and observe - -The functions mentioned above, such as `sample_init!`, `step!`, etc., must, of course, use information about the model in order to generate samples! In particular, these functions may need **samples from distributions** defined in the model or to **evaluate the density of these distributions** at some values of the corresponding parameters or observations. - -For an example of the former, consider **Importance Sampling** as defined in `is.jl`. This implementation of Importance Sampling uses the model prior distribution as a proposal distribution, and therefore requires **samples from the prior distribution** of the model. Another example is **Approximate Bayesian Computation**, which requires multiple **samples from the model prior and likelihood distributions** in order to generate a single sample. - -An example of the latter is the **Metropolis-Hastings** algorithm. At every step of sampling from a target posterior - -$$ -p(\theta \mid x_{\text{obs}}), -$$ - -in order to compute the acceptance ratio, you need to **evaluate the model joint density** - -$$ -p\left(\theta_{\text{prop}}, x_{\text{obs}}\right) -$$ - -with $\theta_{\text{prop}}$ a sample from the proposal and $x_{\text{obs}}$ the observed data. - -This begs the question: how can these functions access model information during sampling? Recall that the model is stored as an instance `m` of `Model`. One of the attributes of `m` is the model evaluation function `m.f`, which is built by compiling the `@model` macro. Executing `f` runs the tilde statements of the model in order, and adds model information to the sampler (the instance of `Sampler` that stores information about the ongoing sampling process) at each step (see [here](https://turinglang.org/dev/docs/for-developers/compiler) for more information about how the `@model` macro is compiled). The DynamicPPL functions `assume` and `observe` determine what kind of information to add to the sampler for every tilde statement. - -Consider an instance `m` of `Model` and a sampler `spl`, with associated `VarInfo` `vi = spl.state.vi`. At some point during the sampling process, an AbstractMCMC function such as `step!` calls `m(vi, ...)`, which calls the model evaluation function `m.f(vi, ...)`. - - - for every tilde statement in the `@model` macro, `m.f(vi, ...)` returns model-related information (samples, value of the model density, etc.), and adds it to `vi`. How does it do that? - - + recall that the code for `m.f(vi, ...)` is automatically generated by compilation of the `@model` macro - - + for every tilde statement in the `@model` declaration, this code contains a call to `assume(vi, ...)` if the variable on the LHS of the tilde is a **model parameter to infer**, and `observe(vi, ...)` if the variable on the LHS of the tilde is an **observation** - - + in the file corresponding to your sampling method (ie in `Turing.jl/src/inference/.jl`), you have **overloaded** `assume` and `observe`, so that they can modify `vi` to include the information and samples that you care about! - - + at a minimum, `assume` and `observe` return the log density `lp` of the sample or observation. the model evaluation function then immediately calls `acclogp!!(vi, lp)`, which adds `lp` to the value of the log joint density stored in `vi`. - -Here's what `assume` looks like for Importance Sampling: - -```{julia} -#| eval: false -function DynamicPPL.assume(rng, spl::Sampler{<:IS}, dist::Distribution, vn::VarName, vi) - r = rand(rng, dist) - push!(vi, vn, r, dist, spl) - return r, 0 -end -``` - -The function first generates a sample `r` from the distribution `dist` (the right hand side of the tilde statement). It then adds `r` to `vi`, and returns `r` and 0. - -The `observe` function is even simpler: - -```{julia} -#| eval: false -function DynamicPPL.observe(spl::Sampler{<:IS}, dist::Distribution, value, vi) - return logpdf(dist, value) -end -``` - -It simply returns the density (in the discrete case, the probability) of the observed value under the distribution `dist`. - -## 4. Summary: Importance Sampling step by step - -We focus on the AbstractMCMC functions that are overridden in `is.jl` and executed inside `mcmcsample`: `step!`, which is called `n_samples` times, and `sample_end!`, which is executed once after those `n_samples` iterations. - - - During the $i$-th iteration, `step!` does 3 things: - - + `empty!!(spl.state.vi)`: remove information about the previous sample from the sampler's `VarInfo` - - + `model(rng, spl.state.vi, spl)`: call the model evaluation function - - * calls to `assume` add the samples from the prior $s_i$ and $m_i$ to `spl.state.vi` - - * calls to `assume` or `observe` are followed by the line `acclogp!!(vi, lp)`, where `lp` is an output of `assume` and `observe` - - * `lp` is set to 0 after `assume`, and to the value of the density at the observation after `observe` - - * When all the tilde statements have been covered, `spl.state.vi.logp[]` is the sum of the `lp`, i.e., the likelihood $\log p(x, y \mid s_i, m_i) = \log p(x \mid s_i, m_i) + \log p(y \mid s_i, m_i)$ of the observations given the latent variable samples $s_i$ and $m_i$. - - + `return Transition(spl)`: build a transition from the sampler, and return that transition - - * the transition's `vi` field is simply `spl.state.vi` - - * the `lp` field contains the likelihood `spl.state.vi.logp[]` - - - When the `n_samples` iterations are completed, `sample_end!` fills the `final_logevidence` field of `spl.state` - - + It simply takes the logarithm of the average of the sample weights, using the log weights for numerical stability +--- +title: How Turing Implements AbstractMCMC +engine: julia +aliases: + - ../../tutorials/docs-04-for-developers-abstractmcmc-turing/index.html +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +Prerequisite: [Interface guide]({{}}). + +## Introduction + +Consider the following Turing, code block: + +```{julia} +using Turing + +@model function gdemo(x, y) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + x ~ Normal(m, sqrt(s²)) + return y ~ Normal(m, sqrt(s²)) +end + +mod = gdemo(1.5, 2) +alg = IS() +n_samples = 1000 + +chn = sample(mod, alg, n_samples, progress=false) +``` + +The function `sample` is part of the AbstractMCMC interface. As explained in the [interface guide]({{}}), building a sampling method that can be used by `sample` consists in overloading the structs and functions in `AbstractMCMC`. The interface guide also gives a standalone example of their implementation, [`AdvancedMH.jl`](). + +Turing sampling methods (most of which are written [here](https://github.com/TuringLang/Turing.jl/tree/master/src/mcmc)) also implement `AbstractMCMC`. Turing defines a particular architecture for `AbstractMCMC` implementations, that enables working with models defined by the `@model` macro, and uses DynamicPPL as a backend. The goal of this page is to describe this architecture, and how you would go about implementing your own sampling method in Turing, using Importance Sampling as an example. I don't go into all the details: for instance, I don't address selectors or parallelism. + +First, we explain how Importance Sampling works in the abstract. Consider the model defined in the first code block. Mathematically, it can be written: + +$$ +\begin{align*} +s &\sim \text{InverseGamma}(2, 3), \\ +m &\sim \text{Normal}(0, \sqrt{s}), \\ +x &\sim \text{Normal}(m, \sqrt{s}), \\ +y &\sim \text{Normal}(m, \sqrt{s}). +\end{align*} +$$ + +The **latent** variables are $s$ and $m$, the **observed** variables are $x$ and $y$. The model **joint** distribution $p(s,m,x,y)$ decomposes into the **prior** $p(s,m)$ and the **likelihood** $p(x,y \mid s,m).$ Since $x = 1.5$ and $y = 2$ are observed, the goal is to infer the **posterior** distribution $p(s,m \mid x,y).$ + +Importance Sampling produces independent samples $(s_i, m_i)$ from the prior distribution. It also outputs unnormalized weights + +$$ +w_i = \frac {p(x,y,s_i,m_i)} {p(s_i, m_i)} = p(x,y \mid s_i, m_i) +$$ + +such that the empirical distribution + +$$ +\frac{1}{N} \sum_{i =1}^N \frac {w_i} {\sum_{j=1}^N w_j} \delta_{(s_i, m_i)} +$$ + +is a good approximation of the posterior. + +## 1. Define a Sampler + +Recall the last line of the above code block: + +```{julia} +chn = sample(mod, alg, n_samples, progress=false) +``` + +Here `sample` takes as arguments a **model** `mod`, an **algorithm** `alg`, and a **number of samples** `n_samples`, and returns an instance `chn` of `Chains` which can be analysed using the functions in `MCMCChains`. + +### Models + +To define a **model**, you declare a joint distribution on variables in the `@model` macro, and specify which variables are observed and which should be inferred, as well as the value of the observed variables. Thus, when implementing Importance Sampling, + +```{julia} +mod = gdemo(1.5, 2) +``` + +creates an instance `mod` of the struct `Model`, which corresponds to the observations of a value of `1.5` for `x`, and a value of `2` for `y`. + +This is all handled by DynamicPPL, more specifically [here](https://github.com/TuringLang/DynamicPPL.jl/blob/master/src/model.jl). I will return to how models are used to inform sampling algorithms [below](#assumeobserve). + +### Algorithms + +An **algorithm** is just a sampling method: in Turing, it is a subtype of the abstract type `InferenceAlgorithm`. Defining an algorithm may require specifying a few high-level parameters. For example, "Hamiltonian Monte-Carlo" may be too vague, but "Hamiltonian Monte Carlo with 10 leapfrog steps per proposal and a stepsize of 0.01" is an algorithm. "Metropolis-Hastings" may be too vague, but "Metropolis-Hastings with proposal distribution `p`" is an algorithm. +Thus + +```{julia} +stepsize = 0.01 +L = 10 +alg = HMC(stepsize, L) +``` + +defines a Hamiltonian Monte-Carlo algorithm, an instance of `HMC`, which is a subtype of `InferenceAlgorithm`. + +In the case of Importance Sampling, there is no need to specify additional parameters: + +```{julia} +alg = IS() +``` + +defines an Importance Sampling algorithm, an instance of `IS`, a subtype of `InferenceAlgorithm`. + +When creating your own Turing sampling method, you must, therefore, build a subtype of `InferenceAlgorithm` corresponding to your method. + +### Samplers + +Samplers are **not** the same as algorithms. An algorithm is a generic sampling method, a sampler is an object that stores information about how algorithm and model interact during sampling, and is modified as sampling progresses. The `Sampler` struct is defined in DynamicPPL. + +Turing implements `AbstractMCMC`'s `AbstractSampler` with the `Sampler` struct defined in `DynamicPPL`. The most important attributes of an instance `spl` of `Sampler` are: + +- `spl.alg`: the sampling method used, an instance of a subtype of `InferenceAlgorithm` +- `spl.state`: information about the sampling process, see [below](#states) + +When you call `sample(mod, alg, n_samples)`, Turing first uses `model` and `alg` to build an instance `spl` of `Sampler` , then calls the native `AbstractMCMC` function `sample(mod, spl, n_samples)`. + +When you define your own Turing sampling method, you must therefore build: + +- a **sampler constructor** that uses a model and an algorithm to initialize an instance of `Sampler`. For Importance Sampling: + +```{julia} +#| eval: false +function Sampler(alg::IS, model::Model, s::Selector) + info = Dict{Symbol,Any}() + state = ISState(model) + return Sampler(alg, info, s, state) +end +``` + +- a **state** struct implementing `AbstractSamplerState` corresponding to your method: we cover this in the following paragraph. + +### States + +The `vi` field contains all the important information about sampling: first and foremost, the values of all the samples, but also the distributions from which they are sampled, the names of model parameters, and other metadata. As we will see below, many important steps during sampling correspond to queries or updates to `spl.state.vi`. + +By default, you can use `SamplerState`, a concrete type defined in `inference/Inference.jl`, which extends `AbstractSamplerState` and has no field except for `vi`: + +```{julia} +#| eval: false +mutable struct SamplerState{VIType<:VarInfo} <: AbstractSamplerState + vi::VIType +end +``` + +When doing Importance Sampling, we care not only about the values of the samples but also their weights. We will see below that the weight of each sample is also added to `spl.state.vi`. Moreover, the average + +$$ +\frac 1 N \sum_{j=1}^N w_i = \frac 1 N \sum_{j=1}^N p(x,y \mid s_i, m_i) +$$ + +of the sample weights is a particularly important quantity: + +- it is used to **normalize** the **empirical approximation** of the posterior distribution +- its logarithm is the importance sampling **estimate** of the **log evidence** $\log p(x, y)$ + +To avoid having to compute it over and over again, `is.jl`defines an IS-specific concrete type `ISState` for sampler states, with an additional field `final_logevidence` containing + +$$ +\log \frac 1 N \sum_{j=1}^N w_i. +$$ + +```{julia} +#| eval: false +mutable struct ISState{V<:VarInfo,F<:AbstractFloat} <: AbstractSamplerState + vi::V + final_logevidence::F +end + +# additional constructor +ISState(model::Model) = ISState(VarInfo(model), 0.0) +``` + +The following diagram summarizes the hierarchy presented above. + +```{dot} +//| echo: false +digraph G { + node [shape=box]; + + spl [label=Sampler
<:AbstractSampler>, style=rounded, xlabel="", shape=box]; + state [label=State
<:AbstractSamplerState>, style=rounded, xlabel="", shape=box]; + alg [label=Algorithm
<:InferenceAlgorithm>, style=rounded, xlabel="", shape=box]; + vi [label=VarInfo
<:AbstractVarInfo>, style=rounded, xlabel="", shape=box]; + placeholder1 [label="...", width=1]; + placeholder2 [label="...", width=1]; + placeholder3 [label="...", width=1]; + placeholder4 [label="...", width=1]; + + spl -> state; + spl -> alg; + spl -> placeholder1; + + state -> vi; + state -> placeholder2; + + alg -> placeholder3; + placeholder1 -> placeholder4; +} +``` + +## 2. Overload the functions used inside mcmcsample + +A lot of the things here are method-specific. However, Turing also has some functions that make it easier for you to implement these functions, for example. + +### Transitions + +`AbstractMCMC` stores information corresponding to each individual sample in objects called `transition`, but does not specify what the structure of these objects could be. You could decide to implement a type `MyTransition` for transitions corresponding to the specifics of your methods. However, there are many situations in which the only information you need for each sample is: + +- its value: $\theta$ +- log of the joint probability of the observed data and this sample: `lp` + +`Inference.jl` [defines](https://github.com/TuringLang/Turing.jl/blob/master/src/inference/Inference.jl#L103) a struct `Transition`, which corresponds to this default situation + +```{julia} +#| eval: false +struct Transition{T,F<:AbstractFloat} + θ::T + lp::F +end +``` + +It also [contains](https://github.com/TuringLang/Turing.jl/blob/master/src/inference/Inference.jl#L108) a constructor that builds an instance of `Transition` from an instance `spl` of `Sampler`: $\theta$ is `spl.state.vi` converted to a `namedtuple`, and `lp` is `getlogp(spl.state.vi)`. `is.jl` uses this default constructor at the end of the `step!` function [here](https://github.com/TuringLang/Turing.jl/blob/master/src/inference/is.jl#L58). + +### How `sample` works + +A crude summary, which ignores things like parallelism, is the following: + +`sample` calls `mcmcsample`, which calls + +- `sample_init!` to set things up +- `step!` repeatedly to produce multiple new transitions +- `sample_end!` to perform operations once all samples have been obtained +- `bundle_samples` to convert a vector of transitions into a more palatable type, for instance a `Chain`. + +You can, of course, implement all of these functions, but `AbstractMCMC` as well as Turing, also provide default implementations for simple cases. For instance, importance sampling uses the default implementations of `sample_init!` and `bundle_samples`, which is why you don't see code for them inside `is.jl`. + +## 3. Overload assume and observe + +The functions mentioned above, such as `sample_init!`, `step!`, etc., must, of course, use information about the model in order to generate samples! In particular, these functions may need **samples from distributions** defined in the model or to **evaluate the density of these distributions** at some values of the corresponding parameters or observations. + +For an example of the former, consider **Importance Sampling** as defined in `is.jl`. This implementation of Importance Sampling uses the model prior distribution as a proposal distribution, and therefore requires **samples from the prior distribution** of the model. Another example is **Approximate Bayesian Computation**, which requires multiple **samples from the model prior and likelihood distributions** in order to generate a single sample. + +An example of the latter is the **Metropolis-Hastings** algorithm. At every step of sampling from a target posterior + +$$ +p(\theta \mid x_{\text{obs}}), +$$ + +in order to compute the acceptance ratio, you need to **evaluate the model joint density** + +$$ +p\left(\theta_{\text{prop}}, x_{\text{obs}}\right) +$$ + +with $\theta_{\text{prop}}$ a sample from the proposal and $x_{\text{obs}}$ the observed data. + +This begs the question: how can these functions access model information during sampling? Recall that the model is stored as an instance `m` of `Model`. One of the attributes of `m` is the model evaluation function `m.f`, which is built by compiling the `@model` macro. Executing `f` runs the tilde statements of the model in order, and adds model information to the sampler (the instance of `Sampler` that stores information about the ongoing sampling process) at each step (see [here](https://turinglang.org/dev/docs/for-developers/compiler) for more information about how the `@model` macro is compiled). The DynamicPPL functions `assume` and `observe` determine what kind of information to add to the sampler for every tilde statement. + +Consider an instance `m` of `Model` and a sampler `spl`, with associated `VarInfo` `vi = spl.state.vi`. At some point during the sampling process, an AbstractMCMC function such as `step!` calls `m(vi, ...)`, which calls the model evaluation function `m.f(vi, ...)`. + + - for every tilde statement in the `@model` macro, `m.f(vi, ...)` returns model-related information (samples, value of the model density, etc.), and adds it to `vi`. How does it do that? + + + recall that the code for `m.f(vi, ...)` is automatically generated by compilation of the `@model` macro + + + for every tilde statement in the `@model` declaration, this code contains a call to `assume(vi, ...)` if the variable on the LHS of the tilde is a **model parameter to infer**, and `observe(vi, ...)` if the variable on the LHS of the tilde is an **observation** + + + in the file corresponding to your sampling method (ie in `Turing.jl/src/inference/.jl`), you have **overloaded** `assume` and `observe`, so that they can modify `vi` to include the information and samples that you care about! + + + at a minimum, `assume` and `observe` return the log density `lp` of the sample or observation. the model evaluation function then immediately calls `acclogp!!(vi, lp)`, which adds `lp` to the value of the log joint density stored in `vi`. + +Here's what `assume` looks like for Importance Sampling: + +```{julia} +#| eval: false +function DynamicPPL.assume(rng, spl::Sampler{<:IS}, dist::Distribution, vn::VarName, vi) + r = rand(rng, dist) + push!(vi, vn, r, dist, spl) + return r, 0 +end +``` + +The function first generates a sample `r` from the distribution `dist` (the right hand side of the tilde statement). It then adds `r` to `vi`, and returns `r` and 0. + +The `observe` function is even simpler: + +```{julia} +#| eval: false +function DynamicPPL.observe(spl::Sampler{<:IS}, dist::Distribution, value, vi) + return logpdf(dist, value) +end +``` + +It simply returns the density (in the discrete case, the probability) of the observed value under the distribution `dist`. + +## 4. Summary: Importance Sampling step by step + +We focus on the AbstractMCMC functions that are overridden in `is.jl` and executed inside `mcmcsample`: `step!`, which is called `n_samples` times, and `sample_end!`, which is executed once after those `n_samples` iterations. + + - During the $i$-th iteration, `step!` does 3 things: + + + `empty!!(spl.state.vi)`: remove information about the previous sample from the sampler's `VarInfo` + + + `model(rng, spl.state.vi, spl)`: call the model evaluation function + + * calls to `assume` add the samples from the prior $s_i$ and $m_i$ to `spl.state.vi` + + * calls to `assume` or `observe` are followed by the line `acclogp!!(vi, lp)`, where `lp` is an output of `assume` and `observe` + + * `lp` is set to 0 after `assume`, and to the value of the density at the observation after `observe` + + * When all the tilde statements have been covered, `spl.state.vi.logp[]` is the sum of the `lp`, i.e., the likelihood $\log p(x, y \mid s_i, m_i) = \log p(x \mid s_i, m_i) + \log p(y \mid s_i, m_i)$ of the observations given the latent variable samples $s_i$ and $m_i$. + + + `return Transition(spl)`: build a transition from the sampler, and return that transition + + * the transition's `vi` field is simply `spl.state.vi` + + * the `lp` field contains the likelihood `spl.state.vi.logp[]` + + - When the `n_samples` iterations are completed, `sample_end!` fills the `final_logevidence` field of `spl.state` + + + It simply takes the logarithm of the average of the sample weights, using the log weights for numerical stability diff --git a/developers/inference/variational-inference/index.qmd b/developers/inference/variational-inference/index.qmd index d424b5ecc..0965b07c7 100755 --- a/developers/inference/variational-inference/index.qmd +++ b/developers/inference/variational-inference/index.qmd @@ -1,385 +1,385 @@ ---- -title: Variational Inference -engine: julia -aliases: - - ../../tutorials/docs-07-for-developers-variational-inference/index.html ---- - -# Overview - -In this post, we'll examine variational inference (VI), a family of approximate Bayesian inference methods. We will focus on one of the more standard VI methods, Automatic Differentiation Variational Inference (ADVI). - -Here, we'll examine the theory behind VI, but if you're interested in using ADVI in Turing, [check out this tutorial]({{}}). - -# Motivation - -In Bayesian inference, one usually specifies a model as follows: given data $\\{x_i\\}_{i = 1}^n$, - -::: {.column-page} -$$ -\begin{align*} - \text{prior:} \quad z &\sim p(z) \\ - \text{likelihood:} \quad x_i &\overset{\text{i.i.d.}}{\sim} p(x \mid z) \quad \text{where} \quad i = 1, \dots, n -\end{align*} -$$ -::: - -where $\overset{\text{i.i.d.}}{\sim}$ denotes that the samples are identically independently distributed. Our goal in Bayesian inference is then to find the _posterior_ - -::: {.column-page} -$$ -p(z \mid \\{ x\_i \\}\_{i = 1}^n) \propto p(z) \prod\_{i=1}^{n} p(x\_i \mid z). -$$ -::: - -In general, one cannot obtain a closed form expression for $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$, but one might still be able to _sample_ from $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$ with guarantees of converging to the target posterior $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$ as the number of samples go to $\infty$, e.g. MCMC. - -As you are hopefully already aware, Turing.jl provides many methods with asymptotic exactness guarantees that we can apply to such a problem! - -Unfortunately, these unbiased samplers can be prohibitively expensive to run. As the model $p$ increases in complexity, the convergence of these unbiased samplers can slow down dramatically. Still, in the _infinite_ limit, these methods should converge to the true posterior! But infinity is fairly large, like, _at least_ more than 12, so this might take a while. - -In such a case, it might be desirable to sacrifice some of these asymptotic guarantees and instead _approximate_ the posterior $p(z \mid \\{ x_i \\}_{i = 1}^n)$ using some other model which we'll denote $q(z)$. - -There are multiple approaches to take in this case, one of which is **variational inference (VI)**. - -# Variational Inference (VI) - -In VI, we're looking to approximate $p(z \mid \\{ x_i \\}_{i = 1}^n )$ using some _approximate_ or _variational_ posterior $q(z)$. - -To approximate something you need a notion of what "close" means. In the context of probability densities a standard such "measure" of closeness is the _Kullback-Leibler (KL) divergence_ , though this is far from the only one. The KL-divergence is defined between two densities $q(z)$ and $p(z \mid \\{ x_i \\}_{i = 1}^n)$ as - -::: {.column-page} -$$ -\begin{align*} - \mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) &= \int \log \left( \frac{q(z)}{\prod\_{i = 1}^n p(z \mid x\_i)} \right) q(z) \mathrm{d}{z} \\\\ - &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) - \sum\_{i = 1}^n \log p(z \mid x\_i) \right] \\\\ - &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(z \mid x\_i) \right]. -\end{align*} -$$ -::: - -It's worth noting that unfortunately the KL-divergence is _not_ a metric/distance in the analysis-sense due to its lack of symmetry. On the other hand, it turns out that minimizing the KL-divergence that it's actually equivalent to maximizing the log-likelihood! Also, under reasonable restrictions on the densities at hand, - -::: {.column-page} -$$ -\mathrm{D\_{KL}}\left(q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) = 0 \quad \iff \quad q(z) = p(z \mid \\{ x\_i \\}\_{i = 1}^n), \quad \forall z. -$$ -::: - -Therefore one could (and we will) attempt to approximate $p(z \mid \\{ x_i \\}_{i = 1}^n)$ using a density $q(z)$ by minimizing the KL-divergence between these two! - -One can also show that $\mathrm{D_{KL}} \ge 0$, which we'll need later. Finally notice that the KL-divergence is only well-defined when in fact $q(z)$ is zero everywhere $p(z \mid \\{ x_i \\}_{i = 1}^n)$ is zero, i.e. - -::: {.column-page} -$$ -\mathrm{supp}\left(q(z)\right) \subseteq \mathrm{supp}\left(p(z \mid x)\right). -$$ -::: - -Otherwise, there might be a point $z_0 \sim q(z)$ such that $p(z_0 \mid \\{ x_i \\}_{i = 1}^n) = 0$, resulting in $\log\left(\frac{q(z)}{0}\right)$ which doesn't make sense! - -One major problem: as we can see in the definition of the KL-divergence, we need $p(z \mid \\{ x_i \\}_{i = 1}^n)$ for any $z$ if we want to compute the KL-divergence between this and $q(z)$. We don't have that. The entire reason we even do Bayesian inference is that we don't know the posterior! Cleary this isn't going to work. _Or is it?!_ - -## Computing KL-divergence without knowing the posterior - -First off, recall that - -::: {.column-page} -$$ -p(z \mid x\_i) = \frac{p(x\_i, z)}{p(x\_i)} -$$ -::: - -so we can write - -::: {.column-page} -$$ -\begin{align*} -\mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) - \log p(x\_i) \right] \\ - &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] + \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x_i) \right] \\ - &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] + \sum\_{i = 1}^n \log p(x\_i), -\end{align*} -$$ -::: - -where in the last equality we used the fact that $p(x_i)$ is independent of $z$. - -Now you're probably thinking "Oh great! Now you've introduced $p(x_i)$ which we _also_ can't compute (in general)!". Woah. Calm down human. Let's do some more algebra. The above expression can be rearranged to - -::: {.column-page} -$$ -\mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) + \underbrace{\sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] - \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right]}\_{=: \mathrm{ELBO}(q)} = \underbrace{\sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i) \right]}\_{\text{constant}}. -$$ -::: - -See? The left-hand side is _constant_ and, as we mentioned before, $\mathrm{D_{KL}} \ge 0$. What happens if we try to _maximize_ the term we just gave the completely arbitrary name $\mathrm{ELBO}$? Well, if $\mathrm{ELBO}$ goes up while $p(x_i)$ stays constant then $\mathrm{D_{KL}}$ _has to_ go down! That is, the $q(z)$ which _minimizes_ the KL-divergence is the same $q(z)$ which _maximizes_ $\mathrm{ELBO}(q)$: - -::: {.column-page} -$$ -\underset{q}{\mathrm{argmin}} \ \mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) = \underset{q}{\mathrm{argmax}} \ \mathrm{ELBO}(q) -$$ -::: - -where - -::: {.column-page} -$$ -\begin{align*} -\mathrm{ELBO}(q) &:= \left( \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] \right) - \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] \\ - &= \left( \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] \right) + \mathbb{H}\left( q(z) \right) -\end{align*} -$$ -::: - -and $\mathbb{H} \left(q(z) \right)$ denotes the [(differential) entropy](https://www.wikiwand.com/en/Differential_entropy) of $q(z)$. - -Assuming joint $p(x_i, z)$ and the entropy $\mathbb{H}\left(q(z)\right)$ are both tractable, we can use a Monte-Carlo for the remaining expectation. This leaves us with the following tractable expression - -::: {.column-page} -$$ -\underset{q}{\mathrm{argmin}} \ \mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) \approx \underset{q}{\mathrm{argmax}} \ \widehat{\mathrm{ELBO}}(q) -$$ -::: - -where - -::: {.column-page} -$$ -\widehat{\mathrm{ELBO}}(q) = \frac{1}{m} \left( \sum\_{k = 1}^m \sum\_{i = 1}^n \log p(x\_i, z\_k) \right) + \mathbb{H} \left(q(z)\right) \quad \text{where} \quad z\_k \sim q(z) \quad \forall k = 1, \dots, m. -$$ -::: - -Hence, as long as we can sample from $q(z)$ somewhat efficiently, we can indeed minimize the KL-divergence! Neat, eh? - -Sidenote: in the case where $q(z)$ is tractable but $\mathbb{H} \left(q(z) \right)$ is _not_ , we can use an Monte-Carlo estimate for this term too but this generally results in a higher-variance estimate. - -Also, I fooled you real good: the ELBO _isn't_ an arbitrary name, hah! In fact it's an abbreviation for the **expected lower bound (ELBO)** because it, uhmm, well, it's the _expected_ lower bound (remember $\mathrm{D_{KL}} \ge 0$). Yup. - -## Maximizing the ELBO - -Finding the optimal $q$ over _all_ possible densities of course isn't feasible. Instead we consider a family of _parameterized_ densities $\mathscr{D}\_{\Theta}$ where $\Theta$ denotes the space of possible parameters. Each density in this family $q\_{\theta} \in \mathscr{D}\_{\Theta}$ is parameterized by a unique $\theta \in \Theta$. Moreover, we'll assume - - 1. $q\_{\theta}(z)$, i.e. evaluating the probability density $q$ at any point $z$, is differentiable - 2. $z \sim q\_{\theta}(z)$, i.e. the process of sampling from $q\_{\theta}(z)$, is differentiable - -(1) is fairly straight-forward, but (2) is a bit tricky. What does it even mean for a _sampling process_ to be differentiable? This is quite an interesting problem in its own right and would require something like a [50-page paper to properly review the different approaches (highly recommended read)](https://arxiv.org/abs/1906.10652). - -We're going to make use of a particular such approach which goes under a bunch of different names: _reparametrization trick_, _path derivative_, etc. This refers to making the assumption that all elements $q\_{\theta} \in \mathscr{Q}\_{\Theta}$ can be considered as reparameterizations of some base density, say $\bar{q}(z)$. That is, if $q\_{\theta} \in \mathscr{Q}\_{\Theta}$ then - -::: {.column-page} -$$ -z \sim q\_{\theta}(z) \quad \iff \quad z := g\_{\theta}(\tilde{z}) \quad \text{where} \quad \bar{z} \sim \bar{q}(z) -$$ -::: - -for some function $g\_{\theta}$ differentiable wrt. $\theta$. So all $q_{\theta} \in \mathscr{Q}\_{\Theta}$ are using the *same* reparameterization-function $g$ but each $q\_{\theta}$ correspond to different choices of $\theta$ for $f\_{\theta}$. - -Under this assumption we can differentiate the sampling process by taking the derivative of $g\_{\theta}$ wrt. $\theta$, and thus we can differentiate the entire $\widehat{\mathrm{ELBO}}(q\_{\theta})$ wrt. $\theta$! With the gradient available we can either try to solve for optimality either by setting the gradient equal to zero or maximize $\widehat{\mathrm{ELBO}}(q\_{\theta})$ stepwise by traversing $\mathscr{Q}\_{\Theta}$ in the direction of steepest ascent. For the sake of generality, we're going to go with the stepwise approach. - -With all this nailed down, we eventually reach the section on **Automatic Differentiation Variational Inference (ADVI)**. - -## Automatic Differentiation Variational Inference (ADVI) - -So let's revisit the assumptions we've made at this point: - - 1. The variational posterior $q\_{\theta}$ is in a parameterized family of densities denoted $\mathscr{Q}\_{\Theta}$, with $\theta \in \Theta$. - - 2. $\mathscr{Q}\_{\Theta}$ is a space of _reparameterizable_ densities with $\bar{q}(z)$ as the base-density. - - 3. The parameterization function $g\_{\theta}$ is differentiable wrt. $\theta$. - - 4. Evaluation of the probability density $q\_{\theta}(z)$ is differentiable wrt. $\theta$. - - 5. $\mathbb{H}\left(q\_{\theta}(z)\right)$ is tractable. - - 6. Evaluation of the joint density $p(x, z)$ is tractable and differentiable wrt. $z$ - - 7. The support of $q(z)$ is a subspace of the support of $p(z \mid x)$ : $\mathrm{supp}\left(q(z)\right) \subseteq \mathrm{supp}\left(p(z \mid x)\right)$. - -All of these are not *necessary* to do VI, but they are very convenient and results in a fairly flexible approach. One distribution which has a density satisfying all of the above assumptions _except_ (7) (we'll get back to this in second) for any tractable and differentiable $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$ is the good ole' Gaussian/normal distribution: - -::: {.column-page} -$$ -z \sim \mathcal{N}(\mu, \Sigma) \quad \iff \quad z = g\_{\mu, L}(\bar{z}) := \mu + L^T \tilde{z} \quad \text{where} \quad \bar{z} \sim \bar{q}(z) := \mathcal{N}(1\_d, I\_{d \times d}) -$$ -::: - -where $\Sigma = L L^T,$ with $L$ obtained from the Cholesky-decomposition. Abusing notation a bit, we're going to write - -::: {.column-page} -$$ -\theta = (\mu, \Sigma) := (\mu\_1, \dots, \mu\_d, L\_{11}, \dots, L\_{1, d}, L\_{2, 1}, \dots, L\_{2, d}, \dots, L\_{d, 1}, \dots, L\_{d, d}). -$$ -::: - -With this assumption we finally have a tractable expression for $\widehat{\mathrm{ELBO}}(q_{\mu, \Sigma})$! Well, assuming (7) is holds. Since a Gaussian has non-zero probability on the entirety of $\mathbb{R}^d$, we also require $p(z \mid \\{ x_i \\}_{i = 1}^n)$ to have non-zero probability on all of $\mathbb{R}^d$. - -Though not necessary, we'll often make a *mean-field* assumption for the variational posterior $q(z)$, i.e. assume independence between the latent variables. In this case, we'll write - -::: {.column-page} -$$ -\theta = (\mu, \sigma^2) := (\mu\_1, \dots, \mu\_d, \sigma\_1^2, \dots, \sigma\_d^2). -$$ -::: - -### Examples - -As a (trivial) example we could apply the approach described above to is the following generative model for $p(z \mid \\{ x_i \\}\_{i = 1}^n)$: - -::: {.column-page} -$$ -\begin{align*} - m &\sim \mathcal{N}(0, 1) \\ - x\_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, 1), \quad i = 1, \dots, n. -\end{align*} -$$ -::: - -In this case $z = m$ and we have the posterior defined $p(m \mid \\{ x\_i \\}\_{i = 1}^n) = p(m) \prod\_{i = 1}^n p(x\_i \mid m)$. Then the variational posterior would be - -::: {.column-page} -$$ -q\_{\mu, \sigma} = \mathcal{N}(\mu, \sigma^2), \quad \text{where} \quad \mu \in \mathbb{R}, \ \sigma^2 \in \mathbb{R}^{ + }. -$$ -::: - -And since prior of $m$, $\mathcal{N}(0, 1)$, has non-zero probability on the entirety of $\mathbb{R}$, same as $q(m)$, i.e. assumption (7) above holds, everything is fine and life is good. - -But what about this generative model for $p(z \mid \\{ x_i \\}_{i = 1}^n)$: - -::: {.column-page} -$$ -\begin{align*} - s &\sim \mathrm{InverseGamma}(2, 3), \\ - m &\sim \mathcal{N}(0, s), \\ - x\_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, s), \quad i = 1, \dots, n, -\end{align*} -$$ -::: - -with posterior $p(s, m \mid \\{ x\_i \\}\_{i = 1}^n) = p(s) p(m \mid s) \prod\_{i = 1}^n p(x\_i \mid s, m)$ and the mean-field variational posterior $q(s, m)$ will be - -::: {.column-page} -$$ -q\_{\mu\_1, \mu\_2, \sigma\_1^2, \sigma\_2^2}(s, m) = p\_{\mathcal{N}(\mu\_1, \sigma\_1^2)}(s)\ p\_{\mathcal{N}(\mu\_2, \sigma\_2^2)}(m), -$$ -::: - -where we've denoted the evaluation of the probability density of a Gaussian as $p_{\mathcal{N}(\mu, \sigma^2)}(x)$. - -Observe that $\mathrm{InverseGamma}(2, 3)$ has non-zero probability only on $\mathbb{R}^{ + } := (0, \infty)$ which is clearly not all of $\mathbb{R}$ like $q(s, m)$ has, i.e. - -::: {.column-page} -$$ -\mathrm{supp} \left( q(s, m) \right) \not\subseteq \mathrm{supp} \left( p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right). -$$ -::: - -Recall from the definition of the KL-divergence that when this is the case, the KL-divergence isn't well defined. This gets us to the *automatic* part of ADVI. - -### "Automatic"? How? - -For a lot of the standard (continuous) densities $p$ we can actually construct a probability density $\tilde{p}$ with non-zero probability on all of $\mathbb{R}$ by *transforming* the "constrained" probability density $p$ to $\tilde{p}$. In fact, in these cases this is a one-to-one relationship. As we'll see, this helps solve the support-issue we've been going on and on about. - -#### Transforming densities using change of variables - -If we want to compute the probability of $x$ taking a value in some set $A \subseteq \mathrm{supp} \left( p(x) \right)$, we have to integrate $p(x)$ over $A$, i.e. - -::: {.column-page} -$$ -\mathbb{P}_p(x \in A) = \int_A p(x) \mathrm{d}x. -$$ -::: - -This means that if we have a differentiable bijection $f: \mathrm{supp} \left( q(x) \right) \to \mathbb{R}^d$ with differentiable inverse $f^{-1}: \mathbb{R}^d \to \mathrm{supp} \left( p(x) \right)$, we can perform a change of variables - -::: {.column-page} -$$ -\mathbb{P}\_p(x \in A) = \int\_{f^{-1}(A)} p \left(f^{-1}(y) \right) \ \left| \det \mathcal{J}\_{f^{-1}}(y) \right| \mathrm{d}y, -$$ -::: - -where $\mathcal{J}_{f^{-1}}(x)$ denotes the jacobian of $f^{-1}$ evaluated at $x$. Observe that this defines a probability distribution - -::: {.column-page} -$$ -\mathbb{P}\_{\tilde{p}}\left(y \in f^{-1}(A) \right) = \int\_{f^{-1}(A)} \tilde{p}(y) \mathrm{d}y, -$$ -::: - -since $f^{-1}\left(\mathrm{supp} (p(x)) \right) = \mathbb{R}^d$ which has probability 1. This probability distribution has *density* $\tilde{p}(y)$ with $\mathrm{supp} \left( \tilde{p}(y) \right) = \mathbb{R}^d$, defined - -::: {.column-page} -$$ -\tilde{p}(y) = p \left( f^{-1}(y) \right) \ \left| \det \mathcal{J}\_{f^{-1}}(y) \right| -$$ -::: - -or equivalently - -::: {.column-page} -$$ -\tilde{p} \left( f(x) \right) = \frac{p(x)}{\big| \det \mathcal{J}\_{f}(x) \big|} -$$ -::: - -due to the fact that - -::: {.column-page} -$$ -\big| \det \mathcal{J}\_{f^{-1}}(y) \big| = \big| \det \mathcal{J}\_{f}(x) \big|^{-1} -$$ -::: - -*Note: it's also necessary that the log-abs-det-jacobian term is non-vanishing. This can for example be accomplished by assuming $f$ to also be elementwise monotonic.* - -#### Back to VI - -So why is this is useful? Well, we're looking to generalize our approach using a normal distribution to cases where the supports don't match up. How about defining $q(z)$ by - -::: {.column-page} -$$ -\begin{align*} - \eta &\sim \mathcal{N}(\mu, \Sigma), \\\\ - z &= f^{-1}(\eta), -\end{align*} -$$ -::: - -where $f^{-1}: \mathbb{R}^d \to \mathrm{supp} \left( p(z \mid x) \right)$ is a differentiable bijection with differentiable inverse. Then $z \sim q_{\mu, \Sigma}(z) \implies z \in \mathrm{supp} \left( p(z \mid x) \right)$ as we wanted. The resulting variational density is - -::: {.column-page} -$$ -q\_{\mu, \Sigma}(z) = p\_{\mathcal{N}(\mu, \Sigma)}\left( f(z) \right) \ \big| \det \mathcal{J}\_{f}(z) \big|. -$$ -::: - -Note that the way we've constructed $q(z)$ here is basically a reverse of the approach we described above. Here we sample from a distribution with support on $\mathbb{R}$ and transform *to* $\mathrm{supp} \left( p(z \mid x) \right)$. - -If we want to write the ELBO explicitly in terms of $\eta$ rather than $z$, the first term in the ELBO becomes - -::: {.column-page} -$$ -\begin{align*} - \mathbb{E}\_{z \sim q_{\mu, \Sigma}(z)} \left[ \log p(x\_i, z) \right] &= \mathbb{E}\_{\eta \sim \mathcal{N}(\mu, \Sigma)} \Bigg[ \log \frac{p\left(x\_i, f^{-1}(\eta) \right)}{\big| \det \mathcal{J}_{f^{-1}}(\eta) \big|} \Bigg] \\ - &= \mathbb{E}\_{\eta \sim \mathcal{N}(\mu, \Sigma)} \left[ \log p\left(x\_i, f^{-1}(\eta) \right) \right] - \mathbb{E}\_{\eta \sim \mathcal{N}(\mu, \Sigma)} \left[ \left| \det \mathcal{J}\_{f^{-1}}(\eta) \right| \right]. -\end{align*} -$$ -::: - -The entropy is invariant under change of variables, thus $\mathbb{H} \left(q\_{\mu, \Sigma}(z)\right)$ is simply the entropy of the normal distribution which is known analytically. - -Hence, the resulting empirical estimate of the ELBO is - -::: {.column-page} -$$ -\begin{align*} -\widehat{\mathrm{ELBO}}(q\_{\mu, \Sigma}) &= \frac{1}{m} \left( \sum\_{k = 1}^m \sum\_{i = 1}^n \left(\log p\left(x\_i, f^{-1}(\eta_k)\right) - \log \big| \det \mathcal{J}\_{f^{-1}}(\eta\_k) \big| \right) \right) + \mathbb{H} \left(p\_{\mathcal{N}(\mu, \Sigma)}(z)\right) \\ -& \text{where} \quad z\_k \sim \mathcal{N}(\mu, \Sigma) \quad \forall k = 1, \dots, m -\end{align*}. -$$ -::: - -And maximizing this wrt. $\mu$ and $\Sigma$ is what's referred to as **Automatic Differentiation Variational Inference (ADVI)**! - -Now if you want to try it out, [check out the tutorial on how to use ADVI in Turing.jl]({{}})! +--- +title: Variational Inference +engine: julia +aliases: + - ../../tutorials/docs-07-for-developers-variational-inference/index.html +--- + +# Overview + +In this post, we'll examine variational inference (VI), a family of approximate Bayesian inference methods. We will focus on one of the more standard VI methods, Automatic Differentiation Variational Inference (ADVI). + +Here, we'll examine the theory behind VI, but if you're interested in using ADVI in Turing, [check out this tutorial]({{}}). + +# Motivation + +In Bayesian inference, one usually specifies a model as follows: given data $\\{x_i\\}_{i = 1}^n$, + +::: {.column-page} +$$ +\begin{align*} + \text{prior:} \quad z &\sim p(z) \\ + \text{likelihood:} \quad x_i &\overset{\text{i.i.d.}}{\sim} p(x \mid z) \quad \text{where} \quad i = 1, \dots, n +\end{align*} +$$ +::: + +where $\overset{\text{i.i.d.}}{\sim}$ denotes that the samples are identically independently distributed. Our goal in Bayesian inference is then to find the _posterior_ + +::: {.column-page} +$$ +p(z \mid \\{ x\_i \\}\_{i = 1}^n) \propto p(z) \prod\_{i=1}^{n} p(x\_i \mid z). +$$ +::: + +In general, one cannot obtain a closed form expression for $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$, but one might still be able to _sample_ from $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$ with guarantees of converging to the target posterior $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$ as the number of samples go to $\infty$, e.g. MCMC. + +As you are hopefully already aware, Turing.jl provides many methods with asymptotic exactness guarantees that we can apply to such a problem! + +Unfortunately, these unbiased samplers can be prohibitively expensive to run. As the model $p$ increases in complexity, the convergence of these unbiased samplers can slow down dramatically. Still, in the _infinite_ limit, these methods should converge to the true posterior! But infinity is fairly large, like, _at least_ more than 12, so this might take a while. + +In such a case, it might be desirable to sacrifice some of these asymptotic guarantees and instead _approximate_ the posterior $p(z \mid \\{ x_i \\}_{i = 1}^n)$ using some other model which we'll denote $q(z)$. + +There are multiple approaches to take in this case, one of which is **variational inference (VI)**. + +# Variational Inference (VI) + +In VI, we're looking to approximate $p(z \mid \\{ x_i \\}_{i = 1}^n )$ using some _approximate_ or _variational_ posterior $q(z)$. + +To approximate something you need a notion of what "close" means. In the context of probability densities a standard such "measure" of closeness is the _Kullback-Leibler (KL) divergence_ , though this is far from the only one. The KL-divergence is defined between two densities $q(z)$ and $p(z \mid \\{ x_i \\}_{i = 1}^n)$ as + +::: {.column-page} +$$ +\begin{align*} + \mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) &= \int \log \left( \frac{q(z)}{\prod\_{i = 1}^n p(z \mid x\_i)} \right) q(z) \mathrm{d}{z} \\\\ + &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) - \sum\_{i = 1}^n \log p(z \mid x\_i) \right] \\\\ + &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(z \mid x\_i) \right]. +\end{align*} +$$ +::: + +It's worth noting that unfortunately the KL-divergence is _not_ a metric/distance in the analysis-sense due to its lack of symmetry. On the other hand, it turns out that minimizing the KL-divergence that it's actually equivalent to maximizing the log-likelihood! Also, under reasonable restrictions on the densities at hand, + +::: {.column-page} +$$ +\mathrm{D\_{KL}}\left(q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) = 0 \quad \iff \quad q(z) = p(z \mid \\{ x\_i \\}\_{i = 1}^n), \quad \forall z. +$$ +::: + +Therefore one could (and we will) attempt to approximate $p(z \mid \\{ x_i \\}_{i = 1}^n)$ using a density $q(z)$ by minimizing the KL-divergence between these two! + +One can also show that $\mathrm{D_{KL}} \ge 0$, which we'll need later. Finally notice that the KL-divergence is only well-defined when in fact $q(z)$ is zero everywhere $p(z \mid \\{ x_i \\}_{i = 1}^n)$ is zero, i.e. + +::: {.column-page} +$$ +\mathrm{supp}\left(q(z)\right) \subseteq \mathrm{supp}\left(p(z \mid x)\right). +$$ +::: + +Otherwise, there might be a point $z_0 \sim q(z)$ such that $p(z_0 \mid \\{ x_i \\}_{i = 1}^n) = 0$, resulting in $\log\left(\frac{q(z)}{0}\right)$ which doesn't make sense! + +One major problem: as we can see in the definition of the KL-divergence, we need $p(z \mid \\{ x_i \\}_{i = 1}^n)$ for any $z$ if we want to compute the KL-divergence between this and $q(z)$. We don't have that. The entire reason we even do Bayesian inference is that we don't know the posterior! Cleary this isn't going to work. _Or is it?!_ + +## Computing KL-divergence without knowing the posterior + +First off, recall that + +::: {.column-page} +$$ +p(z \mid x\_i) = \frac{p(x\_i, z)}{p(x\_i)} +$$ +::: + +so we can write + +::: {.column-page} +$$ +\begin{align*} +\mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) - \log p(x\_i) \right] \\ + &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] + \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x_i) \right] \\ + &= \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] - \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] + \sum\_{i = 1}^n \log p(x\_i), +\end{align*} +$$ +::: + +where in the last equality we used the fact that $p(x_i)$ is independent of $z$. + +Now you're probably thinking "Oh great! Now you've introduced $p(x_i)$ which we _also_ can't compute (in general)!". Woah. Calm down human. Let's do some more algebra. The above expression can be rearranged to + +::: {.column-page} +$$ +\mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) + \underbrace{\sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] - \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right]}\_{=: \mathrm{ELBO}(q)} = \underbrace{\sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i) \right]}\_{\text{constant}}. +$$ +::: + +See? The left-hand side is _constant_ and, as we mentioned before, $\mathrm{D_{KL}} \ge 0$. What happens if we try to _maximize_ the term we just gave the completely arbitrary name $\mathrm{ELBO}$? Well, if $\mathrm{ELBO}$ goes up while $p(x_i)$ stays constant then $\mathrm{D_{KL}}$ _has to_ go down! That is, the $q(z)$ which _minimizes_ the KL-divergence is the same $q(z)$ which _maximizes_ $\mathrm{ELBO}(q)$: + +::: {.column-page} +$$ +\underset{q}{\mathrm{argmin}} \ \mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) = \underset{q}{\mathrm{argmax}} \ \mathrm{ELBO}(q) +$$ +::: + +where + +::: {.column-page} +$$ +\begin{align*} +\mathrm{ELBO}(q) &:= \left( \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] \right) - \mathbb{E}\_{z \sim q(z)} \left[ \log q(z) \right] \\ + &= \left( \sum\_{i = 1}^n \mathbb{E}\_{z \sim q(z)} \left[ \log p(x\_i, z) \right] \right) + \mathbb{H}\left( q(z) \right) +\end{align*} +$$ +::: + +and $\mathbb{H} \left(q(z) \right)$ denotes the [(differential) entropy](https://www.wikiwand.com/en/Differential_entropy) of $q(z)$. + +Assuming joint $p(x_i, z)$ and the entropy $\mathbb{H}\left(q(z)\right)$ are both tractable, we can use a Monte-Carlo for the remaining expectation. This leaves us with the following tractable expression + +::: {.column-page} +$$ +\underset{q}{\mathrm{argmin}} \ \mathrm{D\_{KL}} \left( q(z), p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right) \approx \underset{q}{\mathrm{argmax}} \ \widehat{\mathrm{ELBO}}(q) +$$ +::: + +where + +::: {.column-page} +$$ +\widehat{\mathrm{ELBO}}(q) = \frac{1}{m} \left( \sum\_{k = 1}^m \sum\_{i = 1}^n \log p(x\_i, z\_k) \right) + \mathbb{H} \left(q(z)\right) \quad \text{where} \quad z\_k \sim q(z) \quad \forall k = 1, \dots, m. +$$ +::: + +Hence, as long as we can sample from $q(z)$ somewhat efficiently, we can indeed minimize the KL-divergence! Neat, eh? + +Sidenote: in the case where $q(z)$ is tractable but $\mathbb{H} \left(q(z) \right)$ is _not_ , we can use an Monte-Carlo estimate for this term too but this generally results in a higher-variance estimate. + +Also, I fooled you real good: the ELBO _isn't_ an arbitrary name, hah! In fact it's an abbreviation for the **expected lower bound (ELBO)** because it, uhmm, well, it's the _expected_ lower bound (remember $\mathrm{D_{KL}} \ge 0$). Yup. + +## Maximizing the ELBO + +Finding the optimal $q$ over _all_ possible densities of course isn't feasible. Instead we consider a family of _parameterized_ densities $\mathscr{D}\_{\Theta}$ where $\Theta$ denotes the space of possible parameters. Each density in this family $q\_{\theta} \in \mathscr{D}\_{\Theta}$ is parameterized by a unique $\theta \in \Theta$. Moreover, we'll assume + + 1. $q\_{\theta}(z)$, i.e. evaluating the probability density $q$ at any point $z$, is differentiable + 2. $z \sim q\_{\theta}(z)$, i.e. the process of sampling from $q\_{\theta}(z)$, is differentiable + +(1) is fairly straight-forward, but (2) is a bit tricky. What does it even mean for a _sampling process_ to be differentiable? This is quite an interesting problem in its own right and would require something like a [50-page paper to properly review the different approaches (highly recommended read)](https://arxiv.org/abs/1906.10652). + +We're going to make use of a particular such approach which goes under a bunch of different names: _reparametrization trick_, _path derivative_, etc. This refers to making the assumption that all elements $q\_{\theta} \in \mathscr{Q}\_{\Theta}$ can be considered as reparameterizations of some base density, say $\bar{q}(z)$. That is, if $q\_{\theta} \in \mathscr{Q}\_{\Theta}$ then + +::: {.column-page} +$$ +z \sim q\_{\theta}(z) \quad \iff \quad z := g\_{\theta}(\tilde{z}) \quad \text{where} \quad \bar{z} \sim \bar{q}(z) +$$ +::: + +for some function $g\_{\theta}$ differentiable wrt. $\theta$. So all $q_{\theta} \in \mathscr{Q}\_{\Theta}$ are using the *same* reparameterization-function $g$ but each $q\_{\theta}$ correspond to different choices of $\theta$ for $f\_{\theta}$. + +Under this assumption we can differentiate the sampling process by taking the derivative of $g\_{\theta}$ wrt. $\theta$, and thus we can differentiate the entire $\widehat{\mathrm{ELBO}}(q\_{\theta})$ wrt. $\theta$! With the gradient available we can either try to solve for optimality either by setting the gradient equal to zero or maximize $\widehat{\mathrm{ELBO}}(q\_{\theta})$ stepwise by traversing $\mathscr{Q}\_{\Theta}$ in the direction of steepest ascent. For the sake of generality, we're going to go with the stepwise approach. + +With all this nailed down, we eventually reach the section on **Automatic Differentiation Variational Inference (ADVI)**. + +## Automatic Differentiation Variational Inference (ADVI) + +So let's revisit the assumptions we've made at this point: + + 1. The variational posterior $q\_{\theta}$ is in a parameterized family of densities denoted $\mathscr{Q}\_{\Theta}$, with $\theta \in \Theta$. + + 2. $\mathscr{Q}\_{\Theta}$ is a space of _reparameterizable_ densities with $\bar{q}(z)$ as the base-density. + + 3. The parameterization function $g\_{\theta}$ is differentiable wrt. $\theta$. + + 4. Evaluation of the probability density $q\_{\theta}(z)$ is differentiable wrt. $\theta$. + + 5. $\mathbb{H}\left(q\_{\theta}(z)\right)$ is tractable. + + 6. Evaluation of the joint density $p(x, z)$ is tractable and differentiable wrt. $z$ + + 7. The support of $q(z)$ is a subspace of the support of $p(z \mid x)$ : $\mathrm{supp}\left(q(z)\right) \subseteq \mathrm{supp}\left(p(z \mid x)\right)$. + +All of these are not *necessary* to do VI, but they are very convenient and results in a fairly flexible approach. One distribution which has a density satisfying all of the above assumptions _except_ (7) (we'll get back to this in second) for any tractable and differentiable $p(z \mid \\{ x\_i \\}\_{i = 1}^n)$ is the good ole' Gaussian/normal distribution: + +::: {.column-page} +$$ +z \sim \mathcal{N}(\mu, \Sigma) \quad \iff \quad z = g\_{\mu, L}(\bar{z}) := \mu + L^T \tilde{z} \quad \text{where} \quad \bar{z} \sim \bar{q}(z) := \mathcal{N}(1\_d, I\_{d \times d}) +$$ +::: + +where $\Sigma = L L^T,$ with $L$ obtained from the Cholesky-decomposition. Abusing notation a bit, we're going to write + +::: {.column-page} +$$ +\theta = (\mu, \Sigma) := (\mu\_1, \dots, \mu\_d, L\_{11}, \dots, L\_{1, d}, L\_{2, 1}, \dots, L\_{2, d}, \dots, L\_{d, 1}, \dots, L\_{d, d}). +$$ +::: + +With this assumption we finally have a tractable expression for $\widehat{\mathrm{ELBO}}(q_{\mu, \Sigma})$! Well, assuming (7) is holds. Since a Gaussian has non-zero probability on the entirety of $\mathbb{R}^d$, we also require $p(z \mid \\{ x_i \\}_{i = 1}^n)$ to have non-zero probability on all of $\mathbb{R}^d$. + +Though not necessary, we'll often make a *mean-field* assumption for the variational posterior $q(z)$, i.e. assume independence between the latent variables. In this case, we'll write + +::: {.column-page} +$$ +\theta = (\mu, \sigma^2) := (\mu\_1, \dots, \mu\_d, \sigma\_1^2, \dots, \sigma\_d^2). +$$ +::: + +### Examples + +As a (trivial) example we could apply the approach described above to is the following generative model for $p(z \mid \\{ x_i \\}\_{i = 1}^n)$: + +::: {.column-page} +$$ +\begin{align*} + m &\sim \mathcal{N}(0, 1) \\ + x\_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, 1), \quad i = 1, \dots, n. +\end{align*} +$$ +::: + +In this case $z = m$ and we have the posterior defined $p(m \mid \\{ x\_i \\}\_{i = 1}^n) = p(m) \prod\_{i = 1}^n p(x\_i \mid m)$. Then the variational posterior would be + +::: {.column-page} +$$ +q\_{\mu, \sigma} = \mathcal{N}(\mu, \sigma^2), \quad \text{where} \quad \mu \in \mathbb{R}, \ \sigma^2 \in \mathbb{R}^{ + }. +$$ +::: + +And since prior of $m$, $\mathcal{N}(0, 1)$, has non-zero probability on the entirety of $\mathbb{R}$, same as $q(m)$, i.e. assumption (7) above holds, everything is fine and life is good. + +But what about this generative model for $p(z \mid \\{ x_i \\}_{i = 1}^n)$: + +::: {.column-page} +$$ +\begin{align*} + s &\sim \mathrm{InverseGamma}(2, 3), \\ + m &\sim \mathcal{N}(0, s), \\ + x\_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, s), \quad i = 1, \dots, n, +\end{align*} +$$ +::: + +with posterior $p(s, m \mid \\{ x\_i \\}\_{i = 1}^n) = p(s) p(m \mid s) \prod\_{i = 1}^n p(x\_i \mid s, m)$ and the mean-field variational posterior $q(s, m)$ will be + +::: {.column-page} +$$ +q\_{\mu\_1, \mu\_2, \sigma\_1^2, \sigma\_2^2}(s, m) = p\_{\mathcal{N}(\mu\_1, \sigma\_1^2)}(s)\ p\_{\mathcal{N}(\mu\_2, \sigma\_2^2)}(m), +$$ +::: + +where we've denoted the evaluation of the probability density of a Gaussian as $p_{\mathcal{N}(\mu, \sigma^2)}(x)$. + +Observe that $\mathrm{InverseGamma}(2, 3)$ has non-zero probability only on $\mathbb{R}^{ + } := (0, \infty)$ which is clearly not all of $\mathbb{R}$ like $q(s, m)$ has, i.e. + +::: {.column-page} +$$ +\mathrm{supp} \left( q(s, m) \right) \not\subseteq \mathrm{supp} \left( p(z \mid \\{ x\_i \\}\_{i = 1}^n) \right). +$$ +::: + +Recall from the definition of the KL-divergence that when this is the case, the KL-divergence isn't well defined. This gets us to the *automatic* part of ADVI. + +### "Automatic"? How? + +For a lot of the standard (continuous) densities $p$ we can actually construct a probability density $\tilde{p}$ with non-zero probability on all of $\mathbb{R}$ by *transforming* the "constrained" probability density $p$ to $\tilde{p}$. In fact, in these cases this is a one-to-one relationship. As we'll see, this helps solve the support-issue we've been going on and on about. + +#### Transforming densities using change of variables + +If we want to compute the probability of $x$ taking a value in some set $A \subseteq \mathrm{supp} \left( p(x) \right)$, we have to integrate $p(x)$ over $A$, i.e. + +::: {.column-page} +$$ +\mathbb{P}_p(x \in A) = \int_A p(x) \mathrm{d}x. +$$ +::: + +This means that if we have a differentiable bijection $f: \mathrm{supp} \left( q(x) \right) \to \mathbb{R}^d$ with differentiable inverse $f^{-1}: \mathbb{R}^d \to \mathrm{supp} \left( p(x) \right)$, we can perform a change of variables + +::: {.column-page} +$$ +\mathbb{P}\_p(x \in A) = \int\_{f^{-1}(A)} p \left(f^{-1}(y) \right) \ \left| \det \mathcal{J}\_{f^{-1}}(y) \right| \mathrm{d}y, +$$ +::: + +where $\mathcal{J}_{f^{-1}}(x)$ denotes the jacobian of $f^{-1}$ evaluated at $x$. Observe that this defines a probability distribution + +::: {.column-page} +$$ +\mathbb{P}\_{\tilde{p}}\left(y \in f^{-1}(A) \right) = \int\_{f^{-1}(A)} \tilde{p}(y) \mathrm{d}y, +$$ +::: + +since $f^{-1}\left(\mathrm{supp} (p(x)) \right) = \mathbb{R}^d$ which has probability 1. This probability distribution has *density* $\tilde{p}(y)$ with $\mathrm{supp} \left( \tilde{p}(y) \right) = \mathbb{R}^d$, defined + +::: {.column-page} +$$ +\tilde{p}(y) = p \left( f^{-1}(y) \right) \ \left| \det \mathcal{J}\_{f^{-1}}(y) \right| +$$ +::: + +or equivalently + +::: {.column-page} +$$ +\tilde{p} \left( f(x) \right) = \frac{p(x)}{\big| \det \mathcal{J}\_{f}(x) \big|} +$$ +::: + +due to the fact that + +::: {.column-page} +$$ +\big| \det \mathcal{J}\_{f^{-1}}(y) \big| = \big| \det \mathcal{J}\_{f}(x) \big|^{-1} +$$ +::: + +*Note: it's also necessary that the log-abs-det-jacobian term is non-vanishing. This can for example be accomplished by assuming $f$ to also be elementwise monotonic.* + +#### Back to VI + +So why is this is useful? Well, we're looking to generalize our approach using a normal distribution to cases where the supports don't match up. How about defining $q(z)$ by + +::: {.column-page} +$$ +\begin{align*} + \eta &\sim \mathcal{N}(\mu, \Sigma), \\\\ + z &= f^{-1}(\eta), +\end{align*} +$$ +::: + +where $f^{-1}: \mathbb{R}^d \to \mathrm{supp} \left( p(z \mid x) \right)$ is a differentiable bijection with differentiable inverse. Then $z \sim q_{\mu, \Sigma}(z) \implies z \in \mathrm{supp} \left( p(z \mid x) \right)$ as we wanted. The resulting variational density is + +::: {.column-page} +$$ +q\_{\mu, \Sigma}(z) = p\_{\mathcal{N}(\mu, \Sigma)}\left( f(z) \right) \ \big| \det \mathcal{J}\_{f}(z) \big|. +$$ +::: + +Note that the way we've constructed $q(z)$ here is basically a reverse of the approach we described above. Here we sample from a distribution with support on $\mathbb{R}$ and transform *to* $\mathrm{supp} \left( p(z \mid x) \right)$. + +If we want to write the ELBO explicitly in terms of $\eta$ rather than $z$, the first term in the ELBO becomes + +::: {.column-page} +$$ +\begin{align*} + \mathbb{E}\_{z \sim q_{\mu, \Sigma}(z)} \left[ \log p(x\_i, z) \right] &= \mathbb{E}\_{\eta \sim \mathcal{N}(\mu, \Sigma)} \Bigg[ \log \frac{p\left(x\_i, f^{-1}(\eta) \right)}{\big| \det \mathcal{J}_{f^{-1}}(\eta) \big|} \Bigg] \\ + &= \mathbb{E}\_{\eta \sim \mathcal{N}(\mu, \Sigma)} \left[ \log p\left(x\_i, f^{-1}(\eta) \right) \right] - \mathbb{E}\_{\eta \sim \mathcal{N}(\mu, \Sigma)} \left[ \left| \det \mathcal{J}\_{f^{-1}}(\eta) \right| \right]. +\end{align*} +$$ +::: + +The entropy is invariant under change of variables, thus $\mathbb{H} \left(q\_{\mu, \Sigma}(z)\right)$ is simply the entropy of the normal distribution which is known analytically. + +Hence, the resulting empirical estimate of the ELBO is + +::: {.column-page} +$$ +\begin{align*} +\widehat{\mathrm{ELBO}}(q\_{\mu, \Sigma}) &= \frac{1}{m} \left( \sum\_{k = 1}^m \sum\_{i = 1}^n \left(\log p\left(x\_i, f^{-1}(\eta_k)\right) - \log \big| \det \mathcal{J}\_{f^{-1}}(\eta\_k) \big| \right) \right) + \mathbb{H} \left(p\_{\mathcal{N}(\mu, \Sigma)}(z)\right) \\ +& \text{where} \quad z\_k \sim \mathcal{N}(\mu, \Sigma) \quad \forall k = 1, \dots, m +\end{align*}. +$$ +::: + +And maximizing this wrt. $\mu$ and $\Sigma$ is what's referred to as **Automatic Differentiation Variational Inference (ADVI)**! + +Now if you want to try it out, [check out the tutorial on how to use ADVI in Turing.jl]({{}})! diff --git a/theming/styles.css b/theming/styles.css index b3927862f..2e55f2cdf 100755 --- a/theming/styles.css +++ b/theming/styles.css @@ -1,57 +1,57 @@ -/* css styles */ -/* .cell-output { - background-color: #f1f3f5; -} */ - -/* .cell-output img { - max-width: 100%; - height: auto; -} */ - -/* .cell-output-display pre { - word-break: break-wor !important; - white-space: pre-wrap !important; -} - */ - -.navbar a:hover { - text-decoration: none; -} - -.cell-output { - border: 1px dashed; -} - -.cell-bg { - background-color: #f1f3f5; -} - -.cell-output-stdout code { - word-break: break-wor !important; - white-space: pre-wrap !important; -} - - -.cell-output-display svg { - height: fit-content; - width: fit-content; -} - -.cell-output-display img { - max-width: 100%; - max-height: 100%; - object-fit: contain; -} - -.nav-footer-center { - display: flex; - justify-content: center; -} - -.dropdown-menu { - text-align: center; - min-width: 100px !important; - border-radius: 5px; - max-height: 250px; - overflow: scroll; +/* css styles */ +/* .cell-output { + background-color: #f1f3f5; +} */ + +/* .cell-output img { + max-width: 100%; + height: auto; +} */ + +/* .cell-output-display pre { + word-break: break-wor !important; + white-space: pre-wrap !important; +} + */ + +.navbar a:hover { + text-decoration: none; +} + +.cell-output { + border: 1px dashed; +} + +.cell-bg { + background-color: #f1f3f5; +} + +.cell-output-stdout code { + word-break: break-wor !important; + white-space: pre-wrap !important; +} + + +.cell-output-display svg { + height: fit-content; + width: fit-content; +} + +.cell-output-display img { + max-width: 100%; + max-height: 100%; + object-fit: contain; +} + +.nav-footer-center { + display: flex; + justify-content: center; +} + +.dropdown-menu { + text-align: center; + min-width: 100px !important; + border-radius: 5px; + max-height: 250px; + overflow: scroll; } \ No newline at end of file diff --git a/tutorials/00-introduction/index.qmd b/tutorials/00-introduction/index.qmd index 813ce19c3..ac4f981b2 100755 --- a/tutorials/00-introduction/index.qmd +++ b/tutorials/00-introduction/index.qmd @@ -1,231 +1,231 @@ ---- -title: "Introduction: Coin Flipping" -engine: julia -aliases: - - ../ ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -This is the first of a series of guided tutorials on the Turing language. -In this tutorial, we will use Bayesian inference to estimate the probability that a coin flip will result in heads, given a series of observations. - -### Setup - -First, let us load some packages that we need to simulate a coin flip: - -```{julia} -using Distributions - -using Random -Random.seed!(12); # Set seed for reproducibility -``` - -and to visualize our results. - -```{julia} -using StatsPlots -``` - -Note that Turing is not loaded here — we do not use it in this example. -Next, we configure the data generating model. Let us set the true probability that a coin flip turns up heads - -```{julia} -p_true = 0.5; -``` - -and set the number of coin flips we will show our model. - -```{julia} -N = 100; -``` - -We simulate `N` coin flips by drawing N random samples from the Bernoulli distribution with success probability `p_true`. The draws are collected in a variable called `data`: - -```{julia} -data = rand(Bernoulli(p_true), N); -``` - -Here are the first five coin flips: - -```{julia} -data[1:5] -``` - - -### Coin Flipping Without Turing - -The following example illustrates the effect of updating our beliefs with every piece of new evidence we observe. - -Assume that we are unsure about the probability of heads in a coin flip. To get an intuitive understanding of what "updating our beliefs" is, we will visualize the probability of heads in a coin flip after each observed evidence. - -We begin by specifying a prior belief about the distribution of heads and tails in a coin toss. Here we choose a [Beta](https://en.wikipedia.org/wiki/Beta_distribution) distribution as prior distribution for the probability of heads. Before any coin flip is observed, we assume a uniform distribution $\operatorname{U}(0, 1) = \operatorname{Beta}(1, 1)$ of the probability of heads. I.e., every probability is equally likely initially. - -```{julia} -prior_belief = Beta(1, 1); -``` - -With our priors set and our data at hand, we can perform Bayesian inference. - -This is a fairly simple process. We expose one additional coin flip to our model every iteration, such that the first run only sees the first coin flip, while the last iteration sees all the coin flips. In each iteration we update our belief to an updated version of the original Beta distribution that accounts for the new proportion of heads and tails. The update is particularly simple since our prior distribution is a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior). Note that a closed-form expression for the posterior (implemented in the `updated_belief` expression below) is not accessible in general and usually does not exist for more interesting models. - -```{julia} -function updated_belief(prior_belief::Beta, data::AbstractArray{Bool}) - # Count the number of heads and tails. - heads = sum(data) - tails = length(data) - heads - - # Update our prior belief in closed form (this is possible because we use a conjugate prior). - return Beta(prior_belief.α + heads, prior_belief.β + tails) -end - -# Show updated belief for increasing number of observations -@gif for n in 0:N - plot( - updated_belief(prior_belief, data[1:n]); - size=(500, 250), - title="Updated belief after $n observations", - xlabel="probability of heads", - ylabel="", - legend=nothing, - xlim=(0, 1), - fill=0, - α=0.3, - w=3, - ) - vline!([p_true]) -end -``` - -The animation above shows that with increasing evidence our belief about the probability of heads in a coin flip slowly adjusts towards the true value. -The orange line in the animation represents the true probability of seeing heads on a single coin flip, while the mode of the distribution shows what the model believes the probability of a heads is given the evidence it has seen. - -For the mathematically inclined, the $\operatorname{Beta}$ distribution is updated by adding each coin flip to the parameters $\alpha$ and $\beta$ of the distribution. -Initially, the parameters are defined as $\alpha = 1$ and $\beta = 1$. -Over time, with more and more coin flips, $\alpha$ and $\beta$ will be approximately equal to each other as we are equally likely to flip a heads or a tails. - -The mean of the $\operatorname{Beta}(\alpha, \beta)$ distribution is - -$$\operatorname{E}[X] = \dfrac{\alpha}{\alpha+\beta}.$$ - -This implies that the plot of the distribution will become centered around 0.5 for a large enough number of coin flips, as we expect $\alpha \approx \beta$. - -The variance of the $\operatorname{Beta}(\alpha, \beta)$ distribution is - -$$\operatorname{var}[X] = \dfrac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}.$$ - -Thus the variance of the distribution will approach 0 with more and more samples, as the denominator will grow faster than will the numerator. -More samples means less variance. -This implies that the distribution will reflect less uncertainty about the probability of receiving a heads and the plot will become more tightly centered around 0.5 for a large enough number of coin flips. - -### Coin Flipping With Turing - -We now move away from the closed-form expression above. -We use **Turing** to specify the same model and to approximate the posterior distribution with samples. -To do so, we first need to load `Turing`. - -```{julia} -using Turing -``` - -Additionally, we load `MCMCChains`, a library for analyzing and visualizing the samples with which we approximate the posterior distribution. - -```{julia} -using MCMCChains -``` - -First, we define the coin-flip model using Turing. - -```{julia} -# Unconditioned coinflip model with `N` observations. -@model function coinflip(; N::Int) - # Our prior belief about the probability of heads in a coin toss. - p ~ Beta(1, 1) - - # Heads or tails of a coin are drawn from `N` independent and identically - # distributed Bernoulli distributions with success rate `p`. - y ~ filldist(Bernoulli(p), N) - - return y -end; -``` - -In the Turing model the prior distribution of the variable `p`, the probability of heads in a coin toss, and the distribution of the observations `y` are specified on the right-hand side of the `~` expressions. -The `@model` macro modifies the body of the Julia function `coinflip` and, e.g., replaces the `~` statements with internal function calls that are used for sampling. - -Here we defined a model that is not conditioned on any specific observations as this allows us to easily obtain samples of both `p` and `y` with - -```{julia} -rand(coinflip(; N)) -``` - -The model can be conditioned on some observations with `|`. -See the [documentation of the `condition` syntax](https://turinglang.github.io/DynamicPPL.jl/stable/api/#Condition-and-decondition) in `DynamicPPL.jl` for more details. -In the conditioned `model` the observations `y` are fixed to `data`. - -```{julia} -coinflip(y::AbstractVector{<:Real}) = coinflip(; N=length(y)) | (; y) - -model = coinflip(data); -``` - -After defining the model, we can approximate the posterior distribution by drawing samples from the distribution. -In this example, we use a [Hamiltonian Monte Carlo](https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo) sampler to draw these samples. -Other tutorials give more information on the samplers available in Turing and discuss their use for different models. - -```{julia} -sampler = NUTS(); -``` - -We approximate the posterior distribution with 1000 samples: - -```{julia} -chain = sample(model, sampler, 2_000, progress=false); -``` - -The `sample` function and common keyword arguments are explained more extensively in the documentation of [AbstractMCMC.jl](https://turinglang.github.io/AbstractMCMC.jl/dev/api/). - -After finishing the sampling process, we can visually compare the closed-form posterior distribution with the approximation obtained with Turing. - -```{julia} -histogram(chain) -``` - -Now we can build our plot: - -```{julia} -#| echo: false -@assert isapprox(mean(chain, :p), 0.5; atol=0.1) "Estimated mean of parameter p: $(mean(chain, :p)) - not in [0.4, 0.6]!" -``` - -```{julia} -# Visualize a blue density plot of the approximate posterior distribution using HMC (see Chain 1 in the legend). -density(chain; xlim=(0, 1), legend=:best, w=2, c=:blue) - -# Visualize a green density plot of the posterior distribution in closed-form. -plot!( - 0:0.01:1, - pdf.(updated_belief(prior_belief, data), 0:0.01:1); - xlabel="probability of heads", - ylabel="", - title="", - xlim=(0, 1), - label="Closed-form", - fill=0, - α=0.3, - w=3, - c=:lightgreen, -) - -# Visualize the true probability of heads in red. -vline!([p_true]; label="True probability", c=:red) -``` - -As we can see, the samples obtained with Turing closely approximate the true posterior distribution. -Hopefully this tutorial has provided an easy-to-follow, yet informative introduction to Turing's simpler applications. -More advanced usage is demonstrated in other tutorials. +--- +title: "Introduction: Coin Flipping" +engine: julia +aliases: + - ../ +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +This is the first of a series of guided tutorials on the Turing language. +In this tutorial, we will use Bayesian inference to estimate the probability that a coin flip will result in heads, given a series of observations. + +### Setup + +First, let us load some packages that we need to simulate a coin flip: + +```{julia} +using Distributions + +using Random +Random.seed!(12); # Set seed for reproducibility +``` + +and to visualize our results. + +```{julia} +using StatsPlots +``` + +Note that Turing is not loaded here — we do not use it in this example. +Next, we configure the data generating model. Let us set the true probability that a coin flip turns up heads + +```{julia} +p_true = 0.5; +``` + +and set the number of coin flips we will show our model. + +```{julia} +N = 100; +``` + +We simulate `N` coin flips by drawing N random samples from the Bernoulli distribution with success probability `p_true`. The draws are collected in a variable called `data`: + +```{julia} +data = rand(Bernoulli(p_true), N); +``` + +Here are the first five coin flips: + +```{julia} +data[1:5] +``` + + +### Coin Flipping Without Turing + +The following example illustrates the effect of updating our beliefs with every piece of new evidence we observe. + +Assume that we are unsure about the probability of heads in a coin flip. To get an intuitive understanding of what "updating our beliefs" is, we will visualize the probability of heads in a coin flip after each observed evidence. + +We begin by specifying a prior belief about the distribution of heads and tails in a coin toss. Here we choose a [Beta](https://en.wikipedia.org/wiki/Beta_distribution) distribution as prior distribution for the probability of heads. Before any coin flip is observed, we assume a uniform distribution $\operatorname{U}(0, 1) = \operatorname{Beta}(1, 1)$ of the probability of heads. I.e., every probability is equally likely initially. + +```{julia} +prior_belief = Beta(1, 1); +``` + +With our priors set and our data at hand, we can perform Bayesian inference. + +This is a fairly simple process. We expose one additional coin flip to our model every iteration, such that the first run only sees the first coin flip, while the last iteration sees all the coin flips. In each iteration we update our belief to an updated version of the original Beta distribution that accounts for the new proportion of heads and tails. The update is particularly simple since our prior distribution is a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior). Note that a closed-form expression for the posterior (implemented in the `updated_belief` expression below) is not accessible in general and usually does not exist for more interesting models. + +```{julia} +function updated_belief(prior_belief::Beta, data::AbstractArray{Bool}) + # Count the number of heads and tails. + heads = sum(data) + tails = length(data) - heads + + # Update our prior belief in closed form (this is possible because we use a conjugate prior). + return Beta(prior_belief.α + heads, prior_belief.β + tails) +end + +# Show updated belief for increasing number of observations +@gif for n in 0:N + plot( + updated_belief(prior_belief, data[1:n]); + size=(500, 250), + title="Updated belief after $n observations", + xlabel="probability of heads", + ylabel="", + legend=nothing, + xlim=(0, 1), + fill=0, + α=0.3, + w=3, + ) + vline!([p_true]) +end +``` + +The animation above shows that with increasing evidence our belief about the probability of heads in a coin flip slowly adjusts towards the true value. +The orange line in the animation represents the true probability of seeing heads on a single coin flip, while the mode of the distribution shows what the model believes the probability of a heads is given the evidence it has seen. + +For the mathematically inclined, the $\operatorname{Beta}$ distribution is updated by adding each coin flip to the parameters $\alpha$ and $\beta$ of the distribution. +Initially, the parameters are defined as $\alpha = 1$ and $\beta = 1$. +Over time, with more and more coin flips, $\alpha$ and $\beta$ will be approximately equal to each other as we are equally likely to flip a heads or a tails. + +The mean of the $\operatorname{Beta}(\alpha, \beta)$ distribution is + +$$\operatorname{E}[X] = \dfrac{\alpha}{\alpha+\beta}.$$ + +This implies that the plot of the distribution will become centered around 0.5 for a large enough number of coin flips, as we expect $\alpha \approx \beta$. + +The variance of the $\operatorname{Beta}(\alpha, \beta)$ distribution is + +$$\operatorname{var}[X] = \dfrac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}.$$ + +Thus the variance of the distribution will approach 0 with more and more samples, as the denominator will grow faster than will the numerator. +More samples means less variance. +This implies that the distribution will reflect less uncertainty about the probability of receiving a heads and the plot will become more tightly centered around 0.5 for a large enough number of coin flips. + +### Coin Flipping With Turing + +We now move away from the closed-form expression above. +We use **Turing** to specify the same model and to approximate the posterior distribution with samples. +To do so, we first need to load `Turing`. + +```{julia} +using Turing +``` + +Additionally, we load `MCMCChains`, a library for analyzing and visualizing the samples with which we approximate the posterior distribution. + +```{julia} +using MCMCChains +``` + +First, we define the coin-flip model using Turing. + +```{julia} +# Unconditioned coinflip model with `N` observations. +@model function coinflip(; N::Int) + # Our prior belief about the probability of heads in a coin toss. + p ~ Beta(1, 1) + + # Heads or tails of a coin are drawn from `N` independent and identically + # distributed Bernoulli distributions with success rate `p`. + y ~ filldist(Bernoulli(p), N) + + return y +end; +``` + +In the Turing model the prior distribution of the variable `p`, the probability of heads in a coin toss, and the distribution of the observations `y` are specified on the right-hand side of the `~` expressions. +The `@model` macro modifies the body of the Julia function `coinflip` and, e.g., replaces the `~` statements with internal function calls that are used for sampling. + +Here we defined a model that is not conditioned on any specific observations as this allows us to easily obtain samples of both `p` and `y` with + +```{julia} +rand(coinflip(; N)) +``` + +The model can be conditioned on some observations with `|`. +See the [documentation of the `condition` syntax](https://turinglang.github.io/DynamicPPL.jl/stable/api/#Condition-and-decondition) in `DynamicPPL.jl` for more details. +In the conditioned `model` the observations `y` are fixed to `data`. + +```{julia} +coinflip(y::AbstractVector{<:Real}) = coinflip(; N=length(y)) | (; y) + +model = coinflip(data); +``` + +After defining the model, we can approximate the posterior distribution by drawing samples from the distribution. +In this example, we use a [Hamiltonian Monte Carlo](https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo) sampler to draw these samples. +Other tutorials give more information on the samplers available in Turing and discuss their use for different models. + +```{julia} +sampler = NUTS(); +``` + +We approximate the posterior distribution with 1000 samples: + +```{julia} +chain = sample(model, sampler, 2_000, progress=false); +``` + +The `sample` function and common keyword arguments are explained more extensively in the documentation of [AbstractMCMC.jl](https://turinglang.github.io/AbstractMCMC.jl/dev/api/). + +After finishing the sampling process, we can visually compare the closed-form posterior distribution with the approximation obtained with Turing. + +```{julia} +histogram(chain) +``` + +Now we can build our plot: + +```{julia} +#| echo: false +@assert isapprox(mean(chain, :p), 0.5; atol=0.1) "Estimated mean of parameter p: $(mean(chain, :p)) - not in [0.4, 0.6]!" +``` + +```{julia} +# Visualize a blue density plot of the approximate posterior distribution using HMC (see Chain 1 in the legend). +density(chain; xlim=(0, 1), legend=:best, w=2, c=:blue) + +# Visualize a green density plot of the posterior distribution in closed-form. +plot!( + 0:0.01:1, + pdf.(updated_belief(prior_belief, data), 0:0.01:1); + xlabel="probability of heads", + ylabel="", + title="", + xlim=(0, 1), + label="Closed-form", + fill=0, + α=0.3, + w=3, + c=:lightgreen, +) + +# Visualize the true probability of heads in red. +vline!([p_true]; label="True probability", c=:red) +``` + +As we can see, the samples obtained with Turing closely approximate the true posterior distribution. +Hopefully this tutorial has provided an easy-to-follow, yet informative introduction to Turing's simpler applications. +More advanced usage is demonstrated in other tutorials. diff --git a/tutorials/01-gaussian-mixture-model/index.qmd b/tutorials/01-gaussian-mixture-model/index.qmd index ffa3c4dbd..68bda2b37 100755 --- a/tutorials/01-gaussian-mixture-model/index.qmd +++ b/tutorials/01-gaussian-mixture-model/index.qmd @@ -1,413 +1,413 @@ ---- -title: Unsupervised Learning using Bayesian Mixture Models -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -The following tutorial illustrates the use of Turing for clustering data using a Bayesian mixture model. -The aim of this task is to infer a latent grouping (hidden structure) from unlabelled data. - -## Synthetic Data - -We generate a synthetic dataset of $N = 60$ two-dimensional points $x_i \in \mathbb{R}^2$ drawn from a Gaussian mixture model. -For simplicity, we use $K = 2$ clusters with - -- equal weights, i.e., we use mixture weights $w = [0.5, 0.5]$, and -- isotropic Gaussian distributions of the points in each cluster. - -More concretely, we use the Gaussian distributions $\mathcal{N}([\mu_k, \mu_k]^\mathsf{T}, I)$ with parameters $\mu_1 = -3.5$ and $\mu_2 = 0.5$. - -```{julia} -using Distributions -using FillArrays -using StatsPlots - -using LinearAlgebra -using Random - -# Set a random seed. -Random.seed!(3) - -# Define Gaussian mixture model. -w = [0.5, 0.5] -μ = [-3.5, 0.5] -mixturemodel = MixtureModel([MvNormal(Fill(μₖ, 2), I) for μₖ in μ], w) - -# We draw the data points. -N = 60 -x = rand(mixturemodel, N); -``` - -The following plot shows the dataset. - -```{julia} -scatter(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset") -``` - -## Gaussian Mixture Model in Turing - -We are interested in recovering the grouping from the dataset. -More precisely, we want to infer the mixture weights, the parameters $\mu_1$ and $\mu_2$, and the assignment of each datum to a cluster for the generative Gaussian mixture model. - -In a Bayesian Gaussian mixture model with $K$ components each data point $x_i$ ($i = 1,\ldots,N$) is generated according to the following generative process. -First we draw the model parameters, i.e., in our example we draw parameters $\mu_k$ for the mean of the isotropic normal distributions and the mixture weights $w$ of the $K$ clusters. -We use standard normal distributions as priors for $\mu_k$ and a Dirichlet distribution with parameters $\alpha_1 = \cdots = \alpha_K = 1$ as prior for $w$: -$$ -\begin{aligned} -\mu_k &\sim \mathcal{N}(0, 1) \qquad (k = 1,\ldots,K)\\ -w &\sim \operatorname{Dirichlet}(\alpha_1, \ldots, \alpha_K) -\end{aligned} -$$ -After having constructed all the necessary model parameters, we can generate an observation by first selecting one of the clusters -$$ -z_i \sim \operatorname{Categorical}(w) \qquad (i = 1,\ldots,N), -$$ -and then drawing the datum accordingly, i.e., in our example drawing -$$ -x_i \sim \mathcal{N}([\mu_{z_i}, \mu_{z_i}]^\mathsf{T}, I) \qquad (i=1,\ldots,N). -$$ -For more details on Gaussian mixture models, we refer to Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Section 9. - -We specify the model with Turing. - -```{julia} -using Turing - -@model function gaussian_mixture_model(x) - # Draw the parameters for each of the K=2 clusters from a standard normal distribution. - K = 2 - μ ~ MvNormal(Zeros(K), I) - - # Draw the weights for the K clusters from a Dirichlet distribution with parameters αₖ = 1. - w ~ Dirichlet(K, 1.0) - # Alternatively, one could use a fixed set of weights. - # w = fill(1/K, K) - - # Construct categorical distribution of assignments. - distribution_assignments = Categorical(w) - - # Construct multivariate normal distributions of each cluster. - D, N = size(x) - distribution_clusters = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] - - # Draw assignments for each datum and generate it from the multivariate normal distribution. - k = Vector{Int}(undef, N) - for i in 1:N - k[i] ~ distribution_assignments - x[:, i] ~ distribution_clusters[k[i]] - end - - return k -end - -model = gaussian_mixture_model(x); -``` - -We run a MCMC simulation to obtain an approximation of the posterior distribution of the parameters $\mu$ and $w$ and assignments $k$. -We use a `Gibbs` sampler that combines a [particle Gibbs](https://www.stats.ox.ac.uk/%7Edoucet/andrieu_doucet_holenstein_PMCMC.pdf) sampler for the discrete parameters (assignments $k$) and a Hamiltonion Monte Carlo sampler for the continuous parameters ($\mu$ and $w$). -We generate multiple chains in parallel using multi-threading. - -```{julia} -#| output: false -#| echo: false -setprogress!(false) -``` - -```{julia} -#| output: false -sampler = Gibbs(PG(100, :k), HMC(0.05, 10, :μ, :w)) -nsamples = 150 -nchains = 4 -burn = 10 -chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn); -``` - -::: {.callout-warning collapse="true"} -## Sampling With Multiple Threads -The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains -will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains) -::: - -```{julia} -#| echo: false -let - # Verify that the output of the chain is as expected. - for i in MCMCChains.chains(chains) - # μ[1] and μ[2] can switch places, so we sort the values first. - chain = Array(chains[:, ["μ[1]", "μ[2]"], i]) - μ_mean = vec(mean(chain; dims=1)) - @assert isapprox(sort(μ_mean), μ; rtol=0.1) "Difference between estimated mean of μ ($(sort(μ_mean))) and data-generating μ ($μ) unexpectedly large!" - end -end -``` - -## Inferred Mixture Model - -After sampling we can visualize the trace and density of the parameters of interest. - -We consider the samples of the location parameters $\mu_1$ and $\mu_2$ for the two clusters. - -```{julia} -plot(chains[["μ[1]", "μ[2]"]]; legend=true) -``` - -It can happen that the modes of $\mu_1$ and $\mu_2$ switch between chains. -For more information see the [Stan documentation](https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html). This is because it's possible for either model parameter $\mu_k$ to be assigned to either of the corresponding true means, and this assignment need not be consistent between chains. - -That is, the posterior is fundamentally multimodal, and different chains can end up in different modes, complicating inference. -One solution here is to enforce an ordering on our $\mu$ vector, requiring $\mu_k > \mu_{k-1}$ for all $k$. -`Bijectors.jl` [provides](https://turinglang.org/Bijectors.jl/dev/transforms/#Bijectors.OrderedBijector) an easy transformation (`ordered()`) for this purpose: - -```{julia} -@model function gaussian_mixture_model_ordered(x) - # Draw the parameters for each of the K=2 clusters from a standard normal distribution. - K = 2 - μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) - # Draw the weights for the K clusters from a Dirichlet distribution with parameters αₖ = 1. - w ~ Dirichlet(K, 1.0) - # Alternatively, one could use a fixed set of weights. - # w = fill(1/K, K) - # Construct categorical distribution of assignments. - distribution_assignments = Categorical(w) - # Construct multivariate normal distributions of each cluster. - D, N = size(x) - distribution_clusters = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] - # Draw assignments for each datum and generate it from the multivariate normal distribution. - k = Vector{Int}(undef, N) - for i in 1:N - k[i] ~ distribution_assignments - x[:, i] ~ distribution_clusters[k[i]] - end - return k -end - -model = gaussian_mixture_model_ordered(x); -``` - - -Now, re-running our model, we can see that the assigned means are consistent across chains: - -```{julia} -#| output: false -chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn); -``` - - -```{julia} -#| echo: false -let - # Verify that the output of the chain is as expected - for i in MCMCChains.chains(chains) - # μ[1] and μ[2] can no longer switch places. Check that they've found the mean - chain = Array(chains[:, ["μ[1]", "μ[2]"], i]) - μ_mean = vec(mean(chain; dims=1)) - @assert isapprox(sort(μ_mean), μ; rtol=0.4) "Difference between estimated mean of μ ($(sort(μ_mean))) and data-generating μ ($μ) unexpectedly large!" - end -end -``` - -```{julia} -plot(chains[["μ[1]", "μ[2]"]]; legend=true) -``` - -We also inspect the samples of the mixture weights $w$. - -```{julia} -plot(chains[["w[1]", "w[2]"]]; legend=true) -``` - -As the distributions of the samples for the parameters $\mu_1$, $\mu_2$, $w_1$, and $w_2$ are unimodal, we can safely visualize the density region of our model using the average values. - -```{julia} -# Model with mean of samples as parameters. -μ_mean = [mean(chains, "μ[$i]") for i in 1:2] -w_mean = [mean(chains, "w[$i]") for i in 1:2] -mixturemodel_mean = MixtureModel([MvNormal(Fill(μₖ, 2), I) for μₖ in μ_mean], w_mean) -contour( - range(-7.5, 3; length=1_000), - range(-6.5, 3; length=1_000), - (x, y) -> logpdf(mixturemodel_mean, [x, y]); - widen=false, -) -scatter!(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset") -``` - -## Inferred Assignments -Finally, we can inspect the assignments of the data points inferred using Turing. -As we can see, the dataset is partitioned into two distinct groups. - -```{julia} -assignments = [mean(chains, "k[$i]") for i in 1:N] -scatter( - x[1, :], - x[2, :]; - legend=false, - title="Assignments on Synthetic Dataset", - zcolor=assignments, -) -``` - - -## Marginalizing Out The Assignments -We can write out the marginal posterior of (continuous) $w, \mu$ by summing out the influence of our (discrete) assignments $z_i$ from -our likelihood: -$$ -p(y \mid w, \mu ) = \sum_{k=1}^K w_k p_k(y \mid \mu_k) -$$ -In our case, this gives us: -$$ -p(y \mid w, \mu) = \sum_{k=1}^K w_k \cdot \operatorname{MvNormal}(y \mid \mu_k, I) -$$ - - -### Marginalizing By Hand -We could implement the above version of the Gaussian mixture model in Turing as follows: -First, Turing uses log-probabilities, so the likelihood above must be converted into log-space: -$$ -\log \left( p(y \mid w, \mu) \right) = \text{logsumexp} \left[\log (w_k) + \log(\operatorname{MvNormal}(y \mid \mu_k, I)) \right] -$$ - -Where we sum the components with `logsumexp` from the [`LogExpFunctions.jl` package](https://juliastats.org/LogExpFunctions.jl/stable/). -The manually incremented likelihood can be added to the log-probability with `Turing.@addlogprob!`, giving us the following model: - -```{julia} -#| output: false -using LogExpFunctions - -@model function gmm_marginalized(x) - K = 2 - D, N = size(x) - μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) - w ~ Dirichlet(K, 1.0) - dists = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] - for i in 1:N - lvec = Vector(undef, K) - for k in 1:K - lvec[k] = (w[k] + logpdf(dists[k], x[:, i])) - end - Turing.@addlogprob! logsumexp(lvec) - end -end -``` - -::: {.callout-warning collapse="false"} -## Manually Incrementing Probablity - -When possible, use of `Turing.@addlogprob!` should be avoided, as it exists outside the -usual structure of a Turing model. In most cases, a custom distribution should be used instead. - -Here, the next section demonstrates the perfered method --- using the `MixtureModel` distribution we have seen already to -perform the marginalization automatically. -::: - - -### Marginalizing For Free With Distribution.jl's MixtureModel Implementation - -We can use Turing's `~` syntax with anything that `Distributions.jl` provides `logpdf` and `rand` methods for. It turns out that the -`MixtureModel` distribution it provides has, as its `logpdf` method, `logpdf(MixtureModel([Component_Distributions], weight_vector), Y)`, where `Y` can be either a single observation or vector of observations. - -In fact, `Distributions.jl` provides [many convenient constructors](https://juliastats.org/Distributions.jl/stable/mixture/) for mixture models, allowing further simplification in common special cases. - -For example, when mixtures distributions are of the same type, one can write: `~ MixtureModel(Normal, [(μ1, σ1), (μ2, σ2)], w)`, or when the weight vector is known to allocate probability equally, it can be ommited. - -The `logpdf` implementation for a `MixtureModel` distribution is exactly the marginalization defined above, and so our model becomes simply: - -```{julia} -#| output: false -@model function gmm_marginalized(x) - K = 2 - D, _ = size(x) - μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) - w ~ Dirichlet(K, 1.0) - x ~ MixtureModel([MvNormal(Fill(μₖ, D), I) for μₖ in μ], w) -end -model = gmm_marginalized(x); -``` - -As we've summed out the discrete components, we can perform inference using `NUTS()` alone. - -```{julia} -#| output: false -sampler = NUTS() -chains = sample(model, sampler, MCMCThreads(), nsamples, nchains; discard_initial = burn); -``` - - -```{julia} -#| echo: false -let - # Verify for marginalized model that the output of the chain is as expected - for i in MCMCChains.chains(chains) - # μ[1] and μ[2] can no longer switch places. Check that they've found the mean - chain = Array(chains[:, ["μ[1]", "μ[2]"], i]) - μ_mean = vec(mean(chain; dims=1)) - @assert isapprox(sort(μ_mean), μ; rtol=0.4) "Difference between estimated mean of μ ($(sort(μ_mean))) and data-generating μ ($μ) unexpectedly large!" - end -end -``` - -`NUTS()` significantly outperforms our compositional Gibbs sampler, in large part because our model is now Rao-Blackwellized thanks to -the marginalization of our assignment parameter. - -```{julia} -plot(chains[["μ[1]", "μ[2]"]], legend=true) -``` - -## Inferred Assignments - Marginalized Model -As we've summed over possible assignments, the associated parameter is no longer available in our chain. -This is not a problem, however, as given any fixed sample $(\mu, w)$, the assignment probability — $p(z_i \mid y_i)$ — can be recovered using Bayes rule: -$$ -p(z_i \mid y_i) = \frac{p(y_i \mid z_i) p(z_i)}{\sum_{k = 1}^K \left(p(y_i \mid z_i) p(z_i) \right)} -$$ - -This quantity can be computed for every $p(z = z_i \mid y_i)$, resulting in a probability vector, which is then used to sample -posterior predictive assignments from a categorial distribution. -For details on the mathematics here, see [the Stan documentation on latent discrete parameters](https://mc-stan.org/docs/stan-users-guide/latent-discrete.html). -```{julia} -#| output: false -function sample_class(xi, dists, w) - lvec = [(logpdf(d, xi) + log(w[i])) for (i, d) in enumerate(dists)] - rand(Categorical(softmax(lvec))) -end - -@model function gmm_recover(x) - K = 2 - D, N = size(x) - μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) - w ~ Dirichlet(K, 1.0) - dists = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] - x ~ MixtureModel(dists, w) - # Return assignment draws for each datapoint. - return [sample_class(x[:, i], dists, w) for i in 1:N] -end -``` - -We sample from this model as before: - -```{julia} -#| output: false -model = gmm_recover(x) -chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn); -``` - -Given a sample from the marginalized posterior, these assignments can be recovered with: - -```{julia} -assignments = mean(generated_quantities(gmm_recover(x), chains)); -``` - -```{julia} -scatter( - x[1, :], - x[2, :]; - legend=false, - title="Assignments on Synthetic Dataset - Recovered", - zcolor=assignments, -) -``` +--- +title: Unsupervised Learning using Bayesian Mixture Models +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +The following tutorial illustrates the use of Turing for clustering data using a Bayesian mixture model. +The aim of this task is to infer a latent grouping (hidden structure) from unlabelled data. + +## Synthetic Data + +We generate a synthetic dataset of $N = 60$ two-dimensional points $x_i \in \mathbb{R}^2$ drawn from a Gaussian mixture model. +For simplicity, we use $K = 2$ clusters with + +- equal weights, i.e., we use mixture weights $w = [0.5, 0.5]$, and +- isotropic Gaussian distributions of the points in each cluster. + +More concretely, we use the Gaussian distributions $\mathcal{N}([\mu_k, \mu_k]^\mathsf{T}, I)$ with parameters $\mu_1 = -3.5$ and $\mu_2 = 0.5$. + +```{julia} +using Distributions +using FillArrays +using StatsPlots + +using LinearAlgebra +using Random + +# Set a random seed. +Random.seed!(3) + +# Define Gaussian mixture model. +w = [0.5, 0.5] +μ = [-3.5, 0.5] +mixturemodel = MixtureModel([MvNormal(Fill(μₖ, 2), I) for μₖ in μ], w) + +# We draw the data points. +N = 60 +x = rand(mixturemodel, N); +``` + +The following plot shows the dataset. + +```{julia} +scatter(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset") +``` + +## Gaussian Mixture Model in Turing + +We are interested in recovering the grouping from the dataset. +More precisely, we want to infer the mixture weights, the parameters $\mu_1$ and $\mu_2$, and the assignment of each datum to a cluster for the generative Gaussian mixture model. + +In a Bayesian Gaussian mixture model with $K$ components each data point $x_i$ ($i = 1,\ldots,N$) is generated according to the following generative process. +First we draw the model parameters, i.e., in our example we draw parameters $\mu_k$ for the mean of the isotropic normal distributions and the mixture weights $w$ of the $K$ clusters. +We use standard normal distributions as priors for $\mu_k$ and a Dirichlet distribution with parameters $\alpha_1 = \cdots = \alpha_K = 1$ as prior for $w$: +$$ +\begin{aligned} +\mu_k &\sim \mathcal{N}(0, 1) \qquad (k = 1,\ldots,K)\\ +w &\sim \operatorname{Dirichlet}(\alpha_1, \ldots, \alpha_K) +\end{aligned} +$$ +After having constructed all the necessary model parameters, we can generate an observation by first selecting one of the clusters +$$ +z_i \sim \operatorname{Categorical}(w) \qquad (i = 1,\ldots,N), +$$ +and then drawing the datum accordingly, i.e., in our example drawing +$$ +x_i \sim \mathcal{N}([\mu_{z_i}, \mu_{z_i}]^\mathsf{T}, I) \qquad (i=1,\ldots,N). +$$ +For more details on Gaussian mixture models, we refer to Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Section 9. + +We specify the model with Turing. + +```{julia} +using Turing + +@model function gaussian_mixture_model(x) + # Draw the parameters for each of the K=2 clusters from a standard normal distribution. + K = 2 + μ ~ MvNormal(Zeros(K), I) + + # Draw the weights for the K clusters from a Dirichlet distribution with parameters αₖ = 1. + w ~ Dirichlet(K, 1.0) + # Alternatively, one could use a fixed set of weights. + # w = fill(1/K, K) + + # Construct categorical distribution of assignments. + distribution_assignments = Categorical(w) + + # Construct multivariate normal distributions of each cluster. + D, N = size(x) + distribution_clusters = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] + + # Draw assignments for each datum and generate it from the multivariate normal distribution. + k = Vector{Int}(undef, N) + for i in 1:N + k[i] ~ distribution_assignments + x[:, i] ~ distribution_clusters[k[i]] + end + + return k +end + +model = gaussian_mixture_model(x); +``` + +We run a MCMC simulation to obtain an approximation of the posterior distribution of the parameters $\mu$ and $w$ and assignments $k$. +We use a `Gibbs` sampler that combines a [particle Gibbs](https://www.stats.ox.ac.uk/%7Edoucet/andrieu_doucet_holenstein_PMCMC.pdf) sampler for the discrete parameters (assignments $k$) and a Hamiltonion Monte Carlo sampler for the continuous parameters ($\mu$ and $w$). +We generate multiple chains in parallel using multi-threading. + +```{julia} +#| output: false +#| echo: false +setprogress!(false) +``` + +```{julia} +#| output: false +sampler = Gibbs(PG(100, :k), HMC(0.05, 10, :μ, :w)) +nsamples = 150 +nchains = 4 +burn = 10 +chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn); +``` + +::: {.callout-warning collapse="true"} +## Sampling With Multiple Threads +The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains +will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains) +::: + +```{julia} +#| echo: false +let + # Verify that the output of the chain is as expected. + for i in MCMCChains.chains(chains) + # μ[1] and μ[2] can switch places, so we sort the values first. + chain = Array(chains[:, ["μ[1]", "μ[2]"], i]) + μ_mean = vec(mean(chain; dims=1)) + @assert isapprox(sort(μ_mean), μ; rtol=0.1) "Difference between estimated mean of μ ($(sort(μ_mean))) and data-generating μ ($μ) unexpectedly large!" + end +end +``` + +## Inferred Mixture Model + +After sampling we can visualize the trace and density of the parameters of interest. + +We consider the samples of the location parameters $\mu_1$ and $\mu_2$ for the two clusters. + +```{julia} +plot(chains[["μ[1]", "μ[2]"]]; legend=true) +``` + +It can happen that the modes of $\mu_1$ and $\mu_2$ switch between chains. +For more information see the [Stan documentation](https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html). This is because it's possible for either model parameter $\mu_k$ to be assigned to either of the corresponding true means, and this assignment need not be consistent between chains. + +That is, the posterior is fundamentally multimodal, and different chains can end up in different modes, complicating inference. +One solution here is to enforce an ordering on our $\mu$ vector, requiring $\mu_k > \mu_{k-1}$ for all $k$. +`Bijectors.jl` [provides](https://turinglang.org/Bijectors.jl/dev/transforms/#Bijectors.OrderedBijector) an easy transformation (`ordered()`) for this purpose: + +```{julia} +@model function gaussian_mixture_model_ordered(x) + # Draw the parameters for each of the K=2 clusters from a standard normal distribution. + K = 2 + μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) + # Draw the weights for the K clusters from a Dirichlet distribution with parameters αₖ = 1. + w ~ Dirichlet(K, 1.0) + # Alternatively, one could use a fixed set of weights. + # w = fill(1/K, K) + # Construct categorical distribution of assignments. + distribution_assignments = Categorical(w) + # Construct multivariate normal distributions of each cluster. + D, N = size(x) + distribution_clusters = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] + # Draw assignments for each datum and generate it from the multivariate normal distribution. + k = Vector{Int}(undef, N) + for i in 1:N + k[i] ~ distribution_assignments + x[:, i] ~ distribution_clusters[k[i]] + end + return k +end + +model = gaussian_mixture_model_ordered(x); +``` + + +Now, re-running our model, we can see that the assigned means are consistent across chains: + +```{julia} +#| output: false +chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn); +``` + + +```{julia} +#| echo: false +let + # Verify that the output of the chain is as expected + for i in MCMCChains.chains(chains) + # μ[1] and μ[2] can no longer switch places. Check that they've found the mean + chain = Array(chains[:, ["μ[1]", "μ[2]"], i]) + μ_mean = vec(mean(chain; dims=1)) + @assert isapprox(sort(μ_mean), μ; rtol=0.4) "Difference between estimated mean of μ ($(sort(μ_mean))) and data-generating μ ($μ) unexpectedly large!" + end +end +``` + +```{julia} +plot(chains[["μ[1]", "μ[2]"]]; legend=true) +``` + +We also inspect the samples of the mixture weights $w$. + +```{julia} +plot(chains[["w[1]", "w[2]"]]; legend=true) +``` + +As the distributions of the samples for the parameters $\mu_1$, $\mu_2$, $w_1$, and $w_2$ are unimodal, we can safely visualize the density region of our model using the average values. + +```{julia} +# Model with mean of samples as parameters. +μ_mean = [mean(chains, "μ[$i]") for i in 1:2] +w_mean = [mean(chains, "w[$i]") for i in 1:2] +mixturemodel_mean = MixtureModel([MvNormal(Fill(μₖ, 2), I) for μₖ in μ_mean], w_mean) +contour( + range(-7.5, 3; length=1_000), + range(-6.5, 3; length=1_000), + (x, y) -> logpdf(mixturemodel_mean, [x, y]); + widen=false, +) +scatter!(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset") +``` + +## Inferred Assignments +Finally, we can inspect the assignments of the data points inferred using Turing. +As we can see, the dataset is partitioned into two distinct groups. + +```{julia} +assignments = [mean(chains, "k[$i]") for i in 1:N] +scatter( + x[1, :], + x[2, :]; + legend=false, + title="Assignments on Synthetic Dataset", + zcolor=assignments, +) +``` + + +## Marginalizing Out The Assignments +We can write out the marginal posterior of (continuous) $w, \mu$ by summing out the influence of our (discrete) assignments $z_i$ from +our likelihood: +$$ +p(y \mid w, \mu ) = \sum_{k=1}^K w_k p_k(y \mid \mu_k) +$$ +In our case, this gives us: +$$ +p(y \mid w, \mu) = \sum_{k=1}^K w_k \cdot \operatorname{MvNormal}(y \mid \mu_k, I) +$$ + + +### Marginalizing By Hand +We could implement the above version of the Gaussian mixture model in Turing as follows: +First, Turing uses log-probabilities, so the likelihood above must be converted into log-space: +$$ +\log \left( p(y \mid w, \mu) \right) = \text{logsumexp} \left[\log (w_k) + \log(\operatorname{MvNormal}(y \mid \mu_k, I)) \right] +$$ + +Where we sum the components with `logsumexp` from the [`LogExpFunctions.jl` package](https://juliastats.org/LogExpFunctions.jl/stable/). +The manually incremented likelihood can be added to the log-probability with `Turing.@addlogprob!`, giving us the following model: + +```{julia} +#| output: false +using LogExpFunctions + +@model function gmm_marginalized(x) + K = 2 + D, N = size(x) + μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) + w ~ Dirichlet(K, 1.0) + dists = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] + for i in 1:N + lvec = Vector(undef, K) + for k in 1:K + lvec[k] = (w[k] + logpdf(dists[k], x[:, i])) + end + Turing.@addlogprob! logsumexp(lvec) + end +end +``` + +::: {.callout-warning collapse="false"} +## Manually Incrementing Probablity + +When possible, use of `Turing.@addlogprob!` should be avoided, as it exists outside the +usual structure of a Turing model. In most cases, a custom distribution should be used instead. + +Here, the next section demonstrates the perfered method --- using the `MixtureModel` distribution we have seen already to +perform the marginalization automatically. +::: + + +### Marginalizing For Free With Distribution.jl's MixtureModel Implementation + +We can use Turing's `~` syntax with anything that `Distributions.jl` provides `logpdf` and `rand` methods for. It turns out that the +`MixtureModel` distribution it provides has, as its `logpdf` method, `logpdf(MixtureModel([Component_Distributions], weight_vector), Y)`, where `Y` can be either a single observation or vector of observations. + +In fact, `Distributions.jl` provides [many convenient constructors](https://juliastats.org/Distributions.jl/stable/mixture/) for mixture models, allowing further simplification in common special cases. + +For example, when mixtures distributions are of the same type, one can write: `~ MixtureModel(Normal, [(μ1, σ1), (μ2, σ2)], w)`, or when the weight vector is known to allocate probability equally, it can be ommited. + +The `logpdf` implementation for a `MixtureModel` distribution is exactly the marginalization defined above, and so our model becomes simply: + +```{julia} +#| output: false +@model function gmm_marginalized(x) + K = 2 + D, _ = size(x) + μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) + w ~ Dirichlet(K, 1.0) + x ~ MixtureModel([MvNormal(Fill(μₖ, D), I) for μₖ in μ], w) +end +model = gmm_marginalized(x); +``` + +As we've summed out the discrete components, we can perform inference using `NUTS()` alone. + +```{julia} +#| output: false +sampler = NUTS() +chains = sample(model, sampler, MCMCThreads(), nsamples, nchains; discard_initial = burn); +``` + + +```{julia} +#| echo: false +let + # Verify for marginalized model that the output of the chain is as expected + for i in MCMCChains.chains(chains) + # μ[1] and μ[2] can no longer switch places. Check that they've found the mean + chain = Array(chains[:, ["μ[1]", "μ[2]"], i]) + μ_mean = vec(mean(chain; dims=1)) + @assert isapprox(sort(μ_mean), μ; rtol=0.4) "Difference between estimated mean of μ ($(sort(μ_mean))) and data-generating μ ($μ) unexpectedly large!" + end +end +``` + +`NUTS()` significantly outperforms our compositional Gibbs sampler, in large part because our model is now Rao-Blackwellized thanks to +the marginalization of our assignment parameter. + +```{julia} +plot(chains[["μ[1]", "μ[2]"]], legend=true) +``` + +## Inferred Assignments - Marginalized Model +As we've summed over possible assignments, the associated parameter is no longer available in our chain. +This is not a problem, however, as given any fixed sample $(\mu, w)$, the assignment probability — $p(z_i \mid y_i)$ — can be recovered using Bayes rule: +$$ +p(z_i \mid y_i) = \frac{p(y_i \mid z_i) p(z_i)}{\sum_{k = 1}^K \left(p(y_i \mid z_i) p(z_i) \right)} +$$ + +This quantity can be computed for every $p(z = z_i \mid y_i)$, resulting in a probability vector, which is then used to sample +posterior predictive assignments from a categorial distribution. +For details on the mathematics here, see [the Stan documentation on latent discrete parameters](https://mc-stan.org/docs/stan-users-guide/latent-discrete.html). +```{julia} +#| output: false +function sample_class(xi, dists, w) + lvec = [(logpdf(d, xi) + log(w[i])) for (i, d) in enumerate(dists)] + rand(Categorical(softmax(lvec))) +end + +@model function gmm_recover(x) + K = 2 + D, N = size(x) + μ ~ Bijectors.ordered(MvNormal(Zeros(K), I)) + w ~ Dirichlet(K, 1.0) + dists = [MvNormal(Fill(μₖ, D), I) for μₖ in μ] + x ~ MixtureModel(dists, w) + # Return assignment draws for each datapoint. + return [sample_class(x[:, i], dists, w) for i in 1:N] +end +``` + +We sample from this model as before: + +```{julia} +#| output: false +model = gmm_recover(x) +chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn); +``` + +Given a sample from the marginalized posterior, these assignments can be recovered with: + +```{julia} +assignments = mean(generated_quantities(gmm_recover(x), chains)); +``` + +```{julia} +scatter( + x[1, :], + x[2, :]; + legend=false, + title="Assignments on Synthetic Dataset - Recovered", + zcolor=assignments, +) +``` diff --git a/tutorials/02-logistic-regression/index.qmd b/tutorials/02-logistic-regression/index.qmd index 74392ace1..7594e0ea2 100755 --- a/tutorials/02-logistic-regression/index.qmd +++ b/tutorials/02-logistic-regression/index.qmd @@ -1,274 +1,274 @@ ---- -title: Bayesian Logistic Regression -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -[Bayesian logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#Bayesian) is the Bayesian counterpart to a common tool in machine learning, logistic regression. -The goal of logistic regression is to predict a one or a zero for a given training item. -An example might be predicting whether someone is sick or ill given their symptoms and personal information. - -In our example, we'll be working to predict whether someone is likely to default with a synthetic dataset found in the `RDatasets` package. This dataset, `Defaults`, comes from R's [ISLR](https://cran.r-project.org/web/packages/ISLR/index.html) package and contains information on borrowers. - -To start, let's import all the libraries we'll need. - -```{julia} -# Import Turing and Distributions. -using Turing, Distributions - -# Import RDatasets. -using RDatasets - -# Import MCMCChains, Plots, and StatsPlots for visualizations and diagnostics. -using MCMCChains, Plots, StatsPlots - -# We need a logistic function, which is provided by StatsFuns. -using StatsFuns: logistic - -# Functionality for splitting and normalizing the data -using MLDataUtils: shuffleobs, stratifiedobs, rescale! - -# Set a seed for reproducibility. -using Random -Random.seed!(0); -``` - -## Data Cleaning & Set Up - -Now we're going to import our dataset. The first six rows of the dataset are shown below so you can get a good feel for what kind of data we have. - -```{julia} -# Import the "Default" dataset. -data = RDatasets.dataset("ISLR", "Default"); - -# Show the first six rows of the dataset. -first(data, 6) -``` - -Most machine learning processes require some effort to tidy up the data, and this is no different. We need to convert the `Default` and `Student` columns, which say "Yes" or "No" into 1s and 0s. Afterwards, we'll get rid of the old words-based columns. - -```{julia} -# Convert "Default" and "Student" to numeric values. -data[!, :DefaultNum] = [r.Default == "Yes" ? 1.0 : 0.0 for r in eachrow(data)] -data[!, :StudentNum] = [r.Student == "Yes" ? 1.0 : 0.0 for r in eachrow(data)] - -# Delete the old columns which say "Yes" and "No". -select!(data, Not([:Default, :Student])) - -# Show the first six rows of our edited dataset. -first(data, 6) -``` - -After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the labels from the data. We separate our data into two halves, `train` and `test`. You can use a higher percentage of splitting (or a lower one) by modifying the `at = 0.05` argument. We have highlighted the use of only a 5% sample to show the power of Bayesian inference with small sample sizes. - -We must rescale our variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing's sampler will have a hard time finding a place to start searching for parameter estimates. To do this we will leverage `MLDataUtils`, which also lets us effortlessly shuffle our observations and perform a stratified split to get a representative test set. - -```{julia} -function split_data(df, target; at=0.70) - shuffled = shuffleobs(df) - return trainset, testset = stratifiedobs(row -> row[target], shuffled; p=at) -end - -features = [:StudentNum, :Balance, :Income] -numerics = [:Balance, :Income] -target = :DefaultNum - -trainset, testset = split_data(data, target; at=0.05) -for feature in numerics - μ, σ = rescale!(trainset[!, feature]; obsdim=1) - rescale!(testset[!, feature], μ, σ; obsdim=1) -end - -# Turing requires data in matrix form, not dataframe -train = Matrix(trainset[:, features]) -test = Matrix(testset[:, features]) -train_label = trainset[:, target] -test_label = testset[:, target]; -``` - -## Model Declaration - -Finally, we can define our model. - -`logistic_regression` takes four arguments: - - - `x` is our set of independent variables; - - `y` is the element we want to predict; - - `n` is the number of observations we have; and - - `σ` is the standard deviation we want to assume for our priors. - -Within the model, we create four coefficients (`intercept`, `student`, `balance`, and `income`) and assign a prior of normally distributed with means of zero and standard deviations of `σ`. We want to find values of these four coefficients to predict any given `y`. - -The `for` block creates a variable `v` which is the logistic function. We then observe the likelihood of calculating `v` given the actual label, `y[i]`. - -```{julia} -# Bayesian logistic regression (LR) -@model function logistic_regression(x, y, n, σ) - intercept ~ Normal(0, σ) - - student ~ Normal(0, σ) - balance ~ Normal(0, σ) - income ~ Normal(0, σ) - - for i in 1:n - v = logistic(intercept + student * x[i, 1] + balance * x[i, 2] + income * x[i, 3]) - y[i] ~ Bernoulli(v) - end -end; -``` - -## Sampling - -Now we can run our sampler. This time we'll use [`NUTS`](https://turinglang.org/stable/docs/library/#Turing.Inference.NUTS) to sample from our posterior. - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -#| output: false -# Retrieve the number of observations. -n, _ = size(train) - -# Sample using NUTS. -m = logistic_regression(train, train_label, n, 1) -chain = sample(m, NUTS(), MCMCThreads(), 1_500, 3) -``` - -```{julia} -#| echo: false -chain -``` - -::: {.callout-warning collapse="true"} -## Sampling With Multiple Threads -The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains -will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains) -::: - -```{julia} -#| echo: false -let - mean_params = mean(chain) - @assert mean_params[:student, :mean] < 0.1 - @assert mean_params[:balance, :mean] > 1 -end -``` - -Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points. - -```{julia} -plot(chain) -``` - -```{julia} -#| echo: false -let - mean_params = mapreduce(hcat, mean(chain; append_chains=false)) do df - return df[:, :mean] - end - for i in (2, 3) - @assert mean_params[:, i] != mean_params[:, 1] - @assert isapprox(mean_params[:, i], mean_params[:, 1]; rtol=5e-2) - end -end -``` - -Looks good! - -We can also use the `corner` function from MCMCChains to show the distributions of the various parameters of our logistic regression. - -```{julia} -# The labels to use. -l = [:student, :balance, :income] - -# Use the corner function. Requires StatsPlots and MCMCChains. -corner(chain, l) -``` - -Fortunately the corner plot appears to demonstrate unimodal distributions for each of our parameters, so it should be straightforward to take the means of each parameter's sampled values to estimate our model to make predictions. - -## Making Predictions - -How do we test how well the model actually predicts whether someone is likely to default? We need to build a prediction function that takes the `test` object we made earlier and runs it through the average parameter calculated during sampling. - -The `prediction` function below takes a `Matrix` and a `Chain` object. It takes the mean of each parameter's sampled values and re-runs the logistic function using those mean values for every element in the test set. - -```{julia} -function prediction(x::Matrix, chain, threshold) - # Pull the means from each parameter's sampled values in the chain. - intercept = mean(chain[:intercept]) - student = mean(chain[:student]) - balance = mean(chain[:balance]) - income = mean(chain[:income]) - - # Retrieve the number of rows. - n, _ = size(x) - - # Generate a vector to store our predictions. - v = Vector{Float64}(undef, n) - - # Calculate the logistic function for each element in the test set. - for i in 1:n - num = logistic( - intercept .+ student * x[i, 1] + balance * x[i, 2] + income * x[i, 3] - ) - if num >= threshold - v[i] = 1 - else - v[i] = 0 - end - end - return v -end; -``` - -Let's see how we did! We run the test matrix through the prediction function, and compute the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error) (MSE) for our prediction. The `threshold` variable sets the sensitivity of the predictions. For example, a threshold of 0.07 will predict a defualt value of 1 for any predicted value greater than 0.07 and no default if it is less than 0.07. - -```{julia} -# Set the prediction threshold. -threshold = 0.07 - -# Make the predictions. -predictions = prediction(test, chain, threshold) - -# Calculate MSE for our test set. -loss = sum((predictions - test_label) .^ 2) / length(test_label) -``` - -Perhaps more important is to see what percentage of defaults we correctly predicted. The code below simply counts defaults and predictions and presents the results. - -```{julia} -defaults = sum(test_label) -not_defaults = length(test_label) - defaults - -predicted_defaults = sum(test_label .== predictions .== 1) -predicted_not_defaults = sum(test_label .== predictions .== 0) - -println("Defaults: $defaults - Predictions: $predicted_defaults - Percentage defaults correct $(predicted_defaults/defaults)") - -println("Not defaults: $not_defaults - Predictions: $predicted_not_defaults - Percentage non-defaults correct $(predicted_not_defaults/not_defaults)") -``` - -```{julia} -#| echo: false -let - percentage_correct = predicted_defaults / defaults - @assert 0.6 < percentage_correct -end -``` - -The above shows that with a threshold of 0.07, we correctly predict a respectable portion of the defaults, and correctly identify most non-defaults. This is fairly sensitive to a choice of threshold, and you may wish to experiment with it. - +--- +title: Bayesian Logistic Regression +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +[Bayesian logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#Bayesian) is the Bayesian counterpart to a common tool in machine learning, logistic regression. +The goal of logistic regression is to predict a one or a zero for a given training item. +An example might be predicting whether someone is sick or ill given their symptoms and personal information. + +In our example, we'll be working to predict whether someone is likely to default with a synthetic dataset found in the `RDatasets` package. This dataset, `Defaults`, comes from R's [ISLR](https://cran.r-project.org/web/packages/ISLR/index.html) package and contains information on borrowers. + +To start, let's import all the libraries we'll need. + +```{julia} +# Import Turing and Distributions. +using Turing, Distributions + +# Import RDatasets. +using RDatasets + +# Import MCMCChains, Plots, and StatsPlots for visualizations and diagnostics. +using MCMCChains, Plots, StatsPlots + +# We need a logistic function, which is provided by StatsFuns. +using StatsFuns: logistic + +# Functionality for splitting and normalizing the data +using MLDataUtils: shuffleobs, stratifiedobs, rescale! + +# Set a seed for reproducibility. +using Random +Random.seed!(0); +``` + +## Data Cleaning & Set Up + +Now we're going to import our dataset. The first six rows of the dataset are shown below so you can get a good feel for what kind of data we have. + +```{julia} +# Import the "Default" dataset. +data = RDatasets.dataset("ISLR", "Default"); + +# Show the first six rows of the dataset. +first(data, 6) +``` + +Most machine learning processes require some effort to tidy up the data, and this is no different. We need to convert the `Default` and `Student` columns, which say "Yes" or "No" into 1s and 0s. Afterwards, we'll get rid of the old words-based columns. + +```{julia} +# Convert "Default" and "Student" to numeric values. +data[!, :DefaultNum] = [r.Default == "Yes" ? 1.0 : 0.0 for r in eachrow(data)] +data[!, :StudentNum] = [r.Student == "Yes" ? 1.0 : 0.0 for r in eachrow(data)] + +# Delete the old columns which say "Yes" and "No". +select!(data, Not([:Default, :Student])) + +# Show the first six rows of our edited dataset. +first(data, 6) +``` + +After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the labels from the data. We separate our data into two halves, `train` and `test`. You can use a higher percentage of splitting (or a lower one) by modifying the `at = 0.05` argument. We have highlighted the use of only a 5% sample to show the power of Bayesian inference with small sample sizes. + +We must rescale our variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing's sampler will have a hard time finding a place to start searching for parameter estimates. To do this we will leverage `MLDataUtils`, which also lets us effortlessly shuffle our observations and perform a stratified split to get a representative test set. + +```{julia} +function split_data(df, target; at=0.70) + shuffled = shuffleobs(df) + return trainset, testset = stratifiedobs(row -> row[target], shuffled; p=at) +end + +features = [:StudentNum, :Balance, :Income] +numerics = [:Balance, :Income] +target = :DefaultNum + +trainset, testset = split_data(data, target; at=0.05) +for feature in numerics + μ, σ = rescale!(trainset[!, feature]; obsdim=1) + rescale!(testset[!, feature], μ, σ; obsdim=1) +end + +# Turing requires data in matrix form, not dataframe +train = Matrix(trainset[:, features]) +test = Matrix(testset[:, features]) +train_label = trainset[:, target] +test_label = testset[:, target]; +``` + +## Model Declaration + +Finally, we can define our model. + +`logistic_regression` takes four arguments: + + - `x` is our set of independent variables; + - `y` is the element we want to predict; + - `n` is the number of observations we have; and + - `σ` is the standard deviation we want to assume for our priors. + +Within the model, we create four coefficients (`intercept`, `student`, `balance`, and `income`) and assign a prior of normally distributed with means of zero and standard deviations of `σ`. We want to find values of these four coefficients to predict any given `y`. + +The `for` block creates a variable `v` which is the logistic function. We then observe the likelihood of calculating `v` given the actual label, `y[i]`. + +```{julia} +# Bayesian logistic regression (LR) +@model function logistic_regression(x, y, n, σ) + intercept ~ Normal(0, σ) + + student ~ Normal(0, σ) + balance ~ Normal(0, σ) + income ~ Normal(0, σ) + + for i in 1:n + v = logistic(intercept + student * x[i, 1] + balance * x[i, 2] + income * x[i, 3]) + y[i] ~ Bernoulli(v) + end +end; +``` + +## Sampling + +Now we can run our sampler. This time we'll use [`NUTS`](https://turinglang.org/stable/docs/library/#Turing.Inference.NUTS) to sample from our posterior. + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +#| output: false +# Retrieve the number of observations. +n, _ = size(train) + +# Sample using NUTS. +m = logistic_regression(train, train_label, n, 1) +chain = sample(m, NUTS(), MCMCThreads(), 1_500, 3) +``` + +```{julia} +#| echo: false +chain +``` + +::: {.callout-warning collapse="true"} +## Sampling With Multiple Threads +The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains +will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains) +::: + +```{julia} +#| echo: false +let + mean_params = mean(chain) + @assert mean_params[:student, :mean] < 0.1 + @assert mean_params[:balance, :mean] > 1 +end +``` + +Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points. + +```{julia} +plot(chain) +``` + +```{julia} +#| echo: false +let + mean_params = mapreduce(hcat, mean(chain; append_chains=false)) do df + return df[:, :mean] + end + for i in (2, 3) + @assert mean_params[:, i] != mean_params[:, 1] + @assert isapprox(mean_params[:, i], mean_params[:, 1]; rtol=5e-2) + end +end +``` + +Looks good! + +We can also use the `corner` function from MCMCChains to show the distributions of the various parameters of our logistic regression. + +```{julia} +# The labels to use. +l = [:student, :balance, :income] + +# Use the corner function. Requires StatsPlots and MCMCChains. +corner(chain, l) +``` + +Fortunately the corner plot appears to demonstrate unimodal distributions for each of our parameters, so it should be straightforward to take the means of each parameter's sampled values to estimate our model to make predictions. + +## Making Predictions + +How do we test how well the model actually predicts whether someone is likely to default? We need to build a prediction function that takes the `test` object we made earlier and runs it through the average parameter calculated during sampling. + +The `prediction` function below takes a `Matrix` and a `Chain` object. It takes the mean of each parameter's sampled values and re-runs the logistic function using those mean values for every element in the test set. + +```{julia} +function prediction(x::Matrix, chain, threshold) + # Pull the means from each parameter's sampled values in the chain. + intercept = mean(chain[:intercept]) + student = mean(chain[:student]) + balance = mean(chain[:balance]) + income = mean(chain[:income]) + + # Retrieve the number of rows. + n, _ = size(x) + + # Generate a vector to store our predictions. + v = Vector{Float64}(undef, n) + + # Calculate the logistic function for each element in the test set. + for i in 1:n + num = logistic( + intercept .+ student * x[i, 1] + balance * x[i, 2] + income * x[i, 3] + ) + if num >= threshold + v[i] = 1 + else + v[i] = 0 + end + end + return v +end; +``` + +Let's see how we did! We run the test matrix through the prediction function, and compute the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error) (MSE) for our prediction. The `threshold` variable sets the sensitivity of the predictions. For example, a threshold of 0.07 will predict a defualt value of 1 for any predicted value greater than 0.07 and no default if it is less than 0.07. + +```{julia} +# Set the prediction threshold. +threshold = 0.07 + +# Make the predictions. +predictions = prediction(test, chain, threshold) + +# Calculate MSE for our test set. +loss = sum((predictions - test_label) .^ 2) / length(test_label) +``` + +Perhaps more important is to see what percentage of defaults we correctly predicted. The code below simply counts defaults and predictions and presents the results. + +```{julia} +defaults = sum(test_label) +not_defaults = length(test_label) - defaults + +predicted_defaults = sum(test_label .== predictions .== 1) +predicted_not_defaults = sum(test_label .== predictions .== 0) + +println("Defaults: $defaults + Predictions: $predicted_defaults + Percentage defaults correct $(predicted_defaults/defaults)") + +println("Not defaults: $not_defaults + Predictions: $predicted_not_defaults + Percentage non-defaults correct $(predicted_not_defaults/not_defaults)") +``` + +```{julia} +#| echo: false +let + percentage_correct = predicted_defaults / defaults + @assert 0.6 < percentage_correct +end +``` + +The above shows that with a threshold of 0.07, we correctly predict a respectable portion of the defaults, and correctly identify most non-defaults. This is fairly sensitive to a choice of threshold, and you may wish to experiment with it. + This tutorial has demonstrated how to use Turing to perform Bayesian logistic regression. \ No newline at end of file diff --git a/tutorials/03-bayesian-neural-network/index.qmd b/tutorials/03-bayesian-neural-network/index.qmd index 1b2033224..133e83cc8 100755 --- a/tutorials/03-bayesian-neural-network/index.qmd +++ b/tutorials/03-bayesian-neural-network/index.qmd @@ -1,293 +1,293 @@ ---- -title: Bayesian Neural Networks -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In this tutorial, we demonstrate how one can implement a Bayesian Neural Network using a combination of Turing and [Lux](https://github.com/LuxDL/Lux.jl), a suite of machine learning tools. We will use Lux to specify the neural network's layers and Turing to implement the probabilistic inference, with the goal of implementing a classification algorithm. - -We will begin with importing the relevant libraries. - -```{julia} -using Turing -using FillArrays -using Lux -using Plots -import Mooncake -using Functors - -using LinearAlgebra -using Random -``` - -Our goal here is to use a Bayesian neural network to classify points in an artificial dataset. -The code below generates data points arranged in a box-like pattern and displays a graph of the dataset we will be working with. - -```{julia} -# Number of points to generate -N = 80 -M = round(Int, N / 4) -rng = Random.default_rng() -Random.seed!(rng, 1234) - -# Generate artificial data -x1s = rand(rng, Float32, M) * 4.5f0; -x2s = rand(rng, Float32, M) * 4.5f0; -xt1s = Array([[x1s[i] + 0.5f0; x2s[i] + 0.5f0] for i in 1:M]) -x1s = rand(rng, Float32, M) * 4.5f0; -x2s = rand(rng, Float32, M) * 4.5f0; -append!(xt1s, Array([[x1s[i] - 5.0f0; x2s[i] - 5.0f0] for i in 1:M])) - -x1s = rand(rng, Float32, M) * 4.5f0; -x2s = rand(rng, Float32, M) * 4.5f0; -xt0s = Array([[x1s[i] + 0.5f0; x2s[i] - 5.0f0] for i in 1:M]) -x1s = rand(rng, Float32, M) * 4.5f0; -x2s = rand(rng, Float32, M) * 4.5f0; -append!(xt0s, Array([[x1s[i] - 5.0f0; x2s[i] + 0.5f0] for i in 1:M])) - -# Store all the data for later -xs = [xt1s; xt0s] -ts = [ones(2 * M); zeros(2 * M)] - -# Plot data points. -function plot_data() - x1 = map(e -> e[1], xt1s) - y1 = map(e -> e[2], xt1s) - x2 = map(e -> e[1], xt0s) - y2 = map(e -> e[2], xt0s) - - Plots.scatter(x1, y1; color="red", clim=(0, 1)) - return Plots.scatter!(x2, y2; color="blue", clim=(0, 1)) -end - -plot_data() -``` - -## Building a Neural Network - -The next step is to define a [feedforward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) where we express our parameters as distributions, and not single points as with traditional neural networks. -For this we will use `Dense` to define liner layers and compose them via `Chain`, both are neural network primitives from Lux. -The network `nn_initial` we created has two hidden layers with `tanh` activations and one output layer with sigmoid (`σ`) activation, as shown below. - -```{dot} -//| echo: false -graph G { - rankdir=LR; - nodesep=0.8; - ranksep=0.8; - node [shape=circle, fixedsize=true, width=0.8, style="filled", color=black, fillcolor="white", fontsize=12]; - - // Input layer - subgraph cluster_input { - node [label=""]; - input1; - input2; - style="rounded" - } - - // Hidden layers - subgraph cluster_hidden1 { - node [label=""]; - hidden11; - hidden12; - hidden13; - style="rounded" - } - - subgraph cluster_hidden2 { - node [label=""]; - hidden21; - hidden22; - style="rounded" - } - - // Output layer - subgraph cluster_output { - output1 [label=""]; - style="rounded" - } - - // Connections from input to hidden layer 1 - input1 -- hidden11; - input1 -- hidden12; - input1 -- hidden13; - input2 -- hidden11; - input2 -- hidden12; - input2 -- hidden13; - - // Connections from hidden layer 1 to hidden layer 2 - hidden11 -- hidden21; - hidden11 -- hidden22; - hidden12 -- hidden21; - hidden12 -- hidden22; - hidden13 -- hidden21; - hidden13 -- hidden22; - - // Connections from hidden layer 2 to output - hidden21 -- output1; - hidden22 -- output1; - - // Labels - labelloc="b"; - fontsize=17; - label="Input layer Hidden layers Output layer"; -} -``` - -The `nn_initial` is an instance that acts as a function and can take data as inputs and output predictions. -We will define distributions on the neural network parameters. - -```{julia} -# Construct a neural network using Lux -nn_initial = Chain(Dense(2 => 3, tanh), Dense(3 => 2, tanh), Dense(2 => 1, σ)) - -# Initialize the model weights and state -ps, st = Lux.setup(rng, nn_initial) - -Lux.parameterlength(nn_initial) # number of parameters in NN -``` - -The probabilistic model specification below creates a `parameters` variable, which has IID normal variables. The `parameters` vector represents all parameters of our neural net (weights and biases). - -```{julia} -# Create a regularization term and a Gaussian prior variance term. -alpha = 0.09 -sigma = sqrt(1.0 / alpha) -``` - -We also define a function to construct a named tuple from a vector of sampled parameters. -(We could use [`ComponentArrays`](https://github.com/jonniedie/ComponentArrays.jl) here and broadcast to avoid doing this, but this way avoids introducing an extra dependency.) - -```{julia} -function vector_to_parameters(ps_new::AbstractVector, ps::NamedTuple) - @assert length(ps_new) == Lux.parameterlength(ps) - i = 1 - function get_ps(x) - z = reshape(view(ps_new, i:(i + length(x) - 1)), size(x)) - i += length(x) - return z - end - return fmap(get_ps, ps) -end -``` - -To interface with external libraries it is often desirable to use the [`StatefulLuxLayer`](https://lux.csail.mit.edu/stable/api/Lux/utilities#Lux.StatefulLuxLayer) to automatically handle the neural network states. - -```{julia} -const nn = StatefulLuxLayer{true}(nn_initial, nothing, st) - -# Specify the probabilistic model. -@model function bayes_nn(xs, ts; sigma = sigma, ps = ps, nn = nn) - # Sample the parameters - nparameters = Lux.parameterlength(nn_initial) - parameters ~ MvNormal(zeros(nparameters), Diagonal(abs2.(sigma .* ones(nparameters)))) - - # Forward NN to make predictions - preds = Lux.apply(nn, xs, f32(vector_to_parameters(parameters, ps))) - - # Observe each prediction. - for i in eachindex(ts) - ts[i] ~ Bernoulli(preds[i]) - end -end -``` - -Inference can now be performed by calling `sample`. We use the `NUTS` Hamiltonian Monte Carlo sampler here. - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -# Perform inference. -n_iters = 2_000 -ch = sample(bayes_nn(reduce(hcat, xs), ts), NUTS(; adtype=AutoMooncake(; config=nothing)), n_iters); -``` - -Now we extract the parameter samples from the sampled chain as `θ` (this is of size `5000 x 20` where `5000` is the number of iterations and `20` is the number of parameters). -We'll use these primarily to determine how good our model's classifier is. - -```{julia} -# Extract all weight and bias parameters. -θ = MCMCChains.group(ch, :parameters).value; -``` - -## Prediction Visualization - -We can use [MAP estimation](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) to classify our population by using the set of weights that provided the highest log posterior. - -```{julia} -# A helper to run the nn through data `x` using parameters `θ` -nn_forward(x, θ) = nn(x, vector_to_parameters(θ, ps)) - -# Plot the data we have. -fig = plot_data() - -# Find the index that provided the highest log posterior in the chain. -_, i = findmax(ch[:lp]) - -# Extract the max row value from i. -i = i.I[1] - -# Plot the posterior distribution with a contour plot -x1_range = collect(range(-6; stop=6, length=25)) -x2_range = collect(range(-6; stop=6, length=25)) -Z = [nn_forward([x1, x2], θ[i, :])[1] for x1 in x1_range, x2 in x2_range] -contour!(x1_range, x2_range, Z; linewidth=3, colormap=:seaborn_bright) -fig -``` - -The contour plot above shows that the MAP method is not too bad at classifying our data. - -Now we can visualize our predictions. - -$$ -p(\tilde{x} | X, \alpha) = \int_{\theta} p(\tilde{x} | \theta) p(\theta | X, \alpha) \approx \sum_{\theta \sim p(\theta | X, \alpha)}f_{\theta}(\tilde{x}) -$$ - -The `nn_predict` function takes the average predicted value from a network parameterized by weights drawn from the MCMC chain. - -```{julia} -# Return the average predicted value across -# multiple weights. -function nn_predict(x, θ, num) - num = min(num, size(θ, 1)) # make sure num does not exceed the number of samples - return mean([first(nn_forward(x, view(θ, i, :))) for i in 1:10:num]) -end -``` - -Next, we use the `nn_predict` function to predict the value at a sample of points where the `x1` and `x2` coordinates range between -6 and 6. As we can see below, we still have a satisfactory fit to our data, and more importantly, we can also see where the neural network is uncertain about its predictions much easier---those regions between cluster boundaries. - -```{julia} -# Plot the average prediction. -fig = plot_data() - -n_end = 1500 -x1_range = collect(range(-6; stop=6, length=25)) -x2_range = collect(range(-6; stop=6, length=25)) -Z = [nn_predict([x1, x2], θ, n_end)[1] for x1 in x1_range, x2 in x2_range] -contour!(x1_range, x2_range, Z; linewidth=3, colormap=:seaborn_bright) -fig -``` - -Suppose we are interested in how the predictive power of our Bayesian neural network evolved between samples. In that case, the following graph displays an animation of the contour plot generated from the network weights in samples 1 to 1,000. - -```{julia} -# Number of iterations to plot. -n_end = 500 - -anim = @gif for i in 1:n_end - plot_data() - Z = [nn_forward([x1, x2], θ[i, :])[1] for x1 in x1_range, x2 in x2_range] - contour!(x1_range, x2_range, Z; title="Iteration $i", clim=(0, 1)) -end every 5 -``` - -This has been an introduction to the applications of Turing and Lux in defining Bayesian neural networks. +--- +title: Bayesian Neural Networks +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In this tutorial, we demonstrate how one can implement a Bayesian Neural Network using a combination of Turing and [Lux](https://github.com/LuxDL/Lux.jl), a suite of machine learning tools. We will use Lux to specify the neural network's layers and Turing to implement the probabilistic inference, with the goal of implementing a classification algorithm. + +We will begin with importing the relevant libraries. + +```{julia} +using Turing +using FillArrays +using Lux +using Plots +import Mooncake +using Functors + +using LinearAlgebra +using Random +``` + +Our goal here is to use a Bayesian neural network to classify points in an artificial dataset. +The code below generates data points arranged in a box-like pattern and displays a graph of the dataset we will be working with. + +```{julia} +# Number of points to generate +N = 80 +M = round(Int, N / 4) +rng = Random.default_rng() +Random.seed!(rng, 1234) + +# Generate artificial data +x1s = rand(rng, Float32, M) * 4.5f0; +x2s = rand(rng, Float32, M) * 4.5f0; +xt1s = Array([[x1s[i] + 0.5f0; x2s[i] + 0.5f0] for i in 1:M]) +x1s = rand(rng, Float32, M) * 4.5f0; +x2s = rand(rng, Float32, M) * 4.5f0; +append!(xt1s, Array([[x1s[i] - 5.0f0; x2s[i] - 5.0f0] for i in 1:M])) + +x1s = rand(rng, Float32, M) * 4.5f0; +x2s = rand(rng, Float32, M) * 4.5f0; +xt0s = Array([[x1s[i] + 0.5f0; x2s[i] - 5.0f0] for i in 1:M]) +x1s = rand(rng, Float32, M) * 4.5f0; +x2s = rand(rng, Float32, M) * 4.5f0; +append!(xt0s, Array([[x1s[i] - 5.0f0; x2s[i] + 0.5f0] for i in 1:M])) + +# Store all the data for later +xs = [xt1s; xt0s] +ts = [ones(2 * M); zeros(2 * M)] + +# Plot data points. +function plot_data() + x1 = map(e -> e[1], xt1s) + y1 = map(e -> e[2], xt1s) + x2 = map(e -> e[1], xt0s) + y2 = map(e -> e[2], xt0s) + + Plots.scatter(x1, y1; color="red", clim=(0, 1)) + return Plots.scatter!(x2, y2; color="blue", clim=(0, 1)) +end + +plot_data() +``` + +## Building a Neural Network + +The next step is to define a [feedforward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) where we express our parameters as distributions, and not single points as with traditional neural networks. +For this we will use `Dense` to define liner layers and compose them via `Chain`, both are neural network primitives from Lux. +The network `nn_initial` we created has two hidden layers with `tanh` activations and one output layer with sigmoid (`σ`) activation, as shown below. + +```{dot} +//| echo: false +graph G { + rankdir=LR; + nodesep=0.8; + ranksep=0.8; + node [shape=circle, fixedsize=true, width=0.8, style="filled", color=black, fillcolor="white", fontsize=12]; + + // Input layer + subgraph cluster_input { + node [label=""]; + input1; + input2; + style="rounded" + } + + // Hidden layers + subgraph cluster_hidden1 { + node [label=""]; + hidden11; + hidden12; + hidden13; + style="rounded" + } + + subgraph cluster_hidden2 { + node [label=""]; + hidden21; + hidden22; + style="rounded" + } + + // Output layer + subgraph cluster_output { + output1 [label=""]; + style="rounded" + } + + // Connections from input to hidden layer 1 + input1 -- hidden11; + input1 -- hidden12; + input1 -- hidden13; + input2 -- hidden11; + input2 -- hidden12; + input2 -- hidden13; + + // Connections from hidden layer 1 to hidden layer 2 + hidden11 -- hidden21; + hidden11 -- hidden22; + hidden12 -- hidden21; + hidden12 -- hidden22; + hidden13 -- hidden21; + hidden13 -- hidden22; + + // Connections from hidden layer 2 to output + hidden21 -- output1; + hidden22 -- output1; + + // Labels + labelloc="b"; + fontsize=17; + label="Input layer Hidden layers Output layer"; +} +``` + +The `nn_initial` is an instance that acts as a function and can take data as inputs and output predictions. +We will define distributions on the neural network parameters. + +```{julia} +# Construct a neural network using Lux +nn_initial = Chain(Dense(2 => 3, tanh), Dense(3 => 2, tanh), Dense(2 => 1, σ)) + +# Initialize the model weights and state +ps, st = Lux.setup(rng, nn_initial) + +Lux.parameterlength(nn_initial) # number of parameters in NN +``` + +The probabilistic model specification below creates a `parameters` variable, which has IID normal variables. The `parameters` vector represents all parameters of our neural net (weights and biases). + +```{julia} +# Create a regularization term and a Gaussian prior variance term. +alpha = 0.09 +sigma = sqrt(1.0 / alpha) +``` + +We also define a function to construct a named tuple from a vector of sampled parameters. +(We could use [`ComponentArrays`](https://github.com/jonniedie/ComponentArrays.jl) here and broadcast to avoid doing this, but this way avoids introducing an extra dependency.) + +```{julia} +function vector_to_parameters(ps_new::AbstractVector, ps::NamedTuple) + @assert length(ps_new) == Lux.parameterlength(ps) + i = 1 + function get_ps(x) + z = reshape(view(ps_new, i:(i + length(x) - 1)), size(x)) + i += length(x) + return z + end + return fmap(get_ps, ps) +end +``` + +To interface with external libraries it is often desirable to use the [`StatefulLuxLayer`](https://lux.csail.mit.edu/stable/api/Lux/utilities#Lux.StatefulLuxLayer) to automatically handle the neural network states. + +```{julia} +const nn = StatefulLuxLayer{true}(nn_initial, nothing, st) + +# Specify the probabilistic model. +@model function bayes_nn(xs, ts; sigma = sigma, ps = ps, nn = nn) + # Sample the parameters + nparameters = Lux.parameterlength(nn_initial) + parameters ~ MvNormal(zeros(nparameters), Diagonal(abs2.(sigma .* ones(nparameters)))) + + # Forward NN to make predictions + preds = Lux.apply(nn, xs, f32(vector_to_parameters(parameters, ps))) + + # Observe each prediction. + for i in eachindex(ts) + ts[i] ~ Bernoulli(preds[i]) + end +end +``` + +Inference can now be performed by calling `sample`. We use the `NUTS` Hamiltonian Monte Carlo sampler here. + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +# Perform inference. +n_iters = 2_000 +ch = sample(bayes_nn(reduce(hcat, xs), ts), NUTS(; adtype=AutoMooncake(; config=nothing)), n_iters); +``` + +Now we extract the parameter samples from the sampled chain as `θ` (this is of size `5000 x 20` where `5000` is the number of iterations and `20` is the number of parameters). +We'll use these primarily to determine how good our model's classifier is. + +```{julia} +# Extract all weight and bias parameters. +θ = MCMCChains.group(ch, :parameters).value; +``` + +## Prediction Visualization + +We can use [MAP estimation](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) to classify our population by using the set of weights that provided the highest log posterior. + +```{julia} +# A helper to run the nn through data `x` using parameters `θ` +nn_forward(x, θ) = nn(x, vector_to_parameters(θ, ps)) + +# Plot the data we have. +fig = plot_data() + +# Find the index that provided the highest log posterior in the chain. +_, i = findmax(ch[:lp]) + +# Extract the max row value from i. +i = i.I[1] + +# Plot the posterior distribution with a contour plot +x1_range = collect(range(-6; stop=6, length=25)) +x2_range = collect(range(-6; stop=6, length=25)) +Z = [nn_forward([x1, x2], θ[i, :])[1] for x1 in x1_range, x2 in x2_range] +contour!(x1_range, x2_range, Z; linewidth=3, colormap=:seaborn_bright) +fig +``` + +The contour plot above shows that the MAP method is not too bad at classifying our data. + +Now we can visualize our predictions. + +$$ +p(\tilde{x} | X, \alpha) = \int_{\theta} p(\tilde{x} | \theta) p(\theta | X, \alpha) \approx \sum_{\theta \sim p(\theta | X, \alpha)}f_{\theta}(\tilde{x}) +$$ + +The `nn_predict` function takes the average predicted value from a network parameterized by weights drawn from the MCMC chain. + +```{julia} +# Return the average predicted value across +# multiple weights. +function nn_predict(x, θ, num) + num = min(num, size(θ, 1)) # make sure num does not exceed the number of samples + return mean([first(nn_forward(x, view(θ, i, :))) for i in 1:10:num]) +end +``` + +Next, we use the `nn_predict` function to predict the value at a sample of points where the `x1` and `x2` coordinates range between -6 and 6. As we can see below, we still have a satisfactory fit to our data, and more importantly, we can also see where the neural network is uncertain about its predictions much easier---those regions between cluster boundaries. + +```{julia} +# Plot the average prediction. +fig = plot_data() + +n_end = 1500 +x1_range = collect(range(-6; stop=6, length=25)) +x2_range = collect(range(-6; stop=6, length=25)) +Z = [nn_predict([x1, x2], θ, n_end)[1] for x1 in x1_range, x2 in x2_range] +contour!(x1_range, x2_range, Z; linewidth=3, colormap=:seaborn_bright) +fig +``` + +Suppose we are interested in how the predictive power of our Bayesian neural network evolved between samples. In that case, the following graph displays an animation of the contour plot generated from the network weights in samples 1 to 1,000. + +```{julia} +# Number of iterations to plot. +n_end = 500 + +anim = @gif for i in 1:n_end + plot_data() + Z = [nn_forward([x1, x2], θ[i, :])[1] for x1 in x1_range, x2 in x2_range] + contour!(x1_range, x2_range, Z; title="Iteration $i", clim=(0, 1)) +end every 5 +``` + +This has been an introduction to the applications of Turing and Lux in defining Bayesian neural networks. diff --git a/tutorials/04-hidden-markov-model/index.qmd b/tutorials/04-hidden-markov-model/index.qmd index c96406c24..ae9148273 100755 --- a/tutorials/04-hidden-markov-model/index.qmd +++ b/tutorials/04-hidden-markov-model/index.qmd @@ -1,193 +1,193 @@ ---- -title: Bayesian Hidden Markov Models -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -This tutorial illustrates training Bayesian [Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) (HMM) using Turing. The main goals are learning the transition matrix, emission parameter, and hidden states. For a more rigorous academic overview on Hidden Markov Models, see [An introduction to Hidden Markov Models and Bayesian Networks](http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf) (Ghahramani, 2001). - -In this tutorial, we assume there are $k$ discrete hidden states; the observations are continuous and normally distributed - centered around the hidden states. This assumption reduces the number of parameters to be estimated in the emission matrix. - -Let's load the libraries we'll need. We also set a random seed (for reproducibility) and the automatic differentiation backend to forward mode (more [here]({{}}) on why this is useful). - -```{julia} -# Load libraries. -using Turing, StatsPlots, Random - -# Set a random seed and use the forward_diff AD mode. -Random.seed!(12345678); -``` - -## Simple State Detection - -In this example, we'll use something where the states and emission parameters are straightforward. - -```{julia} -# Define the emission parameter. -y = [ - 1.0, - 1.0, - 1.0, - 1.0, - 1.0, - 1.0, - 2.0, - 2.0, - 2.0, - 2.0, - 2.0, - 2.0, - 3.0, - 3.0, - 3.0, - 3.0, - 3.0, - 3.0, - 3.0, - 2.0, - 2.0, - 2.0, - 2.0, - 1.0, - 1.0, - 1.0, - 1.0, - 1.0, - 1.0, - 1.0, -]; -N = length(y); -K = 3; - -# Plot the data we just made. -plot(y; xlim=(0, 30), ylim=(-1, 5), size=(500, 250)) -``` - -We can see that we have three states, one for each height of the plot (1, 2, 3). This height is also our emission parameter, so state one produces a value of one, state two produces a value of two, and so on. - -Ultimately, we would like to understand three major parameters: - - 1. The transition matrix. This is a matrix that assigns a probability of switching from one state to any other state, including the state that we are already in. - 2. The emission matrix, which describes a typical value emitted by some state. In the plot above, the emission parameter for state one is simply one. - 3. The state sequence is our understanding of what state we were actually in when we observed some data. This is very important in more sophisticated HMM models, where the emission value does not equal our state. - -With this in mind, let's set up our model. We are going to use some of our knowledge as modelers to provide additional information about our system. This takes the form of the prior on our emission parameter. - -$$ -m_i \sim \mathrm{Normal}(i, 0.5) \quad \text{where} \quad m = \{1,2,3\} -$$ - -Simply put, this says that we expect state one to emit values in a Normally distributed manner, where the mean of each state's emissions is that state's value. The variance of 0.5 helps the model converge more quickly — consider the case where we have a variance of 1 or 2. In this case, the likelihood of observing a 2 when we are in state 1 is actually quite high, as it is within a standard deviation of the true emission value. Applying the prior that we are likely to be tightly centered around the mean prevents our model from being too confused about the state that is generating our observations. - -The priors on our transition matrix are noninformative, using `T[i] ~ Dirichlet(ones(K)/K)`. The Dirichlet prior used in this way assumes that the state is likely to change to any other state with equal probability. As we'll see, this transition matrix prior will be overwritten as we observe data. - -```{julia} -# Turing model definition. -@model function BayesHmm(y, K) - # Get observation length. - N = length(y) - - # State sequence. - s = tzeros(Int, N) - - # Emission matrix. - m = Vector(undef, K) - - # Transition matrix. - T = Vector{Vector}(undef, K) - - # Assign distributions to each element - # of the transition matrix and the - # emission matrix. - for i in 1:K - T[i] ~ Dirichlet(ones(K) / K) - m[i] ~ Normal(i, 0.5) - end - - # Observe each point of the input. - s[1] ~ Categorical(K) - y[1] ~ Normal(m[s[1]], 0.1) - - for i in 2:N - s[i] ~ Categorical(vec(T[s[i - 1]])) - y[i] ~ Normal(m[s[i]], 0.1) - end -end; -``` - -We will use a combination of two samplers ([HMC](https://turinglang.org/dev/docs/library/#Turing.Inference.HMC) and [Particle Gibbs](https://turinglang.org/dev/docs/library/#Turing.Inference.PG)) by passing them to the [Gibbs](https://turinglang.org/dev/docs/library/#Turing.Inference.Gibbs) sampler. The Gibbs sampler allows for compositional inference, where we can utilize different samplers on different parameters. - -In this case, we use HMC for `m` and `T`, representing the emission and transition matrices respectively. We use the Particle Gibbs sampler for `s`, the state sequence. You may wonder why it is that we are not assigning `s` to the HMC sampler, and why it is that we need compositional Gibbs sampling at all. - -The parameter `s` is not a continuous variable. It is a vector of **integers**, and thus Hamiltonian methods like HMC and [NUTS](https://turinglang.org/dev/docs/library/#Turing.Inference.NUTS) won't work correctly. Gibbs allows us to apply the right tools to the best effect. If you are a particularly advanced user interested in higher performance, you may benefit from setting up your Gibbs sampler to use [different automatic differentiation]({{}}#compositional-sampling-with-differing-ad-modes) backends for each parameter space. - -Time to run our sampler. - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -g = Gibbs(HMC(0.01, 50, :m, :T), PG(120, :s)) -chn = sample(BayesHmm(y, 3), g, 1000); -``` - -Let's see how well our chain performed. -Ordinarily, using `display(chn)` would be a good first step, but we have generated a lot of parameters here (`s[1]`, `s[2]`, `m[1]`, and so on). -It's a bit easier to show how our model performed graphically. - -The code below generates an animation showing the graph of the data above, and the data our model generates in each sample. - -```{julia} -# Extract our m and s parameters from the chain. -m_set = MCMCChains.group(chn, :m).value -s_set = MCMCChains.group(chn, :s).value - -# Iterate through the MCMC samples. -Ns = 1:length(chn) - -# Make an animation. -animation = @gif for i in Ns - m = m_set[i, :] - s = Int.(s_set[i, :]) - emissions = m[s] - - p = plot( - y; - chn=:red, - size=(500, 250), - xlabel="Time", - ylabel="State", - legend=:topright, - label="True data", - xlim=(0, 30), - ylim=(-1, 5), - ) - plot!(emissions; color=:blue, label="Sample $i") -end every 3 -``` - -Looks like our model did a pretty good job, but we should also check to make sure our chain converges. A quick check is to examine whether the diagonal (representing the probability of remaining in the current state) of the transition matrix appears to be stationary. The code below extracts the diagonal and shows a traceplot of each persistence probability. - -```{julia} -# Index the chain with the persistence probabilities. -subchain = chn[["T[1][1]", "T[2][2]", "T[3][3]"]] - -plot(subchain; seriestype=:traceplot, title="Persistence Probability", legend=false) -``` - -A cursory examination of the traceplot above indicates that all three chains converged to something resembling -stationary. We can use the diagnostic functions provided by [MCMCChains](https://github.com/TuringLang/MCMCChains.jl) to engage in some more formal tests, like the Heidelberg and Welch diagnostic: - -```{julia} -heideldiag(MCMCChains.group(chn, :T))[1] -``` - -The p-values on the test suggest that we cannot reject the hypothesis that the observed sequence comes from a stationary distribution, so we can be reasonably confident that our transition matrix has converged to something reasonable. +--- +title: Bayesian Hidden Markov Models +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +This tutorial illustrates training Bayesian [Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) (HMM) using Turing. The main goals are learning the transition matrix, emission parameter, and hidden states. For a more rigorous academic overview on Hidden Markov Models, see [An introduction to Hidden Markov Models and Bayesian Networks](http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf) (Ghahramani, 2001). + +In this tutorial, we assume there are $k$ discrete hidden states; the observations are continuous and normally distributed - centered around the hidden states. This assumption reduces the number of parameters to be estimated in the emission matrix. + +Let's load the libraries we'll need. We also set a random seed (for reproducibility) and the automatic differentiation backend to forward mode (more [here]({{}}) on why this is useful). + +```{julia} +# Load libraries. +using Turing, StatsPlots, Random + +# Set a random seed and use the forward_diff AD mode. +Random.seed!(12345678); +``` + +## Simple State Detection + +In this example, we'll use something where the states and emission parameters are straightforward. + +```{julia} +# Define the emission parameter. +y = [ + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 2.0, + 2.0, + 2.0, + 2.0, + 2.0, + 2.0, + 3.0, + 3.0, + 3.0, + 3.0, + 3.0, + 3.0, + 3.0, + 2.0, + 2.0, + 2.0, + 2.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, +]; +N = length(y); +K = 3; + +# Plot the data we just made. +plot(y; xlim=(0, 30), ylim=(-1, 5), size=(500, 250)) +``` + +We can see that we have three states, one for each height of the plot (1, 2, 3). This height is also our emission parameter, so state one produces a value of one, state two produces a value of two, and so on. + +Ultimately, we would like to understand three major parameters: + + 1. The transition matrix. This is a matrix that assigns a probability of switching from one state to any other state, including the state that we are already in. + 2. The emission matrix, which describes a typical value emitted by some state. In the plot above, the emission parameter for state one is simply one. + 3. The state sequence is our understanding of what state we were actually in when we observed some data. This is very important in more sophisticated HMM models, where the emission value does not equal our state. + +With this in mind, let's set up our model. We are going to use some of our knowledge as modelers to provide additional information about our system. This takes the form of the prior on our emission parameter. + +$$ +m_i \sim \mathrm{Normal}(i, 0.5) \quad \text{where} \quad m = \{1,2,3\} +$$ + +Simply put, this says that we expect state one to emit values in a Normally distributed manner, where the mean of each state's emissions is that state's value. The variance of 0.5 helps the model converge more quickly — consider the case where we have a variance of 1 or 2. In this case, the likelihood of observing a 2 when we are in state 1 is actually quite high, as it is within a standard deviation of the true emission value. Applying the prior that we are likely to be tightly centered around the mean prevents our model from being too confused about the state that is generating our observations. + +The priors on our transition matrix are noninformative, using `T[i] ~ Dirichlet(ones(K)/K)`. The Dirichlet prior used in this way assumes that the state is likely to change to any other state with equal probability. As we'll see, this transition matrix prior will be overwritten as we observe data. + +```{julia} +# Turing model definition. +@model function BayesHmm(y, K) + # Get observation length. + N = length(y) + + # State sequence. + s = tzeros(Int, N) + + # Emission matrix. + m = Vector(undef, K) + + # Transition matrix. + T = Vector{Vector}(undef, K) + + # Assign distributions to each element + # of the transition matrix and the + # emission matrix. + for i in 1:K + T[i] ~ Dirichlet(ones(K) / K) + m[i] ~ Normal(i, 0.5) + end + + # Observe each point of the input. + s[1] ~ Categorical(K) + y[1] ~ Normal(m[s[1]], 0.1) + + for i in 2:N + s[i] ~ Categorical(vec(T[s[i - 1]])) + y[i] ~ Normal(m[s[i]], 0.1) + end +end; +``` + +We will use a combination of two samplers ([HMC](https://turinglang.org/dev/docs/library/#Turing.Inference.HMC) and [Particle Gibbs](https://turinglang.org/dev/docs/library/#Turing.Inference.PG)) by passing them to the [Gibbs](https://turinglang.org/dev/docs/library/#Turing.Inference.Gibbs) sampler. The Gibbs sampler allows for compositional inference, where we can utilize different samplers on different parameters. + +In this case, we use HMC for `m` and `T`, representing the emission and transition matrices respectively. We use the Particle Gibbs sampler for `s`, the state sequence. You may wonder why it is that we are not assigning `s` to the HMC sampler, and why it is that we need compositional Gibbs sampling at all. + +The parameter `s` is not a continuous variable. It is a vector of **integers**, and thus Hamiltonian methods like HMC and [NUTS](https://turinglang.org/dev/docs/library/#Turing.Inference.NUTS) won't work correctly. Gibbs allows us to apply the right tools to the best effect. If you are a particularly advanced user interested in higher performance, you may benefit from setting up your Gibbs sampler to use [different automatic differentiation]({{}}#compositional-sampling-with-differing-ad-modes) backends for each parameter space. + +Time to run our sampler. + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +g = Gibbs(HMC(0.01, 50, :m, :T), PG(120, :s)) +chn = sample(BayesHmm(y, 3), g, 1000); +``` + +Let's see how well our chain performed. +Ordinarily, using `display(chn)` would be a good first step, but we have generated a lot of parameters here (`s[1]`, `s[2]`, `m[1]`, and so on). +It's a bit easier to show how our model performed graphically. + +The code below generates an animation showing the graph of the data above, and the data our model generates in each sample. + +```{julia} +# Extract our m and s parameters from the chain. +m_set = MCMCChains.group(chn, :m).value +s_set = MCMCChains.group(chn, :s).value + +# Iterate through the MCMC samples. +Ns = 1:length(chn) + +# Make an animation. +animation = @gif for i in Ns + m = m_set[i, :] + s = Int.(s_set[i, :]) + emissions = m[s] + + p = plot( + y; + chn=:red, + size=(500, 250), + xlabel="Time", + ylabel="State", + legend=:topright, + label="True data", + xlim=(0, 30), + ylim=(-1, 5), + ) + plot!(emissions; color=:blue, label="Sample $i") +end every 3 +``` + +Looks like our model did a pretty good job, but we should also check to make sure our chain converges. A quick check is to examine whether the diagonal (representing the probability of remaining in the current state) of the transition matrix appears to be stationary. The code below extracts the diagonal and shows a traceplot of each persistence probability. + +```{julia} +# Index the chain with the persistence probabilities. +subchain = chn[["T[1][1]", "T[2][2]", "T[3][3]"]] + +plot(subchain; seriestype=:traceplot, title="Persistence Probability", legend=false) +``` + +A cursory examination of the traceplot above indicates that all three chains converged to something resembling +stationary. We can use the diagnostic functions provided by [MCMCChains](https://github.com/TuringLang/MCMCChains.jl) to engage in some more formal tests, like the Heidelberg and Welch diagnostic: + +```{julia} +heideldiag(MCMCChains.group(chn, :T))[1] +``` + +The p-values on the test suggest that we cannot reject the hypothesis that the observed sequence comes from a stationary distribution, so we can be reasonably confident that our transition matrix has converged to something reasonable. diff --git a/tutorials/05-linear-regression/index.qmd b/tutorials/05-linear-regression/index.qmd index 9bc5e5a84..7c5111f7e 100755 --- a/tutorials/05-linear-regression/index.qmd +++ b/tutorials/05-linear-regression/index.qmd @@ -1,249 +1,249 @@ ---- -title: Linear Regression -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -Turing is powerful when applied to complex hierarchical models, but it can also be put to task at common statistical procedures, like [linear regression](https://en.wikipedia.org/wiki/Linear_regression). -This tutorial covers how to implement a linear regression model in Turing. - -## Set Up - -We begin by importing all the necessary libraries. - -```{julia} -# Import Turing. -using Turing - -# Package for loading the data set. -using RDatasets - -# Package for visualization. -using StatsPlots - -# Functionality for splitting the data. -using MLUtils: splitobs - -# Functionality for constructing arrays with identical elements efficiently. -using FillArrays - -# Functionality for normalizing the data and evaluating the model predictions. -using StatsBase - -# Functionality for working with scaled identity matrices. -using LinearAlgebra - -# Set a seed for reproducibility. -using Random -Random.seed!(0); -``` - -```{julia} -#| output: false -setprogress!(false) -``` - -We will use the `mtcars` dataset from the [RDatasets](https://github.com/JuliaStats/RDatasets.jl) package. -`mtcars` contains a variety of statistics on different car models, including their miles per gallon, number of cylinders, and horsepower, among others. - -We want to know if we can construct a Bayesian linear regression model to predict the miles per gallon of a car, given the other statistics it has. -Let us take a look at the data we have. - -```{julia} -# Load the dataset. -data = RDatasets.dataset("datasets", "mtcars") - -# Show the first six rows of the dataset. -first(data, 6) -``` - -```{julia} -size(data) -``` - -The next step is to get our data ready for testing. We'll split the `mtcars` dataset into two subsets, one for training our model and one for evaluating our model. Then, we separate the targets we want to learn (`MPG`, in this case) and standardize the datasets by subtracting each column's means and dividing by the standard deviation of that column. The resulting data is not very familiar looking, but this standardization process helps the sampler converge far easier. - -```{julia} -# Remove the model column. -select!(data, Not(:Model)) - -# Split our dataset 70%/30% into training/test sets. -trainset, testset = map(DataFrame, splitobs(data; at=0.7, shuffle=true)) - -# Turing requires data in matrix form. -target = :MPG -train = Matrix(select(trainset, Not(target))) -test = Matrix(select(testset, Not(target))) -train_target = trainset[:, target] -test_target = testset[:, target] - -# Standardize the features. -dt_features = fit(ZScoreTransform, train; dims=1) -StatsBase.transform!(dt_features, train) -StatsBase.transform!(dt_features, test) - -# Standardize the targets. -dt_targets = fit(ZScoreTransform, train_target) -StatsBase.transform!(dt_targets, train_target) -StatsBase.transform!(dt_targets, test_target); -``` - -## Model Specification - -In a traditional frequentist model using [OLS](https://en.wikipedia.org/wiki/Ordinary_least_squares), our model might look like: - -$$ -\mathrm{MPG}_i = \alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i} -$$ - -where $\boldsymbol{\beta}$ is a vector of coefficients and $\boldsymbol{X}$ is a vector of inputs for observation $i$. The Bayesian model we are more concerned with is the following: - -$$ -\mathrm{MPG}_i \sim \mathcal{N}(\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}, \sigma^2) -$$ - -where $\alpha$ is an intercept term common to all observations, $\boldsymbol{\beta}$ is a coefficient vector, $\boldsymbol{X_i}$ is the observed data for car $i$, and $\sigma^2$ is a common variance term. - -For $\sigma^2$, we assign a prior of `truncated(Normal(0, 100); lower=0)`. -This is consistent with [Andrew Gelman's recommendations](http://www.stat.columbia.edu/%7Egelman/research/published/taumain.pdf) on noninformative priors for variance. -The intercept term ($\alpha$) is assumed to be normally distributed with a mean of zero and a variance of three. -This represents our assumptions that miles per gallon can be explained mostly by our assorted variables, but a high variance term indicates our uncertainty about that. -Each coefficient is assumed to be normally distributed with a mean of zero and a variance of 10. -We do not know that our coefficients are different from zero, and we don't know which ones are likely to be the most important, so the variance term is quite high. -Lastly, each observation $y_i$ is distributed according to the calculated `mu` term given by $\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}$. - -```{julia} -# Bayesian linear regression. -@model function linear_regression(x, y) - # Set variance prior. - σ² ~ truncated(Normal(0, 100); lower=0) - - # Set intercept prior. - intercept ~ Normal(0, sqrt(3)) - - # Set the priors on our coefficients. - nfeatures = size(x, 2) - coefficients ~ MvNormal(Zeros(nfeatures), 10.0 * I) - - # Calculate all the mu terms. - mu = intercept .+ x * coefficients - return y ~ MvNormal(mu, σ² * I) -end -``` - -With our model specified, we can call the sampler. We will use the No U-Turn Sampler ([NUTS](https://turinglang.org/stable/docs/library/#Turing.Inference.NUTS)) here. - -```{julia} -model = linear_regression(train, train_target) -chain = sample(model, NUTS(), 5_000) -``` - -We can also check the densities and traces of the parameters visually using the `plot` functionality. - -```{julia} -plot(chain) -``` - -It looks like all parameters have converged. - -```{julia} -#| echo: false -let - ess_df = ess(chain) - @assert minimum(ess_df[:, :ess]) > 500 "Minimum ESS: $(minimum(ess_df[:, :ess])) - not > 700" - @assert mean(ess_df[:, :ess]) > 2_000 "Mean ESS: $(mean(ess_df[:, :ess])) - not > 2000" - @assert maximum(ess_df[:, :ess]) > 3_500 "Maximum ESS: $(maximum(ess_df[:, :ess])) - not > 3500" -end -``` - -## Comparing to OLS - -A satisfactory test of our model is to evaluate how well it predicts. Importantly, we want to compare our model to existing tools like OLS. The code below uses the [GLM.jl](https://juliastats.org/GLM.jl/stable/) package to generate a traditional OLS multiple regression model on the same data as our probabilistic model. - -```{julia} -# Import the GLM package. -using GLM - -# Perform multiple regression OLS. -train_with_intercept = hcat(ones(size(train, 1)), train) -ols = lm(train_with_intercept, train_target) - -# Compute predictions on the training data set and unstandardize them. -train_prediction_ols = GLM.predict(ols) -StatsBase.reconstruct!(dt_targets, train_prediction_ols) - -# Compute predictions on the test data set and unstandardize them. -test_with_intercept = hcat(ones(size(test, 1)), test) -test_prediction_ols = GLM.predict(ols, test_with_intercept) -StatsBase.reconstruct!(dt_targets, test_prediction_ols); -``` - -The function below accepts a chain and an input matrix and calculates predictions. We use the samples of the model parameters in the chain starting with sample 200. - -```{julia} -# Make a prediction given an input vector. -function prediction(chain, x) - p = get_params(chain[200:end, :, :]) - targets = p.intercept' .+ x * reduce(hcat, p.coefficients)' - return vec(mean(targets; dims=2)) -end -``` - -When we make predictions, we unstandardize them so they are more understandable. - -```{julia} -# Calculate the predictions for the training and testing sets and unstandardize them. -train_prediction_bayes = prediction(chain, train) -StatsBase.reconstruct!(dt_targets, train_prediction_bayes) -test_prediction_bayes = prediction(chain, test) -StatsBase.reconstruct!(dt_targets, test_prediction_bayes) - -# Show the predictions on the test data set. -DataFrame(; MPG=testset[!, target], Bayes=test_prediction_bayes, OLS=test_prediction_ols) -``` - -Now let's evaluate the loss for each method, and each prediction set. We will use the mean squared error to evaluate loss, given by -$$ -\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat{y_i})^2} -$$ -where $y_i$ is the actual value (true MPG) and $\hat{y_i}$ is the predicted value using either OLS or Bayesian linear regression. A lower SSE indicates a closer fit to the data. - -```{julia} -println( - "Training set:", - "\n\tBayes loss: ", - msd(train_prediction_bayes, trainset[!, target]), - "\n\tOLS loss: ", - msd(train_prediction_ols, trainset[!, target]), -) - -println( - "Test set:", - "\n\tBayes loss: ", - msd(test_prediction_bayes, testset[!, target]), - "\n\tOLS loss: ", - msd(test_prediction_ols, testset[!, target]), -) -``` - -```{julia} -#| echo: false -let - bayes_train_loss = msd(train_prediction_bayes, trainset[!, target]) - bayes_test_loss = msd(test_prediction_bayes, testset[!, target]) - ols_train_loss = msd(train_prediction_ols, trainset[!, target]) - ols_test_loss = msd(test_prediction_ols, testset[!, target]) - @assert bayes_train_loss < bayes_test_loss "Bayesian training loss ($bayes_train_loss) >= Bayesian test loss ($bayes_test_loss)" - @assert ols_train_loss < ols_test_loss "OLS training loss ($ols_train_loss) >= OLS test loss ($ols_test_loss)" - @assert isapprox(bayes_train_loss, ols_train_loss; rtol=0.01) "Difference between Bayesian training loss ($bayes_train_loss) and OLS training loss ($ols_train_loss) unexpectedly large!" - @assert isapprox(bayes_test_loss, ols_test_loss; rtol=0.05) "Difference between Bayesian test loss ($bayes_test_loss) and OLS test loss ($ols_test_loss) unexpectedly large!" -end -``` - -As we can see above, OLS and our Bayesian model fit our training and test data set about the same. +--- +title: Linear Regression +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +Turing is powerful when applied to complex hierarchical models, but it can also be put to task at common statistical procedures, like [linear regression](https://en.wikipedia.org/wiki/Linear_regression). +This tutorial covers how to implement a linear regression model in Turing. + +## Set Up + +We begin by importing all the necessary libraries. + +```{julia} +# Import Turing. +using Turing + +# Package for loading the data set. +using RDatasets + +# Package for visualization. +using StatsPlots + +# Functionality for splitting the data. +using MLUtils: splitobs + +# Functionality for constructing arrays with identical elements efficiently. +using FillArrays + +# Functionality for normalizing the data and evaluating the model predictions. +using StatsBase + +# Functionality for working with scaled identity matrices. +using LinearAlgebra + +# Set a seed for reproducibility. +using Random +Random.seed!(0); +``` + +```{julia} +#| output: false +setprogress!(false) +``` + +We will use the `mtcars` dataset from the [RDatasets](https://github.com/JuliaStats/RDatasets.jl) package. +`mtcars` contains a variety of statistics on different car models, including their miles per gallon, number of cylinders, and horsepower, among others. + +We want to know if we can construct a Bayesian linear regression model to predict the miles per gallon of a car, given the other statistics it has. +Let us take a look at the data we have. + +```{julia} +# Load the dataset. +data = RDatasets.dataset("datasets", "mtcars") + +# Show the first six rows of the dataset. +first(data, 6) +``` + +```{julia} +size(data) +``` + +The next step is to get our data ready for testing. We'll split the `mtcars` dataset into two subsets, one for training our model and one for evaluating our model. Then, we separate the targets we want to learn (`MPG`, in this case) and standardize the datasets by subtracting each column's means and dividing by the standard deviation of that column. The resulting data is not very familiar looking, but this standardization process helps the sampler converge far easier. + +```{julia} +# Remove the model column. +select!(data, Not(:Model)) + +# Split our dataset 70%/30% into training/test sets. +trainset, testset = map(DataFrame, splitobs(data; at=0.7, shuffle=true)) + +# Turing requires data in matrix form. +target = :MPG +train = Matrix(select(trainset, Not(target))) +test = Matrix(select(testset, Not(target))) +train_target = trainset[:, target] +test_target = testset[:, target] + +# Standardize the features. +dt_features = fit(ZScoreTransform, train; dims=1) +StatsBase.transform!(dt_features, train) +StatsBase.transform!(dt_features, test) + +# Standardize the targets. +dt_targets = fit(ZScoreTransform, train_target) +StatsBase.transform!(dt_targets, train_target) +StatsBase.transform!(dt_targets, test_target); +``` + +## Model Specification + +In a traditional frequentist model using [OLS](https://en.wikipedia.org/wiki/Ordinary_least_squares), our model might look like: + +$$ +\mathrm{MPG}_i = \alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i} +$$ + +where $\boldsymbol{\beta}$ is a vector of coefficients and $\boldsymbol{X}$ is a vector of inputs for observation $i$. The Bayesian model we are more concerned with is the following: + +$$ +\mathrm{MPG}_i \sim \mathcal{N}(\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}, \sigma^2) +$$ + +where $\alpha$ is an intercept term common to all observations, $\boldsymbol{\beta}$ is a coefficient vector, $\boldsymbol{X_i}$ is the observed data for car $i$, and $\sigma^2$ is a common variance term. + +For $\sigma^2$, we assign a prior of `truncated(Normal(0, 100); lower=0)`. +This is consistent with [Andrew Gelman's recommendations](http://www.stat.columbia.edu/%7Egelman/research/published/taumain.pdf) on noninformative priors for variance. +The intercept term ($\alpha$) is assumed to be normally distributed with a mean of zero and a variance of three. +This represents our assumptions that miles per gallon can be explained mostly by our assorted variables, but a high variance term indicates our uncertainty about that. +Each coefficient is assumed to be normally distributed with a mean of zero and a variance of 10. +We do not know that our coefficients are different from zero, and we don't know which ones are likely to be the most important, so the variance term is quite high. +Lastly, each observation $y_i$ is distributed according to the calculated `mu` term given by $\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}$. + +```{julia} +# Bayesian linear regression. +@model function linear_regression(x, y) + # Set variance prior. + σ² ~ truncated(Normal(0, 100); lower=0) + + # Set intercept prior. + intercept ~ Normal(0, sqrt(3)) + + # Set the priors on our coefficients. + nfeatures = size(x, 2) + coefficients ~ MvNormal(Zeros(nfeatures), 10.0 * I) + + # Calculate all the mu terms. + mu = intercept .+ x * coefficients + return y ~ MvNormal(mu, σ² * I) +end +``` + +With our model specified, we can call the sampler. We will use the No U-Turn Sampler ([NUTS](https://turinglang.org/stable/docs/library/#Turing.Inference.NUTS)) here. + +```{julia} +model = linear_regression(train, train_target) +chain = sample(model, NUTS(), 5_000) +``` + +We can also check the densities and traces of the parameters visually using the `plot` functionality. + +```{julia} +plot(chain) +``` + +It looks like all parameters have converged. + +```{julia} +#| echo: false +let + ess_df = ess(chain) + @assert minimum(ess_df[:, :ess]) > 500 "Minimum ESS: $(minimum(ess_df[:, :ess])) - not > 700" + @assert mean(ess_df[:, :ess]) > 2_000 "Mean ESS: $(mean(ess_df[:, :ess])) - not > 2000" + @assert maximum(ess_df[:, :ess]) > 3_500 "Maximum ESS: $(maximum(ess_df[:, :ess])) - not > 3500" +end +``` + +## Comparing to OLS + +A satisfactory test of our model is to evaluate how well it predicts. Importantly, we want to compare our model to existing tools like OLS. The code below uses the [GLM.jl](https://juliastats.org/GLM.jl/stable/) package to generate a traditional OLS multiple regression model on the same data as our probabilistic model. + +```{julia} +# Import the GLM package. +using GLM + +# Perform multiple regression OLS. +train_with_intercept = hcat(ones(size(train, 1)), train) +ols = lm(train_with_intercept, train_target) + +# Compute predictions on the training data set and unstandardize them. +train_prediction_ols = GLM.predict(ols) +StatsBase.reconstruct!(dt_targets, train_prediction_ols) + +# Compute predictions on the test data set and unstandardize them. +test_with_intercept = hcat(ones(size(test, 1)), test) +test_prediction_ols = GLM.predict(ols, test_with_intercept) +StatsBase.reconstruct!(dt_targets, test_prediction_ols); +``` + +The function below accepts a chain and an input matrix and calculates predictions. We use the samples of the model parameters in the chain starting with sample 200. + +```{julia} +# Make a prediction given an input vector. +function prediction(chain, x) + p = get_params(chain[200:end, :, :]) + targets = p.intercept' .+ x * reduce(hcat, p.coefficients)' + return vec(mean(targets; dims=2)) +end +``` + +When we make predictions, we unstandardize them so they are more understandable. + +```{julia} +# Calculate the predictions for the training and testing sets and unstandardize them. +train_prediction_bayes = prediction(chain, train) +StatsBase.reconstruct!(dt_targets, train_prediction_bayes) +test_prediction_bayes = prediction(chain, test) +StatsBase.reconstruct!(dt_targets, test_prediction_bayes) + +# Show the predictions on the test data set. +DataFrame(; MPG=testset[!, target], Bayes=test_prediction_bayes, OLS=test_prediction_ols) +``` + +Now let's evaluate the loss for each method, and each prediction set. We will use the mean squared error to evaluate loss, given by +$$ +\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat{y_i})^2} +$$ +where $y_i$ is the actual value (true MPG) and $\hat{y_i}$ is the predicted value using either OLS or Bayesian linear regression. A lower SSE indicates a closer fit to the data. + +```{julia} +println( + "Training set:", + "\n\tBayes loss: ", + msd(train_prediction_bayes, trainset[!, target]), + "\n\tOLS loss: ", + msd(train_prediction_ols, trainset[!, target]), +) + +println( + "Test set:", + "\n\tBayes loss: ", + msd(test_prediction_bayes, testset[!, target]), + "\n\tOLS loss: ", + msd(test_prediction_ols, testset[!, target]), +) +``` + +```{julia} +#| echo: false +let + bayes_train_loss = msd(train_prediction_bayes, trainset[!, target]) + bayes_test_loss = msd(test_prediction_bayes, testset[!, target]) + ols_train_loss = msd(train_prediction_ols, trainset[!, target]) + ols_test_loss = msd(test_prediction_ols, testset[!, target]) + @assert bayes_train_loss < bayes_test_loss "Bayesian training loss ($bayes_train_loss) >= Bayesian test loss ($bayes_test_loss)" + @assert ols_train_loss < ols_test_loss "OLS training loss ($ols_train_loss) >= OLS test loss ($ols_test_loss)" + @assert isapprox(bayes_train_loss, ols_train_loss; rtol=0.01) "Difference between Bayesian training loss ($bayes_train_loss) and OLS training loss ($ols_train_loss) unexpectedly large!" + @assert isapprox(bayes_test_loss, ols_test_loss; rtol=0.05) "Difference between Bayesian test loss ($bayes_test_loss) and OLS test loss ($ols_test_loss) unexpectedly large!" +end +``` + +As we can see above, OLS and our Bayesian model fit our training and test data set about the same. diff --git a/tutorials/06-infinite-mixture-model/index.qmd b/tutorials/06-infinite-mixture-model/index.qmd index 538c02ad4..8ed264f3d 100755 --- a/tutorials/06-infinite-mixture-model/index.qmd +++ b/tutorials/06-infinite-mixture-model/index.qmd @@ -1,284 +1,284 @@ ---- -title: Probabilistic Modelling using the Infinite Mixture Model -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In many applications it is desirable to allow the model to adjust its complexity to the amount of data. Consider for example the task of assigning objects into clusters or groups. This task often involves the specification of the number of groups. However, often times it is not known beforehand how many groups exist. Moreover, in some applictions, e.g. modelling topics in text documents or grouping species, the number of examples per group is heavy tailed. This makes it impossible to predefine the number of groups and requiring the model to form new groups when data points from previously unseen groups are observed. - -A natural approach for such applications is the use of non-parametric models. This tutorial will introduce how to use the Dirichlet process in a mixture of infinitely many Gaussians using Turing. For further information on Bayesian nonparametrics and the Dirichlet process we refer to the [introduction by Zoubin Ghahramani](http://mlg.eng.cam.ac.uk/pub/pdf/Gha12.pdf) and the book "Fundamentals of Nonparametric Bayesian Inference" by Subhashis Ghosal and Aad van der Vaart. - -```{julia} -using Turing -``` - -## Mixture Model - -Before introducing infinite mixture models in Turing, we will briefly review the construction of finite mixture models. Subsequently, we will define how to use the [Chinese restaurant process](https://en.wikipedia.org/wiki/Chinese_restaurant_process) construction of a Dirichlet process for non-parametric clustering. - -#### Two-Component Model - -First, consider the simple case of a mixture model with two Gaussian components with fixed covariance. -The generative process of such a model can be written as: - -\begin{equation*} -\begin{aligned} -\pi_1 &\sim \mathrm{Beta}(a, b) \\ -\pi_2 &= 1-\pi_1 \\ -\mu_1 &\sim \mathrm{Normal}(\mu_0, \Sigma_0) \\ -\mu_2 &\sim \mathrm{Normal}(\mu_0, \Sigma_0) \\ -z_i &\sim \mathrm{Categorical}(\pi_1, \pi_2) \\ -x_i &\sim \mathrm{Normal}(\mu_{z_i}, \Sigma) -\end{aligned} -\end{equation*} - -where $\pi_1, \pi_2$ are the mixing weights of the mixture model, i.e. $\pi_1 + \pi_2 = 1$, and $z_i$ is a latent assignment of the observation $x_i$ to a component (Gaussian). - -We can implement this model in Turing for 1D data as follows: - -```{julia} -@model function two_model(x) - # Hyper-parameters - μ0 = 0.0 - σ0 = 1.0 - - # Draw weights. - π1 ~ Beta(1, 1) - π2 = 1 - π1 - - # Draw locations of the components. - μ1 ~ Normal(μ0, σ0) - μ2 ~ Normal(μ0, σ0) - - # Draw latent assignment. - z ~ Categorical([π1, π2]) - - # Draw observation from selected component. - if z == 1 - x ~ Normal(μ1, 1.0) - else - x ~ Normal(μ2, 1.0) - end -end -``` - -#### Finite Mixture Model - -If we have more than two components, this model can elegantly be extended using a Dirichlet distribution as prior for the mixing weights $\pi_1, \dots, \pi_K$. Note that the Dirichlet distribution is the multivariate generalization of the beta distribution. The resulting model can be written as: - -$$ -\begin{align} -(\pi_1, \dots, \pi_K) &\sim Dirichlet(K, \alpha) \\ -\mu_k &\sim \mathrm{Normal}(\mu_0, \Sigma_0), \;\; \forall k \\ -z &\sim Categorical(\pi_1, \dots, \pi_K) \\ -x &\sim \mathrm{Normal}(\mu_z, \Sigma) -\end{align} -$$ - -which resembles the model in the [Gaussian mixture model tutorial]({{}}) with a slightly different notation. - -## Infinite Mixture Model - -The question now arises, is there a generalization of a Dirichlet distribution for which the dimensionality $K$ is infinite, i.e. $K = \infty$? - -But first, to implement an infinite Gaussian mixture model in Turing, we first need to load the `Turing.RandomMeasures` module. `RandomMeasures` contains a variety of tools useful in nonparametrics. - -```{julia} -using Turing.RandomMeasures -``` - -We now will utilize the fact that one can integrate out the mixing weights in a Gaussian mixture model allowing us to arrive at the Chinese restaurant process construction. See Carl E. Rasmussen: [The Infinite Gaussian Mixture Model](https://www.seas.harvard.edu/courses/cs281/papers/rasmussen-1999a.pdf), NIPS (2000) for details. - -In fact, if the mixing weights are integrated out, the conditional prior for the latent variable $z$ is given by: - -$$ -p(z_i = k \mid z_{\not i}, \alpha) = \frac{n_k + \alpha K}{N - 1 + \alpha} -$$ - -where $z_{\not i}$ are the latent assignments of all observations except observation $i$. Note that we use $n_k$ to denote the number of observations at component $k$ excluding observation $i$. The parameter $\alpha$ is the concentration parameter of the Dirichlet distribution used as prior over the mixing weights. - -#### Chinese Restaurant Process - -To obtain the Chinese restaurant process construction, we can now derive the conditional prior if $K \rightarrow \infty$. - -For $n_k > 0$ we obtain: - -$$ -p(z_i = k \mid z_{\not i}, \alpha) = \frac{n_k}{N - 1 + \alpha} -$$ - -and for all infinitely many clusters that are empty (combined) we get: - -$$ -p(z_i = k \mid z_{\not i}, \alpha) = \frac{\alpha}{N - 1 + \alpha} -$$ - -Those equations show that the conditional prior for component assignments is proportional to the number of such observations, meaning that the Chinese restaurant process has a rich get richer property. - -To get a better understanding of this property, we can plot the cluster choosen by for each new observation drawn from the conditional prior. - -```{julia} -# Concentration parameter. -α = 10.0 - -# Random measure, e.g. Dirichlet process. -rpm = DirichletProcess(α) - -# Cluster assignments for each observation. -z = Vector{Int}() - -# Maximum number of observations we observe. -Nmax = 500 - -for i in 1:Nmax - # Number of observations per cluster. - K = isempty(z) ? 0 : maximum(z) - nk = Vector{Int}(map(k -> sum(z .== k), 1:K)) - - # Draw new assignment. - push!(z, rand(ChineseRestaurantProcess(rpm, nk))) -end -``` - -```{julia} -using Plots - -# Plot the cluster assignments over time -@gif for i in 1:Nmax - scatter( - collect(1:i), - z[1:i]; - markersize=2, - xlabel="observation (i)", - ylabel="cluster (k)", - legend=false, - ) -end -``` - -Further, we can see that the number of clusters is logarithmic in the number of observations and data points. This is a side-effect of the "rich-get-richer" phenomenon, i.e. we expect large clusters and thus the number of clusters has to be smaller than the number of observations. - -$$ -\mathbb{E}[K \mid N] \approx \alpha \cdot log \big(1 + \frac{N}{\alpha}\big) -$$ - -We can see from the equation that the concentration parameter $\alpha$ allows us to control the number of clusters formed *a priori*. - -In Turing we can implement an infinite Gaussian mixture model using the Chinese restaurant process construction of a Dirichlet process as follows: - -```{julia} -@model function infiniteGMM(x) - # Hyper-parameters, i.e. concentration parameter and parameters of H. - α = 1.0 - μ0 = 0.0 - σ0 = 1.0 - - # Define random measure, e.g. Dirichlet process. - rpm = DirichletProcess(α) - - # Define the base distribution, i.e. expected value of the Dirichlet process. - H = Normal(μ0, σ0) - - # Latent assignment. - z = tzeros(Int, length(x)) - - # Locations of the infinitely many clusters. - μ = tzeros(Float64, 0) - - for i in 1:length(x) - - # Number of clusters. - K = maximum(z) - nk = Vector{Int}(map(k -> sum(z .== k), 1:K)) - - # Draw the latent assignment. - z[i] ~ ChineseRestaurantProcess(rpm, nk) - - # Create a new cluster? - if z[i] > K - push!(μ, 0.0) - - # Draw location of new cluster. - μ[z[i]] ~ H - end - - # Draw observation. - x[i] ~ Normal(μ[z[i]], 1.0) - end -end -``` - -We can now use Turing to infer the assignments of some data points. First, we will create some random data that comes from three clusters, with means of 0, -5, and 10. - -```{julia} -using Plots, Random - -# Generate some test data. -Random.seed!(1) -data = vcat(randn(10), randn(10) .- 5, randn(10) .+ 10) -data .-= mean(data) -data /= std(data); -``` - -Next, we'll sample from our posterior using SMC. - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -# MCMC sampling -Random.seed!(2) -iterations = 1000 -model_fun = infiniteGMM(data); -chain = sample(model_fun, SMC(), iterations); -``` - -Finally, we can plot the number of clusters in each sample. - -```{julia} -# Extract the number of clusters for each sample of the Markov chain. -k = map( - t -> length(unique(vec(chain[t, MCMCChains.namesingroup(chain, :z), :].value))), - 1:iterations, -); - -# Visualize the number of clusters. -plot(k; xlabel="Iteration", ylabel="Number of clusters", label="Chain 1") -``` - -If we visualize the histogram of the number of clusters sampled from our posterior, we observe that the model seems to prefer 3 clusters, which is the true number of clusters. Note that the number of clusters in a Dirichlet process mixture model is not limited a priori and will grow to infinity with probability one. However, if conditioned on data the posterior will concentrate on a finite number of clusters enforcing the resulting model to have a finite amount of clusters. It is, however, not given that the posterior of a Dirichlet process Gaussian mixture model converges to the true number of clusters, given that data comes from a finite mixture model. See Jeffrey Miller and Matthew Harrison: [A simple example of Dirichlet process mixture inconsitency for the number of components](https://arxiv.org/pdf/1301.2708.pdf) for details. - -```{julia} -histogram(k; xlabel="Number of clusters", legend=false) -``` - -One issue with the Chinese restaurant process construction is that the number of latent parameters we need to sample scales with the number of observations. It may be desirable to use alternative constructions in certain cases. Alternative methods of constructing a Dirichlet process can be employed via the following representations: - -Size-Biased Sampling Process - -$$ -j_k \sim \mathrm{Beta}(1, \alpha) \cdot \mathrm{surplus} -$$ - -Stick-Breaking Process -$$ -v_k \sim \mathrm{Beta}(1, \alpha) -$$ - -Chinese Restaurant Process -$$ -p(z_n = k | z_{1:n-1}) \propto \begin{cases} -\frac{m_k}{n-1+\alpha}, \text{ if } m_k > 0\\\ -\frac{\alpha}{n-1+\alpha} -\end{cases} -$$ - -For more details see [this article](https://www.stats.ox.ac.uk/%7Eteh/research/npbayes/Teh2010a.pdf). +--- +title: Probabilistic Modelling using the Infinite Mixture Model +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In many applications it is desirable to allow the model to adjust its complexity to the amount of data. Consider for example the task of assigning objects into clusters or groups. This task often involves the specification of the number of groups. However, often times it is not known beforehand how many groups exist. Moreover, in some applictions, e.g. modelling topics in text documents or grouping species, the number of examples per group is heavy tailed. This makes it impossible to predefine the number of groups and requiring the model to form new groups when data points from previously unseen groups are observed. + +A natural approach for such applications is the use of non-parametric models. This tutorial will introduce how to use the Dirichlet process in a mixture of infinitely many Gaussians using Turing. For further information on Bayesian nonparametrics and the Dirichlet process we refer to the [introduction by Zoubin Ghahramani](http://mlg.eng.cam.ac.uk/pub/pdf/Gha12.pdf) and the book "Fundamentals of Nonparametric Bayesian Inference" by Subhashis Ghosal and Aad van der Vaart. + +```{julia} +using Turing +``` + +## Mixture Model + +Before introducing infinite mixture models in Turing, we will briefly review the construction of finite mixture models. Subsequently, we will define how to use the [Chinese restaurant process](https://en.wikipedia.org/wiki/Chinese_restaurant_process) construction of a Dirichlet process for non-parametric clustering. + +#### Two-Component Model + +First, consider the simple case of a mixture model with two Gaussian components with fixed covariance. +The generative process of such a model can be written as: + +\begin{equation*} +\begin{aligned} +\pi_1 &\sim \mathrm{Beta}(a, b) \\ +\pi_2 &= 1-\pi_1 \\ +\mu_1 &\sim \mathrm{Normal}(\mu_0, \Sigma_0) \\ +\mu_2 &\sim \mathrm{Normal}(\mu_0, \Sigma_0) \\ +z_i &\sim \mathrm{Categorical}(\pi_1, \pi_2) \\ +x_i &\sim \mathrm{Normal}(\mu_{z_i}, \Sigma) +\end{aligned} +\end{equation*} + +where $\pi_1, \pi_2$ are the mixing weights of the mixture model, i.e. $\pi_1 + \pi_2 = 1$, and $z_i$ is a latent assignment of the observation $x_i$ to a component (Gaussian). + +We can implement this model in Turing for 1D data as follows: + +```{julia} +@model function two_model(x) + # Hyper-parameters + μ0 = 0.0 + σ0 = 1.0 + + # Draw weights. + π1 ~ Beta(1, 1) + π2 = 1 - π1 + + # Draw locations of the components. + μ1 ~ Normal(μ0, σ0) + μ2 ~ Normal(μ0, σ0) + + # Draw latent assignment. + z ~ Categorical([π1, π2]) + + # Draw observation from selected component. + if z == 1 + x ~ Normal(μ1, 1.0) + else + x ~ Normal(μ2, 1.0) + end +end +``` + +#### Finite Mixture Model + +If we have more than two components, this model can elegantly be extended using a Dirichlet distribution as prior for the mixing weights $\pi_1, \dots, \pi_K$. Note that the Dirichlet distribution is the multivariate generalization of the beta distribution. The resulting model can be written as: + +$$ +\begin{align} +(\pi_1, \dots, \pi_K) &\sim Dirichlet(K, \alpha) \\ +\mu_k &\sim \mathrm{Normal}(\mu_0, \Sigma_0), \;\; \forall k \\ +z &\sim Categorical(\pi_1, \dots, \pi_K) \\ +x &\sim \mathrm{Normal}(\mu_z, \Sigma) +\end{align} +$$ + +which resembles the model in the [Gaussian mixture model tutorial]({{}}) with a slightly different notation. + +## Infinite Mixture Model + +The question now arises, is there a generalization of a Dirichlet distribution for which the dimensionality $K$ is infinite, i.e. $K = \infty$? + +But first, to implement an infinite Gaussian mixture model in Turing, we first need to load the `Turing.RandomMeasures` module. `RandomMeasures` contains a variety of tools useful in nonparametrics. + +```{julia} +using Turing.RandomMeasures +``` + +We now will utilize the fact that one can integrate out the mixing weights in a Gaussian mixture model allowing us to arrive at the Chinese restaurant process construction. See Carl E. Rasmussen: [The Infinite Gaussian Mixture Model](https://www.seas.harvard.edu/courses/cs281/papers/rasmussen-1999a.pdf), NIPS (2000) for details. + +In fact, if the mixing weights are integrated out, the conditional prior for the latent variable $z$ is given by: + +$$ +p(z_i = k \mid z_{\not i}, \alpha) = \frac{n_k + \alpha K}{N - 1 + \alpha} +$$ + +where $z_{\not i}$ are the latent assignments of all observations except observation $i$. Note that we use $n_k$ to denote the number of observations at component $k$ excluding observation $i$. The parameter $\alpha$ is the concentration parameter of the Dirichlet distribution used as prior over the mixing weights. + +#### Chinese Restaurant Process + +To obtain the Chinese restaurant process construction, we can now derive the conditional prior if $K \rightarrow \infty$. + +For $n_k > 0$ we obtain: + +$$ +p(z_i = k \mid z_{\not i}, \alpha) = \frac{n_k}{N - 1 + \alpha} +$$ + +and for all infinitely many clusters that are empty (combined) we get: + +$$ +p(z_i = k \mid z_{\not i}, \alpha) = \frac{\alpha}{N - 1 + \alpha} +$$ + +Those equations show that the conditional prior for component assignments is proportional to the number of such observations, meaning that the Chinese restaurant process has a rich get richer property. + +To get a better understanding of this property, we can plot the cluster choosen by for each new observation drawn from the conditional prior. + +```{julia} +# Concentration parameter. +α = 10.0 + +# Random measure, e.g. Dirichlet process. +rpm = DirichletProcess(α) + +# Cluster assignments for each observation. +z = Vector{Int}() + +# Maximum number of observations we observe. +Nmax = 500 + +for i in 1:Nmax + # Number of observations per cluster. + K = isempty(z) ? 0 : maximum(z) + nk = Vector{Int}(map(k -> sum(z .== k), 1:K)) + + # Draw new assignment. + push!(z, rand(ChineseRestaurantProcess(rpm, nk))) +end +``` + +```{julia} +using Plots + +# Plot the cluster assignments over time +@gif for i in 1:Nmax + scatter( + collect(1:i), + z[1:i]; + markersize=2, + xlabel="observation (i)", + ylabel="cluster (k)", + legend=false, + ) +end +``` + +Further, we can see that the number of clusters is logarithmic in the number of observations and data points. This is a side-effect of the "rich-get-richer" phenomenon, i.e. we expect large clusters and thus the number of clusters has to be smaller than the number of observations. + +$$ +\mathbb{E}[K \mid N] \approx \alpha \cdot log \big(1 + \frac{N}{\alpha}\big) +$$ + +We can see from the equation that the concentration parameter $\alpha$ allows us to control the number of clusters formed *a priori*. + +In Turing we can implement an infinite Gaussian mixture model using the Chinese restaurant process construction of a Dirichlet process as follows: + +```{julia} +@model function infiniteGMM(x) + # Hyper-parameters, i.e. concentration parameter and parameters of H. + α = 1.0 + μ0 = 0.0 + σ0 = 1.0 + + # Define random measure, e.g. Dirichlet process. + rpm = DirichletProcess(α) + + # Define the base distribution, i.e. expected value of the Dirichlet process. + H = Normal(μ0, σ0) + + # Latent assignment. + z = tzeros(Int, length(x)) + + # Locations of the infinitely many clusters. + μ = tzeros(Float64, 0) + + for i in 1:length(x) + + # Number of clusters. + K = maximum(z) + nk = Vector{Int}(map(k -> sum(z .== k), 1:K)) + + # Draw the latent assignment. + z[i] ~ ChineseRestaurantProcess(rpm, nk) + + # Create a new cluster? + if z[i] > K + push!(μ, 0.0) + + # Draw location of new cluster. + μ[z[i]] ~ H + end + + # Draw observation. + x[i] ~ Normal(μ[z[i]], 1.0) + end +end +``` + +We can now use Turing to infer the assignments of some data points. First, we will create some random data that comes from three clusters, with means of 0, -5, and 10. + +```{julia} +using Plots, Random + +# Generate some test data. +Random.seed!(1) +data = vcat(randn(10), randn(10) .- 5, randn(10) .+ 10) +data .-= mean(data) +data /= std(data); +``` + +Next, we'll sample from our posterior using SMC. + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +# MCMC sampling +Random.seed!(2) +iterations = 1000 +model_fun = infiniteGMM(data); +chain = sample(model_fun, SMC(), iterations); +``` + +Finally, we can plot the number of clusters in each sample. + +```{julia} +# Extract the number of clusters for each sample of the Markov chain. +k = map( + t -> length(unique(vec(chain[t, MCMCChains.namesingroup(chain, :z), :].value))), + 1:iterations, +); + +# Visualize the number of clusters. +plot(k; xlabel="Iteration", ylabel="Number of clusters", label="Chain 1") +``` + +If we visualize the histogram of the number of clusters sampled from our posterior, we observe that the model seems to prefer 3 clusters, which is the true number of clusters. Note that the number of clusters in a Dirichlet process mixture model is not limited a priori and will grow to infinity with probability one. However, if conditioned on data the posterior will concentrate on a finite number of clusters enforcing the resulting model to have a finite amount of clusters. It is, however, not given that the posterior of a Dirichlet process Gaussian mixture model converges to the true number of clusters, given that data comes from a finite mixture model. See Jeffrey Miller and Matthew Harrison: [A simple example of Dirichlet process mixture inconsitency for the number of components](https://arxiv.org/pdf/1301.2708.pdf) for details. + +```{julia} +histogram(k; xlabel="Number of clusters", legend=false) +``` + +One issue with the Chinese restaurant process construction is that the number of latent parameters we need to sample scales with the number of observations. It may be desirable to use alternative constructions in certain cases. Alternative methods of constructing a Dirichlet process can be employed via the following representations: + +Size-Biased Sampling Process + +$$ +j_k \sim \mathrm{Beta}(1, \alpha) \cdot \mathrm{surplus} +$$ + +Stick-Breaking Process +$$ +v_k \sim \mathrm{Beta}(1, \alpha) +$$ + +Chinese Restaurant Process +$$ +p(z_n = k | z_{1:n-1}) \propto \begin{cases} +\frac{m_k}{n-1+\alpha}, \text{ if } m_k > 0\\\ +\frac{\alpha}{n-1+\alpha} +\end{cases} +$$ + +For more details see [this article](https://www.stats.ox.ac.uk/%7Eteh/research/npbayes/Teh2010a.pdf). diff --git a/tutorials/07-poisson-regression/index.qmd b/tutorials/07-poisson-regression/index.qmd index dd25797e1..f2e233376 100755 --- a/tutorials/07-poisson-regression/index.qmd +++ b/tutorials/07-poisson-regression/index.qmd @@ -1,234 +1,234 @@ ---- -title: Bayesian Poisson Regression -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -This notebook is ported from the [example notebook](https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-poisson-regression.html) of PyMC3 on Poisson Regression. - -[Poisson Regression](https://en.wikipedia.org/wiki/Poisson_regression) is a technique commonly used to model count data. -Some of the applications include predicting the number of people defaulting on their loans or the number of cars running on a highway on a given day. -This example describes a method to implement the Bayesian version of this technique using Turing. - -We will generate the dataset that we will be working on which describes the relationship between number of times a person sneezes during the day with his alcohol consumption and medicinal intake. - -We start by importing the required libraries. - -```{julia} -#Import Turing, Distributions and DataFrames -using Turing, Distributions, DataFrames, Distributed - -# Import MCMCChain, Plots, and StatsPlots for visualizations and diagnostics. -using MCMCChains, Plots, StatsPlots - -# Set a seed for reproducibility. -using Random -Random.seed!(12); -``` - -# Generating data - -We start off by creating a toy dataset. We take the case of a person who takes medicine to prevent excessive sneezing. Alcohol consumption increases the rate of sneezing for that person. Thus, the two factors affecting the number of sneezes in a given day are alcohol consumption and whether the person has taken his medicine. Both these variable are taken as boolean valued while the number of sneezes will be a count valued variable. We also take into consideration that the interaction between the two boolean variables will affect the number of sneezes - -5 random rows are printed from the generated data to get a gist of the data generated. - -```{julia} -theta_noalcohol_meds = 1 # no alcohol, took medicine -theta_alcohol_meds = 3 # alcohol, took medicine -theta_noalcohol_nomeds = 6 # no alcohol, no medicine -theta_alcohol_nomeds = 36 # alcohol, no medicine - -# no of samples for each of the above cases -q = 100 - -#Generate data from different Poisson distributions -noalcohol_meds = Poisson(theta_noalcohol_meds) -alcohol_meds = Poisson(theta_alcohol_meds) -noalcohol_nomeds = Poisson(theta_noalcohol_nomeds) -alcohol_nomeds = Poisson(theta_alcohol_nomeds) - -nsneeze_data = vcat( - rand(noalcohol_meds, q), - rand(alcohol_meds, q), - rand(noalcohol_nomeds, q), - rand(alcohol_nomeds, q), -) -alcohol_data = vcat(zeros(q), ones(q), zeros(q), ones(q)) -meds_data = vcat(zeros(q), zeros(q), ones(q), ones(q)) - -df = DataFrame(; - nsneeze=nsneeze_data, - alcohol_taken=alcohol_data, - nomeds_taken=meds_data, - product_alcohol_meds=meds_data .* alcohol_data, -) -df[sample(1:nrow(df), 5; replace=false), :] -``` - -# Visualisation of the dataset - -We plot the distribution of the number of sneezes for the 4 different cases taken above. As expected, the person sneezes the most when he has taken alcohol and not taken his medicine. He sneezes the least when he doesn't consume alcohol and takes his medicine. - -```{julia} -# Data Plotting - -p1 = Plots.histogram( - df[(df[:, :alcohol_taken] .== 0) .& (df[:, :nomeds_taken] .== 0), 1]; - title="no_alcohol+meds", -) -p2 = Plots.histogram( - (df[(df[:, :alcohol_taken] .== 1) .& (df[:, :nomeds_taken] .== 0), 1]); - title="alcohol+meds", -) -p3 = Plots.histogram( - (df[(df[:, :alcohol_taken] .== 0) .& (df[:, :nomeds_taken] .== 1), 1]); - title="no_alcohol+no_meds", -) -p4 = Plots.histogram( - (df[(df[:, :alcohol_taken] .== 1) .& (df[:, :nomeds_taken] .== 1), 1]); - title="alcohol+no_meds", -) -plot(p1, p2, p3, p4; layout=(2, 2), legend=false) -``` - -We must convert our `DataFrame` data into the `Matrix` form as the manipulations that we are about are designed to work with `Matrix` data. We also separate the features from the labels which will be later used by the Turing sampler to generate samples from the posterior. - -```{julia} -# Convert the DataFrame object to matrices. -data = Matrix(df[:, [:alcohol_taken, :nomeds_taken, :product_alcohol_meds]]) -data_labels = df[:, :nsneeze] -data -``` - -We must recenter our data about 0 to help the Turing sampler in initialising the parameter estimates. So, normalising the data in each column by subtracting the mean and dividing by the standard deviation: - -```{julia} -# Rescale our matrices. -data = (data .- mean(data; dims=1)) ./ std(data; dims=1) -``` - -# Declaring the Model: Poisson Regression - -Our model, `poisson_regression` takes four arguments: - - - `x` is our set of independent variables; - - `y` is the element we want to predict; - - `n` is the number of observations we have; and - - `σ²` is the standard deviation we want to assume for our priors. - -Within the model, we create four coefficients (`b0`, `b1`, `b2`, and `b3`) and assign a prior of normally distributed with means of zero and standard deviations of `σ²`. We want to find values of these four coefficients to predict any given `y`. - -Intuitively, we can think of the coefficients as: - - - `b1` is the coefficient which represents the effect of taking alcohol on the number of sneezes; - - `b2` is the coefficient which represents the effect of taking in no medicines on the number of sneezes; - - `b3` is the coefficient which represents the effect of interaction between taking alcohol and no medicine on the number of sneezes; - -The `for` block creates a variable `theta` which is the weighted combination of the input features. We have defined the priors on these weights above. We then observe the likelihood of calculating `theta` given the actual label, `y[i]`. - -```{julia} -# Bayesian poisson regression (LR) -@model function poisson_regression(x, y, n, σ²) - b0 ~ Normal(0, σ²) - b1 ~ Normal(0, σ²) - b2 ~ Normal(0, σ²) - b3 ~ Normal(0, σ²) - for i in 1:n - theta = b0 + b1 * x[i, 1] + b2 * x[i, 2] + b3 * x[i, 3] - y[i] ~ Poisson(exp(theta)) - end -end; -``` - -# Sampling from the posterior - -We use the `NUTS` sampler to sample values from the posterior. We run multiple chains using the `MCMCThreads()` function to nullify the effect of a problematic chain. We then use the Gelman, Rubin, and Brooks Diagnostic to check the convergence of these multiple chains. - -```{julia} -#| output: false -# Retrieve the number of observations. -n, _ = size(data) - -# Sample using NUTS. - -num_chains = 4 -m = poisson_regression(data, data_labels, n, 10) -chain = sample(m, NUTS(), MCMCThreads(), 2_500, num_chains; discard_adapt=false, progress=false) -``` - -```{julia} -#| echo: false -chain -``` - -::: {.callout-warning collapse="true"} -## Sampling With Multiple Threads -The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains -will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains) -::: - -# Viewing the Diagnostics - -We use the Gelman, Rubin, and Brooks Diagnostic to check whether our chains have converged. Note that we require multiple chains to use this diagnostic which analyses the difference between these multiple chains. - -We expect the chains to have converged. This is because we have taken sufficient number of iterations (1500) for the NUTS sampler. However, in case the test fails, then we will have to take a larger number of iterations, resulting in longer computation time. - -```{julia} -gelmandiag(chain) -``` - -From the above diagnostic, we can conclude that the chains have converged because the PSRF values of the coefficients are close to 1. - -So, we have obtained the posterior distributions of the parameters. We transform the coefficients and recover theta values by taking the exponent of the meaned values of the coefficients `b0`, `b1`, `b2` and `b3`. We take the exponent of the means to get a better comparison of the relative values of the coefficients. We then compare this with the intuitive meaning that was described earlier. - -```{julia} -# Taking the first chain -c1 = chain[:, :, 1] - -# Calculating the exponentiated means -b0_exp = exp(mean(c1[:b0])) -b1_exp = exp(mean(c1[:b1])) -b2_exp = exp(mean(c1[:b2])) -b3_exp = exp(mean(c1[:b3])) - -print("The exponent of the meaned values of the weights (or coefficients are): \n") -println("b0: ", b0_exp) -println("b1: ", b1_exp) -println("b2: ", b2_exp) -println("b3: ", b3_exp) -print("The posterior distributions obtained after sampling can be visualised as :\n") -``` - -Visualising the posterior by plotting it: - -```{julia} -plot(chain) -``` - -# Interpreting the Obtained Mean Values - -The exponentiated mean of the coefficient `b1` is roughly half of that of `b2`. This makes sense because in the data that we generated, the number of sneezes was more sensitive to the medicinal intake as compared to the alcohol consumption. We also get a weaker dependence on the interaction between the alcohol consumption and the medicinal intake as can be seen from the value of `b3`. - -# Removing the Warmup Samples - -As can be seen from the plots above, the parameters converge to their final distributions after a few iterations. -The initial values during the warmup phase increase the standard deviations of the parameters and are not required after we get the desired distributions. -Thus, we remove these warmup values and once again view the diagnostics. -To remove these warmup values, we take all values except the first 200. -This is because we set the second parameter of the NUTS sampler (which is the number of adaptations) to be equal to 200. - -```{julia} -chains_new = chain[201:end, :, :] -``` - -```{julia} -plot(chains_new) -``` - -As can be seen from the numeric values and the plots above, the standard deviation values have decreased and all the plotted values are from the estimated posteriors. The exponentiated mean values, with the warmup samples removed, have not changed by much and they are still in accordance with their intuitive meanings as described earlier. +--- +title: Bayesian Poisson Regression +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +This notebook is ported from the [example notebook](https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-poisson-regression.html) of PyMC3 on Poisson Regression. + +[Poisson Regression](https://en.wikipedia.org/wiki/Poisson_regression) is a technique commonly used to model count data. +Some of the applications include predicting the number of people defaulting on their loans or the number of cars running on a highway on a given day. +This example describes a method to implement the Bayesian version of this technique using Turing. + +We will generate the dataset that we will be working on which describes the relationship between number of times a person sneezes during the day with his alcohol consumption and medicinal intake. + +We start by importing the required libraries. + +```{julia} +#Import Turing, Distributions and DataFrames +using Turing, Distributions, DataFrames, Distributed + +# Import MCMCChain, Plots, and StatsPlots for visualizations and diagnostics. +using MCMCChains, Plots, StatsPlots + +# Set a seed for reproducibility. +using Random +Random.seed!(12); +``` + +# Generating data + +We start off by creating a toy dataset. We take the case of a person who takes medicine to prevent excessive sneezing. Alcohol consumption increases the rate of sneezing for that person. Thus, the two factors affecting the number of sneezes in a given day are alcohol consumption and whether the person has taken his medicine. Both these variable are taken as boolean valued while the number of sneezes will be a count valued variable. We also take into consideration that the interaction between the two boolean variables will affect the number of sneezes + +5 random rows are printed from the generated data to get a gist of the data generated. + +```{julia} +theta_noalcohol_meds = 1 # no alcohol, took medicine +theta_alcohol_meds = 3 # alcohol, took medicine +theta_noalcohol_nomeds = 6 # no alcohol, no medicine +theta_alcohol_nomeds = 36 # alcohol, no medicine + +# no of samples for each of the above cases +q = 100 + +#Generate data from different Poisson distributions +noalcohol_meds = Poisson(theta_noalcohol_meds) +alcohol_meds = Poisson(theta_alcohol_meds) +noalcohol_nomeds = Poisson(theta_noalcohol_nomeds) +alcohol_nomeds = Poisson(theta_alcohol_nomeds) + +nsneeze_data = vcat( + rand(noalcohol_meds, q), + rand(alcohol_meds, q), + rand(noalcohol_nomeds, q), + rand(alcohol_nomeds, q), +) +alcohol_data = vcat(zeros(q), ones(q), zeros(q), ones(q)) +meds_data = vcat(zeros(q), zeros(q), ones(q), ones(q)) + +df = DataFrame(; + nsneeze=nsneeze_data, + alcohol_taken=alcohol_data, + nomeds_taken=meds_data, + product_alcohol_meds=meds_data .* alcohol_data, +) +df[sample(1:nrow(df), 5; replace=false), :] +``` + +# Visualisation of the dataset + +We plot the distribution of the number of sneezes for the 4 different cases taken above. As expected, the person sneezes the most when he has taken alcohol and not taken his medicine. He sneezes the least when he doesn't consume alcohol and takes his medicine. + +```{julia} +# Data Plotting + +p1 = Plots.histogram( + df[(df[:, :alcohol_taken] .== 0) .& (df[:, :nomeds_taken] .== 0), 1]; + title="no_alcohol+meds", +) +p2 = Plots.histogram( + (df[(df[:, :alcohol_taken] .== 1) .& (df[:, :nomeds_taken] .== 0), 1]); + title="alcohol+meds", +) +p3 = Plots.histogram( + (df[(df[:, :alcohol_taken] .== 0) .& (df[:, :nomeds_taken] .== 1), 1]); + title="no_alcohol+no_meds", +) +p4 = Plots.histogram( + (df[(df[:, :alcohol_taken] .== 1) .& (df[:, :nomeds_taken] .== 1), 1]); + title="alcohol+no_meds", +) +plot(p1, p2, p3, p4; layout=(2, 2), legend=false) +``` + +We must convert our `DataFrame` data into the `Matrix` form as the manipulations that we are about are designed to work with `Matrix` data. We also separate the features from the labels which will be later used by the Turing sampler to generate samples from the posterior. + +```{julia} +# Convert the DataFrame object to matrices. +data = Matrix(df[:, [:alcohol_taken, :nomeds_taken, :product_alcohol_meds]]) +data_labels = df[:, :nsneeze] +data +``` + +We must recenter our data about 0 to help the Turing sampler in initialising the parameter estimates. So, normalising the data in each column by subtracting the mean and dividing by the standard deviation: + +```{julia} +# Rescale our matrices. +data = (data .- mean(data; dims=1)) ./ std(data; dims=1) +``` + +# Declaring the Model: Poisson Regression + +Our model, `poisson_regression` takes four arguments: + + - `x` is our set of independent variables; + - `y` is the element we want to predict; + - `n` is the number of observations we have; and + - `σ²` is the standard deviation we want to assume for our priors. + +Within the model, we create four coefficients (`b0`, `b1`, `b2`, and `b3`) and assign a prior of normally distributed with means of zero and standard deviations of `σ²`. We want to find values of these four coefficients to predict any given `y`. + +Intuitively, we can think of the coefficients as: + + - `b1` is the coefficient which represents the effect of taking alcohol on the number of sneezes; + - `b2` is the coefficient which represents the effect of taking in no medicines on the number of sneezes; + - `b3` is the coefficient which represents the effect of interaction between taking alcohol and no medicine on the number of sneezes; + +The `for` block creates a variable `theta` which is the weighted combination of the input features. We have defined the priors on these weights above. We then observe the likelihood of calculating `theta` given the actual label, `y[i]`. + +```{julia} +# Bayesian poisson regression (LR) +@model function poisson_regression(x, y, n, σ²) + b0 ~ Normal(0, σ²) + b1 ~ Normal(0, σ²) + b2 ~ Normal(0, σ²) + b3 ~ Normal(0, σ²) + for i in 1:n + theta = b0 + b1 * x[i, 1] + b2 * x[i, 2] + b3 * x[i, 3] + y[i] ~ Poisson(exp(theta)) + end +end; +``` + +# Sampling from the posterior + +We use the `NUTS` sampler to sample values from the posterior. We run multiple chains using the `MCMCThreads()` function to nullify the effect of a problematic chain. We then use the Gelman, Rubin, and Brooks Diagnostic to check the convergence of these multiple chains. + +```{julia} +#| output: false +# Retrieve the number of observations. +n, _ = size(data) + +# Sample using NUTS. + +num_chains = 4 +m = poisson_regression(data, data_labels, n, 10) +chain = sample(m, NUTS(), MCMCThreads(), 2_500, num_chains; discard_adapt=false, progress=false) +``` + +```{julia} +#| echo: false +chain +``` + +::: {.callout-warning collapse="true"} +## Sampling With Multiple Threads +The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains +will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains) +::: + +# Viewing the Diagnostics + +We use the Gelman, Rubin, and Brooks Diagnostic to check whether our chains have converged. Note that we require multiple chains to use this diagnostic which analyses the difference between these multiple chains. + +We expect the chains to have converged. This is because we have taken sufficient number of iterations (1500) for the NUTS sampler. However, in case the test fails, then we will have to take a larger number of iterations, resulting in longer computation time. + +```{julia} +gelmandiag(chain) +``` + +From the above diagnostic, we can conclude that the chains have converged because the PSRF values of the coefficients are close to 1. + +So, we have obtained the posterior distributions of the parameters. We transform the coefficients and recover theta values by taking the exponent of the meaned values of the coefficients `b0`, `b1`, `b2` and `b3`. We take the exponent of the means to get a better comparison of the relative values of the coefficients. We then compare this with the intuitive meaning that was described earlier. + +```{julia} +# Taking the first chain +c1 = chain[:, :, 1] + +# Calculating the exponentiated means +b0_exp = exp(mean(c1[:b0])) +b1_exp = exp(mean(c1[:b1])) +b2_exp = exp(mean(c1[:b2])) +b3_exp = exp(mean(c1[:b3])) + +print("The exponent of the meaned values of the weights (or coefficients are): \n") +println("b0: ", b0_exp) +println("b1: ", b1_exp) +println("b2: ", b2_exp) +println("b3: ", b3_exp) +print("The posterior distributions obtained after sampling can be visualised as :\n") +``` + +Visualising the posterior by plotting it: + +```{julia} +plot(chain) +``` + +# Interpreting the Obtained Mean Values + +The exponentiated mean of the coefficient `b1` is roughly half of that of `b2`. This makes sense because in the data that we generated, the number of sneezes was more sensitive to the medicinal intake as compared to the alcohol consumption. We also get a weaker dependence on the interaction between the alcohol consumption and the medicinal intake as can be seen from the value of `b3`. + +# Removing the Warmup Samples + +As can be seen from the plots above, the parameters converge to their final distributions after a few iterations. +The initial values during the warmup phase increase the standard deviations of the parameters and are not required after we get the desired distributions. +Thus, we remove these warmup values and once again view the diagnostics. +To remove these warmup values, we take all values except the first 200. +This is because we set the second parameter of the NUTS sampler (which is the number of adaptations) to be equal to 200. + +```{julia} +chains_new = chain[201:end, :, :] +``` + +```{julia} +plot(chains_new) +``` + +As can be seen from the numeric values and the plots above, the standard deviation values have decreased and all the plotted values are from the estimated posteriors. The exponentiated mean values, with the warmup samples removed, have not changed by much and they are still in accordance with their intuitive meanings as described earlier. diff --git a/tutorials/08-multinomial-logistic-regression/index.qmd b/tutorials/08-multinomial-logistic-regression/index.qmd index 122428661..216598c1c 100755 --- a/tutorials/08-multinomial-logistic-regression/index.qmd +++ b/tutorials/08-multinomial-logistic-regression/index.qmd @@ -1,233 +1,233 @@ ---- -title: Bayesian Multinomial Logistic Regression -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -[Multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) is an extension of logistic regression. Logistic regression is used to model problems in which there are exactly two possible discrete outcomes. Multinomial logistic regression is used to model problems in which there are two or more possible discrete outcomes. - -In our example, we'll be using the iris dataset. The iris multiclass problem aims to predict the species of a flower given measurements (in centimeters) of sepal length and width and petal length and width. There are three possible species: Iris setosa, Iris versicolor, and Iris virginica. - -To start, let's import all the libraries we'll need. - -```{julia} -# Load Turing. -using Turing - -# Load RDatasets. -using RDatasets - -# Load StatsPlots for visualizations and diagnostics. -using StatsPlots - -# Functionality for splitting and normalizing the data. -using MLDataUtils: shuffleobs, splitobs, rescale! - -# We need a softmax function which is provided by NNlib. -using NNlib: softmax - -# Functionality for constructing arrays with identical elements efficiently. -using FillArrays - -# Functionality for working with scaled identity matrices. -using LinearAlgebra - -# Set a seed for reproducibility. -using Random -Random.seed!(0); -``` - -## Data Cleaning & Set Up - -Now we're going to import our dataset. Twenty rows of the dataset are shown below so you can get a good feel for what kind of data we have. - -```{julia} -# Import the "iris" dataset. -data = RDatasets.dataset("datasets", "iris"); - -# Show twenty random rows. -data[rand(1:size(data, 1), 20), :] -``` - -In this data set, the outcome `Species` is currently coded as a string. We convert it to a numerical value by using indices `1`, `2`, and `3` to indicate species `setosa`, `versicolor`, and `virginica`, respectively. - -```{julia} -# Recode the `Species` column. -species = ["setosa", "versicolor", "virginica"] -data[!, :Species_index] = indexin(data[!, :Species], species) - -# Show twenty random rows of the new species columns -data[rand(1:size(data, 1), 20), [:Species, :Species_index]] -``` - -After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the features and target from the data. Additionally, we must rescale our feature variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing's sampler will have a hard time finding a place to start searching for parameter estimates. - -```{julia} -# Split our dataset 50%/50% into training/test sets. -trainset, testset = splitobs(shuffleobs(data), 0.5) - -# Define features and target. -features = [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth] -target = :Species_index - -# Turing requires data in matrix and vector form. -train_features = Matrix(trainset[!, features]) -test_features = Matrix(testset[!, features]) -train_target = trainset[!, target] -test_target = testset[!, target] - -# Standardize the features. -μ, σ = rescale!(train_features; obsdim=1) -rescale!(test_features, μ, σ; obsdim=1); -``` - -## Model Declaration - -Finally, we can define our model `logistic_regression`. It is a function that takes three arguments where - - - `x` is our set of independent variables; - - `y` is the element we want to predict; - - `σ` is the standard deviation we want to assume for our priors. - -We select the `setosa` species as the baseline class (the choice does not matter). Then we create the intercepts and vectors of coefficients for the other classes against that baseline. More concretely, we create scalar intercepts `intercept_versicolor` and `intersept_virginica` and coefficient vectors `coefficients_versicolor` and `coefficients_virginica` with four coefficients each for the features `SepalLength`, `SepalWidth`, `PetalLength` and `PetalWidth`. We assume a normal distribution with mean zero and standard deviation `σ` as prior for each scalar parameter. We want to find the posterior distribution of these, in total ten, parameters to be able to predict the species for any given set of features. - -```{julia} -# Bayesian multinomial logistic regression -@model function logistic_regression(x, y, σ) - n = size(x, 1) - length(y) == n || - throw(DimensionMismatch("number of observations in `x` and `y` is not equal")) - - # Priors of intercepts and coefficients. - intercept_versicolor ~ Normal(0, σ) - intercept_virginica ~ Normal(0, σ) - coefficients_versicolor ~ MvNormal(Zeros(4), σ^2 * I) - coefficients_virginica ~ MvNormal(Zeros(4), σ^2 * I) - - # Compute the likelihood of the observations. - values_versicolor = intercept_versicolor .+ x * coefficients_versicolor - values_virginica = intercept_virginica .+ x * coefficients_virginica - for i in 1:n - # the 0 corresponds to the base category `setosa` - v = softmax([0, values_versicolor[i], values_virginica[i]]) - y[i] ~ Categorical(v) - end -end; -``` - -## Sampling - -Now we can run our sampler. This time we'll use [`NUTS`](https://turinglang.org/stable/docs/library/#Turing.Inference.NUTS) to sample from our posterior. - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -#| output: false -m = logistic_regression(train_features, train_target, 1) -chain = sample(m, NUTS(), MCMCThreads(), 1_500, 3) -``` - - -```{julia} -#| echo: false -chain -``` - -::: {.callout-warning collapse="true"} -## Sampling With Multiple Threads -The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains -will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.]({{}}#sampling-multiple-chains) -::: - -Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points. - -```{julia} -plot(chain) -``` - -Looks good! - -We can also use the `corner` function from MCMCChains to show the distributions of the various parameters of our multinomial logistic regression. The corner function requires MCMCChains and StatsPlots. - -```{julia} -# Only plotting the first 3 coefficients due to a bug in Plots.jl -corner( - chain, - MCMCChains.namesingroup(chain, :coefficients_versicolor)[1:3]; -) -``` - -```{julia} -# Only plotting the first 3 coefficients due to a bug in Plots.jl -corner( - chain, - MCMCChains.namesingroup(chain, :coefficients_virginica)[1:3]; -) -``` - -Fortunately the corner plots appear to demonstrate unimodal distributions for each of our parameters, so it should be straightforward to take the means of each parameter's sampled values to estimate our model to make predictions. - -## Making Predictions - -How do we test how well the model actually predicts which of the three classes an iris flower belongs to? We need to build a `prediction` function that takes the test dataset and runs it through the average parameter calculated during sampling. - -The `prediction` function below takes a `Matrix` and a `Chains` object. It computes the mean of the sampled parameters and calculates the species with the highest probability for each observation. Note that we do not have to evaluate the `softmax` function since it does not affect the order of its inputs. - -```{julia} -function prediction(x::Matrix, chain) - # Pull the means from each parameter's sampled values in the chain. - intercept_versicolor = mean(chain, :intercept_versicolor) - intercept_virginica = mean(chain, :intercept_virginica) - coefficients_versicolor = [ - mean(chain, k) for k in MCMCChains.namesingroup(chain, :coefficients_versicolor) - ] - coefficients_virginica = [ - mean(chain, k) for k in MCMCChains.namesingroup(chain, :coefficients_virginica) - ] - - # Compute the index of the species with the highest probability for each observation. - values_versicolor = intercept_versicolor .+ x * coefficients_versicolor - values_virginica = intercept_virginica .+ x * coefficients_virginica - species_indices = [ - argmax((0, x, y)) for (x, y) in zip(values_versicolor, values_virginica) - ] - - return species_indices -end; -``` - -Let's see how we did! We run the test matrix through the prediction function, and compute the accuracy for our prediction. - -```{julia} -# Make the predictions. -predictions = prediction(test_features, chain) - -# Calculate accuracy for our test set. -mean(predictions .== testset[!, :Species_index]) -``` - -Perhaps more important is to see the accuracy per class. - -```{julia} -for s in 1:3 - rows = testset[!, :Species_index] .== s - println("Number of `", species[s], "`: ", count(rows)) - println( - "Percentage of `", - species[s], - "` predicted correctly: ", - mean(predictions[rows] .== testset[rows, :Species_index]), - ) -end -``` - -This tutorial has demonstrated how to use Turing to perform Bayesian multinomial logistic regression. +--- +title: Bayesian Multinomial Logistic Regression +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +[Multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) is an extension of logistic regression. Logistic regression is used to model problems in which there are exactly two possible discrete outcomes. Multinomial logistic regression is used to model problems in which there are two or more possible discrete outcomes. + +In our example, we'll be using the iris dataset. The iris multiclass problem aims to predict the species of a flower given measurements (in centimeters) of sepal length and width and petal length and width. There are three possible species: Iris setosa, Iris versicolor, and Iris virginica. + +To start, let's import all the libraries we'll need. + +```{julia} +# Load Turing. +using Turing + +# Load RDatasets. +using RDatasets + +# Load StatsPlots for visualizations and diagnostics. +using StatsPlots + +# Functionality for splitting and normalizing the data. +using MLDataUtils: shuffleobs, splitobs, rescale! + +# We need a softmax function which is provided by NNlib. +using NNlib: softmax + +# Functionality for constructing arrays with identical elements efficiently. +using FillArrays + +# Functionality for working with scaled identity matrices. +using LinearAlgebra + +# Set a seed for reproducibility. +using Random +Random.seed!(0); +``` + +## Data Cleaning & Set Up + +Now we're going to import our dataset. Twenty rows of the dataset are shown below so you can get a good feel for what kind of data we have. + +```{julia} +# Import the "iris" dataset. +data = RDatasets.dataset("datasets", "iris"); + +# Show twenty random rows. +data[rand(1:size(data, 1), 20), :] +``` + +In this data set, the outcome `Species` is currently coded as a string. We convert it to a numerical value by using indices `1`, `2`, and `3` to indicate species `setosa`, `versicolor`, and `virginica`, respectively. + +```{julia} +# Recode the `Species` column. +species = ["setosa", "versicolor", "virginica"] +data[!, :Species_index] = indexin(data[!, :Species], species) + +# Show twenty random rows of the new species columns +data[rand(1:size(data, 1), 20), [:Species, :Species_index]] +``` + +After we've done that tidying, it's time to split our dataset into training and testing sets, and separate the features and target from the data. Additionally, we must rescale our feature variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing's sampler will have a hard time finding a place to start searching for parameter estimates. + +```{julia} +# Split our dataset 50%/50% into training/test sets. +trainset, testset = splitobs(shuffleobs(data), 0.5) + +# Define features and target. +features = [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth] +target = :Species_index + +# Turing requires data in matrix and vector form. +train_features = Matrix(trainset[!, features]) +test_features = Matrix(testset[!, features]) +train_target = trainset[!, target] +test_target = testset[!, target] + +# Standardize the features. +μ, σ = rescale!(train_features; obsdim=1) +rescale!(test_features, μ, σ; obsdim=1); +``` + +## Model Declaration + +Finally, we can define our model `logistic_regression`. It is a function that takes three arguments where + + - `x` is our set of independent variables; + - `y` is the element we want to predict; + - `σ` is the standard deviation we want to assume for our priors. + +We select the `setosa` species as the baseline class (the choice does not matter). Then we create the intercepts and vectors of coefficients for the other classes against that baseline. More concretely, we create scalar intercepts `intercept_versicolor` and `intersept_virginica` and coefficient vectors `coefficients_versicolor` and `coefficients_virginica` with four coefficients each for the features `SepalLength`, `SepalWidth`, `PetalLength` and `PetalWidth`. We assume a normal distribution with mean zero and standard deviation `σ` as prior for each scalar parameter. We want to find the posterior distribution of these, in total ten, parameters to be able to predict the species for any given set of features. + +```{julia} +# Bayesian multinomial logistic regression +@model function logistic_regression(x, y, σ) + n = size(x, 1) + length(y) == n || + throw(DimensionMismatch("number of observations in `x` and `y` is not equal")) + + # Priors of intercepts and coefficients. + intercept_versicolor ~ Normal(0, σ) + intercept_virginica ~ Normal(0, σ) + coefficients_versicolor ~ MvNormal(Zeros(4), σ^2 * I) + coefficients_virginica ~ MvNormal(Zeros(4), σ^2 * I) + + # Compute the likelihood of the observations. + values_versicolor = intercept_versicolor .+ x * coefficients_versicolor + values_virginica = intercept_virginica .+ x * coefficients_virginica + for i in 1:n + # the 0 corresponds to the base category `setosa` + v = softmax([0, values_versicolor[i], values_virginica[i]]) + y[i] ~ Categorical(v) + end +end; +``` + +## Sampling + +Now we can run our sampler. This time we'll use [`NUTS`](https://turinglang.org/stable/docs/library/#Turing.Inference.NUTS) to sample from our posterior. + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +#| output: false +m = logistic_regression(train_features, train_target, 1) +chain = sample(m, NUTS(), MCMCThreads(), 1_500, 3) +``` + + +```{julia} +#| echo: false +chain +``` + +::: {.callout-warning collapse="true"} +## Sampling With Multiple Threads +The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains +will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.]({{}}#sampling-multiple-chains) +::: + +Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points. + +```{julia} +plot(chain) +``` + +Looks good! + +We can also use the `corner` function from MCMCChains to show the distributions of the various parameters of our multinomial logistic regression. The corner function requires MCMCChains and StatsPlots. + +```{julia} +# Only plotting the first 3 coefficients due to a bug in Plots.jl +corner( + chain, + MCMCChains.namesingroup(chain, :coefficients_versicolor)[1:3]; +) +``` + +```{julia} +# Only plotting the first 3 coefficients due to a bug in Plots.jl +corner( + chain, + MCMCChains.namesingroup(chain, :coefficients_virginica)[1:3]; +) +``` + +Fortunately the corner plots appear to demonstrate unimodal distributions for each of our parameters, so it should be straightforward to take the means of each parameter's sampled values to estimate our model to make predictions. + +## Making Predictions + +How do we test how well the model actually predicts which of the three classes an iris flower belongs to? We need to build a `prediction` function that takes the test dataset and runs it through the average parameter calculated during sampling. + +The `prediction` function below takes a `Matrix` and a `Chains` object. It computes the mean of the sampled parameters and calculates the species with the highest probability for each observation. Note that we do not have to evaluate the `softmax` function since it does not affect the order of its inputs. + +```{julia} +function prediction(x::Matrix, chain) + # Pull the means from each parameter's sampled values in the chain. + intercept_versicolor = mean(chain, :intercept_versicolor) + intercept_virginica = mean(chain, :intercept_virginica) + coefficients_versicolor = [ + mean(chain, k) for k in MCMCChains.namesingroup(chain, :coefficients_versicolor) + ] + coefficients_virginica = [ + mean(chain, k) for k in MCMCChains.namesingroup(chain, :coefficients_virginica) + ] + + # Compute the index of the species with the highest probability for each observation. + values_versicolor = intercept_versicolor .+ x * coefficients_versicolor + values_virginica = intercept_virginica .+ x * coefficients_virginica + species_indices = [ + argmax((0, x, y)) for (x, y) in zip(values_versicolor, values_virginica) + ] + + return species_indices +end; +``` + +Let's see how we did! We run the test matrix through the prediction function, and compute the accuracy for our prediction. + +```{julia} +# Make the predictions. +predictions = prediction(test_features, chain) + +# Calculate accuracy for our test set. +mean(predictions .== testset[!, :Species_index]) +``` + +Perhaps more important is to see the accuracy per class. + +```{julia} +for s in 1:3 + rows = testset[!, :Species_index] .== s + println("Number of `", species[s], "`: ", count(rows)) + println( + "Percentage of `", + species[s], + "` predicted correctly: ", + mean(predictions[rows] .== testset[rows, :Species_index]), + ) +end +``` + +This tutorial has demonstrated how to use Turing to perform Bayesian multinomial logistic regression. diff --git a/tutorials/09-variational-inference/index.qmd b/tutorials/09-variational-inference/index.qmd index eb7f16c0e..5ec412163 100755 --- a/tutorials/09-variational-inference/index.qmd +++ b/tutorials/09-variational-inference/index.qmd @@ -1,868 +1,868 @@ ---- -title: Variational inference (VI) in Turing.jl -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In this post we'll have a look at what's know as **variational inference (VI)**, a family of _approximate_ Bayesian inference methods, and how to use it in Turing.jl as an alternative to other approaches such as MCMC. In particular, we will focus on one of the more standard VI methods called **Automatic Differentation Variational Inference (ADVI)**. - -Here we will focus on how to use VI in Turing and not much on the theory underlying VI. -If you are interested in understanding the mathematics you can checkout [our write-up]({{}}) or any other resource online (there a lot of great ones). - -Using VI in Turing.jl is very straight forward. -If `model` denotes a definition of a `Turing.Model`, performing VI is as simple as - -```{julia} -#| eval: false -m = model(data...) # instantiate model on the data -q = vi(m, vi_alg) # perform VI on `m` using the VI method `vi_alg`, which returns a `VariationalPosterior` -``` - -Thus it's no more work than standard MCMC sampling in Turing. - -To get a bit more into what we can do with `vi`, we'll first have a look at a simple example and then we'll reproduce the [tutorial on Bayesian linear regression]({{}}) using VI instead of MCMC. Finally we'll look at some of the different parameters of `vi` and how you for example can use your own custom variational family. - -We first import the packages to be used: - -```{julia} -using Random -using Turing -using Turing: Variational -using StatsPlots, Measures - -Random.seed!(42); -``` - -## Simple example: Normal-Gamma conjugate model - -The Normal-(Inverse)Gamma conjugate model is defined by the following generative process - -\begin{align} -s &\sim \mathrm{InverseGamma}(2, 3) \\ -m &\sim \mathcal{N}(0, s) \\ -x_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, s), \quad i = 1, \dots, n -\end{align} - -Recall that *conjugate* refers to the fact that we can obtain a closed-form expression for the posterior. Of course one wouldn't use something like variational inference for a conjugate model, but it's useful as a simple demonstration as we can compare the result to the true posterior. - -First we generate some synthetic data, define the `Turing.Model` and instantiate the model on the data: - -```{julia} -# generate data -x = randn(2000); -``` - -```{julia} -@model function model(x) - s ~ InverseGamma(2, 3) - m ~ Normal(0.0, sqrt(s)) - for i in 1:length(x) - x[i] ~ Normal(m, sqrt(s)) - end -end; -``` - -```{julia} -# Instantiate model -m = model(x); -``` - -Now we'll produce some samples from the posterior using a MCMC method, which in constrast to VI is guaranteed to converge to the *exact* posterior (as the number of samples go to infinity). - -We'll produce 10 000 samples with 200 steps used for adaptation and a target acceptance rate of 0.65 - -If you don't understand what "adaptation" or "target acceptance rate" refers to, all you really need to know is that `NUTS` is known to be one of the most accurate and efficient samplers (when applicable) while requiring little to no hand-tuning to work well. - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -samples_nuts = sample(m, NUTS(), 10_000); -``` - -Now let's try VI. The most important function you need to now about to do VI in Turing is `vi`: - -```{julia} -@doc(Variational.vi) -``` - -Additionally, you can pass - - - an initial variational posterior `q`, for which we assume there exists a implementation of `update(::typeof(q), θ::AbstractVector)` returning an updated posterior `q` with parameters `θ`. - - a function mapping $\theta \mapsto q_{\theta}$ (denoted above `getq`) together with initial parameters `θ`. This provides more flexibility in the types of variational families that we can use, and can sometimes be slightly more convenient for quick and rough work. - -By default, i.e. when calling `vi(m, advi)`, Turing use a *mean-field* approximation with a multivariate normal as the base-distribution. Mean-field refers to the fact that we assume all the latent variables to be *independent*. This the "standard" ADVI approach; see [Automatic Differentiation Variational Inference (2016)](https://arxiv.org/abs/1603.00788) for more. In Turing, one can obtain such a mean-field approximation by calling `Variational.meanfield(model)` for which there exists an internal implementation for `update`: - -```{julia} -@doc(Variational.meanfield) -``` - -Currently the only implementation of `VariationalInference` available is `ADVI`, which is very convenient and applicable as long as your `Model` is differentiable with respect to the *variational parameters*, that is, the parameters of your variational distribution, e.g. mean and variance in the mean-field approximation. - -```{julia} -@doc(Variational.ADVI) -``` - -To perform VI on the model `m` using 10 samples for gradient estimation and taking 1000 gradient steps is then as simple as: - -```{julia} -# ADVI -advi = ADVI(10, 1000) -q = vi(m, advi); -``` - -Unfortunately, for such a small problem Turing's new `NUTS` sampler is *so* efficient now that it's not that much more efficient to use ADVI. So, so very unfortunate... - -With that being said, this is not the case in general. For very complex models we'll later find that `ADVI` produces very reasonable results in a much shorter time than `NUTS`. - -And one significant advantage of using `vi` is that we can sample from the resulting `q` with ease. In fact, the result of the `vi` call is a `TransformedDistribution` from Bijectors.jl, and it implements the Distributions.jl interface for a `Distribution`: - -```{julia} -q isa MultivariateDistribution -``` - -This means that we can call `rand` to sample from the variational posterior `q` - -```{julia} -histogram(rand(q, 1_000)[1, :]) -``` - -and `logpdf` to compute the log-probability - -```{julia} -logpdf(q, rand(q)) -``` - -Let's check the first and second moments of the data to see how our approximation compares to the point-estimates form the data: - -```{julia} -var(x), mean(x) -``` - -```{julia} -(mean(rand(q, 1000); dims=2)...,) -``` - -```{julia} -#| echo: false -let - v, m = (mean(rand(q, 2000); dims=2)...,) - @assert isapprox(v, 1.022; atol=0.1) "Mean of s (VI posterior, 1000 samples): $v" - @assert isapprox(m, -0.027; atol=0.03) "Mean of m (VI posterior, 1000 samples): $m" -end -``` - -That's pretty close! But we're Bayesian so we're not interested in *just* matching the mean. -Let's instead look the actual density `q`. - -For that we need samples: - -```{julia} -samples = rand(q, 10000); -size(samples) -``` - -```{julia} -p1 = histogram( - samples[1, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="", ylabel="density" -) -density!(samples[1, :]; label="s (ADVI)", color=:blue, linewidth=2) -density!(samples_nuts, :s; label="s (NUTS)", color=:green, linewidth=2) -vline!([var(x)]; label="s (data)", color=:black) -vline!([mean(samples[1, :])]; color=:blue, label="") - -p2 = histogram( - samples[2, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="", ylabel="density" -) -density!(samples[2, :]; label="m (ADVI)", color=:blue, linewidth=2) -density!(samples_nuts, :m; label="m (NUTS)", color=:green, linewidth=2) -vline!([mean(x)]; color=:black, label="m (data)") -vline!([mean(samples[2, :])]; color=:blue, label="") - -plot(p1, p2; layout=(2, 1), size=(900, 500), legend=true) -``` - -For this particular `Model`, we can in fact obtain the posterior of the latent variables in closed form. This allows us to compare both `NUTS` and `ADVI` to the true posterior $p(s, m \mid x_1, \ldots, x_n)$. - -*The code below is just work to get the marginals $p(s \mid x_1, \ldots, x_n)$ and $p(m \mid x_1, \ldots, x_n)$. Feel free to skip it.* - -```{julia} -# closed form computation of the Normal-inverse-gamma posterior -# based on "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy -function posterior(μ₀::Real, κ₀::Real, α₀::Real, β₀::Real, x::AbstractVector{<:Real}) - # Compute summary statistics - n = length(x) - x̄ = mean(x) - sum_of_squares = sum(xi -> (xi - x̄)^2, x) - - # Compute parameters of the posterior - κₙ = κ₀ + n - μₙ = (κ₀ * μ₀ + n * x̄) / κₙ - αₙ = α₀ + n / 2 - βₙ = β₀ + (sum_of_squares + n * κ₀ / κₙ * (x̄ - μ₀)^2) / 2 - - return μₙ, κₙ, αₙ, βₙ -end -μₙ, κₙ, αₙ, βₙ = posterior(0.0, 1.0, 2.0, 3.0, x) - -# marginal distribution of σ² -# cf. Eq. (90) in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy -p_σ² = InverseGamma(αₙ, βₙ) -p_σ²_pdf = z -> pdf(p_σ², z) - -# marginal of μ -# Eq. (91) in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy -p_μ = μₙ + sqrt(βₙ / (αₙ * κₙ)) * TDist(2 * αₙ) -p_μ_pdf = z -> pdf(p_μ, z) - -# posterior plots -p1 = plot() -histogram!(samples[1, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="") -density!(samples[1, :]; label="s (ADVI)", color=:blue) -density!(samples_nuts, :s; label="s (NUTS)", color=:green) -vline!([mean(samples[1, :])]; linewidth=1.5, color=:blue, label="") -plot!(range(0.75, 1.35; length=1_001), p_σ²_pdf; label="s (posterior)", color=:red) -vline!([var(x)]; label="s (data)", linewidth=1.5, color=:black, alpha=0.7) -xlims!(0.75, 1.35) - -p2 = plot() -histogram!(samples[2, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="") -density!(samples[2, :]; label="m (ADVI)", color=:blue) -density!(samples_nuts, :m; label="m (NUTS)", color=:green) -vline!([mean(samples[2, :])]; linewidth=1.5, color=:blue, label="") -plot!(range(-0.25, 0.25; length=1_001), p_μ_pdf; label="m (posterior)", color=:red) -vline!([mean(x)]; label="m (data)", linewidth=1.5, color=:black, alpha=0.7) -xlims!(-0.25, 0.25) - -plot(p1, p2; layout=(2, 1), size=(900, 500)) -``` - -## Bayesian linear regression example using ADVI - -This is simply a duplication of the tutorial on [Bayesian linear regression]({{}}) (much of the code is directly lifted), but now with the addition of an approximate posterior obtained using `ADVI`. - -As we'll see, there is really no additional work required to apply variational inference to a more complex `Model`. - -```{julia} -Random.seed!(1); -``` - -```{julia} -using FillArrays -using RDatasets - -using LinearAlgebra -``` - -```{julia} -# Import the "Default" dataset. -data = RDatasets.dataset("datasets", "mtcars"); - -# Show the first six rows of the dataset. -first(data, 6) -``` - -```{julia} -# Function to split samples. -function split_data(df, at=0.70) - r = size(df, 1) - index = Int(round(r * at)) - train = df[1:index, :] - test = df[(index + 1):end, :] - return train, test -end - -# A handy helper function to rescale our dataset. -function standardize(x) - return (x .- mean(x; dims=1)) ./ std(x; dims=1) -end - -function standardize(x, orig) - return (x .- mean(orig; dims=1)) ./ std(orig; dims=1) -end - -# Another helper function to unstandardize our datasets. -function unstandardize(x, orig) - return x .* std(orig; dims=1) .+ mean(orig; dims=1) -end - -function unstandardize(x, mean_train, std_train) - return x .* std_train .+ mean_train -end -``` - -```{julia} -# Remove the model column. -select!(data, Not(:Model)) - -# Split our dataset 70%/30% into training/test sets. -train, test = split_data(data, 0.7) -train_unstandardized = copy(train) - -# Standardize both datasets. -std_train = standardize(Matrix(train)) -std_test = standardize(Matrix(test), Matrix(train)) - -# Save dataframe versions of our dataset. -train_cut = DataFrame(std_train, names(data)) -test_cut = DataFrame(std_test, names(data)) - -# Create our labels. These are the values we are trying to predict. -train_label = train_cut[:, :MPG] -test_label = test_cut[:, :MPG] - -# Get the list of columns to keep. -remove_names = filter(x -> !in(x, ["MPG"]), names(data)) - -# Filter the test and train sets. -train = Matrix(train_cut[:, remove_names]); -test = Matrix(test_cut[:, remove_names]); -``` - -```{julia} -# Bayesian linear regression. -@model function linear_regression(x, y, n_obs, n_vars, ::Type{T}=Vector{Float64}) where {T} - # Set variance prior. - σ² ~ truncated(Normal(0, 100); lower=0) - - # Set intercept prior. - intercept ~ Normal(0, 3) - - # Set the priors on our coefficients. - coefficients ~ MvNormal(Zeros(n_vars), 10.0 * I) - - # Calculate all the mu terms. - mu = intercept .+ x * coefficients - return y ~ MvNormal(mu, σ² * I) -end; -``` - -```{julia} -n_obs, n_vars = size(train) -m = linear_regression(train, train_label, n_obs, n_vars); -``` - -## Performing VI - -First we define the initial variational distribution, or, equivalently, the family of distributions to consider. We're going to use the same mean-field approximation as Turing will use by default when we call `vi(m, advi)`, which we obtain by calling `Variational.meanfield`. This returns a `TransformedDistribution` with a `TuringDiagMvNormal` as the underlying distribution and the transformation mapping from the reals to the domain of the latent variables. - -```{julia} -q0 = Variational.meanfield(m) -typeof(q0) -``` - -```{julia} -advi = ADVI(10, 10_000) -``` - -Turing also provides a couple of different optimizers: - - - `TruncatedADAGrad` (default) - - `DecayedADAGrad` - as these are well-suited for problems with high-variance stochastic objectives, which is usually what the ELBO ends up being at different times in our optimization process. - -With that being said, thanks to Requires.jl, if we add a `using Flux` prior to `using Turing` we can also make use of all the optimizers in `Flux`, e.g. `ADAM`, without any additional changes to your code! For example: - -```{julia} -#| eval: false -using Flux, Turing -using Turing.Variational - -vi(m, advi; optimizer=Flux.ADAM()) -``` - -just works. - -For this problem we'll use the `DecayedADAGrad` from Turing: - -```{julia} -opt = Variational.DecayedADAGrad(1e-2, 1.1, 0.9) -``` - -```{julia} -q = vi(m, advi, q0; optimizer=opt) -typeof(q) -``` - -*Note: as mentioned before, we internally define a `update(q::TransformedDistribution{<:TuringDiagMvNormal}, θ::AbstractVector)` method which takes in the current variational approximation `q` together with new parameters `z` and returns the new variational approximation. This is required so that we can actually update the `Distribution` object after each optimization step.* - -*Alternatively, we can instead provide the mapping $\theta \mapsto q_{\theta}$ directly together with initial parameters using the signature `vi(m, advi, getq, θ_init)` as mentioned earlier. We'll see an explicit example of this later on!* - -To compute statistics for our approximation we need samples: - -```{julia} -z = rand(q, 10_000); -``` - -Now we can for example look at the average - -```{julia} -avg = vec(mean(z; dims=2)) -``` - -The vector has the same ordering as the model, e.g. in this case `σ²` has index `1`, `intercept` has index `2` and `coefficients` has indices `3:12`. If you forget or you might want to do something programmatically with the result, you can obtain the `sym → indices` mapping as follows: - -```{julia} -_, sym2range = bijector(m, Val(true)); -sym2range -``` - -For example, we can check the sample distribution and mean value of `σ²`: - -```{julia} -histogram(z[1, :]) -avg[union(sym2range[:σ²]...)] -``` - -```{julia} -avg[union(sym2range[:intercept]...)] -``` - -```{julia} -avg[union(sym2range[:coefficients]...)] -``` - -*Note: as you can see, this is slightly awkward to work with at the moment. We'll soon add a better way of dealing with this.* - -With a bit of work (this will be much easier in the future), we can also visualize the approximate marginals of the different variables, similar to `plot(chain)`: - -```{julia} -function plot_variational_marginals(z, sym2range) - ps = [] - - for (i, sym) in enumerate(keys(sym2range)) - indices = union(sym2range[sym]...) # <= array of ranges - if sum(length.(indices)) > 1 - offset = 1 - for r in indices - p = density( - z[r, :]; - title="$(sym)[$offset]", - titlefontsize=10, - label="", - ylabel="Density", - margin=1.5mm, - ) - push!(ps, p) - offset += 1 - end - else - p = density( - z[first(indices), :]; - title="$(sym)", - titlefontsize=10, - label="", - ylabel="Density", - margin=1.5mm, - ) - push!(ps, p) - end - end - - return plot(ps...; layout=(length(ps), 1), size=(500, 2000), margin=4.0mm) -end -``` - -```{julia} -plot_variational_marginals(z, sym2range) -``` - -And let's compare this to using the `NUTS` sampler: - -```{julia} -chain = sample(m, NUTS(), 10_000); -``` - -```{julia} -plot(chain; margin=12.00mm) -``` - -```{julia} -vi_mean = vec(mean(z; dims=2))[[ - union(sym2range[:coefficients]...)..., - union(sym2range[:intercept]...)..., - union(sym2range[:σ²]...)..., -]] -``` - -```{julia} -mcmc_mean = mean(chain, names(chain, :parameters))[:, 2] -``` - -```{julia} -plot(mcmc_mean; xticks=1:1:length(mcmc_mean), linestyle=:dot, label="NUTS") -plot!(vi_mean; linestyle=:dot, label="VI") -``` - -One thing we can look at is simply the squared error between the means: - -```{julia} -sum(abs2, mcmc_mean .- vi_mean) -``` - -That looks pretty good! But let's see how the predictive distributions looks for the two. - -## Prediction - -Similarily to the linear regression tutorial, we're going to compare to multivariate ordinary linear regression using the `GLM` package: - -```{julia} -# Import the GLM package. -using GLM - -# Perform multivariate OLS. -ols = lm( - @formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train_cut -) - -# Store our predictions in the original dataframe. -train_cut.OLSPrediction = unstandardize(GLM.predict(ols), train_unstandardized.MPG) -test_cut.OLSPrediction = unstandardize(GLM.predict(ols, test_cut), train_unstandardized.MPG); -``` - -```{julia} -# Make a prediction given an input vector, using mean parameter values from a chain. -function prediction_chain(chain, x) - p = get_params(chain) - α = mean(p.intercept) - β = collect(mean.(p.coefficients)) - return α .+ x * β -end -``` - -```{julia} -# Make a prediction using samples from the variational posterior given an input vector. -function prediction(samples::AbstractVector, sym2ranges, x) - α = mean(samples[union(sym2ranges[:intercept]...)]) - β = vec(mean(samples[union(sym2ranges[:coefficients]...)]; dims=2)) - return α .+ x * β -end - -function prediction(samples::AbstractMatrix, sym2ranges, x) - α = mean(samples[union(sym2ranges[:intercept]...), :]) - β = vec(mean(samples[union(sym2ranges[:coefficients]...), :]; dims=2)) - return α .+ x * β -end -``` - -```{julia} -# Unstandardize the dependent variable. -train_cut.MPG = unstandardize(train_cut.MPG, train_unstandardized.MPG) -test_cut.MPG = unstandardize(test_cut.MPG, train_unstandardized.MPG); -``` - -```{julia} -# Show the first side rows of the modified dataframe. -first(test_cut, 6) -``` - -```{julia} -z = rand(q, 10_000); -``` - -```{julia} -# Calculate the predictions for the training and testing sets using the samples `z` from variational posterior -train_cut.VIPredictions = unstandardize( - prediction(z, sym2range, train), train_unstandardized.MPG -) -test_cut.VIPredictions = unstandardize( - prediction(z, sym2range, test), train_unstandardized.MPG -) - -train_cut.BayesPredictions = unstandardize( - prediction_chain(chain, train), train_unstandardized.MPG -) -test_cut.BayesPredictions = unstandardize( - prediction_chain(chain, test), train_unstandardized.MPG -); -``` - -```{julia} -vi_loss1 = mean((train_cut.VIPredictions - train_cut.MPG) .^ 2) -bayes_loss1 = mean((train_cut.BayesPredictions - train_cut.MPG) .^ 2) -ols_loss1 = mean((train_cut.OLSPrediction - train_cut.MPG) .^ 2) - -vi_loss2 = mean((test_cut.VIPredictions - test_cut.MPG) .^ 2) -bayes_loss2 = mean((test_cut.BayesPredictions - test_cut.MPG) .^ 2) -ols_loss2 = mean((test_cut.OLSPrediction - test_cut.MPG) .^ 2) - -println("Training set: - VI loss: $vi_loss1 - Bayes loss: $bayes_loss1 - OLS loss: $ols_loss1 -Test set: - VI loss: $vi_loss2 - Bayes loss: $bayes_loss2 - OLS loss: $ols_loss2") -``` - - -Interestingly the squared difference between true- and mean-prediction on the test-set is actually *better* for the mean-field variational posterior than for the "true" posterior obtained by MCMC sampling using `NUTS`. But, as Bayesians, we know that the mean doesn't tell the entire story. One quick check is to look at the mean predictions ± standard deviation of the two different approaches: - -```{julia} -z = rand(q, 1000); -preds = mapreduce(hcat, eachcol(z)) do zi - return unstandardize(prediction(zi, sym2range, test), train_unstandardized.MPG) -end - -scatter( - 1:size(test, 1), - mean(preds; dims=2); - yerr=std(preds; dims=2), - label="prediction (mean ± std)", - size=(900, 500), - markersize=8, -) -scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") -xaxis!(1:size(test, 1)) -ylims!(10, 40) -title!("Mean-field ADVI (Normal)") -``` - -```{julia} -preds = mapreduce(hcat, 1:5:size(chain, 1)) do i - return unstandardize(prediction_chain(chain[i], test), train_unstandardized.MPG) -end - -scatter( - 1:size(test, 1), - mean(preds; dims=2); - yerr=std(preds; dims=2), - label="prediction (mean ± std)", - size=(900, 500), - markersize=8, -) -scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") -xaxis!(1:size(test, 1)) -ylims!(10, 40) -title!("MCMC (NUTS)") -``` - -Indeed we see that the MCMC approach generally provides better uncertainty estimates than the mean-field ADVI approach! Good. So all the work we've done to make MCMC fast isn't for nothing. - -## Alternative: provide parameter-to-distribution instead of $q$ with `update` implemented - -As mentioned earlier, it's also possible to just provide the mapping $\theta \mapsto q_{\theta}$ rather than the variational family / initial variational posterior `q`, i.e. use the interface `vi(m, advi, getq, θ_init)` where `getq` is the mapping $\theta \mapsto q_{\theta}$ - -In this section we're going to construct a mean-field approximation to the model by hand using a composition of`Shift` and `Scale` from Bijectors.jl togheter with a standard multivariate Gaussian as the base distribution. - -```{julia} -using Bijectors -``` - -```{julia} -using Bijectors: Scale, Shift -``` - -```{julia} -d = length(q) -base_dist = Turing.DistributionsAD.TuringDiagMvNormal(zeros(d), ones(d)) -``` - -`bijector(model::Turing.Model)` is defined by Turing, and will return a `bijector` which takes you from the space of the latent variables to the real space. In this particular case, this is a mapping `((0, ∞) × ℝ × ℝ¹⁰) → ℝ¹²`. We're interested in using a normal distribution as a base-distribution and transform samples to the latent space, thus we need the inverse mapping from the reals to the latent space: - -```{julia} -to_constrained = inverse(bijector(m)); -``` - -```{julia} -function getq(θ) - d = length(θ) ÷ 2 - A = @inbounds θ[1:d] - b = @inbounds θ[(d + 1):(2 * d)] - - b = to_constrained ∘ Shift(b) ∘ Scale(exp.(A)) - - return transformed(base_dist, b) -end -``` - -```{julia} -q_mf_normal = vi(m, advi, getq, randn(2 * d)); -``` - -```{julia} -p1 = plot_variational_marginals(rand(q_mf_normal, 10_000), sym2range) # MvDiagNormal + Affine transformation + to_constrained -p2 = plot_variational_marginals(rand(q, 10_000), sym2range) # Turing.meanfield(m) - -plot(p1, p2; layout=(1, 2), size=(800, 2000)) -``` - -As expected, the fits look pretty much identical. - -But using this interface it becomes trivial to go beyond the mean-field assumption we made for the variational posterior, as we'll see in the next section. - -### Relaxing the mean-field assumption - -Here we'll instead consider the variational family to be a full non-diagonal multivariate Gaussian. As in the previous section we'll implement this by transforming a standard multivariate Gaussian using `Scale` and `Shift`, but now `Scale` will instead be using a lower-triangular matrix (representing the Cholesky of the covariance matrix of a multivariate normal) in contrast to the diagonal matrix we used in for the mean-field approximate posterior. - -```{julia} -# Using `ComponentArrays.jl` together with `UnPack.jl` makes our lives much easier. -using ComponentArrays, UnPack -``` - -```{julia} -proto_arr = ComponentArray(; L=zeros(d, d), b=zeros(d)) -proto_axes = getaxes(proto_arr) -num_params = length(proto_arr) - -function getq(θ) - L, b = begin - @unpack L, b = ComponentArray(θ, proto_axes) - LowerTriangular(L), b - end - # For this to represent a covariance matrix we need to ensure that the diagonal is positive. - # We can enforce this by zeroing out the diagonal and then adding back the diagonal exponentiated. - D = Diagonal(diag(L)) - A = L - D + exp(D) # exp for Diagonal is the same as exponentiating only the diagonal entries - - b = to_constrained ∘ Shift(b) ∘ Scale(A) - - return transformed(base_dist, b) -end -``` - -```{julia} -advi = ADVI(10, 20_000) -``` - -```{julia} -q_full_normal = vi( - m, advi, getq, randn(num_params); optimizer=Variational.DecayedADAGrad(1e-2) -); -``` - -Let's have a look at the learned covariance matrix: - -```{julia} -A = q_full_normal.transform.inner.a -``` - -```{julia} -heatmap(cov(A * A')) -``` - -```{julia} -zs = rand(q_full_normal, 10_000); -``` - -```{julia} -p1 = plot_variational_marginals(rand(q_mf_normal, 10_000), sym2range) -p2 = plot_variational_marginals(rand(q_full_normal, 10_000), sym2range) - -plot(p1, p2; layout=(1, 2), size=(800, 2000)) -``` - -So it seems like the "full" ADVI approach, i.e. no mean-field assumption, obtain the same modes as the mean-field approach but with greater uncertainty for some of the `coefficients`. This - -```{julia} -# Unfortunately, it seems like this has quite a high variance which is likely to be due to numerical instability, -# so we consider a larger number of samples. If we get a couple of outliers due to numerical issues, -# these kind affect the mean prediction greatly. -z = rand(q_full_normal, 10_000); -``` - -```{julia} -train_cut.VIFullPredictions = unstandardize( - prediction(z, sym2range, train), train_unstandardized.MPG -) -test_cut.VIFullPredictions = unstandardize( - prediction(z, sym2range, test), train_unstandardized.MPG -); -``` - -```{julia} -vi_loss1 = mean((train_cut.VIPredictions - train_cut.MPG) .^ 2) -vifull_loss1 = mean((train_cut.VIFullPredictions - train_cut.MPG) .^ 2) -bayes_loss1 = mean((train_cut.BayesPredictions - train_cut.MPG) .^ 2) -ols_loss1 = mean((train_cut.OLSPrediction - train_cut.MPG) .^ 2) - -vi_loss2 = mean((test_cut.VIPredictions - test_cut.MPG) .^ 2) -vifull_loss2 = mean((test_cut.VIFullPredictions - test_cut.MPG) .^ 2) -bayes_loss2 = mean((test_cut.BayesPredictions - test_cut.MPG) .^ 2) -ols_loss2 = mean((test_cut.OLSPrediction - test_cut.MPG) .^ 2) - -println("Training set: - VI loss: $vi_loss1 - Bayes loss: $bayes_loss1 - OLS loss: $ols_loss1 -Test set: - VI loss: $vi_loss2 - Bayes loss: $bayes_loss2 - OLS loss: $ols_loss2") -``` - -```{julia} -z = rand(q_mf_normal, 1000); -preds = mapreduce(hcat, eachcol(z)) do zi - return unstandardize(prediction(zi, sym2range, test), train_unstandardized.MPG) -end - -p1 = scatter( - 1:size(test, 1), - mean(preds; dims=2); - yerr=std(preds; dims=2), - label="prediction (mean ± std)", - size=(900, 500), - markersize=8, -) -scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") -xaxis!(1:size(test, 1)) -ylims!(10, 40) -title!("Mean-field ADVI (Normal)") -``` - -```{julia} -z = rand(q_full_normal, 1000); -preds = mapreduce(hcat, eachcol(z)) do zi - return unstandardize(prediction(zi, sym2range, test), train_unstandardized.MPG) -end - -p2 = scatter( - 1:size(test, 1), - mean(preds; dims=2); - yerr=std(preds; dims=2), - label="prediction (mean ± std)", - size=(900, 500), - markersize=8, -) -scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") -xaxis!(1:size(test, 1)) -ylims!(10, 40) -title!("Full ADVI (Normal)") -``` - -```{julia} -preds = mapreduce(hcat, 1:5:size(chain, 1)) do i - return unstandardize(prediction_chain(chain[i], test), train_unstandardized.MPG) -end - -p3 = scatter( - 1:size(test, 1), - mean(preds; dims=2); - yerr=std(preds; dims=2), - label="prediction (mean ± std)", - size=(900, 500), - markersize=8, -) -scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") -xaxis!(1:size(test, 1)) -ylims!(10, 40) -title!("MCMC (NUTS)") -``` - -```{julia} -plot(p1, p2, p3; layout=(1, 3), size=(900, 250), label="") -``` - -Here we actually see that indeed both the full ADVI and the MCMC approaches does a much better job of quantifying the uncertainty of predictions for never-before-seen samples, with full ADVI seemingly *underestimating* the variance slightly compared to MCMC. - -So now you know how to do perform VI on your Turing.jl model! Great isn't it? +--- +title: Variational inference (VI) in Turing.jl +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In this post we'll have a look at what's know as **variational inference (VI)**, a family of _approximate_ Bayesian inference methods, and how to use it in Turing.jl as an alternative to other approaches such as MCMC. In particular, we will focus on one of the more standard VI methods called **Automatic Differentation Variational Inference (ADVI)**. + +Here we will focus on how to use VI in Turing and not much on the theory underlying VI. +If you are interested in understanding the mathematics you can checkout [our write-up]({{}}) or any other resource online (there a lot of great ones). + +Using VI in Turing.jl is very straight forward. +If `model` denotes a definition of a `Turing.Model`, performing VI is as simple as + +```{julia} +#| eval: false +m = model(data...) # instantiate model on the data +q = vi(m, vi_alg) # perform VI on `m` using the VI method `vi_alg`, which returns a `VariationalPosterior` +``` + +Thus it's no more work than standard MCMC sampling in Turing. + +To get a bit more into what we can do with `vi`, we'll first have a look at a simple example and then we'll reproduce the [tutorial on Bayesian linear regression]({{}}) using VI instead of MCMC. Finally we'll look at some of the different parameters of `vi` and how you for example can use your own custom variational family. + +We first import the packages to be used: + +```{julia} +using Random +using Turing +using Turing: Variational +using StatsPlots, Measures + +Random.seed!(42); +``` + +## Simple example: Normal-Gamma conjugate model + +The Normal-(Inverse)Gamma conjugate model is defined by the following generative process + +\begin{align} +s &\sim \mathrm{InverseGamma}(2, 3) \\ +m &\sim \mathcal{N}(0, s) \\ +x_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, s), \quad i = 1, \dots, n +\end{align} + +Recall that *conjugate* refers to the fact that we can obtain a closed-form expression for the posterior. Of course one wouldn't use something like variational inference for a conjugate model, but it's useful as a simple demonstration as we can compare the result to the true posterior. + +First we generate some synthetic data, define the `Turing.Model` and instantiate the model on the data: + +```{julia} +# generate data +x = randn(2000); +``` + +```{julia} +@model function model(x) + s ~ InverseGamma(2, 3) + m ~ Normal(0.0, sqrt(s)) + for i in 1:length(x) + x[i] ~ Normal(m, sqrt(s)) + end +end; +``` + +```{julia} +# Instantiate model +m = model(x); +``` + +Now we'll produce some samples from the posterior using a MCMC method, which in constrast to VI is guaranteed to converge to the *exact* posterior (as the number of samples go to infinity). + +We'll produce 10 000 samples with 200 steps used for adaptation and a target acceptance rate of 0.65 + +If you don't understand what "adaptation" or "target acceptance rate" refers to, all you really need to know is that `NUTS` is known to be one of the most accurate and efficient samplers (when applicable) while requiring little to no hand-tuning to work well. + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +samples_nuts = sample(m, NUTS(), 10_000); +``` + +Now let's try VI. The most important function you need to now about to do VI in Turing is `vi`: + +```{julia} +@doc(Variational.vi) +``` + +Additionally, you can pass + + - an initial variational posterior `q`, for which we assume there exists a implementation of `update(::typeof(q), θ::AbstractVector)` returning an updated posterior `q` with parameters `θ`. + - a function mapping $\theta \mapsto q_{\theta}$ (denoted above `getq`) together with initial parameters `θ`. This provides more flexibility in the types of variational families that we can use, and can sometimes be slightly more convenient for quick and rough work. + +By default, i.e. when calling `vi(m, advi)`, Turing use a *mean-field* approximation with a multivariate normal as the base-distribution. Mean-field refers to the fact that we assume all the latent variables to be *independent*. This the "standard" ADVI approach; see [Automatic Differentiation Variational Inference (2016)](https://arxiv.org/abs/1603.00788) for more. In Turing, one can obtain such a mean-field approximation by calling `Variational.meanfield(model)` for which there exists an internal implementation for `update`: + +```{julia} +@doc(Variational.meanfield) +``` + +Currently the only implementation of `VariationalInference` available is `ADVI`, which is very convenient and applicable as long as your `Model` is differentiable with respect to the *variational parameters*, that is, the parameters of your variational distribution, e.g. mean and variance in the mean-field approximation. + +```{julia} +@doc(Variational.ADVI) +``` + +To perform VI on the model `m` using 10 samples for gradient estimation and taking 1000 gradient steps is then as simple as: + +```{julia} +# ADVI +advi = ADVI(10, 1000) +q = vi(m, advi); +``` + +Unfortunately, for such a small problem Turing's new `NUTS` sampler is *so* efficient now that it's not that much more efficient to use ADVI. So, so very unfortunate... + +With that being said, this is not the case in general. For very complex models we'll later find that `ADVI` produces very reasonable results in a much shorter time than `NUTS`. + +And one significant advantage of using `vi` is that we can sample from the resulting `q` with ease. In fact, the result of the `vi` call is a `TransformedDistribution` from Bijectors.jl, and it implements the Distributions.jl interface for a `Distribution`: + +```{julia} +q isa MultivariateDistribution +``` + +This means that we can call `rand` to sample from the variational posterior `q` + +```{julia} +histogram(rand(q, 1_000)[1, :]) +``` + +and `logpdf` to compute the log-probability + +```{julia} +logpdf(q, rand(q)) +``` + +Let's check the first and second moments of the data to see how our approximation compares to the point-estimates form the data: + +```{julia} +var(x), mean(x) +``` + +```{julia} +(mean(rand(q, 1000); dims=2)...,) +``` + +```{julia} +#| echo: false +let + v, m = (mean(rand(q, 2000); dims=2)...,) + @assert isapprox(v, 1.022; atol=0.1) "Mean of s (VI posterior, 1000 samples): $v" + @assert isapprox(m, -0.027; atol=0.03) "Mean of m (VI posterior, 1000 samples): $m" +end +``` + +That's pretty close! But we're Bayesian so we're not interested in *just* matching the mean. +Let's instead look the actual density `q`. + +For that we need samples: + +```{julia} +samples = rand(q, 10000); +size(samples) +``` + +```{julia} +p1 = histogram( + samples[1, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="", ylabel="density" +) +density!(samples[1, :]; label="s (ADVI)", color=:blue, linewidth=2) +density!(samples_nuts, :s; label="s (NUTS)", color=:green, linewidth=2) +vline!([var(x)]; label="s (data)", color=:black) +vline!([mean(samples[1, :])]; color=:blue, label="") + +p2 = histogram( + samples[2, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="", ylabel="density" +) +density!(samples[2, :]; label="m (ADVI)", color=:blue, linewidth=2) +density!(samples_nuts, :m; label="m (NUTS)", color=:green, linewidth=2) +vline!([mean(x)]; color=:black, label="m (data)") +vline!([mean(samples[2, :])]; color=:blue, label="") + +plot(p1, p2; layout=(2, 1), size=(900, 500), legend=true) +``` + +For this particular `Model`, we can in fact obtain the posterior of the latent variables in closed form. This allows us to compare both `NUTS` and `ADVI` to the true posterior $p(s, m \mid x_1, \ldots, x_n)$. + +*The code below is just work to get the marginals $p(s \mid x_1, \ldots, x_n)$ and $p(m \mid x_1, \ldots, x_n)$. Feel free to skip it.* + +```{julia} +# closed form computation of the Normal-inverse-gamma posterior +# based on "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy +function posterior(μ₀::Real, κ₀::Real, α₀::Real, β₀::Real, x::AbstractVector{<:Real}) + # Compute summary statistics + n = length(x) + x̄ = mean(x) + sum_of_squares = sum(xi -> (xi - x̄)^2, x) + + # Compute parameters of the posterior + κₙ = κ₀ + n + μₙ = (κ₀ * μ₀ + n * x̄) / κₙ + αₙ = α₀ + n / 2 + βₙ = β₀ + (sum_of_squares + n * κ₀ / κₙ * (x̄ - μ₀)^2) / 2 + + return μₙ, κₙ, αₙ, βₙ +end +μₙ, κₙ, αₙ, βₙ = posterior(0.0, 1.0, 2.0, 3.0, x) + +# marginal distribution of σ² +# cf. Eq. (90) in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy +p_σ² = InverseGamma(αₙ, βₙ) +p_σ²_pdf = z -> pdf(p_σ², z) + +# marginal of μ +# Eq. (91) in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy +p_μ = μₙ + sqrt(βₙ / (αₙ * κₙ)) * TDist(2 * αₙ) +p_μ_pdf = z -> pdf(p_μ, z) + +# posterior plots +p1 = plot() +histogram!(samples[1, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="") +density!(samples[1, :]; label="s (ADVI)", color=:blue) +density!(samples_nuts, :s; label="s (NUTS)", color=:green) +vline!([mean(samples[1, :])]; linewidth=1.5, color=:blue, label="") +plot!(range(0.75, 1.35; length=1_001), p_σ²_pdf; label="s (posterior)", color=:red) +vline!([var(x)]; label="s (data)", linewidth=1.5, color=:black, alpha=0.7) +xlims!(0.75, 1.35) + +p2 = plot() +histogram!(samples[2, :]; bins=100, normed=true, alpha=0.2, color=:blue, label="") +density!(samples[2, :]; label="m (ADVI)", color=:blue) +density!(samples_nuts, :m; label="m (NUTS)", color=:green) +vline!([mean(samples[2, :])]; linewidth=1.5, color=:blue, label="") +plot!(range(-0.25, 0.25; length=1_001), p_μ_pdf; label="m (posterior)", color=:red) +vline!([mean(x)]; label="m (data)", linewidth=1.5, color=:black, alpha=0.7) +xlims!(-0.25, 0.25) + +plot(p1, p2; layout=(2, 1), size=(900, 500)) +``` + +## Bayesian linear regression example using ADVI + +This is simply a duplication of the tutorial on [Bayesian linear regression]({{}}) (much of the code is directly lifted), but now with the addition of an approximate posterior obtained using `ADVI`. + +As we'll see, there is really no additional work required to apply variational inference to a more complex `Model`. + +```{julia} +Random.seed!(1); +``` + +```{julia} +using FillArrays +using RDatasets + +using LinearAlgebra +``` + +```{julia} +# Import the "Default" dataset. +data = RDatasets.dataset("datasets", "mtcars"); + +# Show the first six rows of the dataset. +first(data, 6) +``` + +```{julia} +# Function to split samples. +function split_data(df, at=0.70) + r = size(df, 1) + index = Int(round(r * at)) + train = df[1:index, :] + test = df[(index + 1):end, :] + return train, test +end + +# A handy helper function to rescale our dataset. +function standardize(x) + return (x .- mean(x; dims=1)) ./ std(x; dims=1) +end + +function standardize(x, orig) + return (x .- mean(orig; dims=1)) ./ std(orig; dims=1) +end + +# Another helper function to unstandardize our datasets. +function unstandardize(x, orig) + return x .* std(orig; dims=1) .+ mean(orig; dims=1) +end + +function unstandardize(x, mean_train, std_train) + return x .* std_train .+ mean_train +end +``` + +```{julia} +# Remove the model column. +select!(data, Not(:Model)) + +# Split our dataset 70%/30% into training/test sets. +train, test = split_data(data, 0.7) +train_unstandardized = copy(train) + +# Standardize both datasets. +std_train = standardize(Matrix(train)) +std_test = standardize(Matrix(test), Matrix(train)) + +# Save dataframe versions of our dataset. +train_cut = DataFrame(std_train, names(data)) +test_cut = DataFrame(std_test, names(data)) + +# Create our labels. These are the values we are trying to predict. +train_label = train_cut[:, :MPG] +test_label = test_cut[:, :MPG] + +# Get the list of columns to keep. +remove_names = filter(x -> !in(x, ["MPG"]), names(data)) + +# Filter the test and train sets. +train = Matrix(train_cut[:, remove_names]); +test = Matrix(test_cut[:, remove_names]); +``` + +```{julia} +# Bayesian linear regression. +@model function linear_regression(x, y, n_obs, n_vars, ::Type{T}=Vector{Float64}) where {T} + # Set variance prior. + σ² ~ truncated(Normal(0, 100); lower=0) + + # Set intercept prior. + intercept ~ Normal(0, 3) + + # Set the priors on our coefficients. + coefficients ~ MvNormal(Zeros(n_vars), 10.0 * I) + + # Calculate all the mu terms. + mu = intercept .+ x * coefficients + return y ~ MvNormal(mu, σ² * I) +end; +``` + +```{julia} +n_obs, n_vars = size(train) +m = linear_regression(train, train_label, n_obs, n_vars); +``` + +## Performing VI + +First we define the initial variational distribution, or, equivalently, the family of distributions to consider. We're going to use the same mean-field approximation as Turing will use by default when we call `vi(m, advi)`, which we obtain by calling `Variational.meanfield`. This returns a `TransformedDistribution` with a `TuringDiagMvNormal` as the underlying distribution and the transformation mapping from the reals to the domain of the latent variables. + +```{julia} +q0 = Variational.meanfield(m) +typeof(q0) +``` + +```{julia} +advi = ADVI(10, 10_000) +``` + +Turing also provides a couple of different optimizers: + + - `TruncatedADAGrad` (default) + - `DecayedADAGrad` + as these are well-suited for problems with high-variance stochastic objectives, which is usually what the ELBO ends up being at different times in our optimization process. + +With that being said, thanks to Requires.jl, if we add a `using Flux` prior to `using Turing` we can also make use of all the optimizers in `Flux`, e.g. `ADAM`, without any additional changes to your code! For example: + +```{julia} +#| eval: false +using Flux, Turing +using Turing.Variational + +vi(m, advi; optimizer=Flux.ADAM()) +``` + +just works. + +For this problem we'll use the `DecayedADAGrad` from Turing: + +```{julia} +opt = Variational.DecayedADAGrad(1e-2, 1.1, 0.9) +``` + +```{julia} +q = vi(m, advi, q0; optimizer=opt) +typeof(q) +``` + +*Note: as mentioned before, we internally define a `update(q::TransformedDistribution{<:TuringDiagMvNormal}, θ::AbstractVector)` method which takes in the current variational approximation `q` together with new parameters `z` and returns the new variational approximation. This is required so that we can actually update the `Distribution` object after each optimization step.* + +*Alternatively, we can instead provide the mapping $\theta \mapsto q_{\theta}$ directly together with initial parameters using the signature `vi(m, advi, getq, θ_init)` as mentioned earlier. We'll see an explicit example of this later on!* + +To compute statistics for our approximation we need samples: + +```{julia} +z = rand(q, 10_000); +``` + +Now we can for example look at the average + +```{julia} +avg = vec(mean(z; dims=2)) +``` + +The vector has the same ordering as the model, e.g. in this case `σ²` has index `1`, `intercept` has index `2` and `coefficients` has indices `3:12`. If you forget or you might want to do something programmatically with the result, you can obtain the `sym → indices` mapping as follows: + +```{julia} +_, sym2range = bijector(m, Val(true)); +sym2range +``` + +For example, we can check the sample distribution and mean value of `σ²`: + +```{julia} +histogram(z[1, :]) +avg[union(sym2range[:σ²]...)] +``` + +```{julia} +avg[union(sym2range[:intercept]...)] +``` + +```{julia} +avg[union(sym2range[:coefficients]...)] +``` + +*Note: as you can see, this is slightly awkward to work with at the moment. We'll soon add a better way of dealing with this.* + +With a bit of work (this will be much easier in the future), we can also visualize the approximate marginals of the different variables, similar to `plot(chain)`: + +```{julia} +function plot_variational_marginals(z, sym2range) + ps = [] + + for (i, sym) in enumerate(keys(sym2range)) + indices = union(sym2range[sym]...) # <= array of ranges + if sum(length.(indices)) > 1 + offset = 1 + for r in indices + p = density( + z[r, :]; + title="$(sym)[$offset]", + titlefontsize=10, + label="", + ylabel="Density", + margin=1.5mm, + ) + push!(ps, p) + offset += 1 + end + else + p = density( + z[first(indices), :]; + title="$(sym)", + titlefontsize=10, + label="", + ylabel="Density", + margin=1.5mm, + ) + push!(ps, p) + end + end + + return plot(ps...; layout=(length(ps), 1), size=(500, 2000), margin=4.0mm) +end +``` + +```{julia} +plot_variational_marginals(z, sym2range) +``` + +And let's compare this to using the `NUTS` sampler: + +```{julia} +chain = sample(m, NUTS(), 10_000); +``` + +```{julia} +plot(chain; margin=12.00mm) +``` + +```{julia} +vi_mean = vec(mean(z; dims=2))[[ + union(sym2range[:coefficients]...)..., + union(sym2range[:intercept]...)..., + union(sym2range[:σ²]...)..., +]] +``` + +```{julia} +mcmc_mean = mean(chain, names(chain, :parameters))[:, 2] +``` + +```{julia} +plot(mcmc_mean; xticks=1:1:length(mcmc_mean), linestyle=:dot, label="NUTS") +plot!(vi_mean; linestyle=:dot, label="VI") +``` + +One thing we can look at is simply the squared error between the means: + +```{julia} +sum(abs2, mcmc_mean .- vi_mean) +``` + +That looks pretty good! But let's see how the predictive distributions looks for the two. + +## Prediction + +Similarily to the linear regression tutorial, we're going to compare to multivariate ordinary linear regression using the `GLM` package: + +```{julia} +# Import the GLM package. +using GLM + +# Perform multivariate OLS. +ols = lm( + @formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train_cut +) + +# Store our predictions in the original dataframe. +train_cut.OLSPrediction = unstandardize(GLM.predict(ols), train_unstandardized.MPG) +test_cut.OLSPrediction = unstandardize(GLM.predict(ols, test_cut), train_unstandardized.MPG); +``` + +```{julia} +# Make a prediction given an input vector, using mean parameter values from a chain. +function prediction_chain(chain, x) + p = get_params(chain) + α = mean(p.intercept) + β = collect(mean.(p.coefficients)) + return α .+ x * β +end +``` + +```{julia} +# Make a prediction using samples from the variational posterior given an input vector. +function prediction(samples::AbstractVector, sym2ranges, x) + α = mean(samples[union(sym2ranges[:intercept]...)]) + β = vec(mean(samples[union(sym2ranges[:coefficients]...)]; dims=2)) + return α .+ x * β +end + +function prediction(samples::AbstractMatrix, sym2ranges, x) + α = mean(samples[union(sym2ranges[:intercept]...), :]) + β = vec(mean(samples[union(sym2ranges[:coefficients]...), :]; dims=2)) + return α .+ x * β +end +``` + +```{julia} +# Unstandardize the dependent variable. +train_cut.MPG = unstandardize(train_cut.MPG, train_unstandardized.MPG) +test_cut.MPG = unstandardize(test_cut.MPG, train_unstandardized.MPG); +``` + +```{julia} +# Show the first side rows of the modified dataframe. +first(test_cut, 6) +``` + +```{julia} +z = rand(q, 10_000); +``` + +```{julia} +# Calculate the predictions for the training and testing sets using the samples `z` from variational posterior +train_cut.VIPredictions = unstandardize( + prediction(z, sym2range, train), train_unstandardized.MPG +) +test_cut.VIPredictions = unstandardize( + prediction(z, sym2range, test), train_unstandardized.MPG +) + +train_cut.BayesPredictions = unstandardize( + prediction_chain(chain, train), train_unstandardized.MPG +) +test_cut.BayesPredictions = unstandardize( + prediction_chain(chain, test), train_unstandardized.MPG +); +``` + +```{julia} +vi_loss1 = mean((train_cut.VIPredictions - train_cut.MPG) .^ 2) +bayes_loss1 = mean((train_cut.BayesPredictions - train_cut.MPG) .^ 2) +ols_loss1 = mean((train_cut.OLSPrediction - train_cut.MPG) .^ 2) + +vi_loss2 = mean((test_cut.VIPredictions - test_cut.MPG) .^ 2) +bayes_loss2 = mean((test_cut.BayesPredictions - test_cut.MPG) .^ 2) +ols_loss2 = mean((test_cut.OLSPrediction - test_cut.MPG) .^ 2) + +println("Training set: + VI loss: $vi_loss1 + Bayes loss: $bayes_loss1 + OLS loss: $ols_loss1 +Test set: + VI loss: $vi_loss2 + Bayes loss: $bayes_loss2 + OLS loss: $ols_loss2") +``` + + +Interestingly the squared difference between true- and mean-prediction on the test-set is actually *better* for the mean-field variational posterior than for the "true" posterior obtained by MCMC sampling using `NUTS`. But, as Bayesians, we know that the mean doesn't tell the entire story. One quick check is to look at the mean predictions ± standard deviation of the two different approaches: + +```{julia} +z = rand(q, 1000); +preds = mapreduce(hcat, eachcol(z)) do zi + return unstandardize(prediction(zi, sym2range, test), train_unstandardized.MPG) +end + +scatter( + 1:size(test, 1), + mean(preds; dims=2); + yerr=std(preds; dims=2), + label="prediction (mean ± std)", + size=(900, 500), + markersize=8, +) +scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") +xaxis!(1:size(test, 1)) +ylims!(10, 40) +title!("Mean-field ADVI (Normal)") +``` + +```{julia} +preds = mapreduce(hcat, 1:5:size(chain, 1)) do i + return unstandardize(prediction_chain(chain[i], test), train_unstandardized.MPG) +end + +scatter( + 1:size(test, 1), + mean(preds; dims=2); + yerr=std(preds; dims=2), + label="prediction (mean ± std)", + size=(900, 500), + markersize=8, +) +scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") +xaxis!(1:size(test, 1)) +ylims!(10, 40) +title!("MCMC (NUTS)") +``` + +Indeed we see that the MCMC approach generally provides better uncertainty estimates than the mean-field ADVI approach! Good. So all the work we've done to make MCMC fast isn't for nothing. + +## Alternative: provide parameter-to-distribution instead of $q$ with `update` implemented + +As mentioned earlier, it's also possible to just provide the mapping $\theta \mapsto q_{\theta}$ rather than the variational family / initial variational posterior `q`, i.e. use the interface `vi(m, advi, getq, θ_init)` where `getq` is the mapping $\theta \mapsto q_{\theta}$ + +In this section we're going to construct a mean-field approximation to the model by hand using a composition of`Shift` and `Scale` from Bijectors.jl togheter with a standard multivariate Gaussian as the base distribution. + +```{julia} +using Bijectors +``` + +```{julia} +using Bijectors: Scale, Shift +``` + +```{julia} +d = length(q) +base_dist = Turing.DistributionsAD.TuringDiagMvNormal(zeros(d), ones(d)) +``` + +`bijector(model::Turing.Model)` is defined by Turing, and will return a `bijector` which takes you from the space of the latent variables to the real space. In this particular case, this is a mapping `((0, ∞) × ℝ × ℝ¹⁰) → ℝ¹²`. We're interested in using a normal distribution as a base-distribution and transform samples to the latent space, thus we need the inverse mapping from the reals to the latent space: + +```{julia} +to_constrained = inverse(bijector(m)); +``` + +```{julia} +function getq(θ) + d = length(θ) ÷ 2 + A = @inbounds θ[1:d] + b = @inbounds θ[(d + 1):(2 * d)] + + b = to_constrained ∘ Shift(b) ∘ Scale(exp.(A)) + + return transformed(base_dist, b) +end +``` + +```{julia} +q_mf_normal = vi(m, advi, getq, randn(2 * d)); +``` + +```{julia} +p1 = plot_variational_marginals(rand(q_mf_normal, 10_000), sym2range) # MvDiagNormal + Affine transformation + to_constrained +p2 = plot_variational_marginals(rand(q, 10_000), sym2range) # Turing.meanfield(m) + +plot(p1, p2; layout=(1, 2), size=(800, 2000)) +``` + +As expected, the fits look pretty much identical. + +But using this interface it becomes trivial to go beyond the mean-field assumption we made for the variational posterior, as we'll see in the next section. + +### Relaxing the mean-field assumption + +Here we'll instead consider the variational family to be a full non-diagonal multivariate Gaussian. As in the previous section we'll implement this by transforming a standard multivariate Gaussian using `Scale` and `Shift`, but now `Scale` will instead be using a lower-triangular matrix (representing the Cholesky of the covariance matrix of a multivariate normal) in contrast to the diagonal matrix we used in for the mean-field approximate posterior. + +```{julia} +# Using `ComponentArrays.jl` together with `UnPack.jl` makes our lives much easier. +using ComponentArrays, UnPack +``` + +```{julia} +proto_arr = ComponentArray(; L=zeros(d, d), b=zeros(d)) +proto_axes = getaxes(proto_arr) +num_params = length(proto_arr) + +function getq(θ) + L, b = begin + @unpack L, b = ComponentArray(θ, proto_axes) + LowerTriangular(L), b + end + # For this to represent a covariance matrix we need to ensure that the diagonal is positive. + # We can enforce this by zeroing out the diagonal and then adding back the diagonal exponentiated. + D = Diagonal(diag(L)) + A = L - D + exp(D) # exp for Diagonal is the same as exponentiating only the diagonal entries + + b = to_constrained ∘ Shift(b) ∘ Scale(A) + + return transformed(base_dist, b) +end +``` + +```{julia} +advi = ADVI(10, 20_000) +``` + +```{julia} +q_full_normal = vi( + m, advi, getq, randn(num_params); optimizer=Variational.DecayedADAGrad(1e-2) +); +``` + +Let's have a look at the learned covariance matrix: + +```{julia} +A = q_full_normal.transform.inner.a +``` + +```{julia} +heatmap(cov(A * A')) +``` + +```{julia} +zs = rand(q_full_normal, 10_000); +``` + +```{julia} +p1 = plot_variational_marginals(rand(q_mf_normal, 10_000), sym2range) +p2 = plot_variational_marginals(rand(q_full_normal, 10_000), sym2range) + +plot(p1, p2; layout=(1, 2), size=(800, 2000)) +``` + +So it seems like the "full" ADVI approach, i.e. no mean-field assumption, obtain the same modes as the mean-field approach but with greater uncertainty for some of the `coefficients`. This + +```{julia} +# Unfortunately, it seems like this has quite a high variance which is likely to be due to numerical instability, +# so we consider a larger number of samples. If we get a couple of outliers due to numerical issues, +# these kind affect the mean prediction greatly. +z = rand(q_full_normal, 10_000); +``` + +```{julia} +train_cut.VIFullPredictions = unstandardize( + prediction(z, sym2range, train), train_unstandardized.MPG +) +test_cut.VIFullPredictions = unstandardize( + prediction(z, sym2range, test), train_unstandardized.MPG +); +``` + +```{julia} +vi_loss1 = mean((train_cut.VIPredictions - train_cut.MPG) .^ 2) +vifull_loss1 = mean((train_cut.VIFullPredictions - train_cut.MPG) .^ 2) +bayes_loss1 = mean((train_cut.BayesPredictions - train_cut.MPG) .^ 2) +ols_loss1 = mean((train_cut.OLSPrediction - train_cut.MPG) .^ 2) + +vi_loss2 = mean((test_cut.VIPredictions - test_cut.MPG) .^ 2) +vifull_loss2 = mean((test_cut.VIFullPredictions - test_cut.MPG) .^ 2) +bayes_loss2 = mean((test_cut.BayesPredictions - test_cut.MPG) .^ 2) +ols_loss2 = mean((test_cut.OLSPrediction - test_cut.MPG) .^ 2) + +println("Training set: + VI loss: $vi_loss1 + Bayes loss: $bayes_loss1 + OLS loss: $ols_loss1 +Test set: + VI loss: $vi_loss2 + Bayes loss: $bayes_loss2 + OLS loss: $ols_loss2") +``` + +```{julia} +z = rand(q_mf_normal, 1000); +preds = mapreduce(hcat, eachcol(z)) do zi + return unstandardize(prediction(zi, sym2range, test), train_unstandardized.MPG) +end + +p1 = scatter( + 1:size(test, 1), + mean(preds; dims=2); + yerr=std(preds; dims=2), + label="prediction (mean ± std)", + size=(900, 500), + markersize=8, +) +scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") +xaxis!(1:size(test, 1)) +ylims!(10, 40) +title!("Mean-field ADVI (Normal)") +``` + +```{julia} +z = rand(q_full_normal, 1000); +preds = mapreduce(hcat, eachcol(z)) do zi + return unstandardize(prediction(zi, sym2range, test), train_unstandardized.MPG) +end + +p2 = scatter( + 1:size(test, 1), + mean(preds; dims=2); + yerr=std(preds; dims=2), + label="prediction (mean ± std)", + size=(900, 500), + markersize=8, +) +scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") +xaxis!(1:size(test, 1)) +ylims!(10, 40) +title!("Full ADVI (Normal)") +``` + +```{julia} +preds = mapreduce(hcat, 1:5:size(chain, 1)) do i + return unstandardize(prediction_chain(chain[i], test), train_unstandardized.MPG) +end + +p3 = scatter( + 1:size(test, 1), + mean(preds; dims=2); + yerr=std(preds; dims=2), + label="prediction (mean ± std)", + size=(900, 500), + markersize=8, +) +scatter!(1:size(test, 1), unstandardize(test_label, train_unstandardized.MPG); label="true") +xaxis!(1:size(test, 1)) +ylims!(10, 40) +title!("MCMC (NUTS)") +``` + +```{julia} +plot(p1, p2, p3; layout=(1, 3), size=(900, 250), label="") +``` + +Here we actually see that indeed both the full ADVI and the MCMC approaches does a much better job of quantifying the uncertainty of predictions for never-before-seen samples, with full ADVI seemingly *underestimating* the variance slightly compared to MCMC. + +So now you know how to do perform VI on your Turing.jl model! Great isn't it? diff --git a/tutorials/10-bayesian-differential-equations/index.qmd b/tutorials/10-bayesian-differential-equations/index.qmd index b7ff4ae0f..a43bc3e8f 100755 --- a/tutorials/10-bayesian-differential-equations/index.qmd +++ b/tutorials/10-bayesian-differential-equations/index.qmd @@ -1,365 +1,365 @@ ---- -title: Bayesian Estimation of Differential Equations -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -Most of the scientific community deals with the basic problem of trying to mathematically model the reality around them and this often involves dynamical systems. The general trend to model these complex dynamical systems is through the use of differential equations. -Differential equation models often have non-measurable parameters. -The popular “forward-problem” of simulation consists of solving the differential equations for a given set of parameters, the “inverse problem” to simulation, known as parameter estimation, is the process of utilizing data to determine these model parameters. -Bayesian inference provides a robust approach to parameter estimation with quantified uncertainty. - -```{julia} -using Turing -using DifferentialEquations - -# Load StatsPlots for visualizations and diagnostics. -using StatsPlots - -using LinearAlgebra - -# Set a seed for reproducibility. -using Random -Random.seed!(14); -``` - -## The Lotka-Volterra Model - -The Lotka–Volterra equations, also known as the predator–prey equations, are a pair of first-order nonlinear differential equations. -These differential equations are frequently used to describe the dynamics of biological systems in which two species interact, one as a predator and the other as prey. -The populations change through time according to the pair of equations - -$$ -\begin{aligned} -\frac{\mathrm{d}x}{\mathrm{d}t} &= (\alpha - \beta y(t))x(t), \\ -\frac{\mathrm{d}y}{\mathrm{d}t} &= (\delta x(t) - \gamma)y(t) -\end{aligned} -$$ - -where $x(t)$ and $y(t)$ denote the populations of prey and predator at time $t$, respectively, and $\alpha, \beta, \gamma, \delta$ are positive parameters. - -We implement the Lotka-Volterra model and simulate it with parameters $\alpha = 1.5$, $\beta = 1$, $\gamma = 3$, and $\delta = 1$ and initial conditions $x(0) = y(0) = 1$. - -```{julia} -# Define Lotka-Volterra model. -function lotka_volterra(du, u, p, t) - # Model parameters. - α, β, γ, δ = p - # Current state. - x, y = u - - # Evaluate differential equations. - du[1] = (α - β * y) * x # prey - du[2] = (δ * x - γ) * y # predator - - return nothing -end - -# Define initial-value problem. -u0 = [1.0, 1.0] -p = [1.5, 1.0, 3.0, 1.0] -tspan = (0.0, 10.0) -prob = ODEProblem(lotka_volterra, u0, tspan, p) - -# Plot simulation. -plot(solve(prob, Tsit5())) -``` - -We generate noisy observations to use for the parameter estimation tasks in this tutorial. -With the [`saveat` argument](https://docs.sciml.ai/latest/basics/common_solver_opts/) we specify that the solution is stored only at `0.1` time units. -To make the example more realistic we add random normally distributed noise to the simulation. - -```{julia} -sol = solve(prob, Tsit5(); saveat=0.1) -odedata = Array(sol) + 0.8 * randn(size(Array(sol))) - -# Plot simulation and noisy observations. -plot(sol; alpha=0.3) -scatter!(sol.t, odedata'; color=[1 2], label="") -``` - -Alternatively, we can use real-world data from Hudson’s Bay Company records (an Stan implementation with slightly different priors can be found here: https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html). - -## Direct Handling of Bayesian Estimation with Turing - -Previously, functions in Turing and DifferentialEquations were not inter-composable, so Bayesian inference of differential equations needed to be handled by another package called [DiffEqBayes.jl](https://github.com/SciML/DiffEqBayes.jl) (note that DiffEqBayes works also with CmdStan.jl, Turing.jl, DynamicHMC.jl and ApproxBayes.jl - see the [DiffEqBayes docs](https://docs.sciml.ai/latest/analysis/parameter_estimation/#Bayesian-Methods-1) for more info). - -Nowadays, however, Turing and DifferentialEquations are completely composable and we can just simulate differential equations inside a Turing `@model`. -Therefore, we write the Lotka-Volterra parameter estimation problem using the Turing `@model` macro as below: - -```{julia} -@model function fitlv(data, prob) - # Prior distributions. - σ ~ InverseGamma(2, 3) - α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) - β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) - γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) - δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) - - # Simulate Lotka-Volterra model. - p = [α, β, γ, δ] - predicted = solve(prob, Tsit5(); p=p, saveat=0.1) - - # Observations. - for i in 1:length(predicted) - data[:, i] ~ MvNormal(predicted[i], σ^2 * I) - end - - return nothing -end - -model = fitlv(odedata, prob) - -# Sample 3 independent chains with forward-mode automatic differentiation (the default). -chain = sample(model, NUTS(), MCMCSerial(), 1000, 3; progress=false) -``` - -The estimated parameters are close to the parameter values the observations were generated with. -We can also check visually that the chains have converged. - -```{julia} -plot(chain) -``` - -### Data retrodiction - -In Bayesian analysis it is often useful to retrodict the data, i.e. generate simulated data using samples from the posterior distribution, and compare to the original data (see for instance section 3.3.2 - model checking of McElreath's book "Statistical Rethinking"). -Here, we solve the ODE for 300 randomly picked posterior samples in the `chain`. -We plot the ensemble of solutions to check if the solution resembles the data. -The 300 retrodicted time courses from the posterior are plotted in gray, the noisy observations are shown as blue and red dots, and the green and purple lines are the ODE solution that was used to generate the data. - -```{julia} -plot(; legend=false) -posterior_samples = sample(chain[[:α, :β, :γ, :δ]], 300; replace=false) -for p in eachrow(Array(posterior_samples)) - sol_p = solve(prob, Tsit5(); p=p, saveat=0.1) - plot!(sol_p; alpha=0.1, color="#BBBBBB") -end - -# Plot simulation and noisy observations. -plot!(sol; color=[1 2], linewidth=1) -scatter!(sol.t, odedata'; color=[1 2]) -``` - -We can see that, even though we added quite a bit of noise to the data the posterior distribution reproduces quite accurately the "true" ODE solution. - -## Lotka-Volterra model without data of prey - -One can also perform parameter inference for a Lotka-Volterra model with incomplete data. -For instance, let us suppose we have only observations of the predators but not of the prey. -I.e., we fit the model only to the $y$ variable of the system without providing any data for $x$: - -```{julia} -@model function fitlv2(data::AbstractVector, prob) - # Prior distributions. - σ ~ InverseGamma(2, 3) - α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) - β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) - γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) - δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) - - # Simulate Lotka-Volterra model but save only the second state of the system (predators). - p = [α, β, γ, δ] - predicted = solve(prob, Tsit5(); p=p, saveat=0.1, save_idxs=2) - - # Observations of the predators. - data ~ MvNormal(predicted.u, σ^2 * I) - - return nothing -end - -model2 = fitlv2(odedata[2, :], prob) - -# Sample 3 independent chains. -chain2 = sample(model2, NUTS(0.45), MCMCSerial(), 5000, 3; progress=false) -``` - -Again we inspect the trajectories of 300 randomly selected posterior samples. - -```{julia} -plot(; legend=false) -posterior_samples = sample(chain2[[:α, :β, :γ, :δ]], 300; replace=false) -for p in eachrow(Array(posterior_samples)) - sol_p = solve(prob, Tsit5(); p=p, saveat=0.1) - plot!(sol_p; alpha=0.1, color="#BBBBBB") -end - -# Plot simulation and noisy observations. -plot!(sol; color=[1 2], linewidth=1) -scatter!(sol.t, odedata'; color=[1 2]) -``` - -Note that here the observations of the prey (blue dots) were not used in the parameter estimation! -Yet, the model can predict the values of $x$ relatively accurately, albeit with a wider distribution of solutions, reflecting the greater uncertainty in the prediction of the $x$ values. - -## Inference of Delay Differential Equations - -Here we show an example of inference with another type of differential equation: a Delay Differential Equation (DDE). -DDEs are differential equations where derivatives are function of values at an earlier point in time. -This is useful to model a delayed effect, like incubation time of a virus for instance. - -Here is a delayed version of the Lokta-Voltera system: - -$$ -\begin{aligned} -\frac{\mathrm{d}x}{\mathrm{d}t} &= \alpha x(t-\tau) - \beta y(t) x(t),\\ -\frac{\mathrm{d}y}{\mathrm{d}t} &= - \gamma y(t) + \delta x(t) y(t), -\end{aligned} -$$ - -where $\tau$ is a (positive) delay and $x(t-\tau)$ is the variable $x$ at an earlier time point $t - \tau$. - -The initial-value problem of the delayed system can be implemented as a [`DDEProblem`](https://diffeq.sciml.ai/stable/tutorials/dde_example/). -As described in the [DDE example](https://diffeq.sciml.ai/stable/tutorials/dde_example/), here the function `h` is the history function that can be used to obtain a state at an earlier time point. -Again we use parameters $\alpha = 1.5$, $\beta = 1$, $\gamma = 3$, and $\delta = 1$ and initial conditions $x(0) = y(0) = 1$. -Moreover, we assume $x(t) = 1$ for $t < 0$. - -```{julia} -function delay_lotka_volterra(du, u, h, p, t) - # Model parameters. - α, β, γ, δ = p - - # Current state. - x, y = u - # Evaluate differential equations - du[1] = α * h(p, t - 1; idxs=1) - β * x * y - du[2] = -γ * y + δ * x * y - - return nothing -end - -# Define initial-value problem. -p = (1.5, 1.0, 3.0, 1.0) -u0 = [1.0; 1.0] -tspan = (0.0, 10.0) -h(p, t; idxs::Int) = 1.0 -prob_dde = DDEProblem(delay_lotka_volterra, u0, h, tspan, p); -``` - -We generate observations by adding normally distributed noise to the results of our simulations. - -```{julia} -sol_dde = solve(prob_dde; saveat=0.1) -ddedata = Array(sol_dde) + 0.5 * randn(size(sol_dde)) - -# Plot simulation and noisy observations. -plot(sol_dde) -scatter!(sol_dde.t, ddedata'; color=[1 2], label="") -``` - -Now we define the Turing model for the Lotka-Volterra model with delay and sample 3 independent chains. - -```{julia} -@model function fitlv_dde(data, prob) - # Prior distributions. - σ ~ InverseGamma(2, 3) - α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) - β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) - γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) - δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) - - # Simulate Lotka-Volterra model. - p = [α, β, γ, δ] - predicted = solve(prob, MethodOfSteps(Tsit5()); p=p, saveat=0.1) - - # Observations. - for i in 1:length(predicted) - data[:, i] ~ MvNormal(predicted[i], σ^2 * I) - end -end - -model_dde = fitlv_dde(ddedata, prob_dde) - -# Sample 3 independent chains. -chain_dde = sample(model_dde, NUTS(), MCMCSerial(), 300, 3; progress=false) -``` - -```{julia} -plot(chain_dde) -``` - -Finally, plot trajectories of 300 randomly selected samples from the posterior. -Again, the dots indicate our observations, the colored lines are the "true" simulations without noise, and the gray lines are trajectories from the posterior samples. - -```{julia} -plot(; legend=false) -posterior_samples = sample(chain_dde[[:α, :β, :γ, :δ]], 300; replace=false) -for p in eachrow(Array(posterior_samples)) - sol_p = solve(prob_dde, MethodOfSteps(Tsit5()); p=p, saveat=0.1) - plot!(sol_p; alpha=0.1, color="#BBBBBB") -end - -# Plot simulation and noisy observations. -plot!(sol_dde; color=[1 2], linewidth=1) -scatter!(sol_dde.t, ddedata'; color=[1 2]) -``` - -The fit is pretty good even though the data was quite noisy to start. - -## Scaling to Large Models: Adjoint Sensitivities - -DifferentialEquations.jl's efficiency for large stiff models has been shown in multiple [benchmarks](https://github.com/SciML/DiffEqBenchmarks.jl). -To learn more about how to optimize solving performance for stiff problems you can take a look at the [docs](https://docs.sciml.ai/latest/tutorials/advanced_ode_example/). - -[Sensitivity analysis](https://docs.sciml.ai/latest/analysis/sensitivity/), or automatic differentiation (AD) of the solver, is provided by the DiffEq suite. -The model sensitivities are the derivatives of the solution with respect to the parameters. -Specifically, the local sensitivity of the solution to a parameter is defined by how much the solution would change by changes in the parameter. -Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimization tasks. - -The AD ecosystem in Julia allows you to switch between forward mode, reverse mode, source to source and other choices of AD and have it work with any Julia code. -For a user to make use of this within [SciML](https://sciml.ai), [high level interactions in `solve`](https://sensitivity.sciml.ai/dev/ad_examples/differentiating_ode/) automatically plug into those AD systems to allow for choosing advanced sensitivity analysis (derivative calculation) [methods](https://sensitivity.sciml.ai/dev/manual/differential_equation_sensitivities/). - -More theoretical details on these methods can be found at: https://docs.sciml.ai/latest/extras/sensitivity_math/. - -While these sensitivity analysis methods may seem complicated, using them is dead simple. -Here is a version of the Lotka-Volterra model using adjoint sensitivities. - -All we have to do is switch the AD backend to one of the adjoint-compatible backends (ReverseDiff or Zygote)! -Notice that on this model adjoints are slower. -This is because adjoints have a higher overhead on small parameter models and therefore we suggest using these methods only for models with around 100 parameters or more. -For more details, see https://arxiv.org/abs/1812.01892. - -```{julia} -using Zygote, SciMLSensitivity - -# Sample a single chain with 1000 samples using Zygote. -sample(model, NUTS(;adtype=AutoZygote()), 1000; progress=false) -``` - -If desired, we can control the sensitivity analysis method that is used by providing the `sensealg` keyword argument to `solve`. -Here we will not choose a `sensealg` and let it use the default choice: - -```{julia} -@model function fitlv_sensealg(data, prob) - # Prior distributions. - σ ~ InverseGamma(2, 3) - α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) - β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) - γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) - δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) - - # Simulate Lotka-Volterra model and use a specific algorithm for computing sensitivities. - p = [α, β, γ, δ] - predicted = solve(prob; p=p, saveat=0.1) - - # Observations. - for i in 1:length(predicted) - data[:, i] ~ MvNormal(predicted[i], σ^2 * I) - end - - return nothing -end; - -model_sensealg = fitlv_sensealg(odedata, prob) - -# Sample a single chain with 1000 samples using Zygote. -sample(model_sensealg, NUTS(;adtype=AutoZygote()), 1000; progress=false) -``` - -For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://diffeqflux.sciml.ai/dev/). +--- +title: Bayesian Estimation of Differential Equations +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +Most of the scientific community deals with the basic problem of trying to mathematically model the reality around them and this often involves dynamical systems. The general trend to model these complex dynamical systems is through the use of differential equations. +Differential equation models often have non-measurable parameters. +The popular “forward-problem” of simulation consists of solving the differential equations for a given set of parameters, the “inverse problem” to simulation, known as parameter estimation, is the process of utilizing data to determine these model parameters. +Bayesian inference provides a robust approach to parameter estimation with quantified uncertainty. + +```{julia} +using Turing +using DifferentialEquations + +# Load StatsPlots for visualizations and diagnostics. +using StatsPlots + +using LinearAlgebra + +# Set a seed for reproducibility. +using Random +Random.seed!(14); +``` + +## The Lotka-Volterra Model + +The Lotka–Volterra equations, also known as the predator–prey equations, are a pair of first-order nonlinear differential equations. +These differential equations are frequently used to describe the dynamics of biological systems in which two species interact, one as a predator and the other as prey. +The populations change through time according to the pair of equations + +$$ +\begin{aligned} +\frac{\mathrm{d}x}{\mathrm{d}t} &= (\alpha - \beta y(t))x(t), \\ +\frac{\mathrm{d}y}{\mathrm{d}t} &= (\delta x(t) - \gamma)y(t) +\end{aligned} +$$ + +where $x(t)$ and $y(t)$ denote the populations of prey and predator at time $t$, respectively, and $\alpha, \beta, \gamma, \delta$ are positive parameters. + +We implement the Lotka-Volterra model and simulate it with parameters $\alpha = 1.5$, $\beta = 1$, $\gamma = 3$, and $\delta = 1$ and initial conditions $x(0) = y(0) = 1$. + +```{julia} +# Define Lotka-Volterra model. +function lotka_volterra(du, u, p, t) + # Model parameters. + α, β, γ, δ = p + # Current state. + x, y = u + + # Evaluate differential equations. + du[1] = (α - β * y) * x # prey + du[2] = (δ * x - γ) * y # predator + + return nothing +end + +# Define initial-value problem. +u0 = [1.0, 1.0] +p = [1.5, 1.0, 3.0, 1.0] +tspan = (0.0, 10.0) +prob = ODEProblem(lotka_volterra, u0, tspan, p) + +# Plot simulation. +plot(solve(prob, Tsit5())) +``` + +We generate noisy observations to use for the parameter estimation tasks in this tutorial. +With the [`saveat` argument](https://docs.sciml.ai/latest/basics/common_solver_opts/) we specify that the solution is stored only at `0.1` time units. +To make the example more realistic we add random normally distributed noise to the simulation. + +```{julia} +sol = solve(prob, Tsit5(); saveat=0.1) +odedata = Array(sol) + 0.8 * randn(size(Array(sol))) + +# Plot simulation and noisy observations. +plot(sol; alpha=0.3) +scatter!(sol.t, odedata'; color=[1 2], label="") +``` + +Alternatively, we can use real-world data from Hudson’s Bay Company records (an Stan implementation with slightly different priors can be found here: https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html). + +## Direct Handling of Bayesian Estimation with Turing + +Previously, functions in Turing and DifferentialEquations were not inter-composable, so Bayesian inference of differential equations needed to be handled by another package called [DiffEqBayes.jl](https://github.com/SciML/DiffEqBayes.jl) (note that DiffEqBayes works also with CmdStan.jl, Turing.jl, DynamicHMC.jl and ApproxBayes.jl - see the [DiffEqBayes docs](https://docs.sciml.ai/latest/analysis/parameter_estimation/#Bayesian-Methods-1) for more info). + +Nowadays, however, Turing and DifferentialEquations are completely composable and we can just simulate differential equations inside a Turing `@model`. +Therefore, we write the Lotka-Volterra parameter estimation problem using the Turing `@model` macro as below: + +```{julia} +@model function fitlv(data, prob) + # Prior distributions. + σ ~ InverseGamma(2, 3) + α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) + β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) + γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) + δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) + + # Simulate Lotka-Volterra model. + p = [α, β, γ, δ] + predicted = solve(prob, Tsit5(); p=p, saveat=0.1) + + # Observations. + for i in 1:length(predicted) + data[:, i] ~ MvNormal(predicted[i], σ^2 * I) + end + + return nothing +end + +model = fitlv(odedata, prob) + +# Sample 3 independent chains with forward-mode automatic differentiation (the default). +chain = sample(model, NUTS(), MCMCSerial(), 1000, 3; progress=false) +``` + +The estimated parameters are close to the parameter values the observations were generated with. +We can also check visually that the chains have converged. + +```{julia} +plot(chain) +``` + +### Data retrodiction + +In Bayesian analysis it is often useful to retrodict the data, i.e. generate simulated data using samples from the posterior distribution, and compare to the original data (see for instance section 3.3.2 - model checking of McElreath's book "Statistical Rethinking"). +Here, we solve the ODE for 300 randomly picked posterior samples in the `chain`. +We plot the ensemble of solutions to check if the solution resembles the data. +The 300 retrodicted time courses from the posterior are plotted in gray, the noisy observations are shown as blue and red dots, and the green and purple lines are the ODE solution that was used to generate the data. + +```{julia} +plot(; legend=false) +posterior_samples = sample(chain[[:α, :β, :γ, :δ]], 300; replace=false) +for p in eachrow(Array(posterior_samples)) + sol_p = solve(prob, Tsit5(); p=p, saveat=0.1) + plot!(sol_p; alpha=0.1, color="#BBBBBB") +end + +# Plot simulation and noisy observations. +plot!(sol; color=[1 2], linewidth=1) +scatter!(sol.t, odedata'; color=[1 2]) +``` + +We can see that, even though we added quite a bit of noise to the data the posterior distribution reproduces quite accurately the "true" ODE solution. + +## Lotka-Volterra model without data of prey + +One can also perform parameter inference for a Lotka-Volterra model with incomplete data. +For instance, let us suppose we have only observations of the predators but not of the prey. +I.e., we fit the model only to the $y$ variable of the system without providing any data for $x$: + +```{julia} +@model function fitlv2(data::AbstractVector, prob) + # Prior distributions. + σ ~ InverseGamma(2, 3) + α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) + β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) + γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) + δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) + + # Simulate Lotka-Volterra model but save only the second state of the system (predators). + p = [α, β, γ, δ] + predicted = solve(prob, Tsit5(); p=p, saveat=0.1, save_idxs=2) + + # Observations of the predators. + data ~ MvNormal(predicted.u, σ^2 * I) + + return nothing +end + +model2 = fitlv2(odedata[2, :], prob) + +# Sample 3 independent chains. +chain2 = sample(model2, NUTS(0.45), MCMCSerial(), 5000, 3; progress=false) +``` + +Again we inspect the trajectories of 300 randomly selected posterior samples. + +```{julia} +plot(; legend=false) +posterior_samples = sample(chain2[[:α, :β, :γ, :δ]], 300; replace=false) +for p in eachrow(Array(posterior_samples)) + sol_p = solve(prob, Tsit5(); p=p, saveat=0.1) + plot!(sol_p; alpha=0.1, color="#BBBBBB") +end + +# Plot simulation and noisy observations. +plot!(sol; color=[1 2], linewidth=1) +scatter!(sol.t, odedata'; color=[1 2]) +``` + +Note that here the observations of the prey (blue dots) were not used in the parameter estimation! +Yet, the model can predict the values of $x$ relatively accurately, albeit with a wider distribution of solutions, reflecting the greater uncertainty in the prediction of the $x$ values. + +## Inference of Delay Differential Equations + +Here we show an example of inference with another type of differential equation: a Delay Differential Equation (DDE). +DDEs are differential equations where derivatives are function of values at an earlier point in time. +This is useful to model a delayed effect, like incubation time of a virus for instance. + +Here is a delayed version of the Lokta-Voltera system: + +$$ +\begin{aligned} +\frac{\mathrm{d}x}{\mathrm{d}t} &= \alpha x(t-\tau) - \beta y(t) x(t),\\ +\frac{\mathrm{d}y}{\mathrm{d}t} &= - \gamma y(t) + \delta x(t) y(t), +\end{aligned} +$$ + +where $\tau$ is a (positive) delay and $x(t-\tau)$ is the variable $x$ at an earlier time point $t - \tau$. + +The initial-value problem of the delayed system can be implemented as a [`DDEProblem`](https://diffeq.sciml.ai/stable/tutorials/dde_example/). +As described in the [DDE example](https://diffeq.sciml.ai/stable/tutorials/dde_example/), here the function `h` is the history function that can be used to obtain a state at an earlier time point. +Again we use parameters $\alpha = 1.5$, $\beta = 1$, $\gamma = 3$, and $\delta = 1$ and initial conditions $x(0) = y(0) = 1$. +Moreover, we assume $x(t) = 1$ for $t < 0$. + +```{julia} +function delay_lotka_volterra(du, u, h, p, t) + # Model parameters. + α, β, γ, δ = p + + # Current state. + x, y = u + # Evaluate differential equations + du[1] = α * h(p, t - 1; idxs=1) - β * x * y + du[2] = -γ * y + δ * x * y + + return nothing +end + +# Define initial-value problem. +p = (1.5, 1.0, 3.0, 1.0) +u0 = [1.0; 1.0] +tspan = (0.0, 10.0) +h(p, t; idxs::Int) = 1.0 +prob_dde = DDEProblem(delay_lotka_volterra, u0, h, tspan, p); +``` + +We generate observations by adding normally distributed noise to the results of our simulations. + +```{julia} +sol_dde = solve(prob_dde; saveat=0.1) +ddedata = Array(sol_dde) + 0.5 * randn(size(sol_dde)) + +# Plot simulation and noisy observations. +plot(sol_dde) +scatter!(sol_dde.t, ddedata'; color=[1 2], label="") +``` + +Now we define the Turing model for the Lotka-Volterra model with delay and sample 3 independent chains. + +```{julia} +@model function fitlv_dde(data, prob) + # Prior distributions. + σ ~ InverseGamma(2, 3) + α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) + β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) + γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) + δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) + + # Simulate Lotka-Volterra model. + p = [α, β, γ, δ] + predicted = solve(prob, MethodOfSteps(Tsit5()); p=p, saveat=0.1) + + # Observations. + for i in 1:length(predicted) + data[:, i] ~ MvNormal(predicted[i], σ^2 * I) + end +end + +model_dde = fitlv_dde(ddedata, prob_dde) + +# Sample 3 independent chains. +chain_dde = sample(model_dde, NUTS(), MCMCSerial(), 300, 3; progress=false) +``` + +```{julia} +plot(chain_dde) +``` + +Finally, plot trajectories of 300 randomly selected samples from the posterior. +Again, the dots indicate our observations, the colored lines are the "true" simulations without noise, and the gray lines are trajectories from the posterior samples. + +```{julia} +plot(; legend=false) +posterior_samples = sample(chain_dde[[:α, :β, :γ, :δ]], 300; replace=false) +for p in eachrow(Array(posterior_samples)) + sol_p = solve(prob_dde, MethodOfSteps(Tsit5()); p=p, saveat=0.1) + plot!(sol_p; alpha=0.1, color="#BBBBBB") +end + +# Plot simulation and noisy observations. +plot!(sol_dde; color=[1 2], linewidth=1) +scatter!(sol_dde.t, ddedata'; color=[1 2]) +``` + +The fit is pretty good even though the data was quite noisy to start. + +## Scaling to Large Models: Adjoint Sensitivities + +DifferentialEquations.jl's efficiency for large stiff models has been shown in multiple [benchmarks](https://github.com/SciML/DiffEqBenchmarks.jl). +To learn more about how to optimize solving performance for stiff problems you can take a look at the [docs](https://docs.sciml.ai/latest/tutorials/advanced_ode_example/). + +[Sensitivity analysis](https://docs.sciml.ai/latest/analysis/sensitivity/), or automatic differentiation (AD) of the solver, is provided by the DiffEq suite. +The model sensitivities are the derivatives of the solution with respect to the parameters. +Specifically, the local sensitivity of the solution to a parameter is defined by how much the solution would change by changes in the parameter. +Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimization tasks. + +The AD ecosystem in Julia allows you to switch between forward mode, reverse mode, source to source and other choices of AD and have it work with any Julia code. +For a user to make use of this within [SciML](https://sciml.ai), [high level interactions in `solve`](https://sensitivity.sciml.ai/dev/ad_examples/differentiating_ode/) automatically plug into those AD systems to allow for choosing advanced sensitivity analysis (derivative calculation) [methods](https://sensitivity.sciml.ai/dev/manual/differential_equation_sensitivities/). + +More theoretical details on these methods can be found at: https://docs.sciml.ai/latest/extras/sensitivity_math/. + +While these sensitivity analysis methods may seem complicated, using them is dead simple. +Here is a version of the Lotka-Volterra model using adjoint sensitivities. + +All we have to do is switch the AD backend to one of the adjoint-compatible backends (ReverseDiff or Zygote)! +Notice that on this model adjoints are slower. +This is because adjoints have a higher overhead on small parameter models and therefore we suggest using these methods only for models with around 100 parameters or more. +For more details, see https://arxiv.org/abs/1812.01892. + +```{julia} +using Zygote, SciMLSensitivity + +# Sample a single chain with 1000 samples using Zygote. +sample(model, NUTS(;adtype=AutoZygote()), 1000; progress=false) +``` + +If desired, we can control the sensitivity analysis method that is used by providing the `sensealg` keyword argument to `solve`. +Here we will not choose a `sensealg` and let it use the default choice: + +```{julia} +@model function fitlv_sensealg(data, prob) + # Prior distributions. + σ ~ InverseGamma(2, 3) + α ~ truncated(Normal(1.5, 0.5); lower=0.5, upper=2.5) + β ~ truncated(Normal(1.2, 0.5); lower=0, upper=2) + γ ~ truncated(Normal(3.0, 0.5); lower=1, upper=4) + δ ~ truncated(Normal(1.0, 0.5); lower=0, upper=2) + + # Simulate Lotka-Volterra model and use a specific algorithm for computing sensitivities. + p = [α, β, γ, δ] + predicted = solve(prob; p=p, saveat=0.1) + + # Observations. + for i in 1:length(predicted) + data[:, i] ~ MvNormal(predicted[i], σ^2 * I) + end + + return nothing +end; + +model_sensealg = fitlv_sensealg(odedata, prob) + +# Sample a single chain with 1000 samples using Zygote. +sample(model_sensealg, NUTS(;adtype=AutoZygote()), 1000; progress=false) +``` + +For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://diffeqflux.sciml.ai/dev/). diff --git a/tutorials/11-probabilistic-pca/index.qmd b/tutorials/11-probabilistic-pca/index.qmd index cb25bc93c..83c4923cb 100755 --- a/tutorials/11-probabilistic-pca/index.qmd +++ b/tutorials/11-probabilistic-pca/index.qmd @@ -1,383 +1,383 @@ ---- -title: Probabilistic Principal Component Analysis (p-PCA) -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -## Overview of PCA - -Principal component analysis (PCA) is a fundamental technique to analyse and visualise data. -It is an unsupervised learning method mainly used for dimensionality reduction. - -For example, we have a data matrix $\mathbf{X} \in \mathbb{R}^{N \times D}$, and we would like to extract $k \ll D$ principal components which captures most of the information from the original matrix. -The goal is to understand $\mathbf{X}$ through a lower dimensional subspace (e.g. two-dimensional subspace for visualisation convenience) spanned by the principal components. - -In order to project the original data matrix into low dimensions, we need to find the principal directions where most of the variations of $\mathbf{X}$ lie in. -Traditionally, this is implemented via [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) which provides a robust and accurate computational framework for decomposing matrix into products of rotation-scaling-rotation matrices, particularly for large datasets(see an illustration [here](https://intoli.com/blog/pca-and-svd/)): - -$$ -\mathbf{X}_{N \times D} = \mathbf{U}_{N \times r} \times \boldsymbol{\Sigma}_{r \times r} \times \mathbf{V}^T_{r \times D} -$$ - -where $\Sigma_{r \times r}$ contains only $r := \operatorname{rank} \mathbf{X} \leq \min\{N,D\}$ non-zero singular values of $\mathbf{X}$. -If we pad $\Sigma$ with zeros and add arbitrary orthonormal columns to $\mathbf{U}$ and $\mathbf{V}$, we obtain the more compact form:[^1] - -$$ -\mathbf{X}_{N \times D} = \mathbf{U}_{N \times N} \mathbf{\Sigma}_{N \times D} \mathbf{V}_{D \times D}^T -$$ - -where $\mathbf{U}$ and $\mathbf{V}$ are unitary matrices (i.e. with orthonormal columns). -Such a decomposition always exists for any matrix. -Columns of $\mathbf{V}$ are the principal directions/axes. -The percentage of variations explained can be calculated using the ratios of singular values.[^3] - -Here we take a probabilistic perspective. -For more details and a mathematical derivation, we recommend Bishop's textbook (Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006). -The idea of proabilistic PCA is to find a latent variable $z$ that can be used to describe the hidden structure in a dataset.[^2] -Consider a data set $\mathbf{X}_{D \times N}=\{x_i\}$ with $i=1,2,...,N$ data points, where each data point $x_i$ is $D$-dimensional (i.e. $x_i \in \mathcal{R}^D$). -Note that, here we use the flipped version of the data matrix. We aim to represent the original $n$ dimensional vector using a lower dimensional a latent variable $z_i \in \mathcal{R}^k$. - -We first assume that each latent variable $z_i$ is normally distributed: - -$$ -z_i \sim \mathcal{N}(0, I) -$$ - -and the corresponding data point is generated via projection: - -$$ -x_i | z_i \sim \mathcal{N}(\mathbf{W} z_i + \boldsymbol{μ}, \sigma^2 \mathbf{I}) -$$ - -where the projection matrix $\mathbf{W}_{D \times k}$ accommodates the principal axes. -The above formula expresses $x_i$ as a linear combination of the basis columns in the projection matrix `W`, where the combination coefficients sit in `z_i` (they are the coordinats of `x_i` in the new $k$-dimensional space.). -We can also express the above formula in matrix form: $\mathbf{X}_{D \times N} \approx \mathbf{W}_{D \times k} \mathbf{Z}_{k \times N}$. -We are interested in inferring $\mathbf{W}$, $μ$ and $\sigma$. - -Classical PCA is the specific case of probabilistic PCA when the covariance of the noise becomes infinitesimally small, i.e. $\sigma^2 \to 0$. -Probabilistic PCA generalizes classical PCA, this can be seen by marginalizing out the the latent variable.[^2] - -## The gene expression example - -In the first example, we illustrate: - - - how to specify the probabilistic model and - - how to perform inference on $\mathbf{W}$, $\boldsymbol{\mu}$ and $\sigma$ using MCMC. - -We use simulated gemnome data to demonstrate these. -The simulation is inspired by biological measurement of expression of genes in cells, and each cell is characterized by different gene features. -While the human genome is (mostly) identical between all the cells in the body, there exist interesting differences in gene expression in different human tissues and disease conditions. -One way to investigate certain diseases is to look at differences in gene expression in cells from patients and healthy controls (usually from the same tissue). - -Usually, we can assume that the changes in gene expression only affect a subset of all genes (and these can be linked to diseases in some way). -One of the challenges for this kind of data is to explore the underlying structure, e.g. to make the connection between a certain state (healthy/disease) and gene expression. -This becomes difficult when the dimensions is very large (up to 20000 genes across 1000s of cells). So in order to find structure in this data, it is useful to project the data into a lower dimensional space. - -Regardless of the biological background, the more abstract problem formulation is to project the data living in high-dimensional space onto a representation in lower-dimensional space where most of the variation is concentrated in the first few dimensions. -We use PCA to explore underlying structure or pattern which may not necessarily be obvious from looking at the raw data itself. - -#### Step 1: configuration of dependencies - -First, we load the dependencies used. - -```{julia} -using Turing -using ReverseDiff -using LinearAlgebra, FillArrays - -# Packages for visualization -using DataFrames, StatsPlots, Measures - -# Set a seed for reproducibility. -using Random -Random.seed!(1789); -``` - -All packages used in this tutorial are listed here. -You can install them via `using Pkg; Pkg.add("package_name")`. - - -::: {.callout-caution} -## Package usages: -We use `DataFrames` for instantiating matrices, `LinearAlgebra` and `FillArrays` to perform matrix operations; -`Turing` for model specification and MCMC sampling, `ReverseDiff` for setting the automatic differentiation backend when sampling. -`StatsPlots` for visualising the resutls. `, Measures` for setting plot margin units. -As all examples involve sampling, for reproducibility we set a fixed seed using the `Random` standard library. -::: - -#### Step 2: Data generation - -Here, we simulate the biological gene expression problem described earlier. -We simulate 60 cells, each cell has 9 gene features. -This is a simplified problem with only a few cells and genes for demonstration purpose, which is not comparable to the complexity in real-life (e.g. thousands of features for each individual). -Even so, spotting the structures or patterns in a 9-feature space would be a challenging task; it would be nice to reduce the dimentionality using p-PCA. - -By design, we mannually divide the 60 cells into two groups. the first 3 gene features of the first 30 cells have mean 10, while those of the last 30 cells have mean 10. -These two groups of cells differ in the expression of genes. - -```{julia} -n_genes = 9 # D -n_cells = 60 # N - -# create a diagonal block like expression matrix, with some non-informative genes; -# not all features/genes are informative, some might just not differ very much between cells) -mat_exp = randn(n_genes, n_cells) -mat_exp[1:(n_genes ÷ 3), 1:(n_cells ÷ 2)] .+= 10 -mat_exp[(2 * (n_genes ÷ 3) + 1):end, (n_cells ÷ 2 + 1):end] .+= 10 -``` - -To visualize the $(D=9) \times (N=60)$ data matrix `mat_exp`, we use the `heatmap` plot. - -```{julia} -heatmap( - mat_exp; - c=:summer, - colors=:value, - xlabel="cell number", - yflip=true, - ylabel="gene feature", - yticks=1:9, - colorbar_title="expression", -) -``` - -Note that: - - 1. We have made distinct feature differences between these two groups of cells (it is fairly obvious from looking at the raw data), in practice and with large enough data sets, it is often impossible to spot the differences from the raw data alone. - 2. If you have some patience and compute resources you can increase the size of the dataset, or play around with the noise levels to make the problem increasingly harder. - -#### Step 3: Create the pPCA model - -Here we construct the probabilistic model `pPCA()`. -As per the p-PCA formula, we think of each row (i.e. each gene feature) following a $N=60$ dimensional multivariate normal distribution centered around the corresponding row of $\mathbf{W}_{D \times k} \times \mathbf{Z}_{k \times N} + \boldsymbol{\mu}_{D \times N}$. - -```{julia} -@model function pPCA(X::AbstractMatrix{<:Real}, k::Int) - # retrieve the dimension of input matrix X. - N, D = size(X) - - # weights/loadings W - W ~ filldist(Normal(), D, k) - - # latent variable z - Z ~ filldist(Normal(), k, N) - - # mean offset - μ ~ MvNormal(Eye(D)) - genes_mean = W * Z .+ reshape(μ, n_genes, 1) - return X ~ arraydist([MvNormal(m, Eye(N)) for m in eachcol(genes_mean')]) -end; -``` - -The function `pPCA()` accepts: - - 1. an data array $\mathbf{X}$ (with no. of instances x dimension no. of features, NB: it is a transpose of the original data matrix); - 2. an integer $k$ which indicates the dimension of the latent space (the space the original feature matrix is projected onto). - -Specifically: - - 1. it first extracts the dimension $D$ and number of instances $N$ of the input matrix; - 2. draw samples of each entries of the projection matrix $\mathbf{W}$ from a standard normal; - 3. draw samples of the latent variable $\mathbf{Z}_{k \times N}$ from an MND; - 4. draw samples of the offset $\boldsymbol{\mu}$ from an MND, assuming uniform offset for all instances; - 5. Finally, we iterate through each gene dimension in $\mathbf{X}$, and define an MND for the sampling distribution (i.e. likelihood). - -#### Step 4: Sampling-based inference of the pPCA model - -Here we aim to perform MCMC sampling to infer the projection matrix $\mathbf{W}_{D \times k}$, the latent variable matrix $\mathbf{Z}_{k \times N}$, and the offsets $\boldsymbol{\mu}_{N \times 1}$. - -We run the inference using the NUTS sampler, of which the chain length is set to be 500, target accept ratio 0.65 and initial stepsize 0.1. By default, the NUTS sampler samples 1 chain. -You are free to try [different samplers](https://turinglang.org/stable/docs/library/#samplers). - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -k = 2 # k is the dimension of the projected space, i.e. the number of principal components/axes of choice -ppca = pPCA(mat_exp', k) # instantiate the probabilistic model -chain_ppca = sample(ppca, NUTS(;adtype=AutoReverseDiff()), 500); -``` - -The samples are saved in the Chains struct `chain_ppca`, whose shape can be checked: - -```{julia} -size(chain_ppca) # (no. of iterations, no. of vars, no. of chains) = (500, 159, 1) -``` - -The Chains struct `chain_ppca` also contains the sampling info such as r-hat, ess, mean estimates, etc. -You can print it to check these quantities. - -#### Step 5: posterior predictive checks - -We try to reconstruct the input data using the posterior mean as parameter estimates. -We first retrieve the samples for the projection matrix `W` from `chain_ppca`. This can be done using the Julia `group(chain, parameter_name)` function. -Then we calculate the mean value for each element in $W$, averaging over the whole chain of samples. - -```{julia} -# Extract parameter estimates for predicting x - mean of posterior -W = reshape(mean(group(chain_ppca, :W))[:, 2], (n_genes, k)) -Z = reshape(mean(group(chain_ppca, :Z))[:, 2], (k, n_cells)) -μ = mean(group(chain_ppca, :μ))[:, 2] - -mat_rec = W * Z .+ repeat(μ; inner=(1, n_cells)) -``` - -```{julia} -heatmap( - mat_rec; - c=:summer, - colors=:value, - xlabel="cell number", - yflip=true, - ylabel="gene feature", - yticks=1:9, - colorbar_title="expression", -) -``` - -We can quantitatively check the absolute magnitudes of the column average of the gap between `mat_exp` and `mat_rec`: - -```{julia} -diff_matrix = mat_exp .- mat_rec -for col in 4:6 - @assert abs(mean(diff_matrix[:, col])) <= 0.5 -end -``` - -We observe that, using posterior mean, the recovered data matrix `mat_rec` has values align with the original data matrix - particularly the same pattern in the first and last 3 gene features are captured, which implies the inference and p-PCA decomposition are successful. -This is satisfying as we have just projected the original 9-dimensional space onto a 2-dimensional space - some info has been cut off in the projection process, but we haven't lost any important info, e.g. the key differences between the two groups. -The is the desirable property of PCA: it picks up the principal axes along which most of the (original) data variations cluster, and remove those less relevant. -If we choose the reduced space dimension $k$ to be exactly $D$ (the original data dimension), we would recover exactly the same original data matrix `mat_exp`, i.e. all information will be preserved. - -Now we have represented the original high-dimensional data in two dimensions, without lossing the key information about the two groups of cells in the input data. -Finally, the benefits of performing PCA is to analyse and visualise the dimension-reduced data in the projected, low-dimensional space. -we save the dimension-reduced matrix $\mathbf{Z}$ as a `DataFrame`, rename the columns and visualise the first two dimensions. - -```{julia} -df_pca = DataFrame(Z', :auto) -rename!(df_pca, Symbol.(["z" * string(i) for i in collect(1:k)])) -df_pca[!, :type] = repeat([1, 2]; inner=n_cells ÷ 2) - -scatter(df_pca[:, :z1], df_pca[:, :z2]; xlabel="z1", ylabel="z2", group=df_pca[:, :type]) -``` - -We see the two groups are well separated in this 2-D space. -As an unsupervised learning method, performing PCA on this dataset gives membership for each cell instance. -Another way to put it: 2 dimensions is enough to capture the main structure of the data. - -#### Further extension: automatic choice of the number of principal components with ARD - -A direct question arises from above practice is: how many principal components do we want to keep, in order to sufficiently represent the latent structure in the data? -This is a very central question for all latent factor models, i.e. how many dimensions are needed to represent that data in the latent space. -In the case of PCA, there exist a lot of heuristics to make that choice. -For example, We can tune the number of principal components using empirical methods such as cross-validation based some criteria such as MSE between the posterior predicted (e.g. mean predictions) data matrix and the original data matrix or the percentage of variation explained [^3]. - -For p-PCA, this can be done in an elegant and principled way, using a technique called *Automatic Relevance Determination* (ARD). -ARD can help pick the correct number of principal directions by regularizing the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features [^4]. -Essentially, we are using a specific prior over the factor loadings $\mathbf{W}$ that allows us to prune away dimensions in the latent space. The prior is determined by a precision hyperparameter $\alpha$. Here, smaller values of $\alpha$ correspond to more important components. -You can find more details about this in, for example, Bishop (2006) [^5]. - -```{julia} -@model function pPCA_ARD(X) - # Dimensionality of the problem. - N, D = size(X) - - # latent variable Z - Z ~ filldist(Normal(), D, N) - - # weights/loadings w with Automatic Relevance Determination part - α ~ filldist(Gamma(1.0, 1.0), D) - W ~ filldist(MvNormal(zeros(D), 1.0 ./ sqrt.(α)), D) - - mu = (W' * Z)' - - tau ~ Gamma(1.0, 1.0) - return X ~ arraydist([MvNormal(m, 1.0 / sqrt(tau)) for m in eachcol(mu)]) -end; -``` - -Instead of drawing samples of each entry in $\mathbf{W}$ from a standard normal, this time we repeatedly draw $D$ samples from the $D$-dimensional MND, forming a $D \times D$ matrix $\mathbf{W}$. -This matrix is a function of $\alpha$ as the samples are drawn from the MND parameterized by $\alpha$. -We also introduce a hyper-parameter $\tau$ which is the precision in the sampling distribution. -We also re-paramterise the sampling distribution, i.e. each dimension across all instances is a 60-dimensional multivariate normal distribution. Re-parameterisation can sometimes accelrate the sampling process. - -We instantiate the model and ask Turing to sample from it using NUTS sampler. The sample trajectories of $\alpha$ is plotted using the `plot` function from the package `StatsPlots`. - -```{julia} -ppca_ARD = pPCA_ARD(mat_exp') # instantiate the probabilistic model -chain_ppcaARD = sample(ppca_ARD, NUTS(;adtype=AutoReverseDiff()), 500) # sampling -plot(group(chain_ppcaARD, :α); margin=6.0mm) -``` - -Again, we do some inference diagnostics. -Here we look at the convergence of the chains for the $α$ parameter. -This parameter determines the relevance of individual components. -We see that the chains have converged and the posterior of the $\alpha$ parameters is centered around much smaller values in two instances. -In the following, we will use the mean of the small values to select the *relevant* dimensions (remember that, smaller values of $\alpha$ correspond to more important components.). -We can clearly see from the values of $\alpha$ that there should be two dimensions (corresponding to $\bar{\alpha}_3=\bar{\alpha}_5≈0.05$) for this dataset. - -```{julia} -# Extract parameter mean estimates of the posterior -W = permutedims(reshape(mean(group(chain_ppcaARD, :W))[:, 2], (n_genes, n_genes))) -Z = permutedims(reshape(mean(group(chain_ppcaARD, :Z))[:, 2], (n_genes, n_cells)))' -α = mean(group(chain_ppcaARD, :α))[:, 2] -plot(α; label="α") -``` - -We can inspect `α` to see which elements are small (i.e. high relevance). -To do this, we first sort `α` using `sortperm()` (in ascending order by default), and record the indices of the first two smallest values (among the $D=9$ $\alpha$ values). -After picking the desired principal directions, we extract the corresponding subset loading vectors from $\mathbf{W}$, and the corresponding dimensions of $\mathbf{Z}$. -We obtain a posterior predicted matrix $\mathbf{X} \in \mathbb{R}^{2 \times 60}$ as the product of the two sub-matrices, and compare the recovered info with the original matrix. - -```{julia} -α_indices = sortperm(α)[1:2] -k = size(α_indices)[1] -X_rec = W[:, α_indices] * Z[α_indices, :] - -df_rec = DataFrame(X_rec', :auto) -heatmap( - X_rec; - c=:summer, - colors=:value, - xlabel="cell number", - yflip=true, - ylabel="gene feature", - yticks=1:9, - colorbar_title="expression", -) -``` - -We observe that, the data in the original space is recovered with key information, the distinct feature values in the first and last three genes for the two cell groups, are preserved. -We can also examine the data in the dimension-reduced space, i.e. the selected components (rows) in $\mathbf{Z}$. - -```{julia} -df_pro = DataFrame(Z[α_indices, :]', :auto) -rename!(df_pro, Symbol.(["z" * string(i) for i in collect(1:k)])) -df_pro[!, :type] = repeat([1, 2]; inner=n_cells ÷ 2) -scatter( - df_pro[:, 1], df_pro[:, 2]; xlabel="z1", ylabel="z2", color=df_pro[:, "type"], label="" -) -``` - -This plot is very similar to the low-dimensional plot above, with the *relevant* dimensions chosen based on the values of $α$ via ARD. -When you are in doubt about the number of dimensions to project onto, ARD might provide an answer to that question. - -## Final comments. - -p-PCA is a linear map which linearly transforms the data between the original and projected spaces. -It can also thought as a matrix factorisation method, in which $\mathbf{X}=(\mathbf{W} \times \mathbf{Z})^T$. The projection matrix can be understood as a new basis in the projected space, and $\mathbf{Z}$ are the new coordinates. - - -[^1]: Gilbert Strang, *Introduction to Linear Algebra*, 5th Ed., Wellesley-Cambridge Press, 2016. -[^2]: Probabilistic PCA by TensorFlow, "https://www.tensorflow.org/probability/examples/Probabilistic_PCA". -[^3]: Gareth M. James, Daniela Witten, Trevor Hastie, Robert Tibshirani, *An Introduction to Statistical Learning*, Springer, 2013. -[^4]: David Wipf, Srikantan Nagarajan, *A New View of Automatic Relevance Determination*, NIPS 2007. -[^5]: Christopher Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006. +--- +title: Probabilistic Principal Component Analysis (p-PCA) +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +## Overview of PCA + +Principal component analysis (PCA) is a fundamental technique to analyse and visualise data. +It is an unsupervised learning method mainly used for dimensionality reduction. + +For example, we have a data matrix $\mathbf{X} \in \mathbb{R}^{N \times D}$, and we would like to extract $k \ll D$ principal components which captures most of the information from the original matrix. +The goal is to understand $\mathbf{X}$ through a lower dimensional subspace (e.g. two-dimensional subspace for visualisation convenience) spanned by the principal components. + +In order to project the original data matrix into low dimensions, we need to find the principal directions where most of the variations of $\mathbf{X}$ lie in. +Traditionally, this is implemented via [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) which provides a robust and accurate computational framework for decomposing matrix into products of rotation-scaling-rotation matrices, particularly for large datasets(see an illustration [here](https://intoli.com/blog/pca-and-svd/)): + +$$ +\mathbf{X}_{N \times D} = \mathbf{U}_{N \times r} \times \boldsymbol{\Sigma}_{r \times r} \times \mathbf{V}^T_{r \times D} +$$ + +where $\Sigma_{r \times r}$ contains only $r := \operatorname{rank} \mathbf{X} \leq \min\{N,D\}$ non-zero singular values of $\mathbf{X}$. +If we pad $\Sigma$ with zeros and add arbitrary orthonormal columns to $\mathbf{U}$ and $\mathbf{V}$, we obtain the more compact form:[^1] + +$$ +\mathbf{X}_{N \times D} = \mathbf{U}_{N \times N} \mathbf{\Sigma}_{N \times D} \mathbf{V}_{D \times D}^T +$$ + +where $\mathbf{U}$ and $\mathbf{V}$ are unitary matrices (i.e. with orthonormal columns). +Such a decomposition always exists for any matrix. +Columns of $\mathbf{V}$ are the principal directions/axes. +The percentage of variations explained can be calculated using the ratios of singular values.[^3] + +Here we take a probabilistic perspective. +For more details and a mathematical derivation, we recommend Bishop's textbook (Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006). +The idea of proabilistic PCA is to find a latent variable $z$ that can be used to describe the hidden structure in a dataset.[^2] +Consider a data set $\mathbf{X}_{D \times N}=\{x_i\}$ with $i=1,2,...,N$ data points, where each data point $x_i$ is $D$-dimensional (i.e. $x_i \in \mathcal{R}^D$). +Note that, here we use the flipped version of the data matrix. We aim to represent the original $n$ dimensional vector using a lower dimensional a latent variable $z_i \in \mathcal{R}^k$. + +We first assume that each latent variable $z_i$ is normally distributed: + +$$ +z_i \sim \mathcal{N}(0, I) +$$ + +and the corresponding data point is generated via projection: + +$$ +x_i | z_i \sim \mathcal{N}(\mathbf{W} z_i + \boldsymbol{μ}, \sigma^2 \mathbf{I}) +$$ + +where the projection matrix $\mathbf{W}_{D \times k}$ accommodates the principal axes. +The above formula expresses $x_i$ as a linear combination of the basis columns in the projection matrix `W`, where the combination coefficients sit in `z_i` (they are the coordinats of `x_i` in the new $k$-dimensional space.). +We can also express the above formula in matrix form: $\mathbf{X}_{D \times N} \approx \mathbf{W}_{D \times k} \mathbf{Z}_{k \times N}$. +We are interested in inferring $\mathbf{W}$, $μ$ and $\sigma$. + +Classical PCA is the specific case of probabilistic PCA when the covariance of the noise becomes infinitesimally small, i.e. $\sigma^2 \to 0$. +Probabilistic PCA generalizes classical PCA, this can be seen by marginalizing out the the latent variable.[^2] + +## The gene expression example + +In the first example, we illustrate: + + - how to specify the probabilistic model and + - how to perform inference on $\mathbf{W}$, $\boldsymbol{\mu}$ and $\sigma$ using MCMC. + +We use simulated gemnome data to demonstrate these. +The simulation is inspired by biological measurement of expression of genes in cells, and each cell is characterized by different gene features. +While the human genome is (mostly) identical between all the cells in the body, there exist interesting differences in gene expression in different human tissues and disease conditions. +One way to investigate certain diseases is to look at differences in gene expression in cells from patients and healthy controls (usually from the same tissue). + +Usually, we can assume that the changes in gene expression only affect a subset of all genes (and these can be linked to diseases in some way). +One of the challenges for this kind of data is to explore the underlying structure, e.g. to make the connection between a certain state (healthy/disease) and gene expression. +This becomes difficult when the dimensions is very large (up to 20000 genes across 1000s of cells). So in order to find structure in this data, it is useful to project the data into a lower dimensional space. + +Regardless of the biological background, the more abstract problem formulation is to project the data living in high-dimensional space onto a representation in lower-dimensional space where most of the variation is concentrated in the first few dimensions. +We use PCA to explore underlying structure or pattern which may not necessarily be obvious from looking at the raw data itself. + +#### Step 1: configuration of dependencies + +First, we load the dependencies used. + +```{julia} +using Turing +using ReverseDiff +using LinearAlgebra, FillArrays + +# Packages for visualization +using DataFrames, StatsPlots, Measures + +# Set a seed for reproducibility. +using Random +Random.seed!(1789); +``` + +All packages used in this tutorial are listed here. +You can install them via `using Pkg; Pkg.add("package_name")`. + + +::: {.callout-caution} +## Package usages: +We use `DataFrames` for instantiating matrices, `LinearAlgebra` and `FillArrays` to perform matrix operations; +`Turing` for model specification and MCMC sampling, `ReverseDiff` for setting the automatic differentiation backend when sampling. +`StatsPlots` for visualising the resutls. `, Measures` for setting plot margin units. +As all examples involve sampling, for reproducibility we set a fixed seed using the `Random` standard library. +::: + +#### Step 2: Data generation + +Here, we simulate the biological gene expression problem described earlier. +We simulate 60 cells, each cell has 9 gene features. +This is a simplified problem with only a few cells and genes for demonstration purpose, which is not comparable to the complexity in real-life (e.g. thousands of features for each individual). +Even so, spotting the structures or patterns in a 9-feature space would be a challenging task; it would be nice to reduce the dimentionality using p-PCA. + +By design, we mannually divide the 60 cells into two groups. the first 3 gene features of the first 30 cells have mean 10, while those of the last 30 cells have mean 10. +These two groups of cells differ in the expression of genes. + +```{julia} +n_genes = 9 # D +n_cells = 60 # N + +# create a diagonal block like expression matrix, with some non-informative genes; +# not all features/genes are informative, some might just not differ very much between cells) +mat_exp = randn(n_genes, n_cells) +mat_exp[1:(n_genes ÷ 3), 1:(n_cells ÷ 2)] .+= 10 +mat_exp[(2 * (n_genes ÷ 3) + 1):end, (n_cells ÷ 2 + 1):end] .+= 10 +``` + +To visualize the $(D=9) \times (N=60)$ data matrix `mat_exp`, we use the `heatmap` plot. + +```{julia} +heatmap( + mat_exp; + c=:summer, + colors=:value, + xlabel="cell number", + yflip=true, + ylabel="gene feature", + yticks=1:9, + colorbar_title="expression", +) +``` + +Note that: + + 1. We have made distinct feature differences between these two groups of cells (it is fairly obvious from looking at the raw data), in practice and with large enough data sets, it is often impossible to spot the differences from the raw data alone. + 2. If you have some patience and compute resources you can increase the size of the dataset, or play around with the noise levels to make the problem increasingly harder. + +#### Step 3: Create the pPCA model + +Here we construct the probabilistic model `pPCA()`. +As per the p-PCA formula, we think of each row (i.e. each gene feature) following a $N=60$ dimensional multivariate normal distribution centered around the corresponding row of $\mathbf{W}_{D \times k} \times \mathbf{Z}_{k \times N} + \boldsymbol{\mu}_{D \times N}$. + +```{julia} +@model function pPCA(X::AbstractMatrix{<:Real}, k::Int) + # retrieve the dimension of input matrix X. + N, D = size(X) + + # weights/loadings W + W ~ filldist(Normal(), D, k) + + # latent variable z + Z ~ filldist(Normal(), k, N) + + # mean offset + μ ~ MvNormal(Eye(D)) + genes_mean = W * Z .+ reshape(μ, n_genes, 1) + return X ~ arraydist([MvNormal(m, Eye(N)) for m in eachcol(genes_mean')]) +end; +``` + +The function `pPCA()` accepts: + + 1. an data array $\mathbf{X}$ (with no. of instances x dimension no. of features, NB: it is a transpose of the original data matrix); + 2. an integer $k$ which indicates the dimension of the latent space (the space the original feature matrix is projected onto). + +Specifically: + + 1. it first extracts the dimension $D$ and number of instances $N$ of the input matrix; + 2. draw samples of each entries of the projection matrix $\mathbf{W}$ from a standard normal; + 3. draw samples of the latent variable $\mathbf{Z}_{k \times N}$ from an MND; + 4. draw samples of the offset $\boldsymbol{\mu}$ from an MND, assuming uniform offset for all instances; + 5. Finally, we iterate through each gene dimension in $\mathbf{X}$, and define an MND for the sampling distribution (i.e. likelihood). + +#### Step 4: Sampling-based inference of the pPCA model + +Here we aim to perform MCMC sampling to infer the projection matrix $\mathbf{W}_{D \times k}$, the latent variable matrix $\mathbf{Z}_{k \times N}$, and the offsets $\boldsymbol{\mu}_{N \times 1}$. + +We run the inference using the NUTS sampler, of which the chain length is set to be 500, target accept ratio 0.65 and initial stepsize 0.1. By default, the NUTS sampler samples 1 chain. +You are free to try [different samplers](https://turinglang.org/stable/docs/library/#samplers). + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +k = 2 # k is the dimension of the projected space, i.e. the number of principal components/axes of choice +ppca = pPCA(mat_exp', k) # instantiate the probabilistic model +chain_ppca = sample(ppca, NUTS(;adtype=AutoReverseDiff()), 500); +``` + +The samples are saved in the Chains struct `chain_ppca`, whose shape can be checked: + +```{julia} +size(chain_ppca) # (no. of iterations, no. of vars, no. of chains) = (500, 159, 1) +``` + +The Chains struct `chain_ppca` also contains the sampling info such as r-hat, ess, mean estimates, etc. +You can print it to check these quantities. + +#### Step 5: posterior predictive checks + +We try to reconstruct the input data using the posterior mean as parameter estimates. +We first retrieve the samples for the projection matrix `W` from `chain_ppca`. This can be done using the Julia `group(chain, parameter_name)` function. +Then we calculate the mean value for each element in $W$, averaging over the whole chain of samples. + +```{julia} +# Extract parameter estimates for predicting x - mean of posterior +W = reshape(mean(group(chain_ppca, :W))[:, 2], (n_genes, k)) +Z = reshape(mean(group(chain_ppca, :Z))[:, 2], (k, n_cells)) +μ = mean(group(chain_ppca, :μ))[:, 2] + +mat_rec = W * Z .+ repeat(μ; inner=(1, n_cells)) +``` + +```{julia} +heatmap( + mat_rec; + c=:summer, + colors=:value, + xlabel="cell number", + yflip=true, + ylabel="gene feature", + yticks=1:9, + colorbar_title="expression", +) +``` + +We can quantitatively check the absolute magnitudes of the column average of the gap between `mat_exp` and `mat_rec`: + +```{julia} +diff_matrix = mat_exp .- mat_rec +for col in 4:6 + @assert abs(mean(diff_matrix[:, col])) <= 0.5 +end +``` + +We observe that, using posterior mean, the recovered data matrix `mat_rec` has values align with the original data matrix - particularly the same pattern in the first and last 3 gene features are captured, which implies the inference and p-PCA decomposition are successful. +This is satisfying as we have just projected the original 9-dimensional space onto a 2-dimensional space - some info has been cut off in the projection process, but we haven't lost any important info, e.g. the key differences between the two groups. +The is the desirable property of PCA: it picks up the principal axes along which most of the (original) data variations cluster, and remove those less relevant. +If we choose the reduced space dimension $k$ to be exactly $D$ (the original data dimension), we would recover exactly the same original data matrix `mat_exp`, i.e. all information will be preserved. + +Now we have represented the original high-dimensional data in two dimensions, without lossing the key information about the two groups of cells in the input data. +Finally, the benefits of performing PCA is to analyse and visualise the dimension-reduced data in the projected, low-dimensional space. +we save the dimension-reduced matrix $\mathbf{Z}$ as a `DataFrame`, rename the columns and visualise the first two dimensions. + +```{julia} +df_pca = DataFrame(Z', :auto) +rename!(df_pca, Symbol.(["z" * string(i) for i in collect(1:k)])) +df_pca[!, :type] = repeat([1, 2]; inner=n_cells ÷ 2) + +scatter(df_pca[:, :z1], df_pca[:, :z2]; xlabel="z1", ylabel="z2", group=df_pca[:, :type]) +``` + +We see the two groups are well separated in this 2-D space. +As an unsupervised learning method, performing PCA on this dataset gives membership for each cell instance. +Another way to put it: 2 dimensions is enough to capture the main structure of the data. + +#### Further extension: automatic choice of the number of principal components with ARD + +A direct question arises from above practice is: how many principal components do we want to keep, in order to sufficiently represent the latent structure in the data? +This is a very central question for all latent factor models, i.e. how many dimensions are needed to represent that data in the latent space. +In the case of PCA, there exist a lot of heuristics to make that choice. +For example, We can tune the number of principal components using empirical methods such as cross-validation based some criteria such as MSE between the posterior predicted (e.g. mean predictions) data matrix and the original data matrix or the percentage of variation explained [^3]. + +For p-PCA, this can be done in an elegant and principled way, using a technique called *Automatic Relevance Determination* (ARD). +ARD can help pick the correct number of principal directions by regularizing the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features [^4]. +Essentially, we are using a specific prior over the factor loadings $\mathbf{W}$ that allows us to prune away dimensions in the latent space. The prior is determined by a precision hyperparameter $\alpha$. Here, smaller values of $\alpha$ correspond to more important components. +You can find more details about this in, for example, Bishop (2006) [^5]. + +```{julia} +@model function pPCA_ARD(X) + # Dimensionality of the problem. + N, D = size(X) + + # latent variable Z + Z ~ filldist(Normal(), D, N) + + # weights/loadings w with Automatic Relevance Determination part + α ~ filldist(Gamma(1.0, 1.0), D) + W ~ filldist(MvNormal(zeros(D), 1.0 ./ sqrt.(α)), D) + + mu = (W' * Z)' + + tau ~ Gamma(1.0, 1.0) + return X ~ arraydist([MvNormal(m, 1.0 / sqrt(tau)) for m in eachcol(mu)]) +end; +``` + +Instead of drawing samples of each entry in $\mathbf{W}$ from a standard normal, this time we repeatedly draw $D$ samples from the $D$-dimensional MND, forming a $D \times D$ matrix $\mathbf{W}$. +This matrix is a function of $\alpha$ as the samples are drawn from the MND parameterized by $\alpha$. +We also introduce a hyper-parameter $\tau$ which is the precision in the sampling distribution. +We also re-paramterise the sampling distribution, i.e. each dimension across all instances is a 60-dimensional multivariate normal distribution. Re-parameterisation can sometimes accelrate the sampling process. + +We instantiate the model and ask Turing to sample from it using NUTS sampler. The sample trajectories of $\alpha$ is plotted using the `plot` function from the package `StatsPlots`. + +```{julia} +ppca_ARD = pPCA_ARD(mat_exp') # instantiate the probabilistic model +chain_ppcaARD = sample(ppca_ARD, NUTS(;adtype=AutoReverseDiff()), 500) # sampling +plot(group(chain_ppcaARD, :α); margin=6.0mm) +``` + +Again, we do some inference diagnostics. +Here we look at the convergence of the chains for the $α$ parameter. +This parameter determines the relevance of individual components. +We see that the chains have converged and the posterior of the $\alpha$ parameters is centered around much smaller values in two instances. +In the following, we will use the mean of the small values to select the *relevant* dimensions (remember that, smaller values of $\alpha$ correspond to more important components.). +We can clearly see from the values of $\alpha$ that there should be two dimensions (corresponding to $\bar{\alpha}_3=\bar{\alpha}_5≈0.05$) for this dataset. + +```{julia} +# Extract parameter mean estimates of the posterior +W = permutedims(reshape(mean(group(chain_ppcaARD, :W))[:, 2], (n_genes, n_genes))) +Z = permutedims(reshape(mean(group(chain_ppcaARD, :Z))[:, 2], (n_genes, n_cells)))' +α = mean(group(chain_ppcaARD, :α))[:, 2] +plot(α; label="α") +``` + +We can inspect `α` to see which elements are small (i.e. high relevance). +To do this, we first sort `α` using `sortperm()` (in ascending order by default), and record the indices of the first two smallest values (among the $D=9$ $\alpha$ values). +After picking the desired principal directions, we extract the corresponding subset loading vectors from $\mathbf{W}$, and the corresponding dimensions of $\mathbf{Z}$. +We obtain a posterior predicted matrix $\mathbf{X} \in \mathbb{R}^{2 \times 60}$ as the product of the two sub-matrices, and compare the recovered info with the original matrix. + +```{julia} +α_indices = sortperm(α)[1:2] +k = size(α_indices)[1] +X_rec = W[:, α_indices] * Z[α_indices, :] + +df_rec = DataFrame(X_rec', :auto) +heatmap( + X_rec; + c=:summer, + colors=:value, + xlabel="cell number", + yflip=true, + ylabel="gene feature", + yticks=1:9, + colorbar_title="expression", +) +``` + +We observe that, the data in the original space is recovered with key information, the distinct feature values in the first and last three genes for the two cell groups, are preserved. +We can also examine the data in the dimension-reduced space, i.e. the selected components (rows) in $\mathbf{Z}$. + +```{julia} +df_pro = DataFrame(Z[α_indices, :]', :auto) +rename!(df_pro, Symbol.(["z" * string(i) for i in collect(1:k)])) +df_pro[!, :type] = repeat([1, 2]; inner=n_cells ÷ 2) +scatter( + df_pro[:, 1], df_pro[:, 2]; xlabel="z1", ylabel="z2", color=df_pro[:, "type"], label="" +) +``` + +This plot is very similar to the low-dimensional plot above, with the *relevant* dimensions chosen based on the values of $α$ via ARD. +When you are in doubt about the number of dimensions to project onto, ARD might provide an answer to that question. + +## Final comments. + +p-PCA is a linear map which linearly transforms the data between the original and projected spaces. +It can also thought as a matrix factorisation method, in which $\mathbf{X}=(\mathbf{W} \times \mathbf{Z})^T$. The projection matrix can be understood as a new basis in the projected space, and $\mathbf{Z}$ are the new coordinates. + + +[^1]: Gilbert Strang, *Introduction to Linear Algebra*, 5th Ed., Wellesley-Cambridge Press, 2016. +[^2]: Probabilistic PCA by TensorFlow, "https://www.tensorflow.org/probability/examples/Probabilistic_PCA". +[^3]: Gareth M. James, Daniela Witten, Trevor Hastie, Robert Tibshirani, *An Introduction to Statistical Learning*, Springer, 2013. +[^4]: David Wipf, Srikantan Nagarajan, *A New View of Automatic Relevance Determination*, NIPS 2007. +[^5]: Christopher Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006. diff --git a/tutorials/12-gplvm/index.qmd b/tutorials/12-gplvm/index.qmd index 82ee096cf..387637ad4 100755 --- a/tutorials/12-gplvm/index.qmd +++ b/tutorials/12-gplvm/index.qmd @@ -1,234 +1,234 @@ ---- -title: Gaussian Process Latent Variable Model -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In a previous tutorial, we have discussed latent variable models, in particular probabilistic principal component analysis (pPCA). -Here, we show how we can extend the mapping provided by pPCA to non-linear mappings between input and output. -For more details about the Gaussian Process Latent Variable Model (GPLVM), -we refer the reader to the [original publication](https://jmlr.org/papers/v6/lawrence05a.html) and a [further extension](http://proceedings.mlr.press/v9/titsias10a/titsias10a.pdf). - -In short, the GPVLM is a dimensionality reduction technique that allows us to embed a high-dimensional dataset in a lower-dimensional embedding. -Importantly, it provides the advantage that the linear mappings from the embedded space can be non-linearised through the use of Gaussian Processes. - -### Let's start by loading some dependencies. - -```{julia} -#| eval: false -using Turing -using AbstractGPs -using FillArrays -using LaTeXStrings -using Plots -using RDatasets -using ReverseDiff -using StatsBase - -using LinearAlgebra -using Random - -Random.seed!(1789); -``` - -We demonstrate the GPLVM with a very small dataset: [Fisher's Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set). -This is mostly for reasons of run time, so the tutorial can be run quickly. -As you will see, one of the major drawbacks of using GPs is their speed, -although this is an active area of research. -We will briefly touch on some ways to speed things up at the end of this tutorial. -We transform the original data with non-linear operations in order to demonstrate the power of GPs to work on non-linear relationships, while keeping the problem reasonably small. - -```{julia} -#| eval: false -data = dataset("datasets", "iris") -species = data[!, "Species"] -index = shuffle(1:150) -# we extract the four measured quantities, -# so the dimension of the data is only d=4 for this toy example -dat = Matrix(data[index, 1:4]) -labels = data[index, "Species"] - -# non-linearize data to demonstrate ability of GPs to deal with non-linearity -dat[:, 1] = 0.5 * dat[:, 1] .^ 2 + 0.1 * dat[:, 1] .^ 3 -dat[:, 2] = dat[:, 2] .^ 3 + 0.2 * dat[:, 2] .^ 4 -dat[:, 3] = 0.1 * exp.(dat[:, 3]) - 0.2 * dat[:, 3] .^ 2 -dat[:, 4] = 0.5 * log.(dat[:, 4]) .^ 2 + 0.01 * dat[:, 3] .^ 5 - -# normalize data -dt = fit(ZScoreTransform, dat; dims=1); -StatsBase.transform!(dt, dat); -``` - -We will start out by demonstrating the basic similarity between pPCA (see the tutorial on this topic) and the GPLVM model. -Indeed, pPCA is basically equivalent to running the GPLVM model with an automatic relevance determination (ARD) linear kernel. - -First, we re-introduce the pPCA model (see the tutorial on pPCA for details) - -```{julia} -#| eval: false -@model function pPCA(x) - # Dimensionality of the problem. - N, D = size(x) - # latent variable z - z ~ filldist(Normal(), D, N) - # weights/loadings W - w ~ filldist(Normal(), D, D) - mu = (w * z)' - for d in 1:D - x[:, d] ~ MvNormal(mu[:, d], I) - end - return nothing -end; -``` - -We define two different kernels, a simple linear kernel with an Automatic Relevance Determination transform and a -squared exponential kernel. - - -```{julia} -#| eval: false -linear_kernel(α) = LinearKernel() ∘ ARDTransform(α) -sekernel(α, σ) = σ * SqExponentialKernel() ∘ ARDTransform(α); -``` - -And here is the GPLVM model. -We create separate models for the two types of kernel. - -```{julia} -#| eval: false -@model function GPLVM_linear(Y, K) - # Dimensionality of the problem. - N, D = size(Y) - # K is the dimension of the latent space - @assert K <= D - noise = 1e-3 - - # Priors - α ~ MvLogNormal(MvNormal(Zeros(K), I)) - Z ~ filldist(Normal(), K, N) - mu ~ filldist(Normal(), N) - - gp = GP(linear_kernel(α)) - gpz = gp(ColVecs(Z), noise) - Y ~ filldist(MvNormal(mu, cov(gpz)), D) - - return nothing -end; - -@model function GPLVM(Y, K) - # Dimensionality of the problem. - N, D = size(Y) - # K is the dimension of the latent space - @assert K <= D - noise = 1e-3 - - # Priors - α ~ MvLogNormal(MvNormal(Zeros(K), I)) - σ ~ LogNormal(0.0, 1.0) - Z ~ filldist(Normal(), K, N) - mu ~ filldist(Normal(), N) - - gp = GP(sekernel(α, σ)) - gpz = gp(ColVecs(Z), noise) - Y ~ filldist(MvNormal(mu, cov(gpz)), D) - - return nothing -end; -``` - -```{julia} -#| eval: false -# Standard GPs don't scale very well in n, so we use a small subsample for the purpose of this tutorial -n_data = 40 -# number of features to use from dataset -n_features = 4 -# latent dimension for GP case -ndim = 4; -``` - -```{julia} -#| eval: false -ppca = pPCA(dat[1:n_data, 1:n_features]) -chain_ppca = sample(ppca, NUTS{Turing.ReverseDiffAD{true}}(), 1000); -``` - -```{julia} -#| eval: false -# we extract the posterior mean estimates of the parameters from the chain -z_mean = reshape(mean(group(chain_ppca, :z))[:, 2], (n_features, n_data)) -scatter(z_mean[1, :], z_mean[2, :]; group=labels[1:n_data], xlabel=L"z_1", ylabel=L"z_2") -``` - -We can see that the pPCA fails to distinguish the groups. -In particular, the `setosa` species is not clearly separated from `versicolor` and `virginica`. -This is due to the non-linearities that we introduced, as without them the two groups can be clearly distinguished -using pPCA (see the pPCA tutorial). - -Let's try the same with our linear kernel GPLVM model. - -```{julia} -#| eval: false -gplvm_linear = GPLVM_linear(dat[1:n_data, 1:n_features], ndim) -chain_linear = sample(gplvm_linear, NUTS{Turing.ReverseDiffAD{true}}(), 500); -``` - -```{julia} -#| eval: false -# we extract the posterior mean estimates of the parameters from the chain -z_mean = reshape(mean(group(chain_linear, :Z))[:, 2], (n_features, n_data)) -alpha_mean = mean(group(chain_linear, :α))[:, 2] - -alpha1, alpha2 = partialsortperm(alpha_mean, 1:2; rev=true) -scatter( - z_mean[alpha1, :], - z_mean[alpha2, :]; - group=labels[1:n_data], - xlabel=L"z_{\mathrm{ard}_1}", - ylabel=L"z_{\mathrm{ard}_2}", -) -``` - -We can see that similar to the pPCA case, the linear kernel GPLVM fails to distinguish between the two groups -(`setosa` on the one hand, and `virginica` and `verticolor` on the other). - -Finally, we demonstrate that by changing the kernel to a non-linear function, we are able to separate the data again. - -```{julia} -#| eval: false -gplvm = GPLVM(dat[1:n_data, 1:n_features], ndim) -chain_gplvm = sample(gplvm, NUTS{Turing.ReverseDiffAD{true}}(), 500); -``` - -```{julia} -#| eval: false -# we extract the posterior mean estimates of the parameters from the chain -z_mean = reshape(mean(group(chain_gplvm, :Z))[:, 2], (ndim, n_data)) -alpha_mean = mean(group(chain_gplvm, :α))[:, 2] - -alpha1, alpha2 = partialsortperm(alpha_mean, 1:2; rev=true) -scatter( - z_mean[alpha1, :], - z_mean[alpha2, :]; - group=labels[1:n_data], - xlabel=L"z_{\mathrm{ard}_1}", - ylabel=L"z_{\mathrm{ard}_2}", -) -``` - -```{julia} -#| eval: false -let - @assert abs( - mean(z_mean[alpha1, labels[1:n_data] .== "setosa"]) - - mean(z_mean[alpha1, labels[1:n_data] .!= "setosa"]), - ) > 1 -end -``` - +--- +title: Gaussian Process Latent Variable Model +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In a previous tutorial, we have discussed latent variable models, in particular probabilistic principal component analysis (pPCA). +Here, we show how we can extend the mapping provided by pPCA to non-linear mappings between input and output. +For more details about the Gaussian Process Latent Variable Model (GPLVM), +we refer the reader to the [original publication](https://jmlr.org/papers/v6/lawrence05a.html) and a [further extension](http://proceedings.mlr.press/v9/titsias10a/titsias10a.pdf). + +In short, the GPVLM is a dimensionality reduction technique that allows us to embed a high-dimensional dataset in a lower-dimensional embedding. +Importantly, it provides the advantage that the linear mappings from the embedded space can be non-linearised through the use of Gaussian Processes. + +### Let's start by loading some dependencies. + +```{julia} +#| eval: false +using Turing +using AbstractGPs +using FillArrays +using LaTeXStrings +using Plots +using RDatasets +using ReverseDiff +using StatsBase + +using LinearAlgebra +using Random + +Random.seed!(1789); +``` + +We demonstrate the GPLVM with a very small dataset: [Fisher's Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set). +This is mostly for reasons of run time, so the tutorial can be run quickly. +As you will see, one of the major drawbacks of using GPs is their speed, +although this is an active area of research. +We will briefly touch on some ways to speed things up at the end of this tutorial. +We transform the original data with non-linear operations in order to demonstrate the power of GPs to work on non-linear relationships, while keeping the problem reasonably small. + +```{julia} +#| eval: false +data = dataset("datasets", "iris") +species = data[!, "Species"] +index = shuffle(1:150) +# we extract the four measured quantities, +# so the dimension of the data is only d=4 for this toy example +dat = Matrix(data[index, 1:4]) +labels = data[index, "Species"] + +# non-linearize data to demonstrate ability of GPs to deal with non-linearity +dat[:, 1] = 0.5 * dat[:, 1] .^ 2 + 0.1 * dat[:, 1] .^ 3 +dat[:, 2] = dat[:, 2] .^ 3 + 0.2 * dat[:, 2] .^ 4 +dat[:, 3] = 0.1 * exp.(dat[:, 3]) - 0.2 * dat[:, 3] .^ 2 +dat[:, 4] = 0.5 * log.(dat[:, 4]) .^ 2 + 0.01 * dat[:, 3] .^ 5 + +# normalize data +dt = fit(ZScoreTransform, dat; dims=1); +StatsBase.transform!(dt, dat); +``` + +We will start out by demonstrating the basic similarity between pPCA (see the tutorial on this topic) and the GPLVM model. +Indeed, pPCA is basically equivalent to running the GPLVM model with an automatic relevance determination (ARD) linear kernel. + +First, we re-introduce the pPCA model (see the tutorial on pPCA for details) + +```{julia} +#| eval: false +@model function pPCA(x) + # Dimensionality of the problem. + N, D = size(x) + # latent variable z + z ~ filldist(Normal(), D, N) + # weights/loadings W + w ~ filldist(Normal(), D, D) + mu = (w * z)' + for d in 1:D + x[:, d] ~ MvNormal(mu[:, d], I) + end + return nothing +end; +``` + +We define two different kernels, a simple linear kernel with an Automatic Relevance Determination transform and a +squared exponential kernel. + + +```{julia} +#| eval: false +linear_kernel(α) = LinearKernel() ∘ ARDTransform(α) +sekernel(α, σ) = σ * SqExponentialKernel() ∘ ARDTransform(α); +``` + +And here is the GPLVM model. +We create separate models for the two types of kernel. + +```{julia} +#| eval: false +@model function GPLVM_linear(Y, K) + # Dimensionality of the problem. + N, D = size(Y) + # K is the dimension of the latent space + @assert K <= D + noise = 1e-3 + + # Priors + α ~ MvLogNormal(MvNormal(Zeros(K), I)) + Z ~ filldist(Normal(), K, N) + mu ~ filldist(Normal(), N) + + gp = GP(linear_kernel(α)) + gpz = gp(ColVecs(Z), noise) + Y ~ filldist(MvNormal(mu, cov(gpz)), D) + + return nothing +end; + +@model function GPLVM(Y, K) + # Dimensionality of the problem. + N, D = size(Y) + # K is the dimension of the latent space + @assert K <= D + noise = 1e-3 + + # Priors + α ~ MvLogNormal(MvNormal(Zeros(K), I)) + σ ~ LogNormal(0.0, 1.0) + Z ~ filldist(Normal(), K, N) + mu ~ filldist(Normal(), N) + + gp = GP(sekernel(α, σ)) + gpz = gp(ColVecs(Z), noise) + Y ~ filldist(MvNormal(mu, cov(gpz)), D) + + return nothing +end; +``` + +```{julia} +#| eval: false +# Standard GPs don't scale very well in n, so we use a small subsample for the purpose of this tutorial +n_data = 40 +# number of features to use from dataset +n_features = 4 +# latent dimension for GP case +ndim = 4; +``` + +```{julia} +#| eval: false +ppca = pPCA(dat[1:n_data, 1:n_features]) +chain_ppca = sample(ppca, NUTS{Turing.ReverseDiffAD{true}}(), 1000); +``` + +```{julia} +#| eval: false +# we extract the posterior mean estimates of the parameters from the chain +z_mean = reshape(mean(group(chain_ppca, :z))[:, 2], (n_features, n_data)) +scatter(z_mean[1, :], z_mean[2, :]; group=labels[1:n_data], xlabel=L"z_1", ylabel=L"z_2") +``` + +We can see that the pPCA fails to distinguish the groups. +In particular, the `setosa` species is not clearly separated from `versicolor` and `virginica`. +This is due to the non-linearities that we introduced, as without them the two groups can be clearly distinguished +using pPCA (see the pPCA tutorial). + +Let's try the same with our linear kernel GPLVM model. + +```{julia} +#| eval: false +gplvm_linear = GPLVM_linear(dat[1:n_data, 1:n_features], ndim) +chain_linear = sample(gplvm_linear, NUTS{Turing.ReverseDiffAD{true}}(), 500); +``` + +```{julia} +#| eval: false +# we extract the posterior mean estimates of the parameters from the chain +z_mean = reshape(mean(group(chain_linear, :Z))[:, 2], (n_features, n_data)) +alpha_mean = mean(group(chain_linear, :α))[:, 2] + +alpha1, alpha2 = partialsortperm(alpha_mean, 1:2; rev=true) +scatter( + z_mean[alpha1, :], + z_mean[alpha2, :]; + group=labels[1:n_data], + xlabel=L"z_{\mathrm{ard}_1}", + ylabel=L"z_{\mathrm{ard}_2}", +) +``` + +We can see that similar to the pPCA case, the linear kernel GPLVM fails to distinguish between the two groups +(`setosa` on the one hand, and `virginica` and `verticolor` on the other). + +Finally, we demonstrate that by changing the kernel to a non-linear function, we are able to separate the data again. + +```{julia} +#| eval: false +gplvm = GPLVM(dat[1:n_data, 1:n_features], ndim) +chain_gplvm = sample(gplvm, NUTS{Turing.ReverseDiffAD{true}}(), 500); +``` + +```{julia} +#| eval: false +# we extract the posterior mean estimates of the parameters from the chain +z_mean = reshape(mean(group(chain_gplvm, :Z))[:, 2], (ndim, n_data)) +alpha_mean = mean(group(chain_gplvm, :α))[:, 2] + +alpha1, alpha2 = partialsortperm(alpha_mean, 1:2; rev=true) +scatter( + z_mean[alpha1, :], + z_mean[alpha2, :]; + group=labels[1:n_data], + xlabel=L"z_{\mathrm{ard}_1}", + ylabel=L"z_{\mathrm{ard}_2}", +) +``` + +```{julia} +#| eval: false +let + @assert abs( + mean(z_mean[alpha1, labels[1:n_data] .== "setosa"]) - + mean(z_mean[alpha1, labels[1:n_data] .!= "setosa"]), + ) > 1 +end +``` + Now, the split between the two groups is visible again. \ No newline at end of file diff --git a/tutorials/13-seasonal-time-series/index.qmd b/tutorials/13-seasonal-time-series/index.qmd index 7cd9cde6d..4d3545f5c 100755 --- a/tutorials/13-seasonal-time-series/index.qmd +++ b/tutorials/13-seasonal-time-series/index.qmd @@ -1,348 +1,348 @@ ---- -title: Bayesian Time Series Analysis -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -In time series analysis we are often interested in understanding how various real-life circumstances impact our quantity of interest. -These can be, for instance, season, day of week, or time of day. -To analyse this it is useful to decompose time series into simpler components (corresponding to relevant circumstances) -and infer their relevance. -In this tutorial we are going to use Turing for time series analysis and learn about useful ways to decompose time series. - -# Modelling time series - -Before we start coding, let us talk about what exactly we mean with time series decomposition. -In a nutshell, it is a divide-and-conquer approach where we express a time series as a sum or a product of simpler series. -For instance, the time series $f(t)$ can be decomposed into a sum of $n$ components - -$$f(t) = \sum_{i=1}^n f_i(t),$$ - -or we can decompose $g(t)$ into a product of $m$ components - -$$g(t) = \prod_{i=1}^m g_i(t).$$ - -We refer to this as *additive* or *multiplicative* decomposition respectively. -This type of decomposition is great since it lets us reason about individual components, which makes encoding prior information and interpreting model predictions very easy. -Two common components are *trends*, which represent the overall change of the time series (often assumed to be linear), -and *cyclic effects* which contribute oscillating effects around the trend. -Let us simulate some data with an additive linear trend and oscillating effects. - -```{julia} -using Turing -using FillArrays -using StatsPlots - -using LinearAlgebra -using Random -using Statistics - -Random.seed!(12345) - -true_sin_freq = 2 -true_sin_amp = 5 -true_cos_freq = 7 -true_cos_amp = 2.5 -tmax = 10 -β_true = 2 -α_true = -1 -tt = 0:0.05:tmax -f₁(t) = α_true + β_true * t -f₂(t) = true_sin_amp * sinpi(2 * t * true_sin_freq / tmax) -f₃(t) = true_cos_amp * cospi(2 * t * true_cos_freq / tmax) -f(t) = f₁(t) + f₂(t) + f₃(t) - -plot(f, tt; label="f(t)", title="Observed time series", legend=:topleft, linewidth=3) -plot!( - [f₁, f₂, f₃], - tt; - label=["f₁(t)" "f₂(t)" "f₃(t)"], - style=[:dot :dash :dashdot], - linewidth=1, -) -``` - -Even though we use simple components, combining them can give rise to fairly complex time series. -In this time series, cyclic effects are just added on top of the trend. -If we instead multiply the components the cyclic effects cause the series to oscillate -between larger and larger values, since they get scaled by the trend. - -```{julia} -g(t) = f₁(t) * f₂(t) * f₃(t) - -plot(g, tt; label="f(t)", title="Observed time series", legend=:topleft, linewidth=3) -plot!([f₁, f₂, f₃], tt; label=["f₁(t)" "f₂(t)" "f₃(t)"], linewidth=1) -``` - -Unlike $f$, $g$ oscillates around $0$ since it is being multiplied with sines and cosines. -To let a multiplicative decomposition oscillate around the trend we could define it as -$\tilde{g}(t) = f₁(t) * (1 + f₂(t)) * (1 + f₃(t)),$ -but for convenience we will leave it as is. -The inference machinery is the same for both cases. - -# Model fitting - -Having discussed time series decomposition, let us fit a model to the time series above and recover the true parameters. -Before building our model, we standardise the time axis to $[0, 1]$ and subtract the max of the time series. -This helps convergence while maintaining interpretability and the correct scales for the cyclic components. - -```{julia} -σ_true = 0.35 -t = collect(tt[begin:3:end]) -t_min, t_max = extrema(t) -x = (t .- t_min) ./ (t_max - t_min) -yf = f.(t) .+ σ_true .* randn(size(t)) -yf_max = maximum(yf) -yf = yf .- yf_max - -scatter(x, yf; title="Standardised data", legend=false) -``` - -Let us now build our model. -We want to assume a linear trend, and cyclic effects. -Encoding a linear trend is easy enough, but what about cyclical effects? -We will take a scattergun approach, and create multiple cyclical features using both sine and cosine functions and let our inference machinery figure out which to keep. -To do this, we define how long a one period should be, and create features in reference to said period. -How long a period should be is problem dependent, but as an example let us say it is $1$ year. -If we then find evidence for a cyclic effect with a frequency of 2, that would mean a biannual effect. A frequency of 4 would mean quarterly etc. -Since we are using synthetic data, we are simply going to let the period be 1, which is the entire length of the time series. - -```{julia} -freqs = 1:10 -num_freqs = length(freqs) -period = 1 -cyclic_features = [sinpi.(2 .* freqs' .* x ./ period) cospi.(2 .* freqs' .* x ./ period)] - -plot_freqs = [1, 3, 5] -freq_ptl = plot( - cyclic_features[:, plot_freqs]; - label=permutedims(["sin(2π$(f)x)" for f in plot_freqs]), - title="Cyclical features subset", -) -``` - -Having constructed the cyclical features, we can finally build our model. The model we will implement looks like this - -$$ -f(t) = \alpha + \beta_t t + \sum_{i=1}^F \beta_{\sin{},i} \sin{}(2\pi f_i t) + \sum_{i=1}^F \beta_{\cos{},i} \cos{}(2\pi f_i t), -$$ - -with a Gaussian likelihood $y \sim \mathcal{N}(f(t), \sigma^2)$. -For convenience we are treating the cyclical feature weights $\beta_{\sin{},i}$ and $\beta_{\cos{},i}$ the same in code and weight them with $\beta_c$. -And just because it is so easy, we parameterise our model with the operation with which to apply the cyclic effects. -This lets us use the exact same code for both additive and multiplicative models. -Finally, we plot prior predictive samples to make sure our priors make sense. - -```{julia} -@model function decomp_model(t, c, op) - α ~ Normal(0, 10) - βt ~ Normal(0, 2) - βc ~ MvNormal(Zeros(size(c, 2)), I) - σ ~ truncated(Normal(0, 0.1); lower=0) - - cyclic = c * βc - trend = α .+ βt .* t - μ = op(trend, cyclic) - y ~ MvNormal(μ, σ^2 * I) - return (; trend, cyclic) -end - -y_prior_samples = mapreduce(hcat, 1:100) do _ - rand(decomp_model(t, cyclic_features, +)).y -end -plot(t, y_prior_samples; linewidth=1, alpha=0.5, color=1, label="", title="Prior samples") -scatter!(t, yf; color=2, label="Data") -``` - -With the model specified and with a reasonable prior we can now let Turing decompose the time series for us! - -```{julia} -function mean_ribbon(samples) - qs = quantile(samples) - low = qs[:, Symbol("2.5%")] - up = qs[:, Symbol("97.5%")] - m = mean(samples)[:, :mean] - return m, (m - low, up - m) -end - -function get_decomposition(model, x, cyclic_features, chain, op) - chain_params = Turing.MCMCChains.get_sections(chain, :parameters) - return generated_quantities(model(x, cyclic_features, op), chain_params) -end - -function plot_fit(x, y, decomp, ymax) - trend = mapreduce(x -> x.trend, hcat, decomp) - cyclic = mapreduce(x -> x.cyclic, hcat, decomp) - - trend_plt = plot( - x, - trend .+ ymax; - color=1, - label=nothing, - alpha=0.2, - title="Trend", - xlabel="Time", - ylabel="f₁(t)", - ) - ls = [ones(length(t)) t] \ y - α̂, β̂ = ls[1], ls[2:end] - plot!( - trend_plt, - t, - α̂ .+ t .* β̂ .+ ymax; - label="Least squares trend", - color=5, - linewidth=4, - ) - - scatter!(trend_plt, x, y .+ ymax; label=nothing, color=2, legend=:topleft) - cyclic_plt = plot( - x, - cyclic; - color=1, - label=nothing, - alpha=0.2, - title="Cyclic effect", - xlabel="Time", - ylabel="f₂(t)", - ) - return trend_plt, cyclic_plt -end - -chain = sample(decomp_model(x, cyclic_features, +) | (; y=yf), NUTS(), 2000, progress=false) -yf_samples = predict(decomp_model(x, cyclic_features, +), chain) -m, conf = mean_ribbon(yf_samples) -predictive_plt = plot( - t, - m .+ yf_max; - ribbon=conf, - label="Posterior density", - title="Posterior decomposition", - xlabel="Time", - ylabel="f(t)", -) -scatter!(predictive_plt, t, yf .+ yf_max; color=2, label="Data", legend=:topleft) - -decomp = get_decomposition(decomp_model, x, cyclic_features, chain, +) -decomposed_plt = plot_fit(t, yf, decomp, yf_max) -plot(predictive_plt, decomposed_plt...; layout=(3, 1), size=(700, 1000)) -``` - -```{julia} -#| echo: false -let - @assert mean(ess(chain)[:, :ess]) > 500 "Mean ESS: $(mean(ess(chain)[:, :ess])) - not > 500" - lower_quantile = m .- conf[1] # 2.5% quantile - upper_quantile = m .+ conf[2] # 97.5% quantile - @assert mean(lower_quantile .≤ yf .≤ upper_quantile) ≥ 0.9 "Surprisingly few observations in predicted 95% interval: $(mean(lower_quantile .≤ yf .≤ upper_quantile))" -end -``` - -Inference is successful and the posterior beautifully captures the data. -We see that the least squares linear fit deviates somewhat from the posterior trend. -Since our model takes cyclic effects into account separately, -we get a better estimate of the true overall trend than if we would have just fitted a line. -But what frequency content did the model identify? - - -```{julia} -function plot_cyclic_features(βsin, βcos) - labels = reshape(["freq = $i" for i in freqs], 1, :) - colors = collect(freqs)' - style = reshape([i <= 10 ? :solid : :dash for i in 1:length(labels)], 1, :) - sin_features_plt = density( - βsin[:, :, 1]; - title="Sine features posterior", - label=labels, - ylabel="Density", - xlabel="Weight", - color=colors, - linestyle=style, - legend=nothing, - ) - cos_features_plt = density( - βcos[:, :, 1]; - title="Cosine features posterior", - ylabel="Density", - xlabel="Weight", - label=nothing, - color=colors, - linestyle=style, - ) - - return seasonal_features_plt = plot( - sin_features_plt, - cos_features_plt; - layout=(2, 1), - size=(800, 600), - legend=:outerright, - ) -end - -βc = Array(group(chain, :βc)) -plot_cyclic_features(βc[:, begin:num_freqs, :], βc[:, (num_freqs + 1):end, :]) -``` - -Plotting the posterior over the cyclic features reveals that the model managed to extract the true frequency content. - -Since we wrote our model to accept a combining operator, we can easily run the same analysis for a multiplicative model. - -```{julia} -yg = g.(t) .+ σ_true .* randn(size(t)) - -y_prior_samples = mapreduce(hcat, 1:100) do _ - rand(decomp_model(t, cyclic_features, .*)).y -end -plot(t, y_prior_samples; linewidth=1, alpha=0.5, color=1, label="", title="Prior samples") -scatter!(t, yf; color=2, label="Data") -``` - -```{julia} -chain = sample(decomp_model(x, cyclic_features, .*) | (; y=yg), NUTS(), 2000, progress=false) -yg_samples = predict(decomp_model(x, cyclic_features, .*), chain) -m, conf = mean_ribbon(yg_samples) -predictive_plt = plot( - t, - m; - ribbon=conf, - label="Posterior density", - title="Posterior decomposition", - xlabel="Time", - ylabel="g(t)", -) -scatter!(predictive_plt, t, yg; color=2, label="Data", legend=:topleft) - -decomp = get_decomposition(decomp_model, x, cyclic_features, chain, .*) -decomposed_plt = plot_fit(t, yg, decomp, 0) -plot(predictive_plt, decomposed_plt...; layout=(3, 1), size=(700, 1000)) -``` - -```{julia} -#| echo: false -let - @assert mean(ess(chain)[:, :ess]) > 500 "Mean ESS: $(mean(ess(chain)[:, :ess])) - not > 500" - lower_quantile = m .- conf[1] # 2.5% quantile - upper_quantile = m .+ conf[2] # 97.5% quantile - @assert mean(lower_quantile .≤ yg .≤ upper_quantile) ≥ 0.9 "Surprisingly few observations in predicted 95% interval: $(mean(lower_quantile .≤ yg .≤ upper_quantile))" -end -``` - -The model fits! What about the infered cyclic components? - -```{julia} -βc = Array(group(chain, :βc)) -plot_cyclic_features(βc[:, begin:num_freqs, :], βc[:, (num_freqs + 1):end, :]) -``` - -While multiplicative model fits to the data, it does not recover the true parameters for this dataset. - -# Wrapping up - -In this tutorial we have seen how to implement and fit time series models using additive and multiplicative decomposition. -We also saw how to visualise the model fit, and how to interpret learned cyclical components. +--- +title: Bayesian Time Series Analysis +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +In time series analysis we are often interested in understanding how various real-life circumstances impact our quantity of interest. +These can be, for instance, season, day of week, or time of day. +To analyse this it is useful to decompose time series into simpler components (corresponding to relevant circumstances) +and infer their relevance. +In this tutorial we are going to use Turing for time series analysis and learn about useful ways to decompose time series. + +# Modelling time series + +Before we start coding, let us talk about what exactly we mean with time series decomposition. +In a nutshell, it is a divide-and-conquer approach where we express a time series as a sum or a product of simpler series. +For instance, the time series $f(t)$ can be decomposed into a sum of $n$ components + +$$f(t) = \sum_{i=1}^n f_i(t),$$ + +or we can decompose $g(t)$ into a product of $m$ components + +$$g(t) = \prod_{i=1}^m g_i(t).$$ + +We refer to this as *additive* or *multiplicative* decomposition respectively. +This type of decomposition is great since it lets us reason about individual components, which makes encoding prior information and interpreting model predictions very easy. +Two common components are *trends*, which represent the overall change of the time series (often assumed to be linear), +and *cyclic effects* which contribute oscillating effects around the trend. +Let us simulate some data with an additive linear trend and oscillating effects. + +```{julia} +using Turing +using FillArrays +using StatsPlots + +using LinearAlgebra +using Random +using Statistics + +Random.seed!(12345) + +true_sin_freq = 2 +true_sin_amp = 5 +true_cos_freq = 7 +true_cos_amp = 2.5 +tmax = 10 +β_true = 2 +α_true = -1 +tt = 0:0.05:tmax +f₁(t) = α_true + β_true * t +f₂(t) = true_sin_amp * sinpi(2 * t * true_sin_freq / tmax) +f₃(t) = true_cos_amp * cospi(2 * t * true_cos_freq / tmax) +f(t) = f₁(t) + f₂(t) + f₃(t) + +plot(f, tt; label="f(t)", title="Observed time series", legend=:topleft, linewidth=3) +plot!( + [f₁, f₂, f₃], + tt; + label=["f₁(t)" "f₂(t)" "f₃(t)"], + style=[:dot :dash :dashdot], + linewidth=1, +) +``` + +Even though we use simple components, combining them can give rise to fairly complex time series. +In this time series, cyclic effects are just added on top of the trend. +If we instead multiply the components the cyclic effects cause the series to oscillate +between larger and larger values, since they get scaled by the trend. + +```{julia} +g(t) = f₁(t) * f₂(t) * f₃(t) + +plot(g, tt; label="f(t)", title="Observed time series", legend=:topleft, linewidth=3) +plot!([f₁, f₂, f₃], tt; label=["f₁(t)" "f₂(t)" "f₃(t)"], linewidth=1) +``` + +Unlike $f$, $g$ oscillates around $0$ since it is being multiplied with sines and cosines. +To let a multiplicative decomposition oscillate around the trend we could define it as +$\tilde{g}(t) = f₁(t) * (1 + f₂(t)) * (1 + f₃(t)),$ +but for convenience we will leave it as is. +The inference machinery is the same for both cases. + +# Model fitting + +Having discussed time series decomposition, let us fit a model to the time series above and recover the true parameters. +Before building our model, we standardise the time axis to $[0, 1]$ and subtract the max of the time series. +This helps convergence while maintaining interpretability and the correct scales for the cyclic components. + +```{julia} +σ_true = 0.35 +t = collect(tt[begin:3:end]) +t_min, t_max = extrema(t) +x = (t .- t_min) ./ (t_max - t_min) +yf = f.(t) .+ σ_true .* randn(size(t)) +yf_max = maximum(yf) +yf = yf .- yf_max + +scatter(x, yf; title="Standardised data", legend=false) +``` + +Let us now build our model. +We want to assume a linear trend, and cyclic effects. +Encoding a linear trend is easy enough, but what about cyclical effects? +We will take a scattergun approach, and create multiple cyclical features using both sine and cosine functions and let our inference machinery figure out which to keep. +To do this, we define how long a one period should be, and create features in reference to said period. +How long a period should be is problem dependent, but as an example let us say it is $1$ year. +If we then find evidence for a cyclic effect with a frequency of 2, that would mean a biannual effect. A frequency of 4 would mean quarterly etc. +Since we are using synthetic data, we are simply going to let the period be 1, which is the entire length of the time series. + +```{julia} +freqs = 1:10 +num_freqs = length(freqs) +period = 1 +cyclic_features = [sinpi.(2 .* freqs' .* x ./ period) cospi.(2 .* freqs' .* x ./ period)] + +plot_freqs = [1, 3, 5] +freq_ptl = plot( + cyclic_features[:, plot_freqs]; + label=permutedims(["sin(2π$(f)x)" for f in plot_freqs]), + title="Cyclical features subset", +) +``` + +Having constructed the cyclical features, we can finally build our model. The model we will implement looks like this + +$$ +f(t) = \alpha + \beta_t t + \sum_{i=1}^F \beta_{\sin{},i} \sin{}(2\pi f_i t) + \sum_{i=1}^F \beta_{\cos{},i} \cos{}(2\pi f_i t), +$$ + +with a Gaussian likelihood $y \sim \mathcal{N}(f(t), \sigma^2)$. +For convenience we are treating the cyclical feature weights $\beta_{\sin{},i}$ and $\beta_{\cos{},i}$ the same in code and weight them with $\beta_c$. +And just because it is so easy, we parameterise our model with the operation with which to apply the cyclic effects. +This lets us use the exact same code for both additive and multiplicative models. +Finally, we plot prior predictive samples to make sure our priors make sense. + +```{julia} +@model function decomp_model(t, c, op) + α ~ Normal(0, 10) + βt ~ Normal(0, 2) + βc ~ MvNormal(Zeros(size(c, 2)), I) + σ ~ truncated(Normal(0, 0.1); lower=0) + + cyclic = c * βc + trend = α .+ βt .* t + μ = op(trend, cyclic) + y ~ MvNormal(μ, σ^2 * I) + return (; trend, cyclic) +end + +y_prior_samples = mapreduce(hcat, 1:100) do _ + rand(decomp_model(t, cyclic_features, +)).y +end +plot(t, y_prior_samples; linewidth=1, alpha=0.5, color=1, label="", title="Prior samples") +scatter!(t, yf; color=2, label="Data") +``` + +With the model specified and with a reasonable prior we can now let Turing decompose the time series for us! + +```{julia} +function mean_ribbon(samples) + qs = quantile(samples) + low = qs[:, Symbol("2.5%")] + up = qs[:, Symbol("97.5%")] + m = mean(samples)[:, :mean] + return m, (m - low, up - m) +end + +function get_decomposition(model, x, cyclic_features, chain, op) + chain_params = Turing.MCMCChains.get_sections(chain, :parameters) + return generated_quantities(model(x, cyclic_features, op), chain_params) +end + +function plot_fit(x, y, decomp, ymax) + trend = mapreduce(x -> x.trend, hcat, decomp) + cyclic = mapreduce(x -> x.cyclic, hcat, decomp) + + trend_plt = plot( + x, + trend .+ ymax; + color=1, + label=nothing, + alpha=0.2, + title="Trend", + xlabel="Time", + ylabel="f₁(t)", + ) + ls = [ones(length(t)) t] \ y + α̂, β̂ = ls[1], ls[2:end] + plot!( + trend_plt, + t, + α̂ .+ t .* β̂ .+ ymax; + label="Least squares trend", + color=5, + linewidth=4, + ) + + scatter!(trend_plt, x, y .+ ymax; label=nothing, color=2, legend=:topleft) + cyclic_plt = plot( + x, + cyclic; + color=1, + label=nothing, + alpha=0.2, + title="Cyclic effect", + xlabel="Time", + ylabel="f₂(t)", + ) + return trend_plt, cyclic_plt +end + +chain = sample(decomp_model(x, cyclic_features, +) | (; y=yf), NUTS(), 2000, progress=false) +yf_samples = predict(decomp_model(x, cyclic_features, +), chain) +m, conf = mean_ribbon(yf_samples) +predictive_plt = plot( + t, + m .+ yf_max; + ribbon=conf, + label="Posterior density", + title="Posterior decomposition", + xlabel="Time", + ylabel="f(t)", +) +scatter!(predictive_plt, t, yf .+ yf_max; color=2, label="Data", legend=:topleft) + +decomp = get_decomposition(decomp_model, x, cyclic_features, chain, +) +decomposed_plt = plot_fit(t, yf, decomp, yf_max) +plot(predictive_plt, decomposed_plt...; layout=(3, 1), size=(700, 1000)) +``` + +```{julia} +#| echo: false +let + @assert mean(ess(chain)[:, :ess]) > 500 "Mean ESS: $(mean(ess(chain)[:, :ess])) - not > 500" + lower_quantile = m .- conf[1] # 2.5% quantile + upper_quantile = m .+ conf[2] # 97.5% quantile + @assert mean(lower_quantile .≤ yf .≤ upper_quantile) ≥ 0.9 "Surprisingly few observations in predicted 95% interval: $(mean(lower_quantile .≤ yf .≤ upper_quantile))" +end +``` + +Inference is successful and the posterior beautifully captures the data. +We see that the least squares linear fit deviates somewhat from the posterior trend. +Since our model takes cyclic effects into account separately, +we get a better estimate of the true overall trend than if we would have just fitted a line. +But what frequency content did the model identify? + + +```{julia} +function plot_cyclic_features(βsin, βcos) + labels = reshape(["freq = $i" for i in freqs], 1, :) + colors = collect(freqs)' + style = reshape([i <= 10 ? :solid : :dash for i in 1:length(labels)], 1, :) + sin_features_plt = density( + βsin[:, :, 1]; + title="Sine features posterior", + label=labels, + ylabel="Density", + xlabel="Weight", + color=colors, + linestyle=style, + legend=nothing, + ) + cos_features_plt = density( + βcos[:, :, 1]; + title="Cosine features posterior", + ylabel="Density", + xlabel="Weight", + label=nothing, + color=colors, + linestyle=style, + ) + + return seasonal_features_plt = plot( + sin_features_plt, + cos_features_plt; + layout=(2, 1), + size=(800, 600), + legend=:outerright, + ) +end + +βc = Array(group(chain, :βc)) +plot_cyclic_features(βc[:, begin:num_freqs, :], βc[:, (num_freqs + 1):end, :]) +``` + +Plotting the posterior over the cyclic features reveals that the model managed to extract the true frequency content. + +Since we wrote our model to accept a combining operator, we can easily run the same analysis for a multiplicative model. + +```{julia} +yg = g.(t) .+ σ_true .* randn(size(t)) + +y_prior_samples = mapreduce(hcat, 1:100) do _ + rand(decomp_model(t, cyclic_features, .*)).y +end +plot(t, y_prior_samples; linewidth=1, alpha=0.5, color=1, label="", title="Prior samples") +scatter!(t, yf; color=2, label="Data") +``` + +```{julia} +chain = sample(decomp_model(x, cyclic_features, .*) | (; y=yg), NUTS(), 2000, progress=false) +yg_samples = predict(decomp_model(x, cyclic_features, .*), chain) +m, conf = mean_ribbon(yg_samples) +predictive_plt = plot( + t, + m; + ribbon=conf, + label="Posterior density", + title="Posterior decomposition", + xlabel="Time", + ylabel="g(t)", +) +scatter!(predictive_plt, t, yg; color=2, label="Data", legend=:topleft) + +decomp = get_decomposition(decomp_model, x, cyclic_features, chain, .*) +decomposed_plt = plot_fit(t, yg, decomp, 0) +plot(predictive_plt, decomposed_plt...; layout=(3, 1), size=(700, 1000)) +``` + +```{julia} +#| echo: false +let + @assert mean(ess(chain)[:, :ess]) > 500 "Mean ESS: $(mean(ess(chain)[:, :ess])) - not > 500" + lower_quantile = m .- conf[1] # 2.5% quantile + upper_quantile = m .+ conf[2] # 97.5% quantile + @assert mean(lower_quantile .≤ yg .≤ upper_quantile) ≥ 0.9 "Surprisingly few observations in predicted 95% interval: $(mean(lower_quantile .≤ yg .≤ upper_quantile))" +end +``` + +The model fits! What about the infered cyclic components? + +```{julia} +βc = Array(group(chain, :βc)) +plot_cyclic_features(βc[:, begin:num_freqs, :], βc[:, (num_freqs + 1):end, :]) +``` + +While multiplicative model fits to the data, it does not recover the true parameters for this dataset. + +# Wrapping up + +In this tutorial we have seen how to implement and fit time series models using additive and multiplicative decomposition. +We also saw how to visualise the model fit, and how to interpret learned cyclical components. diff --git a/tutorials/15-gaussian-processes/index.qmd b/tutorials/15-gaussian-processes/index.qmd index 6b5bb2062..3a3c86e8d 100755 --- a/tutorials/15-gaussian-processes/index.qmd +++ b/tutorials/15-gaussian-processes/index.qmd @@ -1,169 +1,169 @@ ---- -title: Gaussian Processes -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -[JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) packages integrate well with Turing.jl because they implement the Distributions.jl -interface. -You should be able to understand what is going on in this tutorial if you know what a GP is. -For a more in-depth understanding of the -[JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) functionality -used here, please consult the -[JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) docs. - -In this tutorial, we will model the putting dataset discussed in Chapter 21 of -[Bayesian Data Analysis](http://www.stat.columbia.edu/%7Egelman/book/). -The dataset comprises the result of measuring how often a golfer successfully gets the ball -in the hole, depending on how far away from it they are. -The goal of inference is to estimate the probability of any given shot being successful at a -given distance. - -### Let's download the data and take a look at it: - -```{julia} -using CSV, DataDeps, DataFrames - -ENV["DATADEPS_ALWAYS_ACCEPT"] = true -register( - DataDep( - "putting", - "Putting data from BDA", - "http://www.stat.columbia.edu/~gelman/book/data/golf.dat", - "fc28d83896af7094d765789714524d5a389532279b64902866574079c1a977cc", - ), -) - -fname = joinpath(datadep"putting", "golf.dat") -df = CSV.read(fname, DataFrame; delim=' ', ignorerepeated=true) -df[1:5, :] -``` - -We've printed the first 5 rows of the dataset (which comprises only 19 rows in total). -Observe it has three columns: - - 1. `distance` -- how far away from the hole. I'll refer to `distance` as `d` throughout the rest of this tutorial - 2. `n` -- how many shots were taken from a given distance - 3. `y` -- how many shots were successful from a given distance - -We will use a Binomial model for the data, whose success probability is parametrised by a -transformation of a GP. Something along the lines of: -$$ -\begin{aligned} -f & \sim \operatorname{GP}(0, k) \\ -y_j \mid f(d_j) & \sim \operatorname{Binomial}(n_j, g(f(d_j))) \\ -g(x) & := \frac{1}{1 + e^{-x}} -\end{aligned} -$$ - -To do this, let's define our Turing.jl model: - -```{julia} -using AbstractGPs, LogExpFunctions, Turing - -@model function putting_model(d, n; jitter=1e-4) - v ~ Gamma(2, 1) - l ~ Gamma(4, 1) - f = GP(v * with_lengthscale(SEKernel(), l)) - f_latent ~ f(d, jitter) - y ~ product_distribution(Binomial.(n, logistic.(f_latent))) - return (fx=f(d, jitter), f_latent=f_latent, y=y) -end -``` - -We first define an `AbstractGPs.GP`, which represents a distribution over functions, and -is entirely separate from Turing.jl. -We place a prior over its variance `v` and length-scale `l`. -`f(d, jitter)` constructs the multivariate Gaussian comprising the random variables -in `f` whose indices are in `d` (plus a bit of independent Gaussian noise with variance -`jitter` -- see [the docs](https://juliagaussianprocesses.github.io/AbstractGPs.jl/dev/api/#FiniteGP-and-AbstractGP) -for more details). -`f(d, jitter)` has the type `AbstractMvNormal`, and is the bit of AbstractGPs.jl that implements the -Distributions.jl interface, so it's legal to put it on the right-hand side -of a `~`. -From this you should deduce that `f_latent` is distributed according to a multivariate -Gaussian. -The remaining lines comprise standard Turing.jl code that is encountered in other tutorials -and Turing documentation. - -Before performing inference, we might want to inspect the prior that our model places over -the data, to see whether there is anything obviously wrong. -These kinds of prior predictive checks are straightforward to perform using Turing.jl, since -it is possible to sample from the prior easily by just calling the model: - -```{julia} -m = putting_model(Float64.(df.distance), df.n) -m().y -``` - -We make use of this to see what kinds of datasets we simulate from the prior: - -```{julia} -using Plots - -function plot_data(d, n, y, xticks, yticks) - ylims = (0, round(maximum(n), RoundUp; sigdigits=2)) - margin = -0.5 * Plots.mm - plt = plot(; xticks=xticks, yticks=yticks, ylims=ylims, margin=margin, grid=false) - bar!(plt, d, n; color=:red, label="", alpha=0.5) - bar!(plt, d, y; label="", color=:blue, alpha=0.7) - return plt -end - -# Construct model and run some prior predictive checks. -m = putting_model(Float64.(df.distance), df.n) -hists = map(1:20) do j - xticks = j > 15 ? :auto : nothing - yticks = rem(j, 5) == 1 ? :auto : nothing - return plot_data(df.distance, df.n, m().y, xticks, yticks) -end -plot(hists...; layout=(4, 5)) -``` - -In this case, the only prior knowledge I have is that the proportion of successful shots -ought to decrease monotonically as the distance from the hole increases, which should show -up in the data as the blue lines generally go down as we move from left to right on each -graph. -Unfortunately, there is not a simple way to enforce monotonicity in the samples from a GP, -and we can see this in some of the plots above, so we must hope that we have enough data to -ensure that this relationship holds approximately under the posterior. -In any case, you can judge for yourself whether you think this is the most useful -visualisation that we can perform -- if you think there is something better to look at, -please let us know! - -Moving on, we generate samples from the posterior using the default `NUTS` sampler. -We'll make use of [ReverseDiff.jl](https://github.com/JuliaDiff/ReverseDiff.jl), as it has -better performance than [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl/) on -this example. See Turing.jl's docs on Automatic Differentiation for more info. - - -```{julia} -using Random, ReverseDiff - -m_post = m | (y=df.y,) -chn = sample(Xoshiro(123456), m_post, NUTS(; adtype=AutoReverseDiff()), 1_000, progress=false) -``` - -We can use these samples and the `posterior` function from `AbstractGPs` to sample from the -posterior probability of success at any distance we choose: - -```{julia} -d_pred = 1:0.2:21 -samples = map(generated_quantities(m_post, chn)[1:10:end]) do x - return logistic.(rand(posterior(x.fx, x.f_latent)(d_pred, 1e-4))) -end -p = plot() -plot!(d_pred, reduce(hcat, samples); label="", color=:blue, alpha=0.2) -scatter!(df.distance, df.y ./ df.n; label="", color=:red) -``` - -We can see that the general trend is indeed down as the distance from the hole increases, -and that if we move away from the data, the posterior uncertainty quickly inflates. -This suggests that the model is probably going to do a reasonable job of interpolating -between observed data, but less good a job at extrapolating to larger distances. +--- +title: Gaussian Processes +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +[JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) packages integrate well with Turing.jl because they implement the Distributions.jl +interface. +You should be able to understand what is going on in this tutorial if you know what a GP is. +For a more in-depth understanding of the +[JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) functionality +used here, please consult the +[JuliaGPs](https://github.com/JuliaGaussianProcesses/#welcome-to-juliagps) docs. + +In this tutorial, we will model the putting dataset discussed in Chapter 21 of +[Bayesian Data Analysis](http://www.stat.columbia.edu/%7Egelman/book/). +The dataset comprises the result of measuring how often a golfer successfully gets the ball +in the hole, depending on how far away from it they are. +The goal of inference is to estimate the probability of any given shot being successful at a +given distance. + +### Let's download the data and take a look at it: + +```{julia} +using CSV, DataDeps, DataFrames + +ENV["DATADEPS_ALWAYS_ACCEPT"] = true +register( + DataDep( + "putting", + "Putting data from BDA", + "http://www.stat.columbia.edu/~gelman/book/data/golf.dat", + "fc28d83896af7094d765789714524d5a389532279b64902866574079c1a977cc", + ), +) + +fname = joinpath(datadep"putting", "golf.dat") +df = CSV.read(fname, DataFrame; delim=' ', ignorerepeated=true) +df[1:5, :] +``` + +We've printed the first 5 rows of the dataset (which comprises only 19 rows in total). +Observe it has three columns: + + 1. `distance` -- how far away from the hole. I'll refer to `distance` as `d` throughout the rest of this tutorial + 2. `n` -- how many shots were taken from a given distance + 3. `y` -- how many shots were successful from a given distance + +We will use a Binomial model for the data, whose success probability is parametrised by a +transformation of a GP. Something along the lines of: +$$ +\begin{aligned} +f & \sim \operatorname{GP}(0, k) \\ +y_j \mid f(d_j) & \sim \operatorname{Binomial}(n_j, g(f(d_j))) \\ +g(x) & := \frac{1}{1 + e^{-x}} +\end{aligned} +$$ + +To do this, let's define our Turing.jl model: + +```{julia} +using AbstractGPs, LogExpFunctions, Turing + +@model function putting_model(d, n; jitter=1e-4) + v ~ Gamma(2, 1) + l ~ Gamma(4, 1) + f = GP(v * with_lengthscale(SEKernel(), l)) + f_latent ~ f(d, jitter) + y ~ product_distribution(Binomial.(n, logistic.(f_latent))) + return (fx=f(d, jitter), f_latent=f_latent, y=y) +end +``` + +We first define an `AbstractGPs.GP`, which represents a distribution over functions, and +is entirely separate from Turing.jl. +We place a prior over its variance `v` and length-scale `l`. +`f(d, jitter)` constructs the multivariate Gaussian comprising the random variables +in `f` whose indices are in `d` (plus a bit of independent Gaussian noise with variance +`jitter` -- see [the docs](https://juliagaussianprocesses.github.io/AbstractGPs.jl/dev/api/#FiniteGP-and-AbstractGP) +for more details). +`f(d, jitter)` has the type `AbstractMvNormal`, and is the bit of AbstractGPs.jl that implements the +Distributions.jl interface, so it's legal to put it on the right-hand side +of a `~`. +From this you should deduce that `f_latent` is distributed according to a multivariate +Gaussian. +The remaining lines comprise standard Turing.jl code that is encountered in other tutorials +and Turing documentation. + +Before performing inference, we might want to inspect the prior that our model places over +the data, to see whether there is anything obviously wrong. +These kinds of prior predictive checks are straightforward to perform using Turing.jl, since +it is possible to sample from the prior easily by just calling the model: + +```{julia} +m = putting_model(Float64.(df.distance), df.n) +m().y +``` + +We make use of this to see what kinds of datasets we simulate from the prior: + +```{julia} +using Plots + +function plot_data(d, n, y, xticks, yticks) + ylims = (0, round(maximum(n), RoundUp; sigdigits=2)) + margin = -0.5 * Plots.mm + plt = plot(; xticks=xticks, yticks=yticks, ylims=ylims, margin=margin, grid=false) + bar!(plt, d, n; color=:red, label="", alpha=0.5) + bar!(plt, d, y; label="", color=:blue, alpha=0.7) + return plt +end + +# Construct model and run some prior predictive checks. +m = putting_model(Float64.(df.distance), df.n) +hists = map(1:20) do j + xticks = j > 15 ? :auto : nothing + yticks = rem(j, 5) == 1 ? :auto : nothing + return plot_data(df.distance, df.n, m().y, xticks, yticks) +end +plot(hists...; layout=(4, 5)) +``` + +In this case, the only prior knowledge I have is that the proportion of successful shots +ought to decrease monotonically as the distance from the hole increases, which should show +up in the data as the blue lines generally go down as we move from left to right on each +graph. +Unfortunately, there is not a simple way to enforce monotonicity in the samples from a GP, +and we can see this in some of the plots above, so we must hope that we have enough data to +ensure that this relationship holds approximately under the posterior. +In any case, you can judge for yourself whether you think this is the most useful +visualisation that we can perform -- if you think there is something better to look at, +please let us know! + +Moving on, we generate samples from the posterior using the default `NUTS` sampler. +We'll make use of [ReverseDiff.jl](https://github.com/JuliaDiff/ReverseDiff.jl), as it has +better performance than [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl/) on +this example. See Turing.jl's docs on Automatic Differentiation for more info. + + +```{julia} +using Random, ReverseDiff + +m_post = m | (y=df.y,) +chn = sample(Xoshiro(123456), m_post, NUTS(; adtype=AutoReverseDiff()), 1_000, progress=false) +``` + +We can use these samples and the `posterior` function from `AbstractGPs` to sample from the +posterior probability of success at any distance we choose: + +```{julia} +d_pred = 1:0.2:21 +samples = map(generated_quantities(m_post, chn)[1:10:end]) do x + return logistic.(rand(posterior(x.fx, x.f_latent)(d_pred, 1e-4))) +end +p = plot() +plot!(d_pred, reduce(hcat, samples); label="", color=:blue, alpha=0.2) +scatter!(df.distance, df.y ./ df.n; label="", color=:red) +``` + +We can see that the general trend is indeed down as the distance from the hole increases, +and that if we move away from the data, the posterior uncertainty quickly inflates. +This suggests that the model is probably going to do a reasonable job of interpolating +between observed data, but less good a job at extrapolating to larger distances. diff --git a/tutorials/_metadata.yml b/tutorials/_metadata.yml index bfb360691..3956cf9c8 100755 --- a/tutorials/_metadata.yml +++ b/tutorials/_metadata.yml @@ -1,18 +1,18 @@ -format: - html: - toc-title: "Table of Contents" - code-fold: false - code-overflow: scroll -execute: - echo: true - output: true -include-in-header: - - text: | - +format: + html: + toc-title: "Table of Contents" + code-fold: false + code-overflow: scroll +execute: + echo: true + output: true +include-in-header: + - text: | + diff --git a/tutorials/docs-09-using-turing-advanced/index.qmd b/tutorials/docs-09-using-turing-advanced/index.qmd index 7cd76bdb0..c526e87e4 100755 --- a/tutorials/docs-09-using-turing-advanced/index.qmd +++ b/tutorials/docs-09-using-turing-advanced/index.qmd @@ -1,11 +1,11 @@ ---- -title: Advanced Usage -engine: julia ---- - -This page has been separated into new sections. Please update any bookmarks you might have: - - - [Custom Distributions]({{< meta usage-custom-distribution >}}) - - [Modifying the Log Probability]({{< meta usage-modifying-logprob >}}) - - [Defining a Model without `@model`]({{< meta dev-model-manual >}}) - - [Reparametrization and Generated Quantities]({{< meta usage-generated-quantities >}}) +--- +title: Advanced Usage +engine: julia +--- + +This page has been separated into new sections. Please update any bookmarks you might have: + + - [Custom Distributions]({{< meta usage-custom-distribution >}}) + - [Modifying the Log Probability]({{< meta usage-modifying-logprob >}}) + - [Defining a Model without `@model`]({{< meta dev-model-manual >}}) + - [Reparametrization and Generated Quantities]({{< meta usage-generated-quantities >}}) diff --git a/tutorials/docs-10-using-turing-autodiff/index.qmd b/tutorials/docs-10-using-turing-autodiff/index.qmd index 48541ab5a..b46bb2beb 100755 --- a/tutorials/docs-10-using-turing-autodiff/index.qmd +++ b/tutorials/docs-10-using-turing-autodiff/index.qmd @@ -1,76 +1,76 @@ ---- -title: Automatic Differentiation -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -## Switching AD Modes - -Turing currently supports four automatic differentiation (AD) backends for sampling: [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) for forward-mode AD; and [Mooncake](https://github.com/compintell/Mooncake.jl), [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl), and [Zygote](https://github.com/FluxML/Zygote.jl) for reverse-mode AD. -`ForwardDiff` is automatically imported by Turing. To utilize `Mooncake`, `Zygote`, or `ReverseDiff` for AD, users must explicitly import them with `import Mooncake`, `import Zygote` or `import ReverseDiff`, alongside `using Turing`. - -As of Turing version v0.30, the global configuration flag for the AD backend has been removed in favour of [`AdTypes.jl`](https://github.com/SciML/ADTypes.jl), allowing users to specify the AD backend for individual samplers independently. -Users can pass the `adtype` keyword argument to the sampler constructor to select the desired AD backend, with the default being `AutoForwardDiff(; chunksize=0)`. - -For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of `nothing` permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size). - -For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional keyword argument called `compile` can be provided to `AutoReverseDiff`. It specifies whether to pre-record the tape only once and reuse it later (`compile` is set to `false` by default, which means no pre-recording). This can substantially improve performance, but risks silently incorrect results if not used with care. - - - -Pre-recorded tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model. - -Thus, e.g., in the model definition and all implicitly and explicitly called functions in the model, all loops should be of fixed size, and `if`-statements should consistently execute the same branches. -For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the model, e.g. fixed data. -However, `if`-statements that depend on the model parameters can take different branches during sampling; hence, the compiled tape might be incorrect. -Thus you must not use compiled tapes when your model makes decisions based on the model parameters, and you should be careful if you compute functions of parameters that those functions do not have branching which might cause them to execute different code for different values of the parameter. - -For `Zygote`, pass `adtype=AutoZygote()` to the sampler constructor. - -And the previously used interface functions including `ADBackend`, `setadbackend`, `setsafe`, `setchunksize`, and `setrdcache` are deprecated and removed. - -## Compositional Sampling with Differing AD Modes - -Turing supports intermixed automatic differentiation methods for different variable spaces. The snippet below shows using `ForwardDiff` to sample the mean (`m`) parameter, and using `ReverseDiff` for the variance (`s`) parameter: - -```{julia} -using Turing -using ReverseDiff - -# Define a simple Normal model with unknown mean and variance. -@model function gdemo(x, y) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - x ~ Normal(m, sqrt(s²)) - return y ~ Normal(m, sqrt(s²)) -end - -# Sample using Gibbs and varying autodiff backends. -c = sample( - gdemo(1.5, 2), - Gibbs( - HMC(0.1, 5, :m; adtype=AutoForwardDiff(; chunksize=0)), - HMC(0.1, 5, :s²; adtype=AutoReverseDiff(false)), - ), - 1000, - progress=false, -) -``` - -Generally, reverse-mode AD, for instance `ReverseDiff`, is faster when sampling from variables of high dimensionality (greater than 20), while forward-mode AD, for instance `ForwardDiff`, is more efficient for lower-dimension variables. This functionality allows those who are performance sensitive to fine tune their automatic differentiation for their specific models. - -If the differentiation method is not specified in this way, Turing will default to using whatever the global AD backend is. -Currently, this defaults to `ForwardDiff`. - -The most reliable way to ensure you are using the fastest AD that works for your problem is to benchmark them using [`TuringBenchmarking`](https://github.com/TuringLang/TuringBenchmarking.jl): - -```{julia} -using TuringBenchmarking -benchmark_model(gdemo(1.5, 2), adbackends=[AutoForwardDiff(), AutoReverseDiff()]) -``` +--- +title: Automatic Differentiation +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +## Switching AD Modes + +Turing currently supports four automatic differentiation (AD) backends for sampling: [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) for forward-mode AD; and [Mooncake](https://github.com/compintell/Mooncake.jl), [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl), and [Zygote](https://github.com/FluxML/Zygote.jl) for reverse-mode AD. +`ForwardDiff` is automatically imported by Turing. To utilize `Mooncake`, `Zygote`, or `ReverseDiff` for AD, users must explicitly import them with `import Mooncake`, `import Zygote` or `import ReverseDiff`, alongside `using Turing`. + +As of Turing version v0.30, the global configuration flag for the AD backend has been removed in favour of [`AdTypes.jl`](https://github.com/SciML/ADTypes.jl), allowing users to specify the AD backend for individual samplers independently. +Users can pass the `adtype` keyword argument to the sampler constructor to select the desired AD backend, with the default being `AutoForwardDiff(; chunksize=0)`. + +For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of `nothing` permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size). + +For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional keyword argument called `compile` can be provided to `AutoReverseDiff`. It specifies whether to pre-record the tape only once and reuse it later (`compile` is set to `false` by default, which means no pre-recording). This can substantially improve performance, but risks silently incorrect results if not used with care. + + + +Pre-recorded tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model. + +Thus, e.g., in the model definition and all implicitly and explicitly called functions in the model, all loops should be of fixed size, and `if`-statements should consistently execute the same branches. +For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the model, e.g. fixed data. +However, `if`-statements that depend on the model parameters can take different branches during sampling; hence, the compiled tape might be incorrect. +Thus you must not use compiled tapes when your model makes decisions based on the model parameters, and you should be careful if you compute functions of parameters that those functions do not have branching which might cause them to execute different code for different values of the parameter. + +For `Zygote`, pass `adtype=AutoZygote()` to the sampler constructor. + +And the previously used interface functions including `ADBackend`, `setadbackend`, `setsafe`, `setchunksize`, and `setrdcache` are deprecated and removed. + +## Compositional Sampling with Differing AD Modes + +Turing supports intermixed automatic differentiation methods for different variable spaces. The snippet below shows using `ForwardDiff` to sample the mean (`m`) parameter, and using `ReverseDiff` for the variance (`s`) parameter: + +```{julia} +using Turing +using ReverseDiff + +# Define a simple Normal model with unknown mean and variance. +@model function gdemo(x, y) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + x ~ Normal(m, sqrt(s²)) + return y ~ Normal(m, sqrt(s²)) +end + +# Sample using Gibbs and varying autodiff backends. +c = sample( + gdemo(1.5, 2), + Gibbs( + HMC(0.1, 5, :m; adtype=AutoForwardDiff(; chunksize=0)), + HMC(0.1, 5, :s²; adtype=AutoReverseDiff(false)), + ), + 1000, + progress=false, +) +``` + +Generally, reverse-mode AD, for instance `ReverseDiff`, is faster when sampling from variables of high dimensionality (greater than 20), while forward-mode AD, for instance `ForwardDiff`, is more efficient for lower-dimension variables. This functionality allows those who are performance sensitive to fine tune their automatic differentiation for their specific models. + +If the differentiation method is not specified in this way, Turing will default to using whatever the global AD backend is. +Currently, this defaults to `ForwardDiff`. + +The most reliable way to ensure you are using the fastest AD that works for your problem is to benchmark them using [`TuringBenchmarking`](https://github.com/TuringLang/TuringBenchmarking.jl): + +```{julia} +using TuringBenchmarking +benchmark_model(gdemo(1.5, 2), adbackends=[AutoForwardDiff(), AutoReverseDiff()]) +``` diff --git a/tutorials/docs-11-using-turing-dynamichmc/index.qmd b/tutorials/docs-11-using-turing-dynamichmc/index.qmd index 6eb0e50e1..8802f5c3d 100755 --- a/tutorials/docs-11-using-turing-dynamichmc/index.qmd +++ b/tutorials/docs-11-using-turing-dynamichmc/index.qmd @@ -1,36 +1,36 @@ ---- -title: Using DynamicHMC -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -Turing supports the use of [DynamicHMC](https://github.com/tpapp/DynamicHMC.jl) as a sampler through the `DynamicNUTS` function. - -To use the `DynamicNUTS` function, you must import the `DynamicHMC` package as well as Turing. Turing does not formally require `DynamicHMC` but will include additional functionality if both packages are present. - -Here is a brief example: - -### How to apply `DynamicNUTS`: - -```{julia} -# Import Turing and DynamicHMC. -using DynamicHMC, Turing - -# Model definition. -@model function gdemo(x, y) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - x ~ Normal(m, sqrt(s²)) - return y ~ Normal(m, sqrt(s²)) -end - -# Pull 2,000 samples using DynamicNUTS. -dynamic_nuts = externalsampler(DynamicHMC.NUTS()) -chn = sample(gdemo(1.5, 2.0), dynamic_nuts, 2000, progress=false) +--- +title: Using DynamicHMC +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +Turing supports the use of [DynamicHMC](https://github.com/tpapp/DynamicHMC.jl) as a sampler through the `DynamicNUTS` function. + +To use the `DynamicNUTS` function, you must import the `DynamicHMC` package as well as Turing. Turing does not formally require `DynamicHMC` but will include additional functionality if both packages are present. + +Here is a brief example: + +### How to apply `DynamicNUTS`: + +```{julia} +# Import Turing and DynamicHMC. +using DynamicHMC, Turing + +# Model definition. +@model function gdemo(x, y) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + x ~ Normal(m, sqrt(s²)) + return y ~ Normal(m, sqrt(s²)) +end + +# Pull 2,000 samples using DynamicNUTS. +dynamic_nuts = externalsampler(DynamicHMC.NUTS()) +chn = sample(gdemo(1.5, 2.0), dynamic_nuts, 2000, progress=false) ``` \ No newline at end of file diff --git a/tutorials/docs-12-using-turing-guide/index.qmd b/tutorials/docs-12-using-turing-guide/index.qmd index 565440296..2558b6a13 100755 --- a/tutorials/docs-12-using-turing-guide/index.qmd +++ b/tutorials/docs-12-using-turing-guide/index.qmd @@ -1,534 +1,534 @@ ---- -title: "Core Functionality" -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -This article provides an overview of the core functionality in Turing.jl, which are likely to be used across a wide range of models. - -## Basics - -### Introduction - -A probabilistic program is Julia code wrapped in a `@model` macro. It can use arbitrary Julia code, but to ensure correctness of inference it should not have external effects or modify global state. Stack-allocated variables are safe, but mutable heap-allocated objects may lead to subtle bugs when using task copying. By default Libtask deepcopies `Array` and `Dict` objects when copying task to avoid bugs with data stored in mutable structure in Turing models. - -To specify distributions of random variables, Turing programs should use the `~` notation: - -`x ~ distr` where `x` is a symbol and `distr` is a distribution. If `x` is undefined in the model function, inside the probabilistic program, this puts a random variable named `x`, distributed according to `distr`, in the current scope. `distr` can be a value of any type that implements `rand(distr)`, which samples a value from the distribution `distr`. If `x` is defined, this is used for conditioning in a style similar to [Anglican](https://probprog.github.io/anglican/index.html) (another PPL). In this case, `x` is an observed value, assumed to have been drawn from the distribution `distr`. The likelihood is computed using `logpdf(distr,y)`. The observe statements should be arranged so that every possible run traverses all of them in exactly the same order. This is equivalent to demanding that they are not placed inside stochastic control flow. - -Available inference methods include Importance Sampling (IS), Sequential Monte Carlo (SMC), Particle Gibbs (PG), Hamiltonian Monte Carlo (HMC), Hamiltonian Monte Carlo with Dual Averaging (HMCDA) and The No-U-Turn Sampler (NUTS). - -### Simple Gaussian Demo - -Below is a simple Gaussian demo illustrate the basic usage of Turing.jl. - - -```{julia} -# Import packages. -using Turing -using StatsPlots - -# Define a simple Normal model with unknown mean and variance. -@model function gdemo(x, y) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - x ~ Normal(m, sqrt(s²)) - return y ~ Normal(m, sqrt(s²)) -end -``` - -Note: As a sanity check, the prior expectation of `s²` is `mean(InverseGamma(2, 3)) = 3/(2 - 1) = 3` and the prior expectation of `m` is 0. This can be easily checked using `Prior`: - -```{julia} -#| output: false -setprogress!(false) -``` - -```{julia} -p1 = sample(gdemo(missing, missing), Prior(), 100000) -``` - -We can perform inference by using the `sample` function, the first argument of which is our probabilistic program and the second of which is a sampler. - -```{julia} -# Run sampler, collect results. -c1 = sample(gdemo(1.5, 2), SMC(), 1000) -c2 = sample(gdemo(1.5, 2), PG(10), 1000) -c3 = sample(gdemo(1.5, 2), HMC(0.1, 5), 1000) -c4 = sample(gdemo(1.5, 2), Gibbs(PG(10, :m), HMC(0.1, 5, :s²)), 1000) -c5 = sample(gdemo(1.5, 2), HMCDA(0.15, 0.65), 1000) -c6 = sample(gdemo(1.5, 2), NUTS(0.65), 1000) -``` - -The arguments for each sampler are: - - - SMC: number of particles. - - PG: number of particles, number of iterations. - - HMC: leapfrog step size, leapfrog step numbers. - - Gibbs: component sampler 1, component sampler 2, ... - - HMCDA: total leapfrog length, target accept ratio. - - NUTS: number of adaptation steps (optional), target accept ratio. - -More information about each sampler can be found in [Turing.jl's API docs](https://turinglang.org/Turing.jl). - -The `MCMCChains` module (which is re-exported by Turing) provides plotting tools for the `Chain` objects returned by a `sample` function. See the [MCMCChains](https://github.com/TuringLang/MCMCChains.jl) repository for more information on the suite of tools available for diagnosing MCMC chains. - -```{julia} -#| eval: false -# Summarise results -describe(c3) - -# Plot results -plot(c3) -savefig("gdemo-plot.png") -``` - -### Modelling Syntax Explained - -Using this syntax, a probabilistic model is defined in Turing. The model function generated by Turing can then be used to condition the model onto data. Subsequently, the sample function can be used to generate samples from the posterior distribution. - -In the following example, the defined model is conditioned to the data (arg*1 = 1, arg*2 = 2) by passing (1, 2) to the model function. - -```{julia} -#| eval: false -@model function model_name(arg_1, arg_2) - return ... -end -``` - -The conditioned model can then be passed onto the sample function to run posterior inference. - -```{julia} -#| eval: false -model_func = model_name(1, 2) -chn = sample(model_func, HMC(..)) # Perform inference by sampling using HMC. -``` - -The returned chain contains samples of the variables in the model. - -```{julia} -#| eval: false -var_1 = mean(chn[:var_1]) # Taking the mean of a variable named var_1. -``` - -The key (`:var_1`) can be a `Symbol` or a `String`. For example, to fetch `x[1]`, one can use `chn[Symbol("x[1]")]` or `chn["x[1]"]`. -If you want to retrieve all parameters associated with a specific symbol, you can use `group`. As an example, if you have the -parameters `"x[1]"`, `"x[2]"`, and `"x[3]"`, calling `group(chn, :x)` or `group(chn, "x")` will return a new chain with only `"x[1]"`, `"x[2]"`, and `"x[3]"`. - -Turing does not have a declarative form. More generally, the order in which you place the lines of a `@model` macro matters. For example, the following example works: - -```{julia} -# Define a simple Normal model with unknown mean and variance. -@model function model_function(y) - s ~ Poisson(1) - y ~ Normal(s, 1) - return y -end - -sample(model_function(10), SMC(), 100) -``` - -But if we switch the `s ~ Poisson(1)` and `y ~ Normal(s, 1)` lines, the model will no longer sample correctly: - -```{julia} -#| eval: false -# Define a simple Normal model with unknown mean and variance. -@model function model_function(y) - y ~ Normal(s, 1) - s ~ Poisson(1) - return y -end - -sample(model_function(10), SMC(), 100) -``` - -### Sampling Multiple Chains - -Turing supports distributed and threaded parallel sampling. To do so, call `sample(model, sampler, parallel_type, n, n_chains)`, where `parallel_type` can be either `MCMCThreads()` or `MCMCDistributed()` for thread and parallel sampling, respectively. - -Having multiple chains in the same object is valuable for evaluating convergence. Some diagnostic functions like `gelmandiag` require multiple chains. - -If you do not want parallelism or are on an older version Julia, you can sample multiple chains with the `mapreduce` function: - -```{julia} -#| eval: false -# Replace num_chains below with however many chains you wish to sample. -chains = mapreduce(c -> sample(model_fun, sampler, 1000), chainscat, 1:num_chains) -``` - -The `chains` variable now contains a `Chains` object which can be indexed by chain. To pull out the first chain from the `chains` object, use `chains[:,:,1]`. The method is the same if you use either of the below parallel sampling methods. - -#### Multithreaded sampling - -If you wish to perform multithreaded sampling, you can call `sample` with the following signature: - -```{julia} -#| eval: false -using Turing - -@model function gdemo(x) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - - for i in eachindex(x) - x[i] ~ Normal(m, sqrt(s²)) - end -end - -model = gdemo([1.5, 2.0]) - -# Sample four chains using multiple threads, each with 1000 samples. -sample(model, NUTS(), MCMCThreads(), 1000, 4) -``` - -Be aware that Turing cannot add threads for you -- you must have started your Julia instance with multiple threads to experience any kind of parallelism. See the [Julia documentation](https://docs.julialang.org/en/v1/manual/parallel-computing/#man-multithreading-1) for details on how to achieve this. - -#### Distributed sampling - -To perform distributed sampling (using multiple processes), you must first import `Distributed`. - -Process parallel sampling can be done like so: - -```{julia} -#| eval: false -# Load Distributed to add processes and the @everywhere macro. -using Distributed - -# Load Turing. -using Turing - -# Add four processes to use for sampling. -addprocs(4; exeflags="--project=$(Base.active_project())") - -# Initialize everything on all the processes. -# Note: Make sure to do this after you've already loaded Turing, -# so each process does not have to precompile. -# Parallel sampling may fail silently if you do not do this. -@everywhere using Turing - -# Define a model on all processes. -@everywhere @model function gdemo(x) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - - for i in eachindex(x) - x[i] ~ Normal(m, sqrt(s²)) - end -end - -# Declare the model instance everywhere. -@everywhere model = gdemo([1.5, 2.0]) - -# Sample four chains using multiple processes, each with 1000 samples. -sample(model, NUTS(), MCMCDistributed(), 1000, 4) -``` - -### Sampling from an Unconditional Distribution (The Prior) - -Turing allows you to sample from a declared model's prior. If you wish to draw a chain from the prior to inspect your prior distributions, you can simply run - -```{julia} -#| eval: false -chain = sample(model, Prior(), n_samples) -``` - -You can also run your model (as if it were a function) from the prior distribution, by calling the model without specifying inputs or a sampler. In the below example, we specify a `gdemo` model which returns two variables, `x` and `y`. The model includes `x` and `y` as arguments, but calling the function without passing in `x` or `y` means that Turing's compiler will assume they are missing values to draw from the relevant distribution. The `return` statement is necessary to retrieve the sampled `x` and `y` values. -Assign the function with `missing` inputs to a variable, and Turing will produce a sample from the prior distribution. - -```{julia} -@model function gdemo(x, y) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - x ~ Normal(m, sqrt(s²)) - y ~ Normal(m, sqrt(s²)) - return x, y -end -``` - -Assign the function with `missing` inputs to a variable, and Turing will produce a sample from the prior distribution. - -```{julia} -# Samples from p(x,y) -g_prior_sample = gdemo(missing, missing) -g_prior_sample() -``` - -### Sampling from a Conditional Distribution (The Posterior) - -#### Treating observations as random variables - -Inputs to the model that have a value `missing` are treated as parameters, aka random variables, to be estimated/sampled. This can be useful if you want to simulate draws for that parameter, or if you are sampling from a conditional distribution. Turing supports the following syntax: - -```{julia} -@model function gdemo(x, ::Type{T}=Float64) where {T} - if x === missing - # Initialize `x` if missing - x = Vector{T}(undef, 2) - end - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - for i in eachindex(x) - x[i] ~ Normal(m, sqrt(s²)) - end -end - -# Construct a model with x = missing -model = gdemo(missing) -c = sample(model, HMC(0.01, 5), 500) -``` - -Note the need to initialize `x` when missing since we are iterating over its elements later in the model. The generated values for `x` can be extracted from the `Chains` object using `c[:x]`. - -Turing also supports mixed `missing` and non-`missing` values in `x`, where the missing ones will be treated as random variables to be sampled while the others get treated as observations. For example: - -```{julia} -@model function gdemo(x) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - for i in eachindex(x) - x[i] ~ Normal(m, sqrt(s²)) - end -end - -# x[1] is a parameter, but x[2] is an observation -model = gdemo([missing, 2.4]) -c = sample(model, HMC(0.01, 5), 500) -``` - -#### Default Values - -Arguments to Turing models can have default values much like how default values work in normal Julia functions. For instance, the following will assign `missing` to `x` and treat it as a random variable. If the default value is not `missing`, `x` will be assigned that value and will be treated as an observation instead. - - -```{julia} -using Turing - -@model function generative(x=missing, ::Type{T}=Float64) where {T<:Real} - if x === missing - # Initialize x when missing - x = Vector{T}(undef, 10) - end - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - for i in 1:length(x) - x[i] ~ Normal(m, sqrt(s²)) - end - return s², m -end - -m = generative() -chain = sample(m, HMC(0.01, 5), 1000) -``` - -#### Access Values inside Chain - -You can access the values inside a chain several ways: - - 1. Turn them into a `DataFrame` object - 2. Use their raw `AxisArray` form - 3. Create a three-dimensional `Array` object - -For example, let `c` be a `Chain`: - - 1. `DataFrame(c)` converts `c` to a `DataFrame`, - 2. `c.value` retrieves the values inside `c` as an `AxisArray`, and - 3. `c.value.data` retrieves the values inside `c` as a 3D `Array`. - -#### Variable Types and Type Parameters - -The element type of a vector (or matrix) of random variables should match the `eltype` of its prior distribution, `<: Integer` for discrete distributions and `<: AbstractFloat` for continuous distributions. Moreover, if the continuous random variable is to be sampled using a Hamiltonian sampler, the vector's element type needs to either be: - -1. `Real` to enable auto-differentiation through the model which uses special number types that are sub-types of `Real`, or - -2. Some type parameter `T` defined in the model header using the type parameter syntax, e.g. `function gdemo(x, ::Type{T} = Float64) where {T}`. - -Similarly, when using a particle sampler, the Julia variable used should either be: - -1. An `Array`, or - -2. An instance of some type parameter `T` defined in the model header using the type parameter syntax, e.g. `function gdemo(x, ::Type{T} = Vector{Float64}) where {T}`. - -### Querying Probabilities from Model or Chain - -Turing offers three functions: [`loglikelihood`](https://turinglang.org/DynamicPPL.jl/dev/api/#StatsAPI.loglikelihood), [`logprior`](https://turinglang.org/DynamicPPL.jl/dev/api/#DynamicPPL.logprior), and [`logjoint`](https://turinglang.org/DynamicPPL.jl/dev/api/#DynamicPPL.logjoint) to query the log-likelihood, log-prior, and log-joint probabilities of a model, respectively. - -Let's look at a simple model called `gdemo`: - -```{julia} -@model function gdemo0() - s ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s)) - return x ~ Normal(m, sqrt(s)) -end -``` - -If we observe x to be 1.0, we can condition the model on this datum using the [`condition`](https://turinglang.org/DynamicPPL.jl/dev/api/#AbstractPPL.condition) syntax: - -```{julia} -model = gdemo0() | (x=1.0,) -``` - -Now, let's compute the log-likelihood of the observation given specific values of the model parameters, `s` and `m`: - -```{julia} -loglikelihood(model, (s=1.0, m=1.0)) -``` - -We can easily verify that value in this case: - -```{julia} -logpdf(Normal(1.0, 1.0), 1.0) -``` - -We can also compute the log-prior probability of the model for the same values of s and m: - -```{julia} -logprior(model, (s=1.0, m=1.0)) -``` - -```{julia} -logpdf(InverseGamma(2, 3), 1.0) + logpdf(Normal(0, sqrt(1.0)), 1.0) -``` - -Finally, we can compute the log-joint probability of the model parameters and data: - -```{julia} -logjoint(model, (s=1.0, m=1.0)) -``` - -```{julia} -logpdf(Normal(1.0, 1.0), 1.0) + -logpdf(InverseGamma(2, 3), 1.0) + -logpdf(Normal(0, sqrt(1.0)), 1.0) -``` - -Querying with `Chains` object is easy as well: - -```{julia} -chn = sample(model, Prior(), 10) -``` - -```{julia} -loglikelihood(model, chn) -``` - -### Maximum likelihood and maximum a posterior estimates - -Turing also has functions for estimating the maximum aposteriori and maximum likelihood parameters of a model. This can be done with - -```{julia} -mle_estimate = maximum_likelihood(model) -map_estimate = maximum_a_posteriori(model) -``` - -For more details see the [mode estimation page]({{}}). - -## Beyond the Basics - -### Compositional Sampling Using Gibbs - -Turing.jl provides a Gibbs interface to combine different samplers. For example, one can combine an `HMC` sampler with a `PG` sampler to run inference for different parameters in a single model as below. - -```{julia} -@model function simple_choice(xs) - p ~ Beta(2, 2) - z ~ Bernoulli(p) - for i in 1:length(xs) - if z == 1 - xs[i] ~ Normal(0, 1) - else - xs[i] ~ Normal(2, 1) - end - end -end - -simple_choice_f = simple_choice([1.5, 2.0, 0.3]) - -chn = sample(simple_choice_f, Gibbs(HMC(0.2, 3, :p), PG(20, :z)), 1000) -``` - -The `Gibbs` sampler can be used to specify unique automatic differentiation backends for different variable spaces. Please see the [Automatic Differentiation]({{}}) article for more. - -For more details of compositional sampling in Turing.jl, please check the corresponding [paper](https://proceedings.mlr.press/v84/ge18b.html). - -### Working with filldist and arraydist - -Turing provides `filldist(dist::Distribution, n::Int)` and `arraydist(dists::AbstractVector{<:Distribution})` as a simplified interface to construct product distributions, e.g., to model a set of variables that share the same structure but vary by group. - -#### Constructing product distributions with filldist - -The function `filldist` provides a general interface to construct product distributions over distributions of the same type and parameterisation. -Note that, in contrast to the product distribution interface provided by Distributions.jl (`Product`), `filldist` supports product distributions over univariate or multivariate distributions. - -Example usage: - -```{julia} -@model function demo(x, g) - k = length(unique(g)) - a ~ filldist(Exponential(), k) # = Product(fill(Exponential(), k)) - mu = a[g] - return x .~ Normal.(mu) -end -``` - -#### Constructing product distributions with `arraydist` - -The function `arraydist` provides a general interface to construct product distributions over distributions of varying type and parameterisation. -Note that in contrast to the product distribution interface provided by Distributions.jl (`Product`), `arraydist` supports product distributions over univariate or multivariate distributions. - -Example usage: - -```{julia} -@model function demo(x, g) - k = length(unique(g)) - a ~ arraydist([Exponential(i) for i in 1:k]) - mu = a[g] - return x .~ Normal.(mu) -end -``` - -### Working with MCMCChains.jl - -Turing.jl wraps its samples using `MCMCChains.Chain` so that all the functions working for `MCMCChains.Chain` can be re-used in Turing.jl. Two typical functions are `MCMCChains.describe` and `MCMCChains.plot`, which can be used as follows for an obtained chain `chn`. For more information on `MCMCChains`, please see the [GitHub repository](https://github.com/TuringLang/MCMCChains.jl). - -```{julia} -describe(chn) # Lists statistics of the samples. -plot(chn) # Plots statistics of the samples. -``` - -There are numerous functions in addition to `describe` and `plot` in the `MCMCChains` package, such as those used in convergence diagnostics. For more information on the package, please see the [GitHub repository](https://github.com/TuringLang/MCMCChains.jl). - -### Changing Default Settings - -Some of Turing.jl's default settings can be changed for better usage. - -#### AD Chunk Size - -ForwardDiff (Turing's default AD backend) uses forward-mode chunk-wise AD. The chunk size can be set manually by `AutoForwardDiff(;chunksize=new_chunk_size)`. - -#### AD Backend - -Turing supports four automatic differentiation (AD) packages in the back end during sampling. The default AD backend is [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) for forward-mode AD. Three reverse-mode AD backends are also supported, namely [Mooncake](https://github.com/compintell/Mooncake.jl), [Zygote](https://github.com/FluxML/Zygote.jl) and [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). `Mooncake`, `Zygote`, and `ReverseDiff` also require the user to explicitly load them using `import Mooncake`, `import Zygote`, or `import ReverseDiff` next to `using Turing`. - -For more information on Turing's automatic differentiation backend, please see the [Automatic Differentiation]({{}}) article. - -#### Progress Logging - -`Turing.jl` uses ProgressLogging.jl to log the sampling progress. Progress -logging is enabled as default but might slow down inference. It can be turned on -or off by setting the keyword argument `progress` of `sample` to `true` or `false`. -Moreover, you can enable or disable progress logging globally by calling `setprogress!(true)` or `setprogress!(false)`, respectively. - -Turing uses heuristics to select an appropriate visualization backend. If you -use Jupyter notebooks, the default backend is -[ConsoleProgressMonitor.jl](https://github.com/tkf/ConsoleProgressMonitor.jl). -In all other cases, progress logs are displayed with -[TerminalLoggers.jl](https://github.com/c42f/TerminalLoggers.jl). Alternatively, -if you provide a custom visualization backend, Turing uses it instead of the -default backend. +--- +title: "Core Functionality" +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +This article provides an overview of the core functionality in Turing.jl, which are likely to be used across a wide range of models. + +## Basics + +### Introduction + +A probabilistic program is Julia code wrapped in a `@model` macro. It can use arbitrary Julia code, but to ensure correctness of inference it should not have external effects or modify global state. Stack-allocated variables are safe, but mutable heap-allocated objects may lead to subtle bugs when using task copying. By default Libtask deepcopies `Array` and `Dict` objects when copying task to avoid bugs with data stored in mutable structure in Turing models. + +To specify distributions of random variables, Turing programs should use the `~` notation: + +`x ~ distr` where `x` is a symbol and `distr` is a distribution. If `x` is undefined in the model function, inside the probabilistic program, this puts a random variable named `x`, distributed according to `distr`, in the current scope. `distr` can be a value of any type that implements `rand(distr)`, which samples a value from the distribution `distr`. If `x` is defined, this is used for conditioning in a style similar to [Anglican](https://probprog.github.io/anglican/index.html) (another PPL). In this case, `x` is an observed value, assumed to have been drawn from the distribution `distr`. The likelihood is computed using `logpdf(distr,y)`. The observe statements should be arranged so that every possible run traverses all of them in exactly the same order. This is equivalent to demanding that they are not placed inside stochastic control flow. + +Available inference methods include Importance Sampling (IS), Sequential Monte Carlo (SMC), Particle Gibbs (PG), Hamiltonian Monte Carlo (HMC), Hamiltonian Monte Carlo with Dual Averaging (HMCDA) and The No-U-Turn Sampler (NUTS). + +### Simple Gaussian Demo + +Below is a simple Gaussian demo illustrate the basic usage of Turing.jl. + + +```{julia} +# Import packages. +using Turing +using StatsPlots + +# Define a simple Normal model with unknown mean and variance. +@model function gdemo(x, y) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + x ~ Normal(m, sqrt(s²)) + return y ~ Normal(m, sqrt(s²)) +end +``` + +Note: As a sanity check, the prior expectation of `s²` is `mean(InverseGamma(2, 3)) = 3/(2 - 1) = 3` and the prior expectation of `m` is 0. This can be easily checked using `Prior`: + +```{julia} +#| output: false +setprogress!(false) +``` + +```{julia} +p1 = sample(gdemo(missing, missing), Prior(), 100000) +``` + +We can perform inference by using the `sample` function, the first argument of which is our probabilistic program and the second of which is a sampler. + +```{julia} +# Run sampler, collect results. +c1 = sample(gdemo(1.5, 2), SMC(), 1000) +c2 = sample(gdemo(1.5, 2), PG(10), 1000) +c3 = sample(gdemo(1.5, 2), HMC(0.1, 5), 1000) +c4 = sample(gdemo(1.5, 2), Gibbs(PG(10, :m), HMC(0.1, 5, :s²)), 1000) +c5 = sample(gdemo(1.5, 2), HMCDA(0.15, 0.65), 1000) +c6 = sample(gdemo(1.5, 2), NUTS(0.65), 1000) +``` + +The arguments for each sampler are: + + - SMC: number of particles. + - PG: number of particles, number of iterations. + - HMC: leapfrog step size, leapfrog step numbers. + - Gibbs: component sampler 1, component sampler 2, ... + - HMCDA: total leapfrog length, target accept ratio. + - NUTS: number of adaptation steps (optional), target accept ratio. + +More information about each sampler can be found in [Turing.jl's API docs](https://turinglang.org/Turing.jl). + +The `MCMCChains` module (which is re-exported by Turing) provides plotting tools for the `Chain` objects returned by a `sample` function. See the [MCMCChains](https://github.com/TuringLang/MCMCChains.jl) repository for more information on the suite of tools available for diagnosing MCMC chains. + +```{julia} +#| eval: false +# Summarise results +describe(c3) + +# Plot results +plot(c3) +savefig("gdemo-plot.png") +``` + +### Modelling Syntax Explained + +Using this syntax, a probabilistic model is defined in Turing. The model function generated by Turing can then be used to condition the model onto data. Subsequently, the sample function can be used to generate samples from the posterior distribution. + +In the following example, the defined model is conditioned to the data (arg*1 = 1, arg*2 = 2) by passing (1, 2) to the model function. + +```{julia} +#| eval: false +@model function model_name(arg_1, arg_2) + return ... +end +``` + +The conditioned model can then be passed onto the sample function to run posterior inference. + +```{julia} +#| eval: false +model_func = model_name(1, 2) +chn = sample(model_func, HMC(..)) # Perform inference by sampling using HMC. +``` + +The returned chain contains samples of the variables in the model. + +```{julia} +#| eval: false +var_1 = mean(chn[:var_1]) # Taking the mean of a variable named var_1. +``` + +The key (`:var_1`) can be a `Symbol` or a `String`. For example, to fetch `x[1]`, one can use `chn[Symbol("x[1]")]` or `chn["x[1]"]`. +If you want to retrieve all parameters associated with a specific symbol, you can use `group`. As an example, if you have the +parameters `"x[1]"`, `"x[2]"`, and `"x[3]"`, calling `group(chn, :x)` or `group(chn, "x")` will return a new chain with only `"x[1]"`, `"x[2]"`, and `"x[3]"`. + +Turing does not have a declarative form. More generally, the order in which you place the lines of a `@model` macro matters. For example, the following example works: + +```{julia} +# Define a simple Normal model with unknown mean and variance. +@model function model_function(y) + s ~ Poisson(1) + y ~ Normal(s, 1) + return y +end + +sample(model_function(10), SMC(), 100) +``` + +But if we switch the `s ~ Poisson(1)` and `y ~ Normal(s, 1)` lines, the model will no longer sample correctly: + +```{julia} +#| eval: false +# Define a simple Normal model with unknown mean and variance. +@model function model_function(y) + y ~ Normal(s, 1) + s ~ Poisson(1) + return y +end + +sample(model_function(10), SMC(), 100) +``` + +### Sampling Multiple Chains + +Turing supports distributed and threaded parallel sampling. To do so, call `sample(model, sampler, parallel_type, n, n_chains)`, where `parallel_type` can be either `MCMCThreads()` or `MCMCDistributed()` for thread and parallel sampling, respectively. + +Having multiple chains in the same object is valuable for evaluating convergence. Some diagnostic functions like `gelmandiag` require multiple chains. + +If you do not want parallelism or are on an older version Julia, you can sample multiple chains with the `mapreduce` function: + +```{julia} +#| eval: false +# Replace num_chains below with however many chains you wish to sample. +chains = mapreduce(c -> sample(model_fun, sampler, 1000), chainscat, 1:num_chains) +``` + +The `chains` variable now contains a `Chains` object which can be indexed by chain. To pull out the first chain from the `chains` object, use `chains[:,:,1]`. The method is the same if you use either of the below parallel sampling methods. + +#### Multithreaded sampling + +If you wish to perform multithreaded sampling, you can call `sample` with the following signature: + +```{julia} +#| eval: false +using Turing + +@model function gdemo(x) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + + for i in eachindex(x) + x[i] ~ Normal(m, sqrt(s²)) + end +end + +model = gdemo([1.5, 2.0]) + +# Sample four chains using multiple threads, each with 1000 samples. +sample(model, NUTS(), MCMCThreads(), 1000, 4) +``` + +Be aware that Turing cannot add threads for you -- you must have started your Julia instance with multiple threads to experience any kind of parallelism. See the [Julia documentation](https://docs.julialang.org/en/v1/manual/parallel-computing/#man-multithreading-1) for details on how to achieve this. + +#### Distributed sampling + +To perform distributed sampling (using multiple processes), you must first import `Distributed`. + +Process parallel sampling can be done like so: + +```{julia} +#| eval: false +# Load Distributed to add processes and the @everywhere macro. +using Distributed + +# Load Turing. +using Turing + +# Add four processes to use for sampling. +addprocs(4; exeflags="--project=$(Base.active_project())") + +# Initialize everything on all the processes. +# Note: Make sure to do this after you've already loaded Turing, +# so each process does not have to precompile. +# Parallel sampling may fail silently if you do not do this. +@everywhere using Turing + +# Define a model on all processes. +@everywhere @model function gdemo(x) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + + for i in eachindex(x) + x[i] ~ Normal(m, sqrt(s²)) + end +end + +# Declare the model instance everywhere. +@everywhere model = gdemo([1.5, 2.0]) + +# Sample four chains using multiple processes, each with 1000 samples. +sample(model, NUTS(), MCMCDistributed(), 1000, 4) +``` + +### Sampling from an Unconditional Distribution (The Prior) + +Turing allows you to sample from a declared model's prior. If you wish to draw a chain from the prior to inspect your prior distributions, you can simply run + +```{julia} +#| eval: false +chain = sample(model, Prior(), n_samples) +``` + +You can also run your model (as if it were a function) from the prior distribution, by calling the model without specifying inputs or a sampler. In the below example, we specify a `gdemo` model which returns two variables, `x` and `y`. The model includes `x` and `y` as arguments, but calling the function without passing in `x` or `y` means that Turing's compiler will assume they are missing values to draw from the relevant distribution. The `return` statement is necessary to retrieve the sampled `x` and `y` values. +Assign the function with `missing` inputs to a variable, and Turing will produce a sample from the prior distribution. + +```{julia} +@model function gdemo(x, y) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + x ~ Normal(m, sqrt(s²)) + y ~ Normal(m, sqrt(s²)) + return x, y +end +``` + +Assign the function with `missing` inputs to a variable, and Turing will produce a sample from the prior distribution. + +```{julia} +# Samples from p(x,y) +g_prior_sample = gdemo(missing, missing) +g_prior_sample() +``` + +### Sampling from a Conditional Distribution (The Posterior) + +#### Treating observations as random variables + +Inputs to the model that have a value `missing` are treated as parameters, aka random variables, to be estimated/sampled. This can be useful if you want to simulate draws for that parameter, or if you are sampling from a conditional distribution. Turing supports the following syntax: + +```{julia} +@model function gdemo(x, ::Type{T}=Float64) where {T} + if x === missing + # Initialize `x` if missing + x = Vector{T}(undef, 2) + end + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + for i in eachindex(x) + x[i] ~ Normal(m, sqrt(s²)) + end +end + +# Construct a model with x = missing +model = gdemo(missing) +c = sample(model, HMC(0.01, 5), 500) +``` + +Note the need to initialize `x` when missing since we are iterating over its elements later in the model. The generated values for `x` can be extracted from the `Chains` object using `c[:x]`. + +Turing also supports mixed `missing` and non-`missing` values in `x`, where the missing ones will be treated as random variables to be sampled while the others get treated as observations. For example: + +```{julia} +@model function gdemo(x) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + for i in eachindex(x) + x[i] ~ Normal(m, sqrt(s²)) + end +end + +# x[1] is a parameter, but x[2] is an observation +model = gdemo([missing, 2.4]) +c = sample(model, HMC(0.01, 5), 500) +``` + +#### Default Values + +Arguments to Turing models can have default values much like how default values work in normal Julia functions. For instance, the following will assign `missing` to `x` and treat it as a random variable. If the default value is not `missing`, `x` will be assigned that value and will be treated as an observation instead. + + +```{julia} +using Turing + +@model function generative(x=missing, ::Type{T}=Float64) where {T<:Real} + if x === missing + # Initialize x when missing + x = Vector{T}(undef, 10) + end + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + for i in 1:length(x) + x[i] ~ Normal(m, sqrt(s²)) + end + return s², m +end + +m = generative() +chain = sample(m, HMC(0.01, 5), 1000) +``` + +#### Access Values inside Chain + +You can access the values inside a chain several ways: + + 1. Turn them into a `DataFrame` object + 2. Use their raw `AxisArray` form + 3. Create a three-dimensional `Array` object + +For example, let `c` be a `Chain`: + + 1. `DataFrame(c)` converts `c` to a `DataFrame`, + 2. `c.value` retrieves the values inside `c` as an `AxisArray`, and + 3. `c.value.data` retrieves the values inside `c` as a 3D `Array`. + +#### Variable Types and Type Parameters + +The element type of a vector (or matrix) of random variables should match the `eltype` of its prior distribution, `<: Integer` for discrete distributions and `<: AbstractFloat` for continuous distributions. Moreover, if the continuous random variable is to be sampled using a Hamiltonian sampler, the vector's element type needs to either be: + +1. `Real` to enable auto-differentiation through the model which uses special number types that are sub-types of `Real`, or + +2. Some type parameter `T` defined in the model header using the type parameter syntax, e.g. `function gdemo(x, ::Type{T} = Float64) where {T}`. + +Similarly, when using a particle sampler, the Julia variable used should either be: + +1. An `Array`, or + +2. An instance of some type parameter `T` defined in the model header using the type parameter syntax, e.g. `function gdemo(x, ::Type{T} = Vector{Float64}) where {T}`. + +### Querying Probabilities from Model or Chain + +Turing offers three functions: [`loglikelihood`](https://turinglang.org/DynamicPPL.jl/dev/api/#StatsAPI.loglikelihood), [`logprior`](https://turinglang.org/DynamicPPL.jl/dev/api/#DynamicPPL.logprior), and [`logjoint`](https://turinglang.org/DynamicPPL.jl/dev/api/#DynamicPPL.logjoint) to query the log-likelihood, log-prior, and log-joint probabilities of a model, respectively. + +Let's look at a simple model called `gdemo`: + +```{julia} +@model function gdemo0() + s ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s)) + return x ~ Normal(m, sqrt(s)) +end +``` + +If we observe x to be 1.0, we can condition the model on this datum using the [`condition`](https://turinglang.org/DynamicPPL.jl/dev/api/#AbstractPPL.condition) syntax: + +```{julia} +model = gdemo0() | (x=1.0,) +``` + +Now, let's compute the log-likelihood of the observation given specific values of the model parameters, `s` and `m`: + +```{julia} +loglikelihood(model, (s=1.0, m=1.0)) +``` + +We can easily verify that value in this case: + +```{julia} +logpdf(Normal(1.0, 1.0), 1.0) +``` + +We can also compute the log-prior probability of the model for the same values of s and m: + +```{julia} +logprior(model, (s=1.0, m=1.0)) +``` + +```{julia} +logpdf(InverseGamma(2, 3), 1.0) + logpdf(Normal(0, sqrt(1.0)), 1.0) +``` + +Finally, we can compute the log-joint probability of the model parameters and data: + +```{julia} +logjoint(model, (s=1.0, m=1.0)) +``` + +```{julia} +logpdf(Normal(1.0, 1.0), 1.0) + +logpdf(InverseGamma(2, 3), 1.0) + +logpdf(Normal(0, sqrt(1.0)), 1.0) +``` + +Querying with `Chains` object is easy as well: + +```{julia} +chn = sample(model, Prior(), 10) +``` + +```{julia} +loglikelihood(model, chn) +``` + +### Maximum likelihood and maximum a posterior estimates + +Turing also has functions for estimating the maximum aposteriori and maximum likelihood parameters of a model. This can be done with + +```{julia} +mle_estimate = maximum_likelihood(model) +map_estimate = maximum_a_posteriori(model) +``` + +For more details see the [mode estimation page]({{}}). + +## Beyond the Basics + +### Compositional Sampling Using Gibbs + +Turing.jl provides a Gibbs interface to combine different samplers. For example, one can combine an `HMC` sampler with a `PG` sampler to run inference for different parameters in a single model as below. + +```{julia} +@model function simple_choice(xs) + p ~ Beta(2, 2) + z ~ Bernoulli(p) + for i in 1:length(xs) + if z == 1 + xs[i] ~ Normal(0, 1) + else + xs[i] ~ Normal(2, 1) + end + end +end + +simple_choice_f = simple_choice([1.5, 2.0, 0.3]) + +chn = sample(simple_choice_f, Gibbs(HMC(0.2, 3, :p), PG(20, :z)), 1000) +``` + +The `Gibbs` sampler can be used to specify unique automatic differentiation backends for different variable spaces. Please see the [Automatic Differentiation]({{}}) article for more. + +For more details of compositional sampling in Turing.jl, please check the corresponding [paper](https://proceedings.mlr.press/v84/ge18b.html). + +### Working with filldist and arraydist + +Turing provides `filldist(dist::Distribution, n::Int)` and `arraydist(dists::AbstractVector{<:Distribution})` as a simplified interface to construct product distributions, e.g., to model a set of variables that share the same structure but vary by group. + +#### Constructing product distributions with filldist + +The function `filldist` provides a general interface to construct product distributions over distributions of the same type and parameterisation. +Note that, in contrast to the product distribution interface provided by Distributions.jl (`Product`), `filldist` supports product distributions over univariate or multivariate distributions. + +Example usage: + +```{julia} +@model function demo(x, g) + k = length(unique(g)) + a ~ filldist(Exponential(), k) # = Product(fill(Exponential(), k)) + mu = a[g] + return x .~ Normal.(mu) +end +``` + +#### Constructing product distributions with `arraydist` + +The function `arraydist` provides a general interface to construct product distributions over distributions of varying type and parameterisation. +Note that in contrast to the product distribution interface provided by Distributions.jl (`Product`), `arraydist` supports product distributions over univariate or multivariate distributions. + +Example usage: + +```{julia} +@model function demo(x, g) + k = length(unique(g)) + a ~ arraydist([Exponential(i) for i in 1:k]) + mu = a[g] + return x .~ Normal.(mu) +end +``` + +### Working with MCMCChains.jl + +Turing.jl wraps its samples using `MCMCChains.Chain` so that all the functions working for `MCMCChains.Chain` can be re-used in Turing.jl. Two typical functions are `MCMCChains.describe` and `MCMCChains.plot`, which can be used as follows for an obtained chain `chn`. For more information on `MCMCChains`, please see the [GitHub repository](https://github.com/TuringLang/MCMCChains.jl). + +```{julia} +describe(chn) # Lists statistics of the samples. +plot(chn) # Plots statistics of the samples. +``` + +There are numerous functions in addition to `describe` and `plot` in the `MCMCChains` package, such as those used in convergence diagnostics. For more information on the package, please see the [GitHub repository](https://github.com/TuringLang/MCMCChains.jl). + +### Changing Default Settings + +Some of Turing.jl's default settings can be changed for better usage. + +#### AD Chunk Size + +ForwardDiff (Turing's default AD backend) uses forward-mode chunk-wise AD. The chunk size can be set manually by `AutoForwardDiff(;chunksize=new_chunk_size)`. + +#### AD Backend + +Turing supports four automatic differentiation (AD) packages in the back end during sampling. The default AD backend is [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) for forward-mode AD. Three reverse-mode AD backends are also supported, namely [Mooncake](https://github.com/compintell/Mooncake.jl), [Zygote](https://github.com/FluxML/Zygote.jl) and [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). `Mooncake`, `Zygote`, and `ReverseDiff` also require the user to explicitly load them using `import Mooncake`, `import Zygote`, or `import ReverseDiff` next to `using Turing`. + +For more information on Turing's automatic differentiation backend, please see the [Automatic Differentiation]({{}}) article. + +#### Progress Logging + +`Turing.jl` uses ProgressLogging.jl to log the sampling progress. Progress +logging is enabled as default but might slow down inference. It can be turned on +or off by setting the keyword argument `progress` of `sample` to `true` or `false`. +Moreover, you can enable or disable progress logging globally by calling `setprogress!(true)` or `setprogress!(false)`, respectively. + +Turing uses heuristics to select an appropriate visualization backend. If you +use Jupyter notebooks, the default backend is +[ConsoleProgressMonitor.jl](https://github.com/tkf/ConsoleProgressMonitor.jl). +In all other cases, progress logs are displayed with +[TerminalLoggers.jl](https://github.com/c42f/TerminalLoggers.jl). Alternatively, +if you provide a custom visualization backend, Turing uses it instead of the +default backend. diff --git a/tutorials/docs-13-using-turing-performance-tips/index.qmd b/tutorials/docs-13-using-turing-performance-tips/index.qmd index 4eb9ba42b..241deb989 100755 --- a/tutorials/docs-13-using-turing-performance-tips/index.qmd +++ b/tutorials/docs-13-using-turing-performance-tips/index.qmd @@ -1,150 +1,150 @@ ---- -title: Performance Tips -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -This section briefly summarises a few common techniques to ensure good performance when using Turing. -We refer to [the Julia documentation](https://docs.julialang.org/en/v1/manual/performance-tips/index.html) for general techniques to ensure good performance of Julia programs. - -## Use multivariate distributions - -It is generally preferable to use multivariate distributions if possible. - -The following example: - -```{julia} -using Turing -@model function gmodel(x) - m ~ Normal() - for i in 1:length(x) - x[i] ~ Normal(m, 0.2) - end -end -``` - -can be directly expressed more efficiently using a simple transformation: - -```{julia} -using FillArrays - -@model function gmodel(x) - m ~ Normal() - return x ~ MvNormal(Fill(m, length(x)), 0.04 * I) -end -``` - -## Choose your AD backend - -Automatic differentiation (AD) makes it possible to use modern, efficient gradient-based samplers like NUTS and HMC, and that means a good AD system is incredibly important. Turing currently -supports several AD backends, including [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) (the default), -[Mooncake](https://github.com/compintell/Mooncake.jl), -[Zygote](https://github.com/FluxML/Zygote.jl), and -[ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). - -For many common types of models, the default ForwardDiff backend performs great, and there is no need to worry about changing it. However, if you need more speed, you can try -different backends via the standard [ADTypes](https://github.com/SciML/ADTypes.jl) interface by passing an `AbstractADType` to the sampler with the optional `adtype` argument, e.g. -`NUTS(adtype = AutoZygote())`. See [Automatic Differentiation]({{}}) for details. Generally, `adtype = AutoForwardDiff()` is likely to be the fastest and most reliable for models with -few parameters (say, less than 20 or so), while reverse-mode backends such as `AutoZygote()` or `AutoReverseDiff()` will perform better for models with many parameters or linear algebra -operations. If in doubt, it's easy to try a few different backends to see how they compare. - -### Special care for Zygote - -Note that Zygote will not perform well if your model contains `for`-loops, due to the way reverse-mode AD is implemented in these packages. Zygote also cannot differentiate code -that contains mutating operations. If you can't implement your model without `for`-loops or mutation, `ReverseDiff` will be a better, more performant option. In general, though, -vectorized operations are still likely to perform best. - -Avoiding loops can be done using `filldist(dist, N)` and `arraydist(dists)`. `filldist(dist, N)` creates a multivariate distribution that is composed of `N` identical and independent -copies of the univariate distribution `dist` if `dist` is univariate, or it creates a matrix-variate distribution composed of `N` identical and independent copies of the multivariate -distribution `dist` if `dist` is multivariate. `filldist(dist, N, M)` can also be used to create a matrix-variate distribution from a univariate distribution `dist`. `arraydist(dists)` -is similar to `filldist` but it takes an array of distributions `dists` as input. Writing a [custom distribution](advanced) with a custom adjoint is another option to avoid loops. - -### Special care for ReverseDiff with a compiled tape - -For large models, the fastest option is often ReverseDiff with a compiled tape, specified as `adtype=AutoReverseDiff(true)`. However, it is important to note that if your model contains any -branching code, such as `if`-`else` statements, **the gradients from a compiled tape may be inaccurate, leading to erroneous results**. If you use this option for the (considerable) speedup it -can provide, make sure to check your code. It's also a good idea to verify your gradients with another backend. - -## Ensure that types in your model can be inferred - -For efficient gradient-based inference, e.g. using HMC, NUTS or ADVI, it is important to ensure the types in your model can be inferred. - -The following example with abstract types - -```{julia} -@model function tmodel(x, y) - p, n = size(x) - params = Vector{Real}(undef, n) - for i in 1:n - params[i] ~ truncated(Normal(); lower=0) - end - - a = x * params - return y ~ MvNormal(a, I) -end -``` - -can be transformed into the following representation with concrete types: - -```{julia} -@model function tmodel(x, y, ::Type{T}=Float64) where {T} - p, n = size(x) - params = Vector{T}(undef, n) - for i in 1:n - params[i] ~ truncated(Normal(); lower=0) - end - - a = x * params - return y ~ MvNormal(a, I) -end -``` - -Alternatively, you could use `filldist` in this example: - -```{julia} -@model function tmodel(x, y) - params ~ filldist(truncated(Normal(); lower=0), size(x, 2)) - a = x * params - return y ~ MvNormal(a, I) -end -``` - -Note that you can use `@code_warntype` to find types in your model definition that the compiler cannot infer. -They are marked in red in the Julia REPL. - -For example, consider the following simple program: - -```{julia} -@model function tmodel(x) - p = Vector{Real}(undef, 1) - p[1] ~ Normal() - p = p .+ 1 - return x ~ Normal(p[1]) -end -``` - -We can use - -```{julia} -#| eval: false -using Random - -model = tmodel(1.0) - -@code_warntype model.f( - model, - Turing.VarInfo(model), - Turing.SamplingContext( - Random.default_rng(), Turing.SampleFromPrior(), Turing.DefaultContext() - ), - model.args..., -) -``` - -to inspect type inference in the model. +--- +title: Performance Tips +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +This section briefly summarises a few common techniques to ensure good performance when using Turing. +We refer to [the Julia documentation](https://docs.julialang.org/en/v1/manual/performance-tips/index.html) for general techniques to ensure good performance of Julia programs. + +## Use multivariate distributions + +It is generally preferable to use multivariate distributions if possible. + +The following example: + +```{julia} +using Turing +@model function gmodel(x) + m ~ Normal() + for i in 1:length(x) + x[i] ~ Normal(m, 0.2) + end +end +``` + +can be directly expressed more efficiently using a simple transformation: + +```{julia} +using FillArrays + +@model function gmodel(x) + m ~ Normal() + return x ~ MvNormal(Fill(m, length(x)), 0.04 * I) +end +``` + +## Choose your AD backend + +Automatic differentiation (AD) makes it possible to use modern, efficient gradient-based samplers like NUTS and HMC, and that means a good AD system is incredibly important. Turing currently +supports several AD backends, including [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) (the default), +[Mooncake](https://github.com/compintell/Mooncake.jl), +[Zygote](https://github.com/FluxML/Zygote.jl), and +[ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). + +For many common types of models, the default ForwardDiff backend performs great, and there is no need to worry about changing it. However, if you need more speed, you can try +different backends via the standard [ADTypes](https://github.com/SciML/ADTypes.jl) interface by passing an `AbstractADType` to the sampler with the optional `adtype` argument, e.g. +`NUTS(adtype = AutoZygote())`. See [Automatic Differentiation]({{}}) for details. Generally, `adtype = AutoForwardDiff()` is likely to be the fastest and most reliable for models with +few parameters (say, less than 20 or so), while reverse-mode backends such as `AutoZygote()` or `AutoReverseDiff()` will perform better for models with many parameters or linear algebra +operations. If in doubt, it's easy to try a few different backends to see how they compare. + +### Special care for Zygote + +Note that Zygote will not perform well if your model contains `for`-loops, due to the way reverse-mode AD is implemented in these packages. Zygote also cannot differentiate code +that contains mutating operations. If you can't implement your model without `for`-loops or mutation, `ReverseDiff` will be a better, more performant option. In general, though, +vectorized operations are still likely to perform best. + +Avoiding loops can be done using `filldist(dist, N)` and `arraydist(dists)`. `filldist(dist, N)` creates a multivariate distribution that is composed of `N` identical and independent +copies of the univariate distribution `dist` if `dist` is univariate, or it creates a matrix-variate distribution composed of `N` identical and independent copies of the multivariate +distribution `dist` if `dist` is multivariate. `filldist(dist, N, M)` can also be used to create a matrix-variate distribution from a univariate distribution `dist`. `arraydist(dists)` +is similar to `filldist` but it takes an array of distributions `dists` as input. Writing a [custom distribution](advanced) with a custom adjoint is another option to avoid loops. + +### Special care for ReverseDiff with a compiled tape + +For large models, the fastest option is often ReverseDiff with a compiled tape, specified as `adtype=AutoReverseDiff(true)`. However, it is important to note that if your model contains any +branching code, such as `if`-`else` statements, **the gradients from a compiled tape may be inaccurate, leading to erroneous results**. If you use this option for the (considerable) speedup it +can provide, make sure to check your code. It's also a good idea to verify your gradients with another backend. + +## Ensure that types in your model can be inferred + +For efficient gradient-based inference, e.g. using HMC, NUTS or ADVI, it is important to ensure the types in your model can be inferred. + +The following example with abstract types + +```{julia} +@model function tmodel(x, y) + p, n = size(x) + params = Vector{Real}(undef, n) + for i in 1:n + params[i] ~ truncated(Normal(); lower=0) + end + + a = x * params + return y ~ MvNormal(a, I) +end +``` + +can be transformed into the following representation with concrete types: + +```{julia} +@model function tmodel(x, y, ::Type{T}=Float64) where {T} + p, n = size(x) + params = Vector{T}(undef, n) + for i in 1:n + params[i] ~ truncated(Normal(); lower=0) + end + + a = x * params + return y ~ MvNormal(a, I) +end +``` + +Alternatively, you could use `filldist` in this example: + +```{julia} +@model function tmodel(x, y) + params ~ filldist(truncated(Normal(); lower=0), size(x, 2)) + a = x * params + return y ~ MvNormal(a, I) +end +``` + +Note that you can use `@code_warntype` to find types in your model definition that the compiler cannot infer. +They are marked in red in the Julia REPL. + +For example, consider the following simple program: + +```{julia} +@model function tmodel(x) + p = Vector{Real}(undef, 1) + p[1] ~ Normal() + p = p .+ 1 + return x ~ Normal(p[1]) +end +``` + +We can use + +```{julia} +#| eval: false +using Random + +model = tmodel(1.0) + +@code_warntype model.f( + model, + Turing.VarInfo(model), + Turing.SamplingContext( + Random.default_rng(), Turing.SampleFromPrior(), Turing.DefaultContext() + ), + model.args..., +) +``` + +to inspect type inference in the model. diff --git a/tutorials/docs-15-using-turing-sampler-viz/index.qmd b/tutorials/docs-15-using-turing-sampler-viz/index.qmd index 5dbfbf094..9b31eab8b 100755 --- a/tutorials/docs-15-using-turing-sampler-viz/index.qmd +++ b/tutorials/docs-15-using-turing-sampler-viz/index.qmd @@ -1,197 +1,197 @@ ---- -title: Sampler Visualization -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -## Introduction - -### The Code - -For each sampler, we will use the same code to plot sampler paths. The block below loads the relevant libraries and defines a function for plotting the sampler's trajectory across the posterior. - -The Turing model definition used here is not especially practical, but it is designed in such a way as to produce visually interesting posterior surfaces to show how different samplers move along the distribution. - -```{julia} -ENV["GKS_ENCODING"] = "utf-8" # Allows the use of unicode characters in Plots.jl -using Plots -using StatsPlots -using Turing -using Random -using Bijectors - -# Set a seed. -Random.seed!(0) - -# Define a strange model. -@model function gdemo(x) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - bumps = sin(m) + cos(m) - m = m + 5 * bumps - for i in eachindex(x) - x[i] ~ Normal(m, sqrt(s²)) - end - return s², m -end - -# Define our data points. -x = [1.5, 2.0, 13.0, 2.1, 0.0] - -# Set up the model call, sample from the prior. -model = gdemo(x) - -# Evaluate surface at coordinates. -evaluate(m1, m2) = logjoint(model, (m=m2, s²=invlink.(Ref(InverseGamma(2, 3)), m1))) - -function plot_sampler(chain; label="") - # Extract values from chain. - val = get(chain, [:s², :m, :lp]) - ss = link.(Ref(InverseGamma(2, 3)), val.s²) - ms = val.m - lps = val.lp - - # How many surface points to sample. - granularity = 100 - - # Range start/stop points. - spread = 0.5 - σ_start = minimum(ss) - spread * std(ss) - σ_stop = maximum(ss) + spread * std(ss) - μ_start = minimum(ms) - spread * std(ms) - μ_stop = maximum(ms) + spread * std(ms) - σ_rng = collect(range(σ_start; stop=σ_stop, length=granularity)) - μ_rng = collect(range(μ_start; stop=μ_stop, length=granularity)) - - # Make surface plot. - p = surface( - σ_rng, - μ_rng, - evaluate; - camera=(30, 65), - # ticks=nothing, - colorbar=false, - color=:inferno, - title=label, - ) - - line_range = 1:length(ms) - - scatter3d!( - ss[line_range], - ms[line_range], - lps[line_range]; - mc=:viridis, - marker_z=collect(line_range), - msw=0, - legend=false, - colorbar=false, - alpha=0.5, - xlabel="σ", - ylabel="μ", - zlabel="Log probability", - title=label, - ) - - return p -end; -``` - -```{julia} -#| output: false -setprogress!(false) -``` - -## Samplers - -### Gibbs - -Gibbs sampling tends to exhibit a "jittery" trajectory. The example below combines `HMC` and `PG` sampling to traverse the posterior. - -```{julia} -c = sample(model, Gibbs(HMC(0.01, 5, :s²), PG(20, :m)), 1000) -plot_sampler(c) -``` - -### HMC - -Hamiltonian Monte Carlo (HMC) sampling is a typical sampler to use, as it tends to be fairly good at converging in a efficient manner. It can often be tricky to set the correct parameters for this sampler however, and the `NUTS` sampler is often easier to run if you don't want to spend too much time fiddling with step size and and the number of steps to take. Note however that `HMC` does not explore the positive values μ very well, likely due to the leapfrog and step size parameter settings. - -```{julia} -c = sample(model, HMC(0.01, 10), 1000) -plot_sampler(c) -``` - -### HMCDA - -The HMCDA sampler is an implementation of the Hamiltonian Monte Carlo with Dual Averaging algorithm found in the paper "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo" by Hoffman and Gelman (2011). The paper can be found on [arXiv](https://arxiv.org/abs/1111.4246) for the interested reader. - -```{julia} -c = sample(model, HMCDA(200, 0.65, 0.3), 1000) -plot_sampler(c) -``` - -### MH - -Metropolis-Hastings (MH) sampling is one of the earliest Markov Chain Monte Carlo methods. MH sampling does not "move" a lot, unlike many of the other samplers implemented in Turing. Typically a much longer chain is required to converge to an appropriate parameter estimate. - -The plot below only uses 1,000 iterations of Metropolis-Hastings. - -```{julia} -c = sample(model, MH(), 1000) -plot_sampler(c) -``` - -As you can see, the MH sampler doesn't move parameter estimates very often. - -### NUTS - -The No U-Turn Sampler (NUTS) is an implementation of the algorithm found in the paper "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo" by Hoffman and Gelman (2011). The paper can be found on [arXiv](https://arxiv.org/abs/1111.4246) for the interested reader. - -NUTS tends to be very good at traversing complex posteriors quickly. - - -```{julia} -c = sample(model, NUTS(0.65), 1000) -plot_sampler(c) -``` - -The only parameter that needs to be set other than the number of iterations to run is the target acceptance rate. In the Hoffman and Gelman paper, they note that a target acceptance rate of 0.65 is typical. - -Here is a plot showing a very high acceptance rate. Note that it appears to "stick" to a mode and is not particularly good at exploring the posterior as compared to the 0.65 target acceptance ratio case. - -```{julia} -c = sample(model, NUTS(0.95), 1000) -plot_sampler(c) -``` - -An exceptionally low acceptance rate will show very few moves on the posterior: - -```{julia} -c = sample(model, NUTS(0.2), 1000) -plot_sampler(c) -``` - -### PG - -The Particle Gibbs (PG) sampler is an implementation of an algorithm from the paper "Particle Markov chain Monte Carlo methods" by Andrieu, Doucet, and Holenstein (2010). The interested reader can learn more [here](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9868.2009.00736.x). - -The two parameters are the number of particles, and the number of iterations. The plot below shows the use of 20 particles. - -```{julia} -c = sample(model, PG(20), 1000) -plot_sampler(c) -``` - -Next, we plot using 50 particles. - -```{julia} -c = sample(model, PG(50), 1000) -plot_sampler(c) -``` +--- +title: Sampler Visualization +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +## Introduction + +### The Code + +For each sampler, we will use the same code to plot sampler paths. The block below loads the relevant libraries and defines a function for plotting the sampler's trajectory across the posterior. + +The Turing model definition used here is not especially practical, but it is designed in such a way as to produce visually interesting posterior surfaces to show how different samplers move along the distribution. + +```{julia} +ENV["GKS_ENCODING"] = "utf-8" # Allows the use of unicode characters in Plots.jl +using Plots +using StatsPlots +using Turing +using Random +using Bijectors + +# Set a seed. +Random.seed!(0) + +# Define a strange model. +@model function gdemo(x) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + bumps = sin(m) + cos(m) + m = m + 5 * bumps + for i in eachindex(x) + x[i] ~ Normal(m, sqrt(s²)) + end + return s², m +end + +# Define our data points. +x = [1.5, 2.0, 13.0, 2.1, 0.0] + +# Set up the model call, sample from the prior. +model = gdemo(x) + +# Evaluate surface at coordinates. +evaluate(m1, m2) = logjoint(model, (m=m2, s²=invlink.(Ref(InverseGamma(2, 3)), m1))) + +function plot_sampler(chain; label="") + # Extract values from chain. + val = get(chain, [:s², :m, :lp]) + ss = link.(Ref(InverseGamma(2, 3)), val.s²) + ms = val.m + lps = val.lp + + # How many surface points to sample. + granularity = 100 + + # Range start/stop points. + spread = 0.5 + σ_start = minimum(ss) - spread * std(ss) + σ_stop = maximum(ss) + spread * std(ss) + μ_start = minimum(ms) - spread * std(ms) + μ_stop = maximum(ms) + spread * std(ms) + σ_rng = collect(range(σ_start; stop=σ_stop, length=granularity)) + μ_rng = collect(range(μ_start; stop=μ_stop, length=granularity)) + + # Make surface plot. + p = surface( + σ_rng, + μ_rng, + evaluate; + camera=(30, 65), + # ticks=nothing, + colorbar=false, + color=:inferno, + title=label, + ) + + line_range = 1:length(ms) + + scatter3d!( + ss[line_range], + ms[line_range], + lps[line_range]; + mc=:viridis, + marker_z=collect(line_range), + msw=0, + legend=false, + colorbar=false, + alpha=0.5, + xlabel="σ", + ylabel="μ", + zlabel="Log probability", + title=label, + ) + + return p +end; +``` + +```{julia} +#| output: false +setprogress!(false) +``` + +## Samplers + +### Gibbs + +Gibbs sampling tends to exhibit a "jittery" trajectory. The example below combines `HMC` and `PG` sampling to traverse the posterior. + +```{julia} +c = sample(model, Gibbs(HMC(0.01, 5, :s²), PG(20, :m)), 1000) +plot_sampler(c) +``` + +### HMC + +Hamiltonian Monte Carlo (HMC) sampling is a typical sampler to use, as it tends to be fairly good at converging in a efficient manner. It can often be tricky to set the correct parameters for this sampler however, and the `NUTS` sampler is often easier to run if you don't want to spend too much time fiddling with step size and and the number of steps to take. Note however that `HMC` does not explore the positive values μ very well, likely due to the leapfrog and step size parameter settings. + +```{julia} +c = sample(model, HMC(0.01, 10), 1000) +plot_sampler(c) +``` + +### HMCDA + +The HMCDA sampler is an implementation of the Hamiltonian Monte Carlo with Dual Averaging algorithm found in the paper "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo" by Hoffman and Gelman (2011). The paper can be found on [arXiv](https://arxiv.org/abs/1111.4246) for the interested reader. + +```{julia} +c = sample(model, HMCDA(200, 0.65, 0.3), 1000) +plot_sampler(c) +``` + +### MH + +Metropolis-Hastings (MH) sampling is one of the earliest Markov Chain Monte Carlo methods. MH sampling does not "move" a lot, unlike many of the other samplers implemented in Turing. Typically a much longer chain is required to converge to an appropriate parameter estimate. + +The plot below only uses 1,000 iterations of Metropolis-Hastings. + +```{julia} +c = sample(model, MH(), 1000) +plot_sampler(c) +``` + +As you can see, the MH sampler doesn't move parameter estimates very often. + +### NUTS + +The No U-Turn Sampler (NUTS) is an implementation of the algorithm found in the paper "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo" by Hoffman and Gelman (2011). The paper can be found on [arXiv](https://arxiv.org/abs/1111.4246) for the interested reader. + +NUTS tends to be very good at traversing complex posteriors quickly. + + +```{julia} +c = sample(model, NUTS(0.65), 1000) +plot_sampler(c) +``` + +The only parameter that needs to be set other than the number of iterations to run is the target acceptance rate. In the Hoffman and Gelman paper, they note that a target acceptance rate of 0.65 is typical. + +Here is a plot showing a very high acceptance rate. Note that it appears to "stick" to a mode and is not particularly good at exploring the posterior as compared to the 0.65 target acceptance ratio case. + +```{julia} +c = sample(model, NUTS(0.95), 1000) +plot_sampler(c) +``` + +An exceptionally low acceptance rate will show very few moves on the posterior: + +```{julia} +c = sample(model, NUTS(0.2), 1000) +plot_sampler(c) +``` + +### PG + +The Particle Gibbs (PG) sampler is an implementation of an algorithm from the paper "Particle Markov chain Monte Carlo methods" by Andrieu, Doucet, and Holenstein (2010). The interested reader can learn more [here](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9868.2009.00736.x). + +The two parameters are the number of particles, and the number of iterations. The plot below shows the use of 20 particles. + +```{julia} +c = sample(model, PG(20), 1000) +plot_sampler(c) +``` + +Next, we plot using 50 particles. + +```{julia} +c = sample(model, PG(50), 1000) +plot_sampler(c) +``` diff --git a/tutorials/docs-16-using-turing-external-samplers/index.qmd b/tutorials/docs-16-using-turing-external-samplers/index.qmd index 93a91b63f..a5362bed9 100755 --- a/tutorials/docs-16-using-turing-external-samplers/index.qmd +++ b/tutorials/docs-16-using-turing-external-samplers/index.qmd @@ -1,179 +1,179 @@ ---- -title: Using External Samplers -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -## Using External Samplers on Turing Models - -`Turing` provides several wrapped samplers from external sampling libraries, e.g., HMC samplers from `AdvancedHMC`. -These wrappers allow new users to seamlessly sample statistical models without leaving `Turing` -However, these wrappers might only sometimes be complete, missing some functionality from the wrapped sampling library. -Moreover, users might want to use samplers currently not wrapped within `Turing`. - -For these reasons, `Turing` also makes running external samplers on Turing models easy without any necessary modifications or wrapping! -Throughout, we will use a 10-dimensional Neal's funnel as a running example:: - -```{julia} -# Import libraries. -using Turing, Random, LinearAlgebra - -d = 10 -@model function funnel() - θ ~ Truncated(Normal(0, 3), -3, 3) - z ~ MvNormal(zeros(d - 1), exp(θ) * I) - return x ~ MvNormal(z, I) -end -``` - -Now we sample the model to generate some observations, which we can then condition on. - -```{julia} -(; x) = rand(funnel() | (θ=0,)) -model = funnel() | (; x); -``` - -Users can use any sampler algorithm to sample this model if it follows the `AbstractMCMC` API. -Before discussing how this is done in practice, giving a high-level description of the process is interesting. -Imagine that we created an instance of an external sampler that we will call `spl` such that `typeof(spl)<:AbstractMCMC.AbstractSampler`. -In order to avoid type ambiguity within Turing, at the moment it is necessary to declare `spl` as an external sampler to Turing `espl = externalsampler(spl)`, where `externalsampler(s::AbstractMCMC.AbstractSampler)` is a Turing function that types our external sampler adequately. - -An excellent point to start to show how this is done in practice is by looking at the sampling library `AdvancedMH` ([`AdvancedMH`'s GitHub](https://github.com/TuringLang/AdvancedMH.jl)) for Metropolis-Hastings (MH) methods. -Let's say we want to use a random walk Metropolis-Hastings sampler without specifying the proposal distributions. -The code below constructs an MH sampler using a multivariate Gaussian distribution with zero mean and unit variance in `d` dimensions as a random walk proposal. - -```{julia} -# Importing the sampling library -using AdvancedMH -rwmh = AdvancedMH.RWMH(d) -``` - -```{julia} -#| output: false -setprogress!(false) -``` - -Sampling is then as easy as: - - -```{julia} -chain = sample(model, externalsampler(rwmh), 10_000) -``` - -## Going beyond the Turing API - -As previously mentioned, the Turing wrappers can often limit the capabilities of the sampling libraries they wrap. -`AdvancedHMC`[^1] ([`AdvancedHMC`'s GitHub](https://github.com/TuringLang/AdvancedHMC.jl)) is a clear example of this. A common practice when performing HMC is to provide an initial guess for the mass matrix. -However, the native HMC sampler within Turing only allows the user to specify the type of the mass matrix despite the two options being possible within `AdvancedHMC`. -Thankfully, we can use Turing's support for external samplers to define an HMC sampler with a custom mass matrix in `AdvancedHMC` and then use it to sample our Turing model. - -We will use the library `Pathfinder`[^2] ((`Pathfinder`'s GitHub)[https://github.com/mlcolab/Pathfinder.jl]) to construct our estimate of mass matrix. -`Pathfinder` is a variational inference algorithm that first finds the maximum a posteriori (MAP) estimate of a target posterior distribution and then uses the trace of the optimization to construct a sequence of multivariate normal approximations to the target distribution. -In this process, `Pathfinder` computes an estimate of the mass matrix the user can access. - -The code below shows this can be done in practice. - -```{julia} -using AdvancedHMC, Pathfinder -# Running pathfinder -draws = 1_000 -result_multi = multipathfinder(model, draws; nruns=8) - -# Estimating the metric -inv_metric = result_multi.pathfinder_results[1].fit_distribution.Σ -metric = DenseEuclideanMetric(Matrix(inv_metric)) - -# Creating an AdvancedHMC NUTS sampler with the custom metric. -n_adapts = 1000 # Number of adaptation steps -tap = 0.9 # Large target acceptance probability to deal with the funnel structure of the posterior -nuts = AdvancedHMC.NUTS(tap; metric=metric) - -# Sample -chain = sample(model, externalsampler(nuts), 10_000; n_adapts=1_000) -``` - -## Using new inference methods - -So far we have used Turing's support for external samplers to go beyond the capabilities of the wrappers. -We want to use this support to employ a sampler not supported within Turing's ecosystem yet. -We will use the recently developed Micro-Cannoncial Hamiltonian Monte Carlo (MCHMC) sampler to showcase this. -MCHMC[[^3],[^4]] ((MCHMC's GitHub)[https://github.com/JaimeRZP/MicroCanonicalHMC.jl]) is HMC sampler that uses one single Hamiltonian energy level to explore the whole parameter space. -This is achieved by simulating the dynamics of a microcanonical Hamiltonian with an additional noise term to ensure ergodicity. - -Using this as well as other inference methods outside the Turing ecosystem is as simple as executing the code shown below: - -```{julia} -using MicroCanonicalHMC -# Create MCHMC sampler -n_adapts = 1_000 # adaptation steps -tev = 0.01 # target energy variance -mchmc = MCHMC(n_adapts, tev; adaptive=true) - -# Sample -chain = sample(model, externalsampler(mchmc), 10_000) -``` - -The only requirement to work with `externalsampler` is that the provided `sampler` must implement the AbstractMCMC.jl-interface [INSERT LINK] for a `model` of type `AbstractMCMC.LogDensityModel` [INSERT LINK]. - -As previously stated, in order to use external sampling libraries within `Turing` they must follow the `AbstractMCMC` API. -In this section, we will briefly dwell on what this entails. -First and foremost, the sampler should be a subtype of `AbstractMCMC.AbstractSampler`. -Second, the stepping function of the MCMC algorithm must be made defined using `AbstractMCMC.step` and follow the structure below: - -```{julia} -#| eval: false -# First step -function AbstractMCMC.step{T<:AbstractMCMC.AbstractSampler}( - rng::Random.AbstractRNG, - model::AbstractMCMC.LogDensityModel, - spl::T; - kwargs..., -) - [...] - return transition, sample -end - -# N+1 step -function AbstractMCMC.step{T<:AbstractMCMC.AbstractSampler}( - rng::Random.AbstractRNG, - model::AbstractMCMC.LogDensityModel, - sampler::T, - state; - kwargs..., -) - [...] - return transition, sample -end -``` - -There are several characteristics to note in these functions: - - - There must be two `step` functions: - - + A function that performs the first step and initializes the sampler. - + A function that performs the following steps and takes an extra input, `state`, which carries the initialization information. - - - The functions must follow the displayed signatures. - - The output of the functions must be a transition, the current state of the sampler, and a sample, what is saved to the MCMC chain. - -The last requirement is that the transition must be structured with a field `θ`, which contains the values of the parameters of the model for said transition. -This allows `Turing` to seamlessly extract the parameter values at each step of the chain when bundling the chains. -Note that if the external sampler produces transitions that Turing cannot parse, the bundling of the samples will be different or fail. - -For practical examples of how to adapt a sampling library to the `AbstractMCMC` interface, the readers can consult the following libraries: - - - [AdvancedMH](https://github.com/TuringLang/AdvancedMH.jl/blob/458a602ac32a8514a117d4c671396a9ba8acbdab/src/mh-core.jl#L73-L115) - - [AdvancedHMC](https://github.com/TuringLang/AdvancedHMC.jl/blob/762e55f894d142495a41a6eba0eed9201da0a600/src/abstractmcmc.jl#L102-L170) - - [MicroCanonicalHMC](https://github.com/JaimeRZP/MicroCanonicalHMC.jl/blob/master/src/abstractmcmc.jl) - - -[^1]: Xu et al., [AdvancedHMC.jl: A robust, modular and efficient implementation of advanced HMC algorithms](http://proceedings.mlr.press/v118/xu20a/xu20a.pdf), 2019 -[^2]: Zhang et al., [Pathfinder: Parallel quasi-Newton variational inference](https://arxiv.org/abs/2108.03782), 2021 -[^3]: Robnik et al, [Microcanonical Hamiltonian Monte Carlo](https://arxiv.org/abs/2212.08549), 2022 -[^4]: Robnik and Seljak, [Langevine Hamiltonian Monte Carlo](https://arxiv.org/abs/2303.18221), 2023 +--- +title: Using External Samplers +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +## Using External Samplers on Turing Models + +`Turing` provides several wrapped samplers from external sampling libraries, e.g., HMC samplers from `AdvancedHMC`. +These wrappers allow new users to seamlessly sample statistical models without leaving `Turing` +However, these wrappers might only sometimes be complete, missing some functionality from the wrapped sampling library. +Moreover, users might want to use samplers currently not wrapped within `Turing`. + +For these reasons, `Turing` also makes running external samplers on Turing models easy without any necessary modifications or wrapping! +Throughout, we will use a 10-dimensional Neal's funnel as a running example:: + +```{julia} +# Import libraries. +using Turing, Random, LinearAlgebra + +d = 10 +@model function funnel() + θ ~ Truncated(Normal(0, 3), -3, 3) + z ~ MvNormal(zeros(d - 1), exp(θ) * I) + return x ~ MvNormal(z, I) +end +``` + +Now we sample the model to generate some observations, which we can then condition on. + +```{julia} +(; x) = rand(funnel() | (θ=0,)) +model = funnel() | (; x); +``` + +Users can use any sampler algorithm to sample this model if it follows the `AbstractMCMC` API. +Before discussing how this is done in practice, giving a high-level description of the process is interesting. +Imagine that we created an instance of an external sampler that we will call `spl` such that `typeof(spl)<:AbstractMCMC.AbstractSampler`. +In order to avoid type ambiguity within Turing, at the moment it is necessary to declare `spl` as an external sampler to Turing `espl = externalsampler(spl)`, where `externalsampler(s::AbstractMCMC.AbstractSampler)` is a Turing function that types our external sampler adequately. + +An excellent point to start to show how this is done in practice is by looking at the sampling library `AdvancedMH` ([`AdvancedMH`'s GitHub](https://github.com/TuringLang/AdvancedMH.jl)) for Metropolis-Hastings (MH) methods. +Let's say we want to use a random walk Metropolis-Hastings sampler without specifying the proposal distributions. +The code below constructs an MH sampler using a multivariate Gaussian distribution with zero mean and unit variance in `d` dimensions as a random walk proposal. + +```{julia} +# Importing the sampling library +using AdvancedMH +rwmh = AdvancedMH.RWMH(d) +``` + +```{julia} +#| output: false +setprogress!(false) +``` + +Sampling is then as easy as: + + +```{julia} +chain = sample(model, externalsampler(rwmh), 10_000) +``` + +## Going beyond the Turing API + +As previously mentioned, the Turing wrappers can often limit the capabilities of the sampling libraries they wrap. +`AdvancedHMC`[^1] ([`AdvancedHMC`'s GitHub](https://github.com/TuringLang/AdvancedHMC.jl)) is a clear example of this. A common practice when performing HMC is to provide an initial guess for the mass matrix. +However, the native HMC sampler within Turing only allows the user to specify the type of the mass matrix despite the two options being possible within `AdvancedHMC`. +Thankfully, we can use Turing's support for external samplers to define an HMC sampler with a custom mass matrix in `AdvancedHMC` and then use it to sample our Turing model. + +We will use the library `Pathfinder`[^2] ((`Pathfinder`'s GitHub)[https://github.com/mlcolab/Pathfinder.jl]) to construct our estimate of mass matrix. +`Pathfinder` is a variational inference algorithm that first finds the maximum a posteriori (MAP) estimate of a target posterior distribution and then uses the trace of the optimization to construct a sequence of multivariate normal approximations to the target distribution. +In this process, `Pathfinder` computes an estimate of the mass matrix the user can access. + +The code below shows this can be done in practice. + +```{julia} +using AdvancedHMC, Pathfinder +# Running pathfinder +draws = 1_000 +result_multi = multipathfinder(model, draws; nruns=8) + +# Estimating the metric +inv_metric = result_multi.pathfinder_results[1].fit_distribution.Σ +metric = DenseEuclideanMetric(Matrix(inv_metric)) + +# Creating an AdvancedHMC NUTS sampler with the custom metric. +n_adapts = 1000 # Number of adaptation steps +tap = 0.9 # Large target acceptance probability to deal with the funnel structure of the posterior +nuts = AdvancedHMC.NUTS(tap; metric=metric) + +# Sample +chain = sample(model, externalsampler(nuts), 10_000; n_adapts=1_000) +``` + +## Using new inference methods + +So far we have used Turing's support for external samplers to go beyond the capabilities of the wrappers. +We want to use this support to employ a sampler not supported within Turing's ecosystem yet. +We will use the recently developed Micro-Cannoncial Hamiltonian Monte Carlo (MCHMC) sampler to showcase this. +MCHMC[[^3],[^4]] ((MCHMC's GitHub)[https://github.com/JaimeRZP/MicroCanonicalHMC.jl]) is HMC sampler that uses one single Hamiltonian energy level to explore the whole parameter space. +This is achieved by simulating the dynamics of a microcanonical Hamiltonian with an additional noise term to ensure ergodicity. + +Using this as well as other inference methods outside the Turing ecosystem is as simple as executing the code shown below: + +```{julia} +using MicroCanonicalHMC +# Create MCHMC sampler +n_adapts = 1_000 # adaptation steps +tev = 0.01 # target energy variance +mchmc = MCHMC(n_adapts, tev; adaptive=true) + +# Sample +chain = sample(model, externalsampler(mchmc), 10_000) +``` + +The only requirement to work with `externalsampler` is that the provided `sampler` must implement the AbstractMCMC.jl-interface [INSERT LINK] for a `model` of type `AbstractMCMC.LogDensityModel` [INSERT LINK]. + +As previously stated, in order to use external sampling libraries within `Turing` they must follow the `AbstractMCMC` API. +In this section, we will briefly dwell on what this entails. +First and foremost, the sampler should be a subtype of `AbstractMCMC.AbstractSampler`. +Second, the stepping function of the MCMC algorithm must be made defined using `AbstractMCMC.step` and follow the structure below: + +```{julia} +#| eval: false +# First step +function AbstractMCMC.step{T<:AbstractMCMC.AbstractSampler}( + rng::Random.AbstractRNG, + model::AbstractMCMC.LogDensityModel, + spl::T; + kwargs..., +) + [...] + return transition, sample +end + +# N+1 step +function AbstractMCMC.step{T<:AbstractMCMC.AbstractSampler}( + rng::Random.AbstractRNG, + model::AbstractMCMC.LogDensityModel, + sampler::T, + state; + kwargs..., +) + [...] + return transition, sample +end +``` + +There are several characteristics to note in these functions: + + - There must be two `step` functions: + + + A function that performs the first step and initializes the sampler. + + A function that performs the following steps and takes an extra input, `state`, which carries the initialization information. + + - The functions must follow the displayed signatures. + - The output of the functions must be a transition, the current state of the sampler, and a sample, what is saved to the MCMC chain. + +The last requirement is that the transition must be structured with a field `θ`, which contains the values of the parameters of the model for said transition. +This allows `Turing` to seamlessly extract the parameter values at each step of the chain when bundling the chains. +Note that if the external sampler produces transitions that Turing cannot parse, the bundling of the samples will be different or fail. + +For practical examples of how to adapt a sampling library to the `AbstractMCMC` interface, the readers can consult the following libraries: + + - [AdvancedMH](https://github.com/TuringLang/AdvancedMH.jl/blob/458a602ac32a8514a117d4c671396a9ba8acbdab/src/mh-core.jl#L73-L115) + - [AdvancedHMC](https://github.com/TuringLang/AdvancedHMC.jl/blob/762e55f894d142495a41a6eba0eed9201da0a600/src/abstractmcmc.jl#L102-L170) + - [MicroCanonicalHMC](https://github.com/JaimeRZP/MicroCanonicalHMC.jl/blob/master/src/abstractmcmc.jl) + + +[^1]: Xu et al., [AdvancedHMC.jl: A robust, modular and efficient implementation of advanced HMC algorithms](http://proceedings.mlr.press/v118/xu20a/xu20a.pdf), 2019 +[^2]: Zhang et al., [Pathfinder: Parallel quasi-Newton variational inference](https://arxiv.org/abs/2108.03782), 2021 +[^3]: Robnik et al, [Microcanonical Hamiltonian Monte Carlo](https://arxiv.org/abs/2212.08549), 2022 +[^4]: Robnik and Seljak, [Langevine Hamiltonian Monte Carlo](https://arxiv.org/abs/2303.18221), 2023 diff --git a/tutorials/docs-17-mode-estimation/index.qmd b/tutorials/docs-17-mode-estimation/index.qmd index f3eba5a6b..6319219a6 100755 --- a/tutorials/docs-17-mode-estimation/index.qmd +++ b/tutorials/docs-17-mode-estimation/index.qmd @@ -1,124 +1,124 @@ ---- -title: Mode Estimation -engine: julia ---- - -```{julia} -#| echo: false -#| output: false -using Pkg; -Pkg.instantiate(); -``` - -After defining a statistical model, in addition to sampling from its distributions, one may be interested in finding the parameter values that maximise for instance the posterior distribution density function or the likelihood. This is called mode estimation. Turing provides support for two mode estimation techniques, [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) (MLE) and [maximum a posterior](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) (MAP) estimation. - -To demonstrate mode estimation, let us load Turing and declare a model: - -```{julia} -using Turing - -@model function gdemo(x) - s² ~ InverseGamma(2, 3) - m ~ Normal(0, sqrt(s²)) - - for i in eachindex(x) - x[i] ~ Normal(m, sqrt(s²)) - end -end -``` - -Once the model is defined, we can construct a model instance as we normally would: - -```{julia} -# Instantiate the gdemo model with our data. -data = [1.5, 2.0] -model = gdemo(data) -``` - -Finding the maximum aposteriori or maximum likelihood parameters is as simple as - -```{julia} -# Generate a MLE estimate. -mle_estimate = maximum_likelihood(model) - -# Generate a MAP estimate. -map_estimate = maximum_a_posteriori(model) -``` - -The estimates are returned as instances of the `ModeResult` type. It has the fields `values` for the parameter values found and `lp` for the log probability at the optimum, as well as `f` for the objective function and `optim_result` for more detailed results of the optimisation procedure. - -```{julia} -@show mle_estimate.values -@show mle_estimate.lp; -``` - -## Controlling the optimisation process - -Under the hood `maximum_likelihood` and `maximum_a_posteriori` use the [Optimization.jl](https://github.com/SciML/Optimization.jl) package, which provides a unified interface to many other optimisation packages. By default Turing typically uses the [LBFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) method from [Optim.jl](https://github.com/JuliaNLSolvers/Optim.jl) to find the mode estimate, but we can easily change that: - -```{julia} -using OptimizationOptimJL: NelderMead -@show maximum_likelihood(model, NelderMead()) - -using OptimizationNLopt: NLopt.LD_TNEWTON_PRECOND_RESTART -@show maximum_likelihood(model, LD_TNEWTON_PRECOND_RESTART()); -``` - -The above are just two examples, Optimization.jl supports [many more](https://docs.sciml.ai/Optimization/stable/). - -We can also help the optimisation by giving it a starting point we know is close to the final solution, or by specifying an automatic differentiation method - -```{julia} -using ADTypes: AutoReverseDiff -import ReverseDiff -maximum_likelihood( - model, NelderMead(); initial_params=[0.1, 2], adtype=AutoReverseDiff() -) -``` - -When providing values to arguments like `initial_params` the parameters are typically specified in the order in which they appear in the code of the model, so in this case first `s²` then `m`. More precisely it's the order returned by `Turing.Inference.getparams(model, Turing.VarInfo(model))`. - -We can also do constrained optimisation, by providing either intervals within which the parameters must stay, or costraint functions that they need to respect. For instance, here's how one can find the MLE with the constraint that the variance must be less than 0.01 and the mean must be between -1 and 1.: - -```{julia} -maximum_likelihood(model; lb=[0.0, -1.0], ub=[0.01, 1.0]) -``` - -The arguments for lower (`lb`) and upper (`ub`) bounds follow the arguments of `Optimization.OptimizationProblem`, as do other parameters for providing [constraints](https://docs.sciml.ai/Optimization/stable/tutorials/constraints/), such as `cons`. Any extraneous keyword arguments given to `maximum_likelihood` or `maximum_a_posteriori` are passed to `Optimization.solve`. Some often useful ones are `maxiters` for controlling the maximum number of iterations and `abstol` and `reltol` for the absolute and relative convergence tolerances: - -```{julia} -badly_converged_mle = maximum_likelihood( - model, NelderMead(); maxiters=10, reltol=1e-9 -) -``` - -We can check whether the optimisation converged using the `optim_result` field of the result: - -```{julia} -@show badly_converged_mle.optim_result; -``` - -For more details, such as a full list of possible arguments, we encourage the reader to read the docstring of the function `Turing.Optimisation.estimate_mode`, which is what `maximum_likelihood` and `maximum_a_posteriori` call, and the documentation of [Optimization.jl](https://docs.sciml.ai/Optimization/stable/). - -## Analyzing your mode estimate - -Turing extends several methods from `StatsBase` that can be used to analyze your mode estimation results. Methods implemented include `vcov`, `informationmatrix`, `coeftable`, `params`, and `coef`, among others. - -For example, let's examine our ML estimate from above using `coeftable`: - -```{julia} -using StatsBase: coeftable -coeftable(mle_estimate) -``` - -Standard errors are calculated from the Fisher information matrix (inverse Hessian of the log likelihood or log joint). Note that standard errors calculated in this way may not always be appropriate for MAP estimates, so please be cautious in interpreting them. - -## Sampling with the MAP/MLE as initial states - -You can begin sampling your chain from an MLE/MAP estimate by extracting the vector of parameter values and providing it to the `sample` function with the keyword `initial_params`. For example, here is how to sample from the full posterior using the MAP estimate as the starting point: - -```{julia} -#| eval: false -map_estimate = maximum_a_posteriori(model) -chain = sample(model, NUTS(), 1_000; initial_params=map_estimate.values.array) -``` +--- +title: Mode Estimation +engine: julia +--- + +```{julia} +#| echo: false +#| output: false +using Pkg; +Pkg.instantiate(); +``` + +After defining a statistical model, in addition to sampling from its distributions, one may be interested in finding the parameter values that maximise for instance the posterior distribution density function or the likelihood. This is called mode estimation. Turing provides support for two mode estimation techniques, [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) (MLE) and [maximum a posterior](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) (MAP) estimation. + +To demonstrate mode estimation, let us load Turing and declare a model: + +```{julia} +using Turing + +@model function gdemo(x) + s² ~ InverseGamma(2, 3) + m ~ Normal(0, sqrt(s²)) + + for i in eachindex(x) + x[i] ~ Normal(m, sqrt(s²)) + end +end +``` + +Once the model is defined, we can construct a model instance as we normally would: + +```{julia} +# Instantiate the gdemo model with our data. +data = [1.5, 2.0] +model = gdemo(data) +``` + +Finding the maximum aposteriori or maximum likelihood parameters is as simple as + +```{julia} +# Generate a MLE estimate. +mle_estimate = maximum_likelihood(model) + +# Generate a MAP estimate. +map_estimate = maximum_a_posteriori(model) +``` + +The estimates are returned as instances of the `ModeResult` type. It has the fields `values` for the parameter values found and `lp` for the log probability at the optimum, as well as `f` for the objective function and `optim_result` for more detailed results of the optimisation procedure. + +```{julia} +@show mle_estimate.values +@show mle_estimate.lp; +``` + +## Controlling the optimisation process + +Under the hood `maximum_likelihood` and `maximum_a_posteriori` use the [Optimization.jl](https://github.com/SciML/Optimization.jl) package, which provides a unified interface to many other optimisation packages. By default Turing typically uses the [LBFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) method from [Optim.jl](https://github.com/JuliaNLSolvers/Optim.jl) to find the mode estimate, but we can easily change that: + +```{julia} +using OptimizationOptimJL: NelderMead +@show maximum_likelihood(model, NelderMead()) + +using OptimizationNLopt: NLopt.LD_TNEWTON_PRECOND_RESTART +@show maximum_likelihood(model, LD_TNEWTON_PRECOND_RESTART()); +``` + +The above are just two examples, Optimization.jl supports [many more](https://docs.sciml.ai/Optimization/stable/). + +We can also help the optimisation by giving it a starting point we know is close to the final solution, or by specifying an automatic differentiation method + +```{julia} +using ADTypes: AutoReverseDiff +import ReverseDiff +maximum_likelihood( + model, NelderMead(); initial_params=[0.1, 2], adtype=AutoReverseDiff() +) +``` + +When providing values to arguments like `initial_params` the parameters are typically specified in the order in which they appear in the code of the model, so in this case first `s²` then `m`. More precisely it's the order returned by `Turing.Inference.getparams(model, Turing.VarInfo(model))`. + +We can also do constrained optimisation, by providing either intervals within which the parameters must stay, or costraint functions that they need to respect. For instance, here's how one can find the MLE with the constraint that the variance must be less than 0.01 and the mean must be between -1 and 1.: + +```{julia} +maximum_likelihood(model; lb=[0.0, -1.0], ub=[0.01, 1.0]) +``` + +The arguments for lower (`lb`) and upper (`ub`) bounds follow the arguments of `Optimization.OptimizationProblem`, as do other parameters for providing [constraints](https://docs.sciml.ai/Optimization/stable/tutorials/constraints/), such as `cons`. Any extraneous keyword arguments given to `maximum_likelihood` or `maximum_a_posteriori` are passed to `Optimization.solve`. Some often useful ones are `maxiters` for controlling the maximum number of iterations and `abstol` and `reltol` for the absolute and relative convergence tolerances: + +```{julia} +badly_converged_mle = maximum_likelihood( + model, NelderMead(); maxiters=10, reltol=1e-9 +) +``` + +We can check whether the optimisation converged using the `optim_result` field of the result: + +```{julia} +@show badly_converged_mle.optim_result; +``` + +For more details, such as a full list of possible arguments, we encourage the reader to read the docstring of the function `Turing.Optimisation.estimate_mode`, which is what `maximum_likelihood` and `maximum_a_posteriori` call, and the documentation of [Optimization.jl](https://docs.sciml.ai/Optimization/stable/). + +## Analyzing your mode estimate + +Turing extends several methods from `StatsBase` that can be used to analyze your mode estimation results. Methods implemented include `vcov`, `informationmatrix`, `coeftable`, `params`, and `coef`, among others. + +For example, let's examine our ML estimate from above using `coeftable`: + +```{julia} +using StatsBase: coeftable +coeftable(mle_estimate) +``` + +Standard errors are calculated from the Fisher information matrix (inverse Hessian of the log likelihood or log joint). Note that standard errors calculated in this way may not always be appropriate for MAP estimates, so please be cautious in interpreting them. + +## Sampling with the MAP/MLE as initial states + +You can begin sampling your chain from an MLE/MAP estimate by extracting the vector of parameter values and providing it to the `sample` function with the keyword `initial_params`. For example, here is how to sample from the full posterior using the MAP estimate as the starting point: + +```{julia} +#| eval: false +map_estimate = maximum_a_posteriori(model) +chain = sample(model, NUTS(), 1_000; initial_params=map_estimate.values.array) +```