diff --git a/docs/make.jl b/docs/make.jl index fa1bd451..b209f699 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -19,12 +19,9 @@ makedocs( "Defining (PO)MDP Models" => [ "def_pomdp.md", - "static.md", "interfaces.md", - "dynamics.md", ], - "Writing Solvers" => [ "def_solver.md", "offline_solver.md", diff --git a/docs/src/api.md b/docs/src/api.md index e021065b..5b1d52ae 100644 --- a/docs/src/api.md +++ b/docs/src/api.md @@ -59,6 +59,14 @@ convert_a convert_o ``` +### Type Inference + +```@docs +statetype +actiontype +obstype +``` + ### Distributions and Spaces ```@docs @@ -93,21 +101,3 @@ value Simulator simulate ``` - -## Other - -The following functions are not part of the API for specifying and solving POMDPs, but are included in the package. - -### Type Inference - -```@docs -statetype -actiontype -obstype -``` - -### Utility Tools - -```@docs -add_registry -``` diff --git a/docs/src/concepts.md b/docs/src/concepts.md index 87de57eb..64ab5feb 100644 --- a/docs/src/concepts.md +++ b/docs/src/concepts.md @@ -24,31 +24,26 @@ The code components of the POMDPs.jl ecosystem relevant to problems and solvers An MDP is a mathematical framework for sequential decision making under uncertainty, and where all of the uncertainty arises from outcomes that are partially random and partially under the control of a decision -maker. Mathematically, an MDP is a tuple (S,A,T,R), where S is the state -space, A is the action space, T is a transition function defining the +maker. Mathematically, an MDP is a tuple ``(S,A,T,R,\gamma)``, where ``S`` is the state +space, ``A`` is the action space, ``T`` is a transition function defining the probability of transitioning to each state given the state and action at -the previous time, and R is a reward function mapping every possible -transition (s,a,s') to a real reward value. For more information see a +the previous time, and ``R`` is a reward function mapping every possible +transition ``(s,a,s')`` to a real reward value. Finally, ``\gamma`` is a discount factor that defines the relative weighting of current and future rewards. +For more information see a textbook such as \[1\]. In POMDPs.jl an MDP is represented by a concrete subtype of the [`MDP`](@ref) abstract type and a set of methods that -define each of its components. S and A are defined by implementing -[`states`](@ref) and [`actions`](@ref) for your specific [`MDP`](@ref) -subtype. R is by implementing [`reward`](@ref), and T is defined by implementing [`transition`](@ref) if the [*explicit*](@ref defining_pomdps) interface is used or [`gen`](@ref) if the [*generative*](@ref defining_pomdps) interface is used. +define each of its components as described in the [problem definition section](@ref defining_pomdps). A POMDP is a more general sequential decision making problem in which the agent is not sure what state they are in. The state is only partially observable by the decision making agent. Mathematically, a -POMDP is a tuple (S,A,T,R,O,Z) where S, A, T, and R are the same as with -MDPs, Z is the agent's observation space, and O defines the probability +POMDP is a tuple ``(S,A,T,R,O,Z,\gamma)`` where ``S``, ``A``, ``T``, ``R``, and ``\gamma`` have the same meaning as in an MDP, ``Z`` is the agent's observation space, and ``O`` defines the probability of receiving each observation at a transition. In POMDPs.jl, a POMDP is represented by a concrete subtype of the [`POMDP`](@ref) abstract type, -`Z` may be defined by the [`observations`](@ref) function (though an -explicit definition is often not required), and `O` is defined by -implementing [`observation`](@ref) if the [*explicit*](@ref defining_pomdps) interface is used or [`gen`](@ref) if the [*generative*](@ref defining_pomdps) interface is used. +and the methods described in the [problem definition section](@ref defining_pomdps). POMDPs.jl contains additional functions for defining optional problem behavior -such as a [discount factor](@ref Discount-Factor) or a set of [terminal states](@ref Terminal-States). - +such as an [initial state distribution](@ref Initial-state-distribution) or [terminal states](@ref Terminal-states). More information can be found in the [Defining POMDPs](@ref defining_pomdps) section. ## Beliefs and Updaters diff --git a/docs/src/def_pomdp.md b/docs/src/def_pomdp.md index cb7127a9..328a9fa4 100644 --- a/docs/src/def_pomdp.md +++ b/docs/src/def_pomdp.md @@ -1,47 +1,431 @@ # [Defining POMDPs and MDPs](@id defining_pomdps) -## Consider starting with one of these packages +As described in the [Concepts and Architecture](@ref) section, an MDP is defined by the state space, action space, transition distributions, reward function, and discount factor, ``(S,A,T,R,\gamma)``. A POMDP also includes the observation space, and observation probability distributions, for a definition of ``(S,A,T,R,O,Z,\gamma)``. A problem definition in POMDPs.jl consists of an implicit or explicit definition of each of these elements. -Since POMDPs.jl was designed with performance and flexibility as first priorities, the interface is larger than needed to express most simple problems. For this reason, several packages and tools have been created to help users implement problems quickly. It is often easiest for new users to start with one of these. +It is possible to define a (PO)MDP with a more traditional [object-oriented approach](@ref Object-oriented) in which the user defines a new type to represent the (PO)MDP and methods of [interface functions](@ref API-Documentation) to define the tuple elements. However, the [QuickPOMDPs package](https://github.com/JuliaPOMDP/QuickPOMDPs.jl) provides a more concise way to get started, using keyword arguments instead of new types and methods. Essentially each keyword argument defines a corresponding [POMDPs api function](@ref API-Documentation). Since the important concepts are the same for the object oriented approach and the QuickPOMDP approach, we will use the latter for this discussion. -- [QuickPOMDPs.jl](https://github.com/JuliaPOMDP/QuickPOMDPs.jl) provides structures for concisely defining simple POMDPs without object-oriented programming. -- [POMDPExamples.jl](https://github.com/JuliaPOMDP/POMDPExamples.jl) provides tutorials for defining problems. -- [The Tabular(PO)MDP model](https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Defining-a-tabular-POMDP.ipynb) from [POMDPModels.jl](https://github.com/JuliaPOMDP/POMDPModels.jl) allows users to define POMDPs with matrices for the transitions, observations and rewards. -- The [`gen`](@ref) function is the easiest way to wrap a pre-existing simulator from another project or written in another programming language so that it can be used with POMDPs.jl solvers and simulators. +This guide has three parts: First, it explains a very simple example (the Tiger POMDP), then uses a more complex example to illustrate the broader capabilities of the interface. Finally, some alternative ways of defining (PO)MDPs are discussed. -## Overview +!!! note + This guide assumes that you are comfortable programming in Julia, especially familiar with various ways of defining [*anonymous functions*](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions). Users should consult the [Julia documentation](https://docs.julialang.org) to learn more about programming in Julia. + +## [A Basic Example: The Tiger POMDP](@id tiger) + +In the first section of this guide, we will explain a QuickPOMDP implementation of a very simple problem: the classic Tiger POMDP\[1\]. In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening. The complete implementation looks like this: + +```jldoctest tiger; output=false, filter=r"QuickPOMDP.*" +using QuickPOMDPs: QuickPOMDP +using POMDPModelTools: Deterministic, Uniform, SparseCat + +m = QuickPOMDP( + states = ["left", "right"], + actions = ["left", "right", "listen"], + observations = ["left", "right"], + discount = 0.95, + + transition = function (s, a) + if a == "listen" + return Deterministic(s) # tiger stays behind the same door + else # a door is opened + return Uniform(["left", "right"]) # reset + end + end, + + observation = function (a, sp) + if a == "listen" + if sp == "left" + return SparseCat(["left", "right"], [0.85, 0.15]) # sparse categorical + else + return SparseCat(["right", "left"], [0.85, 0.15]) + end + else + return Uniform(["left", "right"]) + end + end, + + reward = function (s, a) + if a == "listen" + return -1.0 + elseif s == a # the tiger was found + return -100.0 + else # the tiger was escaped + return 10.0 + end + end, + + initialstate = Uniform(["left", "right"]), +); + +# output +QuickPOMDP +``` + +The next sections explain how each of the elements of the POMDP tuple are defined in this implementation: + +#### State, action and observation spaces + +In this example, each state, action, and observation is a `String`. The state, action and observation spaces (``S``, ``A``, and ``O``), are defined with the `states`, `actions` and `observations` keyword arguments. In this case, they are simply `Vector`s containing all the elements in the space. + +#### Transition and observation distributions + +The `transition` and `observation` keyword arguments are used to define the transition distribution, ``T``, and observation distribution, ``Z``, respectively. These models are defined using functions that return [*distribution objects* (more info below)](@ref Commonly-used-distributions). The transition function takes state and action arguments and returns a distribution of the resulting next state. The observation function takes in an action and the resulting next state (`sp`, short for "s prime") and returns the distribution of the observation emitted at this state. + +#### Reward function + +The `reward` keyword argument defines ``R``. It is a function that takes in a state and action and returns a number. + +#### Discount and initial state distribution + +The discount factor, ``\gamma``, is defined with the `discount` keyword, and is simply a number between 0 and 1. The initial state distribution, `b_0`, is defined with the `initialstate` argument, and is a [distribution object](@ref Commonly-used-distributions). + +The example above shows a complete implementation of a very simple discrete-space POMDP. However, POMDPs.jl is capable of concisely expressing much more complex models with continuous and hybrid spaces. The guide below introduces a more complex example to fully explain the ways that a POMDP can be defined. + +## Guide to Defining POMDPs + +### [A more complex example: A partially-observable mountain car](@id po-mountaincar) + +[Mountain car](https://en.wikipedia.org/wiki/Mountain_car_problem) is a classic problem in reinforcement learning. A car starts in a valley between two hills, and must reach the goal at the top of the hill to the right ([see wikipedia for image](https://en.wikipedia.org/wiki/Mountain_car_problem)). The actions are left and right acceleration and neutral and the state consists of the car's position and velocity. In this partially-observable version, there is a small amount of acceleration noise and observations are normally-distributed noisy measurements of the position. This problem can be implemented as follows: + +```jldoctest mountaincar; output=false, filter=r"QuickPOMDP.*" +import QuickPOMDPs: QuickPOMDP +import POMDPModelTools: ImplicitDistribution +import Distributions: Normal + +mountaincar = QuickPOMDP( + actions = [-1., 0., 1.], + obstype = Float64, + discount = 0.95, + + transition = function (s, a) + ImplicitDistribution() do rng + x, v = s + vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng) + vp = clamp(vp, -0.07, 0.07) + xp = x + vp + return (xp, vp) + end + end, + + observation = (a, sp) -> Normal(sp[1], 0.15), + + reward = function (s, a, sp) + if sp[1] > 0.5 + return 100.0 + else + return -1.0 + end + end, + + initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)), + isterminal = s -> s[1] > 0.5 +) + +# output +QuickPOMDP +``` + +The following sections provide a detailed guide to defining the components of a POMDP using this example and the [tiger pomdp](@ref tiger) further above. + +### [State, action, and observation spaces](@id space_representation) + +In POMDPs.jl, a state, action, or observation can be represented by any Julia object, for example an integer, a floating point number, a string or `Symbol`, or a vector. For example, in the tiger problem, the states are `String`s, and in the mountaincar problem, the state is a `Tuple` of two floating point numbers, and the actions and observations are floating point numbers. These types are usually inferred from the space or initial state distribution definitions. + +!!! warn + + Objects representing individual states, actions, and observations should not be altered once they are created, since they may be used as dictionary keys or stored in histories. Hence it is usually best to use immutable objects such as integers or [`StaticArray`s](https://github.com/JuliaArrays/StaticArrays.jl). + +The state, action, and observation spaces are defined with the `states`, `actions`, and `observations` Quick(PO)MDP keyword arguments. The simplest way to define these spaces is with a `Vector` of states, e.g. `states = ["left", "right"]` in the tiger problem. More complicated spaces, such as vector spaces and other continuous, uncountable, or hybrid sets can be defined with custom objects that adhere to the [space interface](@ref space-interface). However it should be noted that, for many solvers, *an explicit enumeration of the state and observation spaces is not needed*. Instead, it is sufficient to specify the state or observation *type* using the `statetype` or `obstype` arguments, e.g. `obstype = Float64` in the mountaincar problem. + +!!! tip + + If you are having a difficult time representing the state or observation space, it is likely that you will not be able to use a solver that requires an explicit representation. It is usually best to omit that space from the definition and try solvers to see if they work. + +#### [State- or belief-dependent action spaces](@id state-dep-action) + +In some problems, the set of allowable actions depends on the state or belief. This can be implemented by providing a function of the state or belief to the `actions` argument, e.g. if you can only take the action `1` in state `1`, but can take actions `2` and `3`, in an MDP, you might use +```jldoctest ; output=false, filter=r".* \(generic function.*\)" +actions = function (s) + if s == 1 + return [1,2,3] + else + return [2,3] + end +end + +# output +#1 (generic function with 1 method) +``` + +Similarly, in a POMDP, you may wish to only allow action `1` if the belief `b` assigns a nonzero probability to state `1`. This can be accomplished with +```jldoctest ; output=false +actions = function (b) + if pdf(b, 1) > 0.0 + return [1,2,3] + else + return [2,3] + end +end + +# output +#1 (generic function with 1 method) +``` + +### Transition and observation distributions + +The transition and observation observation distributions are specified through *functions that return distributions*. A distribution object implements parts of the [distribution interface](@ref Distributions), most importantly a [`rand`](@ref) function that provides a way to sample the distribution and, for explicit distributions, a [`pdf`](@ref) function that evaluates the probability mass or density of a given outcome. In most simple cases, you will be able to use a pre-defined distribution like the ones listed below, but occasionally you will define your own for more complex problems. + +!!! tip + Since the `transition` and `observation` functions return distributions, you should not call `rand` within these functions (unless it is within an `ImplicitDistribution` sampling function (see below)). + +The `transition` function takes in a state `s` and action `a` and returns a distribution object that defines the distribution of next states given that the current state is `s` and the action is `a`, that is ``T(s' | s, a)``. Similarly the `observation` function takes in the action `a` and the next state `sp` and returns a distribution object defining ``O(z | a, s')``. + +!!! note + It is also possible to define the `observation` function in terms of the previous state `s`, along with `a`, and `sp`. This is necessary, for example, when the observation is a measurement of change in state, e.g. `sp - s`. However some solvers may use the `a, sp` method (and hence cannot solve problems where the observation is conditioned on ``s`` and ``s'``). Since providing an `a, sp` method *automatically* defines the `s, a, sp` method, problem writers should usually define only the `a, sp` method, and only define the `s, a, sp` method if it is necessary. Except for special performance cases, problem writers should *never* need to define both methods. + +#### Commonly-used distributions + +In most cases, the following pre-defined distributions found in the [POMDPModelTools](https://github.com/JuliaPOMDP/POMDPModelTools.jl) and [Distributions](https://github.com/JuliaStats/Distributions.jl) packages will be sufficient to define models. + +##### `Deterministic` -The expressive nature of POMDPs.jl gives problem writers the flexibility to write their problem in many forms. -Custom POMDP problems are defined by implementing the functions specified by the POMDPs API. +The `Deterministic` distribution should be used when there is no randomness in the state or observation given the state and action inputs. This commonly occurs when the new state is a deterministic function of the state and action or the state stays the same, for example when the action is `"listen"` in the [tiger example](@ref tiger) above, the transition function returns `Deterministic(s)`. -In this guide, the interface is divided into two sections: functions that define static properties of the problem, and functions that describe the dynamics - how the states, observations and rewards change over time. There are two ways of specifying the dynamic behavior of a POMDP. The problem definition may include a mixture of *explicit* definitions of probability distributions, or *generative* definitions that simulate states and observations without explicitly defining the distributions. In scientific papers explicit definitions are often written as ``T(s' | s, a)`` for transitions and ``O(o | s, a, s')`` for observations, while a generative definition might be expressed as ``s', o, r = G(s, a)`` (or ``s', r = G(s,a)`` for an MDP). +##### `SparseCat` -## What do I need to implement? +In discrete POMDPs, it is common for the state or observation to have a few possible outcomes with specified probabilities. This can be represented with a sparse categorical `SparseCat` distribution that takes a list of outcomes and a list of associated probabilities as arguments. For instance, in the tiger example above, when the action is `"listen"`, there is an 85% chance of receiving the correct observation. Thus if the state is `"left"`, the observation distribution is `SparseCat(["left", "right"], [0.85, 0.15])`, and `SparseCat(["right", "left"], [0.85, 0.15])` if the state is `"right"`. -Because of the wide variety or problems and solvers that POMDPs.jl interfaces with, the question of which functions from the interface need to be implemented does not have a short answer for all cases. In general, a problem will be defined by implementing a combination of functions. +Another example where `SparseCat` distributions are useful is in grid-world problems, where there is a high probability of transitioning along the direction of the action, a low probability of transitioning to other adjacent states, and zero probability of transitioning to any other states. -Specifically, a problem writer will need to define -- Explicit or generative definitions for - - the state transition model, - - the reward function, and - - the observation model. -- Functions to define some other properties of the problem such as the state, action, and observation spaces, which states are terminal, etc. +##### `Uniform` -The precise answer for which functions need to be implemented depends on two factors: problem complexity and which solver will be used. -In particular, 2 questions should be asked: -1. Is it difficult or impossible to specify a probability distribution explicitly? -2. What solvers will be used to solve this, and what are their requirements? +Another common case is a uniform distribution over a space or set of outcomes. This can be represented with a `Uniform` object that takes a set of outcomes as an argument. For example, the initial state distribution in the tiger problem is represented with `Uniform(["left", "right"])` indicating that both states are equally likely. -If the answer to (1) is yes, then a generative definition should be used. Question (2) should be answered by reading about the solvers and trying to run them. Some solvers have specified their requirements using the [POMDPLinter package](https://github.com/JuliaPOMDP/POMDPLinter.jl), however, these requirements are written separately from the solver code, and often the best way is to write a simple prototype problem and running the solver until all `MethodError`s have been fixed. +##### Distributions.jl + +If the states or observations have numerical or vector values, the [Distributions.jl package](https://github.com/JuliaStats/Distributions.jl) provides a suite of suitable distributions. For example, the observation function in the [partially-observable mountain car example above](@ref po-mountaincar), +```julia +observation = (a, sp) -> Normal(sp[1], 0.15) +``` +returns a `Normal` distribution from this package with a mean that depends on the car's location (the first element of state `sp`) and a standard deviation of 0.15. + +##### `ImplicitDistribution` + +In many cases, especially when the state or observation spaces are continuous or hybrid, it is difficult or impossible to specify the probability density explicitly. Fortunately, many solvers for these problems do not require explicit density information and instead need only samples from the distribution. In this case, an "implicit distribution" or "generative model" is sufficient. In POMDPs.jl, this can be represented using an [`ImplicitDistribution` object](https://juliapomdp.github.io/POMDPModelTools.jl/stable/distributions/#POMDPModelTools.ImplicitDistribution). + +The argument to an `ImplicitDistribution` constructor is a function that takes a random number generator as an argument and returns a sample from the distribution. To see how this works, we'll look at an example inspired by the [mountaincar](@ref po-mountaincar) initial state distribution. +Samples from this distribution are position-velocity tuples where the velocity is always zero, but the position is uniformly distributed between -0.2 and 0. Consider the following code: +```jldoctest +using Random: MersenneTwister +using POMDPModelTools: ImplicitDistribution + +rng = MersenneTwister(1) + +d = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)) +rand(rng, d) +# output +(-0.04720666913240939, 0.0) +``` +Here, `rng` is the random number generator. When `rand(rng, d)` is called, the sampling function, `rng -> (-0.2*rand(rng), 0.0)`, is called to generate a state. The sampling function uses `rng` to generate a random number between 0 and 1 (`rand(rng)`), multiplies it by -0.2 to get the position, and creates a tuple with the position and a velocity of `0.0` and returns an initial state that might be, for instance `(-0.11, 0.0)`. Any time that a solver, belief updater, or simulator needs an initial state for the problem, it will be sampled in this way. !!! note + The random number generator is a subtype of `AbstractRNG`. It is important to use this random number generator for all calls to `rand` in the sample function for reproducible results. Moreover some solvers use specialized random number generators that allow them to reduce variance. + +It is also common to use Julia's [`do` block syntax](https://docs.julialang.org/en/v1/manual/functions/#Do-Block-Syntax-for-Function-Arguments) to define more complex sampling functions. For instance the transition function in the mountaincar example returns an ImplicitDistribution with a sampling function that (1) generates a new noisy velocity through a `randn` call, then (2) clamps the velocity, and finally (3) integrates the position with Euler's method: +```julia +transition = function (s, a) + ImplicitDistribution() do rng + x, v = s + vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng) + vp = clamp(vp, -0.07, 0.07) + xp = x + vp + return (xp, vp) + end +end +``` +Because of the nonlinear clamp operation, it would be difficult to represent this distribution explicitly. + +##### Custom distributions + +If none of the distributions above are suitable, for example if you need to represent an explicit distribution with hybrid support, it is not difficult to define your own distributions by implementing the functions in the [distribution interface](@ref Distributions). + +### Reward functions + +The reward function maps a combination of state, action, and observation arguments to the reward for a step. For instance, the reward function in the mountaincar problem, +```julia +reward = function (s, a, sp) + if sp[1] > 0.5 + return 100.0 + else + return -1.0 + end +end +``` +takes in the previous state, `s`, the action, `a`, and the resulting state, `sp` and returns a large positive reward if the resulting position, `sp[1]`, is beyond a threshold (note the coupling of the terminal reward) and a small negative reward on all other steps. If the reward in the problem is stochastic, the `reward` function implemented in POMDPs.jl should return the mean reward. + +There are two possible reward function argument signatures that a problem-writer might consider implementing for an MDP: `(s, a)` and `(s, a, sp)`. For a POMDP, there is an additional version, `(s, a, sp, o)`. The `(s, a, sp)` version is useful when transition to a terminal state results in a reward, and the `(s, a, sp, o)` version is useful for cases when the reward is associated with an observation, such as a negative reward for the stress caused by a medical diagnostic test that indicates the possibility of a disease. **Problem writers should implement the version with the fewest number of arguments possible**, since the versions with more arguments are automatically provided to solvers and simulators if a version with fewer arguments is implemented. + +In rare cases, it may make sense to implement two or more versions of the function, for example if a solver requires `(s, a)`, but the user wants an observation-dependent reward to show up in simulation. It is OK to implement two methods of the reward function as long as the following relationships hold: ``R(s, a) = E_{s'\sim T(s'|s,a)}[R(s, a, s')]`` and ``R(s, a, s') = E_{o \sim Z(o | s, a, s')}[R(s, a, s', o)]``. That is, the versions with fewer arguments *must* be expectations of versions with more arguments. + +### Other Components + +#### Discount factors + +The `discount` keyword argument is simply a number between 0 and 1 used to discount rewards in the future. + +#### Initial state distribution + +The `initialstate` argument should be a distribution object (see [above](@ref Commonly-used-distributions)) that defines the initial state distribution (and initial belief for POMDPs). + +#### Terminal states + +The function supplied to the `isterminal` object defines which which states in the POMDP are terminal. The function should take a state as an argument as an argument and return `true` if the state is terminal and `false` otherwise. For example, in the mountaincar example above, `isterminal = s -> s[1] > 0.5` indicates all states where the position, `s[1]` is greater than 0.5 are terminal. + +It is assumed that the system will take no further steps once it has reached a terminal state. Since reward is assigned for taking steps, no additional award can be accumulated from a terminal state. Consequently, the most important property of terminal states is that *the value of a terminal state is always zero*. Many solvers leverage this property for efficiency. As in the mountaincar example + +## Other ways to define a (PO)MDP + +Besides the Quick(PO)MDP approach above, there are several alternative ways to define (PO)MDP models: + +### Object-oriented + +First, it is possible to create your own (PO)MDP types and implement the components of the POMDP directly as methods of [POMDPs.jl interface functions](@ref API-Documentation). This approach can be thought of as the "low-level" way to define a POMDP, and the QuickPOMDP as merely a syntactic convenience. There are a few things that make this object-oriented approach more cumbersome than the QuickPOMDP approach, but the structure is similar. For example, the [tiger](@ref tiger) QuickPOMDP shown above can be implemented as follows: + +```jldoctest; output=false +import POMDPs +using POMDPs: POMDP +using POMDPModelTools: Deterministic, Uniform, SparseCat + +struct TigerPOMDP <: POMDP{String, String, String} + p_correct::Float64 + indices::Dict{String, Int} + + TigerPOMDP(p_correct=0.85) = new(p_correct, Dict("left"=>1, "right"=>2, "listen"=>3)) +end + +POMDPs.states(m::TigerPOMDP) = ["left", "right"] +POMDPs.actions(m::TigerPOMDP) = ["left", "right", "listen"] +POMDPs.observations(m::TigerPOMDP) = ["left", "right"] +POMDPs.discount(m::TigerPOMDP) = 0.95 +POMDPs.stateindex(m::TigerPOMDP, s) = m.indices[s] +POMDPs.actionindex(m::TigerPOMDP, a) = m.indices[a] +POMDPs.obsindex(m::TigerPOMDP, o) = m.indices[o] + +function POMDPs.transition(m::TigerPOMDP, s, a) + if a == "listen" + return Deterministic(s) # tiger stays behind the same door + else # a door is opened + return Uniform(["left", "right"]) # reset + end +end + +function POMDPs.observation(m::TigerPOMDP, a, sp) + if a == "listen" + if sp == "left" + return SparseCat(["left", "right"], [m.p_correct, 1.0-m.p_correct]) + else + return SparseCat(["right", "left"], [m.p_correct, 1.0-m.p_correct]) + end + else + return Uniform(["left", "right"]) + end +end + +function POMDPs.reward(m::TigerPOMDP, s, a) + if a == "listen" + return -1.0 + elseif s == a # the tiger was found + return -100.0 + else # the tiger was escaped + return 10.0 + end +end + +POMDPs.initialstate(m::TigerPOMDP) = Uniform(["left", "right"]) +# output +``` + +It is easy to see that the new methods are similar to the keyword arguments in the QuickPOMDP approach, except that every function has an initial `m` argument that has the newly created POMDP type. There are several differences from the QuickPOMDP approach: First, the POMDP is represented by a new `struct` that is a subtype of `POMDP{S,A,O}`. The state, action, and observation types must be specified as the `S`, `A`, and `O` parameters of the [`POMDP`](@ref) abstract type. Second, this new `struct` may contain problem-specific fields, which makes it easy for others to construct POMDPs that have the same structure but different parameters. For example, in the code above, the `struct` has a `p_correct` parameter that specifies the probability of receiving a correct observation when the "listen" action is taken. The final and most cumbersome difference between this object-oriented approach and using QuickPOMDPs is that the user must implement [`stateindex`](@ref), [`actionindex`](@ref), and [`obsindex`](@ref) to map states, actions, and observations to appropriate indices so that data such as values can be stored and accessed efficiently in vectors. + +### Using a single generative function instead of separate ``T``, ``Z``, and ``R`` + +In some cases, you may wish to use a simulator that generates the next state, observation, and/or reward (``s'``, ``o``, and ``r``) simultaneously. This is sometimes called a "generative model". + +For example if you are working on an autonomous driving POMDP, the car may travel for one or more seconds in between POMDP decision steps during which it may accumulate reward and observation measurements. In this case it might be very difficult to create a reward or observation function based on ``s``, ``a``, and ``s'``. For such situations, `gen` function is an alternative to `transition`, `observation`, and `reward`. `gen` should take in state, action, and random number generator arguments and return a [`NamedTuple`](https://docs.julialang.org/en/v1/manual/types/#Named-Tuple-Types) with keys `sp` (for "s-prime", the next state), `o`, and `r`. The [mountaincar example above](@ref po-mountaincar) can be implemented with `gen` as follows: +```jldoctest; output=false, filter=r"QuickPOMDP.*" +using QuickPOMDPs: QuickPOMDP +using POMDPModelTools: ImplicitDistribution + +mountaincar = QuickPOMDP( + actions = [-1., 0., 1.], + obstype = Float64, + discount = 0.95, + + gen = function (s, a, rng) + x, v = s + vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng) + vp = clamp(vp, -0.07, 0.07) + xp = x + vp + if xp > 0.5 + r = 100.0 + else + r = -1.0 + end + o = xp + 0.15*randn(rng) + return (sp=(xp, vp), o=o, r=r) + end, + + initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)), + isterminal = s -> s[1] > 0.5 +) +# output +QuickPOMDP +``` + +!!! tip + `gen` is not tied to the QuickPOMDP approach; it can also be used in the object-oriented paradigm. + +!!! tip + It is possible to mix and match `gen` with `transtion`, `observation`, and `reward`. For example, if the `gen` function returns a `NamedTuple` with `sp` and `r` keys, POMDPs.jl will try to use `gen` to generate states and rewards and the `observation` function to generate observations. + +!!! note + Implementing `gen` instead of `transition`, `observation`, and `reward` will limit which solvers you can use; for example, it is impossible to use a solver that requires an explicit transition distribution + +### Tabular + +Finally, it is sometimes convenient to define (PO)MDPs with tables that define the transition and observation probabilities and rewards. In this case, the states, actions, and observations must simply be integers. + +The code below is a tabular implementation of the [tiger example](@ref tiger) with the states, actions, and observations mapped to the following integers: + +|integer | state, action, or observation +|--------|-------- +|1 | "left" +|2 | "right" +|3 | "listen" + +```jldoctest tabular; output=false, filter=r"TabularPOMDP.*" +using POMDPModels: TabularPOMDP + +T = zeros(2,3,2) +T[:,:,1] = [1. 0.5 0.5; + 0. 0.5 0.5] +T[:,:,2] = [0. 0.5 0.5; + 1. 0.5 0.5] + +O = zeros(2,3,2) +O[:,:,1] = [0.85 0.5 0.5; + 0.15 0.5 0.5] +O[:,:,2] = [0.15 0.5 0.5; + 0.85 0.5 0.5] - If a particular function is required by a solver but seems very difficult to implement for a particular problem, one should consider carefully whether the algorithm is capable of solving that problem. For example, if a problem has a complex hybrid state space, it will be more difficult to define [`states`](@ref), but it is also true that solvers that require [`states`](@ref) such as SARSOP or IncrementalPruning, will usually not be able to solve such a problem, and solvers that can handle it, like ARDESPOT or MCVI, usually will not call [`states`](@ref). +R = [-1. -100. 10.; + -1. 10. -100.] -## Outline +m = TabularPOMDP(T, R, O, 0.95) +# output +TabularPOMDP([1.0 0.5 0.5; 0.0 0.5 0.5] -The following pages provide more details on specific parts of the interface: +[0.0 0.5 0.5; 1.0 0.5 0.5], [-1.0 -100.0 10.0; -1.0 10.0 -100.0], [0.85 0.5 0.5; 0.15 0.5 0.5] -- [Static Properties](@ref static) -- [Spaces and Distributions](@ref) -- [Dynamics](@ref dynamics) +[0.15 0.5 0.5; 0.85 0.5 0.5], 0.95) +``` +Here `T` is a ``|S| \times |A| \times |S|`` array representing the transition probabilities, with `T[sp, a, s]` `` = T(s' | s, a)``. Similarly, `O` is an ``|O| \times |A| \times |S|`` encoding the observation distribution with `O[o, a, sp]` `` = Z(o | a, s')``, and `R` is a ``|S| \times |A|`` matrix that encodes the reward function. 0.95 is the discount factor. diff --git a/docs/src/dynamics.md b/docs/src/dynamics.md deleted file mode 100644 index 8222da44..00000000 --- a/docs/src/dynamics.md +++ /dev/null @@ -1,43 +0,0 @@ -# [Defining (PO)MDP Dynamics](@id dynamics) - -The dynamics of a (PO)MDP define how states, observations, and rewards are generated at each time step. One way to visualize the structure of (PO)MDP is with a *dynamic decision network* (DDN) (see for example [*Decision Making under Uncertainty* by Kochenderfer et al.](https://ieeexplore.ieee.org/book/7288640) or [this webpage](https://artint.info/html/ArtInt_229.html) for more discussion of dynamic decision networks). - -The POMDPs.jl DDN models are shown below: - -| Standard MDP DDN | Standard POMDP DDN | -|:---:|:---:| -|![MDP DDN](figures/mdp_ddn.svg) | ![POMDP DDN](figures/pomdp_ddn.svg) | - -!!! note - - In order to provide additional flexibility, these DDNs have `:s`→`:o`, `:sp`→`:r` and `:o`→`:r` edges that are typically absent from the DDNs traditionally used in the (PO)MDP literature. Traditional (PO)MDP algorithms are compatible with these DDNs because only ``R(s,a)``, the expectation of ``R(s, a, s', o)`` over all ``s'`` and ``o`` is needed to make optimal decisions. - -The task of defining the dynamics of a (PO)MDP consists of defining a model for each of the nodes in the DDN. Models for each node can either be implemented separately through the [`transition`](@ref), [`observation`](@ref), and [`reward`](@ref) functions, or together with the [`gen`](@ref) function. - -## Separate definitions (explicit or generative) - -- [`transition`](@ref)`(pomdp, s, a)` defines the state transition probability distribution for state `s` and action `a`. This defines an explicit model for the `:sp` DDN node. -- [`observation`](@ref)`(pomdp, [s,] a, sp)` defines the observation distribution given that action `a` was taken and the state is now `sp` (The observation can optionally depend on `s` - see docstring). This defines an explicit model for the `:o` DDN node. -- [`reward`](@ref)`(pomdp, s, a[, sp[, o]])` defines the reward, which is a deterministic function of the state and action (and optionally `sp` and `o` - see docstring). This defines an explicit model for the `:r` DDN node. - -[`transition`](@ref) and [`observation`](@ref) should return distribution objects that implement part or all of the [distribution interface](@ref Distributions). Some predefined distributions can be found in [Distributions.jl](https://github.com/JuliaStats/Distributions.jl) or [POMDPModelTools.jl](https://github.com/JuliaPOMDP/POMDPModelTools.jl), or custom types that represent distributions appropriate for the problem may be created. - -!!! tip - - To define a *generative* model for one of these components, use an [POMDPModelTools.jl ImplicitDistribution object](https://juliapomdp.github.io/POMDPModelTools.jl/stable/distributions/#Implicit) - -!!! note - - There is no requirement that a problem defined using the explicit interface be discrete; it is straightforward to define continuous POMDPs with the explicit interface, provided that the distributions have some finite parameterization. - -## Combined generative definition - -If the state, observation, and reward are generated simultaneously, a new method of the [`gen`](@ref) function should be implemented to return the state, observation and reward in a single `NamedTuple`. - -### Examples - -An example of defining a problem using separate functions can be found at: -[https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Defining-a-POMDP-with-the-Explicit-Interface.ipynb](https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Defining-a-POMDP-with-the-Explicit-Interface.ipynb) - -An example of defining a problem with a combined `gen` function can be found at: -[https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Defining-a-POMDP-with-the-Generative-Interface.ipynb](https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Defining-a-POMDP-with-the-Generative-Interface.ipynb) diff --git a/docs/src/index.md b/docs/src/index.md index 39940ab1..afc833ec 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -18,12 +18,13 @@ The list of solver and support packages is maintained at the [POMDPs.jl Readme]( ## Documentation Outline -Documentation comes in four forms: -1. How-to examples are available in the [POMDPExamples package](https://github.com/JuliaPOMDP/POMDPExamples.jl) and in pages in this document with "Example" in the title. -2. An explanatory guide is available in the sections outlined below. +Documentation comes in three forms: +1. An explanatory guide is available in the sections outlined below. +2. How-to examples are available in the [POMDPExamples package](https://github.com/JuliaPOMDP/POMDPExamples.jl) and in pages in this document with "Example" in the title. 3. Reference docstrings for the entire interface are available in the [API Documentation](@ref) section. -When updating these documents, make sure this is synced with [docs/make.jl](https://github.com/JuliaPOMDP/POMDPs.jl/blob/master/docs/make.jl)!! +!!! note + When updating these documents, make sure this is synced with [docs/make.jl](https://github.com/JuliaPOMDP/POMDPs.jl/blob/master/docs/make.jl)!! ### Basics @@ -34,7 +35,8 @@ Pages = ["install.md", "get_started.md", "concepts.md"] ### Defining POMDP Models ```@contents -Pages = [ "def_pomdp.md", "static.md", "interfaces.md", "dynamics.md"] +Pages = [ "def_pomdp.md", "interfaces.md"] +Depth = 3 ``` ### Writing Solvers and Updaters diff --git a/docs/src/static.md b/docs/src/static.md deleted file mode 100644 index 29dbdde2..00000000 --- a/docs/src/static.md +++ /dev/null @@ -1,58 +0,0 @@ -# [Defining Static (PO)MDP Properties](@id static) - -The definition of a (PO)MDP includes several static properties, which are defined with the functions listed in this section. This section is an overview, with links to the docstrings for detailed usage information. - -To use most solvers, it is only necessary to implement a few of these functions. - -## Spaces - -The state, action and observation spaces are defined by the following functions: - -- [`states`](@ref)`(pomdp)` -- [`actions`](@ref)`(pomdp[, s])` -- [`observations`](@ref)`(pomdp)` - -The object returned by these functions should implement part or all of the [interface for spaces](@ref space-interface). For discrete problems, a vector is appropriate. - -It is often important to limit the action space based on the current state, belief, or observation. -This can be accomplished with the [`actions`](@ref)`(m, s)` or [`actions`](@ref)`(m, b)` function. -See [Histories associated with a belief](@ref) and the [`history`](@ref) and [`currentobs`](@ref) docstrings for more information. - -## Initial Distributions - -[`initialstate`](@ref)`(pomdp)` should return the distribution of the initial state, either as an explicit distribution (e.g. a `POMDPModelTools.SparseCat`) that conforms to the [distribution interface](@ref Distributions) or with a `POMDPModelTools.ImplicitDistribution` to easily specify a function to sample from the space. - -[`initialobs`](@ref)`(pomdp, state)` is used to return the distribution of the initial observation in occasional cases where the policy expects an initial observation rather than an initial belief, e.g. in a reinforcement learning setting. It is not used in a standard POMDP simulation. - -## Discount Factor - -[`discount`](@ref)`(pomdp)` should return a number between 0 and 1 to define the discount factor. - -## Terminal States - -If a problem has terminal states, they can be specified using the [`isterminal`](@ref) function. If a state `s` is terminal [`isterminal`](@ref)`(pomdp, s)` should return `true`, otherwise it should return `false`. - -In POMDPs.jl, no actions can be taken from terminal states, and no additional rewards can be collected, thus, the value function for a terminal state is zero. POMDPs.jl does not have a mechanism for defining terminal rewards apart from the [`reward`](@ref) function, so the problem should be defined so that any terminal rewards are collected as the system transitions into a terminal state. - -## Indexing - -For discrete problems, some solvers rely on a fast method for finding the index of the states, actions, or observations in an ordered list. These indexing functions can be implemented as -- [`stateindex`](@ref)`(pomdp, s)` -- [`actionindex`](@ref)`(pomdp, a)` -- [`obsindex`](@ref)`(pomdp, o)` - -!!! note - - The converse mapping (from indices to states) is not part of the POMDPs interface. A solver will typically create a vector containing all the states to define it. - -!!! note - - There is no requirement that the object returned by the [space functions](@ref Spaces) above respect the same ordering as the `index` functions. The `index` functions are the *sole definition* of ordering of the states. The `POMDPModelTools` package contains convenience functions for constructing a list of states that respects the ordering specified by the `index` functions. For example, `POMDPModelTools.ordered_states` returns an `AbstractVector` of the states in the order specified by `stateindex`. - -## Conversion to vector types - -Some solvers (notably those that involve deep learning) rely on the ability to represent states, actions, and observations as vectors. To define a mapping between vectors and custom problem-specific representations, implement the following functions (see docstring for signature): - -- [`convert_s`](@ref) -- [`convert_a`](@ref) -- [`convert_o`](@ref)