As described in the Concepts and Architecture section, an MDP is defined by the state space, action space, transition distributions, reward function, and discount factor, $(S,A,T,R,\gamma)$. A POMDP also includes the observation space, and observation probability distributions, for a definition of $(S,A,T,R,O,Z,\gamma)$. A problem definition in POMDPs.jl consists of an implicit or explicit definition of each of these elements.
It is possible to define a (PO)MDP with a more traditional object-oriented approach in which the user defines a new type to represent the (PO)MDP and methods of interface functions to define the tuple elements. However, the QuickPOMDPs package provides a more concise way to get started, using keyword arguments instead of new types and methods. Essentially each keyword argument defines a corresponding POMDPs api function. Since the important concepts are the same for the object oriented approach and the QuickPOMDP approach, we will use the latter for this discussion.
This guide has three parts: First, it explains a very simple example (the Tiger POMDP), then uses a more complex example to illustrate the broader capabilities of the interface. Finally, some alternative ways of defining (PO)MDPs are discussed.
This guide assumes that you are comfortable programming in Julia, especially familiar with various ways of defining anonymous functions. Users should consult the Julia documentation to learn more about programming in Julia.
In the first section of this guide, we will explain a QuickPOMDP implementation of a very simple problem: the classic Tiger POMDP. In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening. The complete implementation looks like this:
using QuickPOMDPs: QuickPOMDP
+using POMDPTools: Deterministic, Uniform, SparseCat
+
+m = QuickPOMDP(
+ states = ["left", "right"],
+ actions = ["left", "right", "listen"],
+ observations = ["left", "right"],
+ discount = 0.95,
+
+ transition = function (s, a)
+ if a == "listen"
+ return Deterministic(s) # tiger stays behind the same door
+ else # a door is opened
+ return Uniform(["left", "right"]) # reset
+ end
+ end,
+
+ observation = function (a, sp)
+ if a == "listen"
+ if sp == "left"
+ return SparseCat(["left", "right"], [0.85, 0.15]) # sparse categorical
+ else
+ return SparseCat(["right", "left"], [0.85, 0.15])
+ end
+ else
+ return Uniform(["left", "right"])
+ end
+ end,
+
+ reward = function (s, a)
+ if a == "listen"
+ return -1.0
+ elseif s == a # the tiger was found
+ return -100.0
+ else # the tiger was escaped
+ return 10.0
+ end
+ end,
+
+ initialstate = Uniform(["left", "right"]),
+);
The next sections explain how each of the elements of the POMDP tuple are defined in this implementation:
In this example, each state, action, and observation is a String
. The state, action and observation spaces ($S$, $A$, and $O$), are defined with the states
, actions
and observations
keyword arguments. In this case, they are simply Vector
s containing all the elements in the space.
The transition
and observation
keyword arguments are used to define the transition distribution, $T$, and observation distribution, $Z$, respectively. These models are defined using functions that return distribution objects (more info below). The transition function takes state and action arguments and returns a distribution of the resulting next state. The observation function takes in an action and the resulting next state (sp
, short for "s prime") and returns the distribution of the observation emitted at this state.
The reward
keyword argument defines $R$. It is a function that takes in a state and action and returns a number.
The discount factor, $\gamma$, is defined with the discount
keyword, and is simply a number between 0 and 1. The initial state distribution, b_0
, is defined with the initialstate
argument, and is a distribution object.
The example above shows a complete implementation of a very simple discrete-space POMDP. However, POMDPs.jl is capable of concisely expressing much more complex models with continuous and hybrid spaces. The guide below introduces a more complex example to fully explain the ways that a POMDP can be defined.
Mountain car is a classic problem in reinforcement learning. A car starts in a valley between two hills, and must reach the goal at the top of the hill to the right (see wikipedia for image). The actions are left and right acceleration and neutral and the state consists of the car's position and velocity. In this partially-observable version, there is a small amount of acceleration noise and observations are normally-distributed noisy measurements of the position. This problem can be implemented as follows:
import QuickPOMDPs: QuickPOMDP
+import POMDPTools: ImplicitDistribution
+import Distributions: Normal
+
+mountaincar = QuickPOMDP(
+ actions = [-1., 0., 1.],
+ obstype = Float64,
+ discount = 0.95,
+
+ transition = function (s, a)
+ ImplicitDistribution() do rng
+ x, v = s
+ vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng)
+ vp = clamp(vp, -0.07, 0.07)
+ xp = x + vp
+ return (xp, vp)
+ end
+ end,
+
+ observation = (a, sp) -> Normal(sp[1], 0.15),
+
+ reward = function (s, a, sp)
+ if sp[1] > 0.5
+ return 100.0
+ else
+ return -1.0
+ end
+ end,
+
+ initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)),
+ isterminal = s -> s[1] > 0.5
+)
The following sections provide a detailed guide to defining the components of a POMDP using this example and the tiger pomdp further above.
In POMDPs.jl, a state, action, or observation can be represented by any Julia object, for example an integer, a floating point number, a string or Symbol
, or a vector. For example, in the tiger problem, the states are String
s, and in the mountaincar problem, the state is a Tuple
of two floating point numbers, and the actions and observations are floating point numbers. These types are usually inferred from the space or initial state distribution definitions.
Objects representing individual states, actions, and observations should not be altered once they are created, since they may be used as dictionary keys or stored in histories. Hence it is usually best to use immutable objects such as integers or StaticArray
s. If the states need to be mutable (e.g. aggregate types with vectors in them), make sure the states are not actualy mutated and that hash
and ==
functions are implmemented (see AutoHashEquals
)
The state, action, and observation spaces are defined with the states
, actions
, and observations
Quick(PO)MDP keyword arguments. The simplest way to define these spaces is with a Vector
of states, e.g. states = ["left", "right"]
in the tiger problem. More complicated spaces, such as vector spaces and other continuous, uncountable, or hybrid sets can be defined with custom objects that adhere to the space interface. However it should be noted that, for many solvers, an explicit enumeration of the state and observation spaces is not needed. Instead, it is sufficient to specify the state or observation type using the statetype
or obstype
arguments, e.g. obstype = Float64
in the mountaincar problem.
If you are having a difficult time representing the state or observation space, it is likely that you will not be able to use a solver that requires an explicit representation. It is usually best to omit that space from the definition and try solvers to see if they work.
In some problems, the set of allowable actions depends on the state or belief. This can be implemented by providing a function of the state or belief to the actions
argument, e.g. if you can only take the action 1
in state 1
, but can take full action space 1
, 2
and 3
, in an MDP, you might use
# add default vlaue "s = nothing" , "actions(mdp)" won't throw error.
+actions = function (s = nothing)
+ if s == 1
+ return [1] #<--- return state-dep-actions
+ else
+ return [1,2,3] #<--- return full action space here
+ end
+end
Similarly, in a POMDP, you may wish to only allow action 1
if the belief b
assigns a nonzero probability to state 1
. This can be accomplished with
actions = function (b)
+ if pdf(b, 1) > 0.0
+ return [1,2,3]
+ else
+ return [2,3]
+ end
+end
The transition and observation observation distributions are specified through functions that return distributions. A distribution object implements parts of the distribution interface, most importantly a rand
function that provides a way to sample the distribution and, for explicit distributions, a pdf
function that evaluates the probability mass or density of a given outcome. In most simple cases, you will be able to use a pre-defined distribution like the ones listed below, but occasionally you will define your own for more complex problems.
Since the transition
and observation
functions return distributions, you should not call rand
within these functions (unless it is within an ImplicitDistribution
sampling function (see below)).
The transition
function takes in a state s
and action a
and returns a distribution object that defines the distribution of next states given that the current state is s
and the action is a
, that is $T(s' | s, a)$. Similarly the observation
function takes in the action a
and the next state sp
and returns a distribution object defining $O(z | a, s')$.
It is also possible to define the observation
function in terms of the previous state s
, along with a
, and sp
. This is necessary, for example, when the observation is a measurement of change in state, e.g. sp - s
. However some solvers may use the a, sp
method (and hence cannot solve problems where the observation is conditioned on $s$ and $s'$). Since providing an a, sp
method automatically defines the s, a, sp
method, problem writers should usually define only the a, sp
method, and only define the s, a, sp
method if it is necessary. Except for special performance cases, problem writers should never need to define both methods.
In most cases, the following pre-defined distributions found in the POMDPTools and Distributions packages will be sufficient to define models.
The Deterministic
distribution should be used when there is no randomness in the state or observation given the state and action inputs. This commonly occurs when the new state is a deterministic function of the state and action or the state stays the same, for example when the action is "listen"
in the tiger example above, the transition function returns Deterministic(s)
.
In discrete POMDPs, it is common for the state or observation to have a few possible outcomes with specified probabilities. This can be represented with a sparse categorical SparseCat
distribution that takes a list of outcomes and a list of associated probabilities as arguments. For instance, in the tiger example above, when the action is "listen"
, there is an 85% chance of receiving the correct observation. Thus if the state is "left"
, the observation distribution is SparseCat(["left", "right"], [0.85, 0.15])
, and SparseCat(["right", "left"], [0.85, 0.15])
if the state is "right"
.
Another example where SparseCat
distributions are useful is in grid-world problems, where there is a high probability of transitioning along the direction of the action, a low probability of transitioning to other adjacent states, and zero probability of transitioning to any other states.
Another common case is a uniform distribution over a space or set of outcomes. This can be represented with a Uniform
object that takes a set of outcomes as an argument. For example, the initial state distribution in the tiger problem is represented with Uniform(["left", "right"])
indicating that both states are equally likely.
If the states or observations have numerical or vector values, the Distributions.jl package provides a suite of suitable distributions. For example, the observation function in the partially-observable mountain car example above,
observation = (a, sp) -> Normal(sp[1], 0.15)
returns a Normal
distribution from this package with a mean that depends on the car's location (the first element of state sp
) and a standard deviation of 0.15.
In many cases, especially when the state or observation spaces are continuous or hybrid, it is difficult or impossible to specify the probability density explicitly. Fortunately, many solvers for these problems do not require explicit density information and instead need only samples from the distribution. In this case, an "implicit distribution" or "generative model" is sufficient. In POMDPs.jl, this can be represented using an ImplicitDistribution
object.
The argument to an ImplicitDistribution
constructor is a function that takes a random number generator as an argument and returns a sample from the distribution. To see how this works, we'll look at an example inspired by the mountaincar initial state distribution. Samples from this distribution are position-velocity tuples where the velocity is always zero, but the position is uniformly distributed between -0.2 and 0. Consider the following code:
using Random: MersenneTwister
+using POMDPTools: ImplicitDistribution
+
+rng = MersenneTwister(1)
+
+d = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0))
+rand(rng, d)
+# output
+(-0.04720666913240939, 0.0)
Here, rng
is the random number generator. When rand(rng, d)
is called, the sampling function, rng -> (-0.2*rand(rng), 0.0)
, is called to generate a state. The sampling function uses rng
to generate a random number between 0 and 1 (rand(rng)
), multiplies it by -0.2 to get the position, and creates a tuple with the position and a velocity of 0.0
and returns an initial state that might be, for instance (-0.11, 0.0)
. Any time that a solver, belief updater, or simulator needs an initial state for the problem, it will be sampled in this way.
The random number generator is a subtype of AbstractRNG
. It is important to use this random number generator for all calls to rand
in the sample function for reproducible results. Moreover some solvers use specialized random number generators that allow them to reduce variance. See also the What if I don't use the rng
argument? FAQ.
It is also common to use Julia's do
block syntax to define more complex sampling functions. For instance the transition function in the mountaincar example returns an ImplicitDistribution
with a sampling function that (1) generates a new noisy velocity through a randn
call, then (2) clamps the velocity, and finally (3) integrates the position with Euler's method:
transition = function (s, a)
+ ImplicitDistribution() do rng
+ x, v = s
+ vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng)
+ vp = clamp(vp, -0.07, 0.07)
+ xp = x + vp
+ return (xp, vp)
+ end
+end
Because of the nonlinear clamp operation, it would be difficult to represent this distribution explicitly.
If none of the distributions above are suitable, for example if you need to represent an explicit distribution with hybrid support, it is not difficult to define your own distributions by implementing the functions in the distribution interface.
The reward function maps a combination of state, action, and observation arguments to the reward for a step. For instance, the reward function in the mountaincar problem,
reward = function (s, a, sp)
+ if sp[1] > 0.5
+ return 100.0
+ else
+ return -1.0
+ end
+end
takes in the previous state, s
, the action, a
, and the resulting state, sp
and returns a large positive reward if the resulting position, sp[1]
, is beyond a threshold (note the coupling of the terminal reward) and a small negative reward on all other steps. If the reward in the problem is stochastic, the reward
function implemented in POMDPs.jl should return the mean reward.
There are two possible reward function argument signatures that a problem-writer might consider implementing for an MDP: (s, a)
and (s, a, sp)
. For a POMDP, there is an additional version, (s, a, sp, o)
. The (s, a, sp)
version is useful when transition to a terminal state results in a reward, and the (s, a, sp, o)
version is useful for cases when the reward is associated with an observation, such as a negative reward for the stress caused by a medical diagnostic test that indicates the possibility of a disease. Problem writers should implement the version with the fewest number of arguments possible, since the versions with more arguments are automatically provided to solvers and simulators if a version with fewer arguments is implemented.
In rare cases, it may make sense to implement two or more versions of the function, for example if a solver requires (s, a)
, but the user wants an observation-dependent reward to show up in simulation. It is OK to implement two methods of the reward function as long as the following relationships hold: $R(s, a) = E_{s'\sim T(s'|s,a)}[R(s, a, s')]$ and $R(s, a, s') = E_{o \sim Z(o | s, a, s')}[R(s, a, s', o)]$. That is, the versions with fewer arguments must be expectations of versions with more arguments.
The discount
keyword argument is simply a number between 0 and 1 used to discount rewards in the future.
The initialstate
argument should be a distribution object (see above) that defines the initial state distribution (and initial belief for POMDPs).
The function supplied to the isterminal
object defines which which states in the POMDP are terminal. The function should take a state as an argument as an argument and return true
if the state is terminal and false
otherwise. For example, in the mountaincar example above, isterminal = s -> s[1] > 0.5
indicates all states where the position, s[1]
is greater than 0.5 are terminal.
It is assumed that the system will take no further steps once it has reached a terminal state. Since reward is assigned for taking steps, no additional award can be accumulated from a terminal state. Consequently, the most important property of terminal states is that the value of a terminal state is always zero. Many solvers leverage this property for efficiency. As in the mountaincar example
Besides the Quick(PO)MDP approach above, there are several alternative ways to define (PO)MDP models:
First, it is possible to create your own (PO)MDP types and implement the components of the POMDP directly as methods of POMDPs.jl interface functions. This approach can be thought of as the "low-level" way to define a POMDP, and the QuickPOMDP as merely a syntactic convenience. There are a few things that make this object-oriented approach more cumbersome than the QuickPOMDP approach, but the structure is similar. For example, the tiger QuickPOMDP shown above can be implemented as follows:
import POMDPs
+using POMDPs: POMDP
+using POMDPTools: Deterministic, Uniform, SparseCat
+
+struct TigerPOMDP <: POMDP{String, String, String}
+ p_correct::Float64
+ indices::Dict{String, Int}
+
+ TigerPOMDP(p_correct=0.85) = new(p_correct, Dict("left"=>1, "right"=>2, "listen"=>3))
+end
+
+POMDPs.states(m::TigerPOMDP) = ["left", "right"]
+POMDPs.actions(m::TigerPOMDP) = ["left", "right", "listen"]
+POMDPs.observations(m::TigerPOMDP) = ["left", "right"]
+POMDPs.discount(m::TigerPOMDP) = 0.95
+POMDPs.stateindex(m::TigerPOMDP, s) = m.indices[s]
+POMDPs.actionindex(m::TigerPOMDP, a) = m.indices[a]
+POMDPs.obsindex(m::TigerPOMDP, o) = m.indices[o]
+
+function POMDPs.transition(m::TigerPOMDP, s, a)
+ if a == "listen"
+ return Deterministic(s) # tiger stays behind the same door
+ else # a door is opened
+ return Uniform(["left", "right"]) # reset
+ end
+end
+
+function POMDPs.observation(m::TigerPOMDP, a, sp)
+ if a == "listen"
+ if sp == "left"
+ return SparseCat(["left", "right"], [m.p_correct, 1.0-m.p_correct])
+ else
+ return SparseCat(["right", "left"], [m.p_correct, 1.0-m.p_correct])
+ end
+ else
+ return Uniform(["left", "right"])
+ end
+end
+
+function POMDPs.reward(m::TigerPOMDP, s, a)
+ if a == "listen"
+ return -1.0
+ elseif s == a # the tiger was found
+ return -100.0
+ else # the tiger was escaped
+ return 10.0
+ end
+end
+
+POMDPs.initialstate(m::TigerPOMDP) = Uniform(["left", "right"])
+# output
It is easy to see that the new methods are similar to the keyword arguments in the QuickPOMDP approach, except that every function has an initial m
argument that has the newly created POMDP type. There are several differences from the QuickPOMDP approach: First, the POMDP is represented by a new struct
that is a subtype of POMDP{S,A,O}
. The state, action, and observation types must be specified as the S
, A
, and O
parameters of the POMDP
abstract type. Second, this new struct
may contain problem-specific fields, which makes it easy for others to construct POMDPs that have the same structure but different parameters. For example, in the code above, the struct
has a p_correct
parameter that specifies the probability of receiving a correct observation when the "listen" action is taken. The final and most cumbersome difference between this object-oriented approach and using QuickPOMDPs is that the user must implement stateindex
, actionindex
, and obsindex
to map states, actions, and observations to appropriate indices so that data such as values can be stored and accessed efficiently in vectors.
In some cases, you may wish to use a simulator that generates the next state, observation, and/or reward ($s'$, $o$, and $r$) simultaneously. This is sometimes called a "generative model".
For example if you are working on an autonomous driving POMDP, the car may travel for one or more seconds in between POMDP decision steps during which it may accumulate reward and observation measurements. In this case it might be very difficult to create a reward
or observation
function based on $s$, $a$, and $s'$ arguments.
For situations like this, gen
is an alternative to transition
, observation
, and reward
. The gen
function should take in state, action, and random number generator arguments and return a NamedTuple
with keys sp
(for "s-prime", the next state), o
, and r
. The mountaincar example above can be implemented with gen
as shown below.
using QuickPOMDPs: QuickPOMDP
+using POMDPTools: ImplicitDistribution
+
+mountaincar = QuickPOMDP(
+ actions = [-1., 0., 1.],
+ obstype = Float64,
+ discount = 0.95,
+
+ gen = function (s, a, rng)
+ x, v = s
+ vp = v + a*0.001 + cos(3*x)*-0.0025 + 0.0002*randn(rng)
+ vp = clamp(vp, -0.07, 0.07)
+ xp = x + vp
+ if xp > 0.5
+ r = 100.0
+ else
+ r = -1.0
+ end
+ o = xp + 0.15*randn(rng)
+ return (sp=(xp, vp), o=o, r=r)
+ end,
+
+ initialstate = ImplicitDistribution(rng -> (-0.2*rand(rng), 0.0)),
+ isterminal = s -> s[1] > 0.5
+)
gen
is not tied to the QuickPOMDP approach; it can also be used in the object-oriented paradigm.
It is possible to mix and match gen
with transtion
, observation
, and reward
. For example, if the gen
function returns a NamedTuple
with sp
and r
keys, POMDPs.jl will try to use gen
to generate states and rewards and the observation
function to generate observations.
Implementing gen
instead of transition
, observation
, and reward
will limit which solvers you can use; for example, it is impossible to use a solver that requires an explicit transition distribution
Finally, it is sometimes convenient to define (PO)MDPs with tables that define the transition and observation probabilities and rewards. In this case, the states, actions, and observations must simply be integers.
The code below is a tabular implementation of the tiger example with the states, actions, and observations mapped to the following integers:
integer | state, action, or observation |
---|
1 | "left" |
2 | "right" |
3 | "listen" |
using POMDPModels: TabularPOMDP
+
+T = zeros(2,3,2)
+T[:,:,1] = [1. 0.5 0.5;
+ 0. 0.5 0.5]
+T[:,:,2] = [0. 0.5 0.5;
+ 1. 0.5 0.5]
+
+O = zeros(2,3,2)
+O[:,:,1] = [0.85 0.5 0.5;
+ 0.15 0.5 0.5]
+O[:,:,2] = [0.15 0.5 0.5;
+ 0.85 0.5 0.5]
+
+R = [-1. -100. 10.;
+ -1. 10. -100.]
+
+m = TabularPOMDP(T, R, O, 0.95)
Here T
is a $|S| \times |A| \times |S|$ array representing the transition probabilities, with T[sp, a, s]
$= T(s' | s, a)$. Similarly, O
is an $|O| \times |A| \times |S|$ encoding the observation distribution with O[o, a, sp]
$= Z(o | a, s')$, and R
is a $|S| \times |A|$ matrix that encodes the reward function. 0.95 is the discount factor.