diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index ea179894..928cd3bb 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-22T19:52:09","documenter_version":"1.5.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-28T23:42:42","documenter_version":"1.5.0"}} \ No newline at end of file diff --git a/dev/POMDPTools/beliefs/index.html b/dev/POMDPTools/beliefs/index.html index 916bcb1b..b119f830 100644 --- a/dev/POMDPTools/beliefs/index.html +++ b/dev/POMDPTools/beliefs/index.html @@ -4,4 +4,4 @@ initial_observation = rand(rng, initialobs(pomdp, s0)) initial_obs_vec = fill(initial_observation, 5) hr = HistoryRecorder(rng=rng, max_steps=100) -hist = simulate(hr, pomdp, policy, up, initial_obs_vec, s0)source

Previous Observation

POMDPTools.BeliefUpdaters.PreviousObservationUpdaterType

Updater that stores the most recent observation as the belief. If an initial distribution is provided, it will pass that as the initial belief.

source

Nothing Updater

POMDPTools.BeliefUpdaters.NothingUpdaterType

An updater useful for when a belief is not necessary (i.e. for a random policy). update always returns nothing.

source
+hist = simulate(hr, pomdp, policy, up, initial_obs_vec, s0)source

Previous Observation

POMDPTools.BeliefUpdaters.PreviousObservationUpdaterType

Updater that stores the most recent observation as the belief. If an initial distribution is provided, it will pass that as the initial belief.

source

Nothing Updater

POMDPTools.BeliefUpdaters.NothingUpdaterType

An updater useful for when a belief is not necessary (i.e. for a random policy). update always returns nothing.

source
diff --git a/dev/POMDPTools/common_rl/index.html b/dev/POMDPTools/common_rl/index.html index 250aa43f..9cadd75a 100644 --- a/dev/POMDPTools/common_rl/index.html +++ b/dev/POMDPTools/common_rl/index.html @@ -13,4 +13,4 @@ planner = solve(POMCPSolver(), m) a = action(planner, initialstate(m))

You can also use the constructors listed below to manually convert between the interfaces.

Environment Wrapper Types

Since the standard reinforcement learning environment interface offers less information about the internal workings of the environment than the POMDPs.jl interface, MDPs and POMDPs created from these environments will have limited functionality. There are two types of (PO)MDP types that can wrap an environment:

Generative model wrappers

If the state and setstate! CommonRLInterface functions are provided, then the environment can be wrapped in a RLEnvMDP or RLEnvPOMDP and the POMDPs.jl generative model interface will be available.

Opaque wrappers

If the state and setstate! are not provided, then the resulting POMDP or MDP can only be simulated. This case is represented using the OpaqueRLEnvPOMDP and OpaqueRLEnvMDP wrappers. From the POMDPs.jl perspective, the state of the opaque (PO)MDP is just an integer wrapped in an OpaqueRLEnvState. This keeps track of the "age" of the environment so that POMDPs.jl actions that attempt to interact with the environment at a different age are invalid.

Constructors

Creating RL environments from MDPs and POMDPs

POMDPTools.CommonRLIntegration.MDPCommonRLEnvType
MDPCommonRLEnv(m, [s])
 MDPCommonRLEnv{RLO}(m, [s])

Create a CommonRLInterface environment from MDP m; optionally specify the state 's'.

The RLO parameter can be used to specify a type to convert the observation to. By default, this is AbstractArray. Use Any to disable conversion.

source
POMDPTools.CommonRLIntegration.POMDPCommonRLEnvType
POMDPCommonRLEnv(m, [s], [o])
-POMDPCommonRLEnv{RLO}(m, [s], [o])

Create a CommonRLInterface environment from POMDP m; optionally specify the state 's' and observation 'o'.

The RLO and RLS parameters can be used to specify types to convert the observation and state to. By default, this is AbstractArray. Use Any to disable conversion.

source

Creating MDPs and POMDPs from RL environments

POMDPTools.CommonRLIntegration.RLEnvMDPType
RLEnvMDP(env; discount=1.0)

Create an MDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.RLEnvPOMDPType
RLEnvPOMDP(env; discount=1.0)

Create an POMDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvMDPType
OpaqueRLEnvMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an MDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvPOMDPType
OpaqueRLEnvPOMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an POMDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
+POMDPCommonRLEnv{RLO}(m, [s], [o])

Create a CommonRLInterface environment from POMDP m; optionally specify the state 's' and observation 'o'.

The RLO and RLS parameters can be used to specify types to convert the observation and state to. By default, this is AbstractArray. Use Any to disable conversion.

source

Creating MDPs and POMDPs from RL environments

POMDPTools.CommonRLIntegration.RLEnvMDPType
RLEnvMDP(env; discount=1.0)

Create an MDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.RLEnvPOMDPType
RLEnvPOMDP(env; discount=1.0)

Create an POMDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvMDPType
OpaqueRLEnvMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an MDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvPOMDPType
OpaqueRLEnvPOMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an POMDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
diff --git a/dev/POMDPTools/distributions/index.html b/dev/POMDPTools/distributions/index.html index 67838a99..1d7c6026 100644 --- a/dev/POMDPTools/distributions/index.html +++ b/dev/POMDPTools/distributions/index.html @@ -8,4 +8,4 @@ end td = transition(MyMDP(), 1.0, 1) -rand(td) # will return a number near 2source

Bool Distribution

POMDPTools.POMDPDistributions.BoolDistributionType
BoolDistribution(p_true)

Create a distribution over Boolean values (true or false).

p_true is the probability of the true outcome; the probability of false is 1-p_true.

source

Deterministic

POMDPTools.POMDPDistributions.DeterministicType
Deterministic(value)

Create a deterministic distribution over only one value.

This is intended to be used when a distribution is required, but the outcome is deterministic. It is equivalent to a Kronecker Delta distribution.

source

Uniform

POMDPTools.POMDPDistributions.UniformType
Uniform(collection)

Create a uniform categorical distribution over a collection of objects.

The objects in the collection must be unique (this is tested on construction), and will be stored in a Set. To avoid this overhead, use UnsafeUniform.

source
POMDPTools.POMDPDistributions.UnsafeUniformType
UnsafeUniform(collection)

Create a uniform categorical distribution over a collection of objects.

No checks are performed to ensure uniqueness or check whether an object is actually in the set when evaluating the pdf.

source

Pretty Printing

POMDPTools.POMDPDistributions.showdistributionFunction
showdistribution([io], [mime], d)

Show a UnicodePlots.barplot representation of a distribution.

Keyword Arguments

  • title::String=string(typeof(d))*" distribution": title for the barplot.
source
+rand(td) # will return a number near 2source

Bool Distribution

POMDPTools.POMDPDistributions.BoolDistributionType
BoolDistribution(p_true)

Create a distribution over Boolean values (true or false).

p_true is the probability of the true outcome; the probability of false is 1-p_true.

source

Deterministic

POMDPTools.POMDPDistributions.DeterministicType
Deterministic(value)

Create a deterministic distribution over only one value.

This is intended to be used when a distribution is required, but the outcome is deterministic. It is equivalent to a Kronecker Delta distribution.

source

Uniform

POMDPTools.POMDPDistributions.UniformType
Uniform(collection)

Create a uniform categorical distribution over a collection of objects.

The objects in the collection must be unique (this is tested on construction), and will be stored in a Set. To avoid this overhead, use UnsafeUniform.

source
POMDPTools.POMDPDistributions.UnsafeUniformType
UnsafeUniform(collection)

Create a uniform categorical distribution over a collection of objects.

No checks are performed to ensure uniqueness or check whether an object is actually in the set when evaluating the pdf.

source

Pretty Printing

POMDPTools.POMDPDistributions.showdistributionFunction
showdistribution([io], [mime], d)

Show a UnicodePlots.barplot representation of a distribution.

Keyword Arguments

  • title::String=string(typeof(d))*" distribution": title for the barplot.
source
diff --git a/dev/POMDPTools/index.html b/dev/POMDPTools/index.html index e0bfe39a..67f2c1b1 100644 --- a/dev/POMDPTools/index.html +++ b/dev/POMDPTools/index.html @@ -1,2 +1,2 @@ -POMDPTools: the standard library for POMDPs.jl · POMDPs.jl

POMDPTools: the standard library for POMDPs.jl

The POMDPs.jl package does nothing more than define an interface or language for interacting with and solving (PO)MDPs; it does not contain any implementations. In practice, defining and solving POMDPs is made vastly easier if some commonly-used structures are provided. The POMDPTools package contains these implementations. Thus, the relationship between POMDPs.jl and POMDPTools is similar to the relationship between a programming language and its standard library.

The POMDPTools package source code is hosted in the POMDPs.jl github repository in the lib/POMDPTools directory.

The contents of the library are outlined below:

+POMDPTools: the standard library for POMDPs.jl · POMDPs.jl

POMDPTools: the standard library for POMDPs.jl

The POMDPs.jl package does nothing more than define an interface or language for interacting with and solving (PO)MDPs; it does not contain any implementations. In practice, defining and solving POMDPs is made vastly easier if some commonly-used structures are provided. The POMDPTools package contains these implementations. Thus, the relationship between POMDPs.jl and POMDPTools is similar to the relationship between a programming language and its standard library.

The POMDPTools package source code is hosted in the POMDPs.jl github repository in the lib/POMDPTools directory.

The contents of the library are outlined below:

diff --git a/dev/POMDPTools/model/index.html b/dev/POMDPTools/model/index.html index e6596ba1..6fe9d6f6 100644 --- a/dev/POMDPTools/model/index.html +++ b/dev/POMDPTools/model/index.html @@ -39,4 +39,4 @@ # output --15.0source

Utility Types

Terminal State

TerminalState and its singleton instance terminalstate are available to use for a terminal state in concert with another state type. It has the appropriate type promotion logic to make its use with other types friendly, similar to nothing and missing.

Note

NOTE: This is NOT a replacement for the standard POMDPs.jl isterminal function, though isterminal is implemented for the type. It is merely a convenient type to use for terminal states.

Warning

WARNING: Early tests (August 2018) suggest that the Julia 1.0 compiler will not be able to efficiently implement union splitting in cases as complex as POMDPs, so using a Union for the state type of a problem can currently have a large overhead.

POMDPTools.ModelTools.TerminalStateType
TerminalState

A type with no fields whose singleton instance terminalstate is used to represent a terminal state with no additional information.

This type has the appropriate promotion logic implemented to function like Missing when added to arrays, etc.

Note that terminal states NEED NOT be of type TerminalState. You can define any state to be terminal by implementing the appropriate isterminal method. Solvers and simulators SHOULD NOT check for this type, but should instead check using isterminal.

source
POMDPTools.ModelTools.terminalstateConstant
terminalstate

The singleton instance of type TerminalState representing a terminal state.

source
+-15.0source

Utility Types

Terminal State

TerminalState and its singleton instance terminalstate are available to use for a terminal state in concert with another state type. It has the appropriate type promotion logic to make its use with other types friendly, similar to nothing and missing.

Note

NOTE: This is NOT a replacement for the standard POMDPs.jl isterminal function, though isterminal is implemented for the type. It is merely a convenient type to use for terminal states.

Warning

WARNING: Early tests (August 2018) suggest that the Julia 1.0 compiler will not be able to efficiently implement union splitting in cases as complex as POMDPs, so using a Union for the state type of a problem can currently have a large overhead.

POMDPTools.ModelTools.TerminalStateType
TerminalState

A type with no fields whose singleton instance terminalstate is used to represent a terminal state with no additional information.

This type has the appropriate promotion logic implemented to function like Missing when added to arrays, etc.

Note that terminal states NEED NOT be of type TerminalState. You can define any state to be terminal by implementing the appropriate isterminal method. Solvers and simulators SHOULD NOT check for this type, but should instead check using isterminal.

source
POMDPTools.ModelTools.terminalstateConstant
terminalstate

The singleton instance of type TerminalState representing a terminal state.

source
diff --git a/dev/POMDPTools/policies/index.html b/dev/POMDPTools/policies/index.html index 663c1457..9a1c4c1b 100644 --- a/dev/POMDPTools/policies/index.html +++ b/dev/POMDPTools/policies/index.html @@ -38,4 +38,4 @@ evaluate(m::MDP, p::Policy; rewardfunction=POMDPs.reward)

Calculate the value for a policy on an MDP using the approach in equation 4.2.2 of Kochenderfer, Decision Making Under Uncertainty, 2015.

Returns a DiscreteValueFunction, which maps states to values.

Example

using POMDPTools, POMDPModels
 m = SimpleGridWorld()
 u = evaluate(m, FunctionPolicy(x->:left))
-u([1,1]) # value of always moving left starting at state [1,1]
source +u([1,1]) # value of always moving left starting at state [1,1]source diff --git a/dev/POMDPTools/simulators/index.html b/dev/POMDPTools/simulators/index.html index 9eb32cdb..25010974 100644 --- a/dev/POMDPTools/simulators/index.html +++ b/dev/POMDPTools/simulators/index.html @@ -77,4 +77,4 @@ m = SimpleGridWorld() simulate(ds, m, RandomPolicy(m))
POMDPTools.Simulators.DisplaySimulatorType
DisplaySimulator(;kwargs...)

Create a simulator that displays each step of a simulation.

Given a POMDP or MDP model m, this simulator roughly works like

for step in stepthrough(m, ...)
     display(render(m, step))
-end

Keyword Arguments

  • display::AbstractDisplay: the display to use for the first argument to the display function. If this is nothing, display(...) will be called without an AbstractDisplay argument.
  • render_kwargs::NamedTuple: keyword arguments for POMDPTools.render(...)
  • max_fps::Number=10: maximum number of frames to be displayed per second - sleep will be used to skip extra time, so this is not designed for high precision
  • predisplay::Function: function to call before every call to display(...). The only argument to this function will be the display (if it is specified) or nothing
  • extra_initial::Bool=false: if true, display an extra step at the beginning with only elements t, sp, and bp for POMDPs (this can be useful to see the initial state if render displays only sp and not s).
  • extra_final::Bool=true: iftrue, display an extra step at the end with only elementst,done,s, andbfor POMDPs (this can be useful to see the final state ifrenderdisplays onlysand notsp`).
  • max_steps::Integer: maximum number of steps to run for
  • spec::NTuple{Symbol}: specification of what step elements to display (see eachstep)
  • rng::AbstractRNG: random number generator

See the POMDPSimulators documentation for more tips about using specific displays.

source

Display-specific tips

The following tips may be helpful when using particular displays.

Jupyter notebooks

By default, in a Jupyter notebook, the visualizations of all steps are displayed in the output box one after another. To make the output animated instead, where the image is overwritten at each step, one may use

DisplaySimulator(predisplay=(d)->IJulia.clear_output(true))

ElectronDisplay

By default, ElectronDisplay will open a new window for each new step. To prevent this, use

ElectronDisplay.CONFIG.single_window = true
+end

Keyword Arguments

See the POMDPSimulators documentation for more tips about using specific displays.

source

Display-specific tips

The following tips may be helpful when using particular displays.

Jupyter notebooks

By default, in a Jupyter notebook, the visualizations of all steps are displayed in the output box one after another. To make the output animated instead, where the image is overwritten at each step, one may use

DisplaySimulator(predisplay=(d)->IJulia.clear_output(true))

ElectronDisplay

By default, ElectronDisplay will open a new window for each new step. To prevent this, use

ElectronDisplay.CONFIG.single_window = true
diff --git a/dev/POMDPTools/testing/index.html b/dev/POMDPTools/testing/index.html index ec8d8a33..5deb5fd7 100644 --- a/dev/POMDPTools/testing/index.html +++ b/dev/POMDPTools/testing/index.html @@ -5,4 +5,4 @@ using POMDPModels solver = YourSolver(# initialize with parameters #) -test_solver(solver, BabyPOMDP())source +test_solver(solver, BabyPOMDP())source diff --git a/dev/POMDPTools/visualization/index.html b/dev/POMDPTools/visualization/index.html index 21efe7af..a93416f7 100644 --- a/dev/POMDPTools/visualization/index.html +++ b/dev/POMDPTools/visualization/index.html @@ -4,4 +4,4 @@ step::NamedTuple end -POMDPTools.render(mdp, step) = MyProblemVisualization(mdp, step)

and then implement custom show methods, e.g.

show(io::IO, mime::MIME"text/html", v::MyProblemVisualization)
+POMDPTools.render(mdp, step) = MyProblemVisualization(mdp, step)

and then implement custom show methods, e.g.

show(io::IO, mime::MIME"text/html", v::MyProblemVisualization)
diff --git a/dev/api/index.html b/dev/api/index.html index b8ce3513..ac36e22d 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -33,4 +33,4 @@ state_distribution::Any) initialize_belief(updater::Updater, belief::Any)

Returns a belief that can be updated using updater that has similar distribution to state_distribution or belief.

The conversion may be lossy. This function is also idempotent, i.e. there is a default implementation that passes the belief through when it is already the correct type: initialize_belief(updater::Updater, belief) = belief

source
POMDPs.historyFunction
history(b)

Return the action-observation history associated with belief b.

The history should be an AbstractVector, Tuple, (or similar object that supports indexing with end) full of NamedTuples with keys :a and :o, i.e. history(b)[end][:a] should be the last action taken leading up to b, and history(b)[end][:o] should be the last observation received.

It is acceptable to return only part of the history if that is all that is available, but it should always end with the current observation. For example, it would be acceptable to return a structure containing only the last three observations in a length 3 Vector{NamedTuple{(:o,),Tuple{O}}.

source
POMDPs.currentobsFunction
currentobs(b)

Return the latest observation associated with belief b.

If a solver or updater implements history(b) for a belief type, currentobs has a default implementation.

source

Policy and Solver Functions

POMDPs.solveFunction
solve(solver::Solver, problem::POMDP)

Solves the POMDP using method associated with solver, and returns a policy.

source
POMDPs.updaterFunction
updater(policy::Policy)

Returns a default Updater appropriate for a belief type that policy p can use

source
POMDPs.actionFunction
action(policy::Policy, x)

Returns the action that the policy deems best for the current state or belief, x.

x is a generalized information state - can be a state in an MDP, a distribution in POMDP, or another specialized policy-dependent representation of the information needed to choose an action.

source
POMDPs.valueFunction
value(p::Policy, s)
 value(p::Policy, s, a)

Returns the utility value from policy p given the state (or belief), or state-action (or belief-action) pair.

The state-action version is commonly referred to as the Q-value.

source

Simulator

POMDPs.SimulatorType

Base type for an object defining how simulations should be carried out.

source
POMDPs.simulateFunction
simulate(sim::Simulator, m::POMDP, p::Policy, u::Updater=updater(p), b0=initialstate(m), s0=rand(b0))
-simulate(sim::Simulator, m::MDP, p::Policy, s0=rand(initialstate(m)))

Run a simulation using the specified policy.

The return type is flexible and depends on the simulator. Simulations should adhere to the Simulation Standard.

source
+simulate(sim::Simulator, m::MDP, p::Policy, s0=rand(initialstate(m)))

Run a simulation using the specified policy.

The return type is flexible and depends on the simulator. Simulations should adhere to the Simulation Standard.

source diff --git a/dev/concepts/index.html b/dev/concepts/index.html index b7de6424..237d540c 100644 --- a/dev/concepts/index.html +++ b/dev/concepts/index.html @@ -1,2 +1,2 @@ -Concepts and Architecture · POMDPs.jl

Concepts and Architecture

POMDPs.jl aims to coordinate the development of three software components: 1) a problem, 2) a solver, 3) an experiment. Each of these components has a set of abstract types associated with it and a set of functions that allow a user to define each component's behavior in a standardized way. An outline of the architecture is shown below.

concepts

The MDP and POMDP types are associated with the problem definition. The Solver and Policy types are associated with the solver or decision-making agent. Typically, the Updater type is also associated with the solver, but a solver may sometimes be used with an updater that was implemented separately. The Simulator type is associated with the experiment.

The code components of the POMDPs.jl ecosystem relevant to problems and solvers are shown below. The arrows represent the flow of information from the problems to the solvers. The figure shows the two interfaces that form POMDPs.jl - Explicit and Generative. Details about these interfaces can be found in the section on Defining POMDPs.

interface_relationships

POMDPs and MDPs

An MDP is a mathematical framework for sequential decision making under uncertainty, and where all of the uncertainty arises from outcomes that are partially random and partially under the control of a decision maker. Mathematically, an MDP is a tuple $(S,A,T,R,\gamma)$, where $S$ is the state space, $A$ is the action space, $T$ is a transition function defining the probability of transitioning to each state given the state and action at the previous time, and $R$ is a reward function mapping every possible transition $(s,a,s')$ to a real reward value. Finally, $\gamma$ is a discount factor that defines the relative weighting of current and future rewards. For more information see a textbook such as [1]. In POMDPs.jl an MDP is represented by a concrete subtype of the MDP abstract type and a set of methods that define each of its components as described in the problem definition section.

A POMDP is a more general sequential decision making problem in which the agent is not sure what state they are in. The state is only partially observable by the decision making agent. Mathematically, a POMDP is a tuple $(S,A,T,R,O,Z,\gamma)$ where $S$, $A$, $T$, $R$, and $\gamma$ have the same meaning as in an MDP, $O$ is the agent's observation space, and $Z$ defines the probability of receiving each observation at a transition. In POMDPs.jl, a POMDP is represented by a concrete subtype of the POMDP abstract type, and the methods described in the problem definition section.

POMDPs.jl contains additional functions for defining optional problem behavior such as an initial state distribution or terminal states. More information can be found in the Defining POMDPs section.

Beliefs and Updaters

In a POMDP domain, the decision-making agent does not have complete information about the state of the problem, so the agent can only make choices based on its "belief" about the state. In the POMDP literature, the term "belief" is typically defined to mean a probability distribution over all possible states of the system. However, in practice, the agent often makes decisions based on an incomplete or lossy record of past observations that has a structure much different from a probability distribution. For example, if the agent is represented by a finite-state controller, as is the case for Monte-Carlo Value Iteration [2], the belief is the controller state, which is a node in a graph. Another example is an agent represented by a recurrent neural network. In this case, the agent's belief is the state of the network. In order to accommodate a wide variety of decision-making approaches in POMDPs.jl, we use the term "belief" to denote the set of information that the agent makes a decision on, which could be an exact state distribution, an action-observation history, a set of weighted particles, or the examples mentioned before. In code, the belief can be represented by any built-in or user-defined type.

When an action is taken and a new observation is received, the belief is updated by the belief updater. In code, a belief updater is represented by a concrete subtype of the Updater abstract type, and the update(updater, belief, action, observation) function defines how the belief is updated when a new observation is received.

Although the agent may use a specialized belief structure to make decisions, the information initially given to the agent about the state of the problem is usually most conveniently represented as a state distribution, thus the initialize_belief function is provided to convert a state distribution to a specialized belief structure that an updater can work with.

In many cases, the belief structure is closely related to the solution technique, so it will be implemented by the programmer who writes the solver. In other cases, the agent can use a variety of belief structures to make decisions, so a domain-specific updater implemented by the programmer that wrote the problem description may be appropriate. Finally, some advanced generic belief updaters such as particle filters may be implemented by a third party. The convenience function updater(policy) can be used to get a suitable default updater for a policy, however many policies can work with other updaters.

For more information on implementing a belief updater, see Defining a Belief Updater

Solvers and Policies

Sequential decision making under uncertainty involves both online and offline calculations. In the broad sense, the term "solver" as used in the node in the figure at the top of the page refers to the software package that performs the calculations at both of these times. However, the code is broken up into two pieces, the solver that performs calculations offline and the policy that performs calculations online.

In the abstract, a policy is a mapping from every belief that an agent might take to an action. A policy is represented in code by a concrete subtype of the Policy abstract type. The programmer implements action to describe what computations need to be done online. For an online solver such as POMCP, all of the decision computation occurs within action while for an offline solver like SARSOP, there is very little computation within action. See Interacting with Policies for more information.

The offline portion of the computation is carried out by the solver, which is represented by a concrete subtype of the Solver abstract type. Computations occur within the solve function. For an offline solver like SARSOP, nearly all of the decision computation occurs within this function, but for some online solvers such as POMCP, solve merely embeds the problem in the policy.

Simulators

A simulator defines a way to run one or more simulations. It is represented by a concrete subtype of the Simulator abstract type and the simulation is an implemention of simulate. Depending on the simulator, simulate may return a variety of data about the simulation, such as the discounted reward or the state history. All simulators should perform simulations consistent with the Simulation Standard.

[1] Decision Making Under Uncertainty: Theory and Application by Mykel J. Kochenderfer, MIT Press, 2015

[2] Bai, H., Hsu, D., & Lee, W. S. (2014). Integrated perception and planning in the continuous space: A POMDP approach. The International Journal of Robotics Research, 33(9), 1288-1302

+Concepts and Architecture · POMDPs.jl

Concepts and Architecture

POMDPs.jl aims to coordinate the development of three software components: 1) a problem, 2) a solver, 3) an experiment. Each of these components has a set of abstract types associated with it and a set of functions that allow a user to define each component's behavior in a standardized way. An outline of the architecture is shown below.

concepts

The MDP and POMDP types are associated with the problem definition. The Solver and Policy types are associated with the solver or decision-making agent. Typically, the Updater type is also associated with the solver, but a solver may sometimes be used with an updater that was implemented separately. The Simulator type is associated with the experiment.

The code components of the POMDPs.jl ecosystem relevant to problems and solvers are shown below. The arrows represent the flow of information from the problems to the solvers. The figure shows the two interfaces that form POMDPs.jl - Explicit and Generative. Details about these interfaces can be found in the section on Defining POMDPs.

interface_relationships

POMDPs and MDPs

An MDP is a mathematical framework for sequential decision making under uncertainty, and where all of the uncertainty arises from outcomes that are partially random and partially under the control of a decision maker. Mathematically, an MDP is a tuple $(S,A,T,R,\gamma)$, where $S$ is the state space, $A$ is the action space, $T$ is a transition function defining the probability of transitioning to each state given the state and action at the previous time, and $R$ is a reward function mapping every possible transition $(s,a,s')$ to a real reward value. Finally, $\gamma$ is a discount factor that defines the relative weighting of current and future rewards. For more information see a textbook such as [1]. In POMDPs.jl an MDP is represented by a concrete subtype of the MDP abstract type and a set of methods that define each of its components as described in the problem definition section.

A POMDP is a more general sequential decision making problem in which the agent is not sure what state they are in. The state is only partially observable by the decision making agent. Mathematically, a POMDP is a tuple $(S,A,T,R,O,Z,\gamma)$ where $S$, $A$, $T$, $R$, and $\gamma$ have the same meaning as in an MDP, $O$ is the agent's observation space, and $Z$ defines the probability of receiving each observation at a transition. In POMDPs.jl, a POMDP is represented by a concrete subtype of the POMDP abstract type, and the methods described in the problem definition section.

POMDPs.jl contains additional functions for defining optional problem behavior such as an initial state distribution or terminal states. More information can be found in the Defining POMDPs section.

Beliefs and Updaters

In a POMDP domain, the decision-making agent does not have complete information about the state of the problem, so the agent can only make choices based on its "belief" about the state. In the POMDP literature, the term "belief" is typically defined to mean a probability distribution over all possible states of the system. However, in practice, the agent often makes decisions based on an incomplete or lossy record of past observations that has a structure much different from a probability distribution. For example, if the agent is represented by a finite-state controller, as is the case for Monte-Carlo Value Iteration [2], the belief is the controller state, which is a node in a graph. Another example is an agent represented by a recurrent neural network. In this case, the agent's belief is the state of the network. In order to accommodate a wide variety of decision-making approaches in POMDPs.jl, we use the term "belief" to denote the set of information that the agent makes a decision on, which could be an exact state distribution, an action-observation history, a set of weighted particles, or the examples mentioned before. In code, the belief can be represented by any built-in or user-defined type.

When an action is taken and a new observation is received, the belief is updated by the belief updater. In code, a belief updater is represented by a concrete subtype of the Updater abstract type, and the update(updater, belief, action, observation) function defines how the belief is updated when a new observation is received.

Although the agent may use a specialized belief structure to make decisions, the information initially given to the agent about the state of the problem is usually most conveniently represented as a state distribution, thus the initialize_belief function is provided to convert a state distribution to a specialized belief structure that an updater can work with.

In many cases, the belief structure is closely related to the solution technique, so it will be implemented by the programmer who writes the solver. In other cases, the agent can use a variety of belief structures to make decisions, so a domain-specific updater implemented by the programmer that wrote the problem description may be appropriate. Finally, some advanced generic belief updaters such as particle filters may be implemented by a third party. The convenience function updater(policy) can be used to get a suitable default updater for a policy, however many policies can work with other updaters.

For more information on implementing a belief updater, see Defining a Belief Updater

Solvers and Policies

Sequential decision making under uncertainty involves both online and offline calculations. In the broad sense, the term "solver" as used in the node in the figure at the top of the page refers to the software package that performs the calculations at both of these times. However, the code is broken up into two pieces, the solver that performs calculations offline and the policy that performs calculations online.

In the abstract, a policy is a mapping from every belief that an agent might take to an action. A policy is represented in code by a concrete subtype of the Policy abstract type. The programmer implements action to describe what computations need to be done online. For an online solver such as POMCP, all of the decision computation occurs within action while for an offline solver like SARSOP, there is very little computation within action. See Interacting with Policies for more information.

The offline portion of the computation is carried out by the solver, which is represented by a concrete subtype of the Solver abstract type. Computations occur within the solve function. For an offline solver like SARSOP, nearly all of the decision computation occurs within this function, but for some online solvers such as POMCP, solve merely embeds the problem in the policy.

Simulators

A simulator defines a way to run one or more simulations. It is represented by a concrete subtype of the Simulator abstract type and the simulation is an implemention of simulate. Depending on the simulator, simulate may return a variety of data about the simulation, such as the discounted reward or the state history. All simulators should perform simulations consistent with the Simulation Standard.

[1] Decision Making Under Uncertainty: Theory and Application by Mykel J. Kochenderfer, MIT Press, 2015

[2] Bai, H., Hsu, D., & Lee, W. S. (2014). Integrated perception and planning in the continuous space: A POMDP approach. The International Journal of Robotics Research, 33(9), 1288-1302

diff --git a/dev/def_pomdp/index.html b/dev/def_pomdp/index.html index b80749d6..9864b6ab 100644 --- a/dev/def_pomdp/index.html +++ b/dev/def_pomdp/index.html @@ -196,4 +196,4 @@ R = [-1. -100. 10.; -1. 10. -100.] -m = TabularPOMDP(T, R, O, 0.95)

Here T is a $|S| \times |A| \times |S|$ array representing the transition probabilities, with T[sp, a, s] $= T(s' | s, a)$. Similarly, O is an $|O| \times |A| \times |S|$ encoding the observation distribution with O[o, a, sp] $= Z(o | a, s')$, and R is a $|S| \times |A|$ matrix that encodes the reward function. 0.95 is the discount factor.

+m = TabularPOMDP(T, R, O, 0.95)

Here T is a $|S| \times |A| \times |S|$ array representing the transition probabilities, with T[sp, a, s] $= T(s' | s, a)$. Similarly, O is an $|O| \times |A| \times |S|$ encoding the observation distribution with O[o, a, sp] $= Z(o | a, s')$, and R is a $|S| \times |A|$ matrix that encodes the reward function. 0.95 is the discount factor.

diff --git a/dev/def_solver/index.html b/dev/def_solver/index.html index 125d4550..9e26fdd5 100644 --- a/dev/def_solver/index.html +++ b/dev/def_solver/index.html @@ -1,2 +1,2 @@ -Solvers · POMDPs.jl

Solvers

Defining a solver involves creating or using four pieces of code:

  1. A subtype of Solver that holds the parameters and configuration options for the solver.
  2. A subtype of Policy that holds all of the data needed to choose actions online.
  3. A method of solve that takes the Solver and a (PO)MDP as arguments, performs all of the offline computations for solving the problem, and returns the policy.
  4. A method of action that takes in the policy and a state or belief and returns an action.

In many cases, items 2 and 4 can be satisfied with an off-the-shelf Policy from the POMDPTools package. also contains many tools that are useful for defining solvers in a robust, concise, and readable manner.

Online and Offline Solvers

Generally, solvers can be grouped into two categories: Offline solvers that do most of their computational work before interacting with the environment, and online solvers that do their work online as each new state or observation is encountered. Although offline and online solvers both use the exact same Solver, solve, Policy, action structure, the work of defining online and offline solvers is focused on different portions.

For an offline solver, most of the implementation effort will be spent on the [solve] function, and an off-the-shelf policy from POMDPTools will typically be used.

For an online solver, the solve function typically does little or no work, but merely creates a Policy object that will carry out computation online. It is typical in POMDPs.jl to use the term "Planner" to name a Policy object for an online solver that carries out a large amount of computation ("planning") at interaction time. In this case most of the effort will be focused on implementing the action method for the "Planner" Policy type.

Examples

Solver implementation is most clearly explained through examples. The following sections contain examples of both online and offline solver definitions:

+Solvers · POMDPs.jl

Solvers

Defining a solver involves creating or using four pieces of code:

  1. A subtype of Solver that holds the parameters and configuration options for the solver.
  2. A subtype of Policy that holds all of the data needed to choose actions online.
  3. A method of solve that takes the Solver and a (PO)MDP as arguments, performs all of the offline computations for solving the problem, and returns the policy.
  4. A method of action that takes in the policy and a state or belief and returns an action.

In many cases, items 2 and 4 can be satisfied with an off-the-shelf Policy from the POMDPTools package. also contains many tools that are useful for defining solvers in a robust, concise, and readable manner.

Online and Offline Solvers

Generally, solvers can be grouped into two categories: Offline solvers that do most of their computational work before interacting with the environment, and online solvers that do their work online as each new state or observation is encountered. Although offline and online solvers both use the exact same Solver, solve, Policy, action structure, the work of defining online and offline solvers is focused on different portions.

For an offline solver, most of the implementation effort will be spent on the [solve] function, and an off-the-shelf policy from POMDPTools will typically be used.

For an online solver, the solve function typically does little or no work, but merely creates a Policy object that will carry out computation online. It is typical in POMDPs.jl to use the term "Planner" to name a Policy object for an online solver that carries out a large amount of computation ("planning") at interaction time. In this case most of the effort will be focused on implementing the action method for the "Planner" Policy type.

Examples

Solver implementation is most clearly explained through examples. The following sections contain examples of both online and offline solver definitions:

diff --git a/dev/def_updater/index.html b/dev/def_updater/index.html index 6605bbcf..eef7b6e5 100644 --- a/dev/def_updater/index.html +++ b/dev/def_updater/index.html @@ -29,4 +29,4 @@ b = Any[POMDPModels.BoolDistribution(0.0), false, false] b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false] b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false, true, false] -b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false, true, false, true, false] +b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false, true, false, true, false] diff --git a/dev/example_defining_problems/index.html b/dev/example_defining_problems/index.html index 0607f280..252898da 100644 --- a/dev/example_defining_problems/index.html +++ b/dev/example_defining_problems/index.html @@ -247,4 +247,4 @@ discount = 0.9 -tabular_crying_baby_pomdp = TabularPOMDP(T, R, O, discount) +tabular_crying_baby_pomdp = TabularPOMDP(T, R, O, discount) diff --git a/dev/example_gridworld_mdp/index.html b/dev/example_gridworld_mdp/index.html index 483b8338..c51ea0d8 100644 --- a/dev/example_gridworld_mdp/index.html +++ b/dev/example_gridworld_mdp/index.html @@ -272,30 +272,30 @@ solver = ValueIterationSolver(; max_iterations=100, belres=1e-3, verbose=true) # Solve for an optimal policy -vi_policy = POMDPs.solve(solver, mdp)
[Iteration 1   ] residual:         10 | iteration runtime:      0.217 ms, (  0.000217 s total)
-[Iteration 2   ] residual:        6.3 | iteration runtime:      0.285 ms, (  0.000502 s total)
-[Iteration 3   ] residual:       4.53 | iteration runtime:      0.233 ms, (  0.000735 s total)
-[Iteration 4   ] residual:       3.21 | iteration runtime:      0.215 ms, (   0.00095 s total)
-[Iteration 5   ] residual:       2.31 | iteration runtime:      0.223 ms, (   0.00117 s total)
-[Iteration 6   ] residual:       1.62 | iteration runtime:      0.215 ms, (   0.00139 s total)
-[Iteration 7   ] residual:       1.24 | iteration runtime:      0.226 ms, (   0.00161 s total)
-[Iteration 8   ] residual:       1.06 | iteration runtime:      0.227 ms, (   0.00184 s total)
-[Iteration 9   ] residual:      0.865 | iteration runtime:      0.223 ms, (   0.00206 s total)
-[Iteration 10  ] residual:      0.657 | iteration runtime:      0.211 ms, (   0.00227 s total)
-[Iteration 11  ] residual:      0.545 | iteration runtime:      0.231 ms, (   0.00251 s total)
-[Iteration 12  ] residual:      0.455 | iteration runtime:      0.214 ms, (   0.00272 s total)
-[Iteration 13  ] residual:      0.378 | iteration runtime:      0.214 ms, (   0.00293 s total)
-[Iteration 14  ] residual:      0.306 | iteration runtime:      0.213 ms, (   0.00315 s total)
-[Iteration 15  ] residual:      0.211 | iteration runtime:      0.211 ms, (   0.00336 s total)
-[Iteration 16  ] residual:      0.132 | iteration runtime:      0.212 ms, (   0.00357 s total)
-[Iteration 17  ] residual:     0.0778 | iteration runtime:      0.211 ms, (   0.00378 s total)
-[Iteration 18  ] residual:     0.0437 | iteration runtime:      0.217 ms, (     0.004 s total)
-[Iteration 19  ] residual:     0.0237 | iteration runtime:      0.223 ms, (   0.00422 s total)
-[Iteration 20  ] residual:     0.0125 | iteration runtime:      0.213 ms, (   0.00443 s total)
-[Iteration 21  ] residual:    0.00649 | iteration runtime:      0.211 ms, (   0.00465 s total)
-[Iteration 22  ] residual:    0.00332 | iteration runtime:      0.213 ms, (   0.00486 s total)
-[Iteration 23  ] residual:    0.00167 | iteration runtime:      0.208 ms, (   0.00507 s total)
-[Iteration 24  ] residual:   0.000834 | iteration runtime:      0.209 ms, (   0.00528 s total)

We can now use the policy to compute the optimal action for a given state:

s = GridWorldState(9, 2)
+vi_policy = POMDPs.solve(solver, mdp)
[Iteration 1   ] residual:         10 | iteration runtime:      0.242 ms, (  0.000242 s total)
+[Iteration 2   ] residual:        6.3 | iteration runtime:      0.215 ms, (  0.000457 s total)
+[Iteration 3   ] residual:       4.53 | iteration runtime:      0.207 ms, (  0.000663 s total)
+[Iteration 4   ] residual:       3.21 | iteration runtime:      0.209 ms, (  0.000873 s total)
+[Iteration 5   ] residual:       2.31 | iteration runtime:      0.210 ms, (   0.00108 s total)
+[Iteration 6   ] residual:       1.62 | iteration runtime:      0.209 ms, (   0.00129 s total)
+[Iteration 7   ] residual:       1.24 | iteration runtime:      0.207 ms, (    0.0015 s total)
+[Iteration 8   ] residual:       1.06 | iteration runtime:      0.205 ms, (    0.0017 s total)
+[Iteration 9   ] residual:      0.865 | iteration runtime:      0.205 ms, (   0.00191 s total)
+[Iteration 10  ] residual:      0.657 | iteration runtime:      0.204 ms, (   0.00211 s total)
+[Iteration 11  ] residual:      0.545 | iteration runtime:      0.205 ms, (   0.00232 s total)
+[Iteration 12  ] residual:      0.455 | iteration runtime:      0.206 ms, (   0.00252 s total)
+[Iteration 13  ] residual:      0.378 | iteration runtime:      0.205 ms, (   0.00273 s total)
+[Iteration 14  ] residual:      0.306 | iteration runtime:      0.205 ms, (   0.00293 s total)
+[Iteration 15  ] residual:      0.211 | iteration runtime:      0.210 ms, (   0.00314 s total)
+[Iteration 16  ] residual:      0.132 | iteration runtime:      0.206 ms, (   0.00335 s total)
+[Iteration 17  ] residual:     0.0778 | iteration runtime:      0.227 ms, (   0.00358 s total)
+[Iteration 18  ] residual:     0.0437 | iteration runtime:      0.220 ms, (    0.0038 s total)
+[Iteration 19  ] residual:     0.0237 | iteration runtime:      0.208 ms, (   0.00401 s total)
+[Iteration 20  ] residual:     0.0125 | iteration runtime:      0.208 ms, (   0.00421 s total)
+[Iteration 21  ] residual:    0.00649 | iteration runtime:      0.204 ms, (   0.00442 s total)
+[Iteration 22  ] residual:    0.00332 | iteration runtime:      0.209 ms, (   0.00463 s total)
+[Iteration 23  ] residual:    0.00167 | iteration runtime:      0.209 ms, (   0.00484 s total)
+[Iteration 24  ] residual:   0.000834 | iteration runtime:      0.204 ms, (   0.00504 s total)

We can now use the policy to compute the optimal action for a given state:

s = GridWorldState(9, 2)
 @show action(vi_policy, s)
:up
s = GridWorldState(8, 3)
 @show action(vi_policy, s)
:right

Solving the Grid World MDP (MCTS)

Similar to the process with Value Iteration, we can solve the MDP using MCTS. We will use the MCTSSolver from the MCTS package.

# Initialize the problem (we have already done this, but just calling it again for completeness in the example)
 mdp = GridWorldMDP()
@@ -400,4 +400,4 @@
  2 | →  →  →  →  →  →  →  →  ↑  ↑ |
  1 | →  →  →  →  →  →  ↑  ↑  ↑  ↑ |
    ------------------------------
-    1  2  3  4  5  6  7  8  9  10

Seeing a Policy In Action

Another useful tool is to view the policy in action by creating a gif of a simulation. To accomplish this, we could use POMDPGifs. To use POMDPGifs, we need to extend the POMDPTools.render function to GridWorldMDP. Please reference Gallery of POMDPs.jl Problems for examples of this process.

+ 1 2 3 4 5 6 7 8 9 10

Seeing a Policy In Action

Another useful tool is to view the policy in action by creating a gif of a simulation. To accomplish this, we could use POMDPGifs. To use POMDPGifs, we need to extend the POMDPTools.render function to GridWorldMDP. Please reference Gallery of POMDPs.jl Problems for examples of this process.

diff --git a/dev/example_simulations/index.html b/dev/example_simulations/index.html index 29f0793d..10f6de8b 100644 --- a/dev/example_simulations/index.html +++ b/dev/example_simulations/index.html @@ -24,41 +24,41 @@ Step 2 b = sated => 0.989010989010989, hungry => 0.010989010989010992 s = :sated -a = :feed +a = :ignore o = :quiet -r = -5.0 -r_sum = -5.5 +r = 0.0 +r_sum = -0.5 Step 3 -b = sated => 1.0, hungry => 0.0 +b = sated => 0.9732977303070761, hungry => 0.026702269692923903 s = :sated a = :sing o = :quiet r = -0.5 -r_sum = -6.0 +r_sum = -1.0 Step 4 -b = sated => 0.989010989010989, hungry => 0.010989010989010992 +b = sated => 0.98603826327417, hungry => 0.013961736725829966 s = :sated a = :sing o = :quiet r = -0.5 -r_sum = -6.5

Rollout Simulations

While stepthrough is a flexible and convenient tool for many user-facing demonstrations, it is often less error-prone to use the standard simulate function with a Simulator object. The simplest Simulator is the RolloutSimulator. It simply runs a simulation and returns the discounted reward.

policy = RandomPolicy(explicit_crying_baby_pomdp)
+r_sum = -1.5

Rollout Simulations

While stepthrough is a flexible and convenient tool for many user-facing demonstrations, it is often less error-prone to use the standard simulate function with a Simulator object. The simplest Simulator is the RolloutSimulator. It simply runs a simulation and returns the discounted reward.

policy = RandomPolicy(explicit_crying_baby_pomdp)
 sim = RolloutSimulator(max_steps=10)
 r_sum = simulate(sim, explicit_crying_baby_pomdp, policy)
-println("Total discounted reward: $r_sum")
Total discounted reward: -52.32876863450001

Recording Histories

Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a HistoryRecorder.

policy = RandomPolicy(tabular_crying_baby_pomdp)
+println("Total discounted reward: $r_sum")
Total discounted reward: -50.384737040000005

Recording Histories

Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a HistoryRecorder.

policy = RandomPolicy(tabular_crying_baby_pomdp)
 hr = HistoryRecorder(max_steps=5)
-history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1))

The history object produced by a HistoryRecorder is a SimHistory, documented in the POMDPTools simulator section Histories. The information in this object can be accessed in several ways. For example, there is a function:

discounted_reward(history)
-30.4755

Accessor functions like state_hist and action_hist can also be used to access parts of the history:

state_hist(history)
6-element Vector{Int64}:
- 2
+history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1))

The history object produced by a HistoryRecorder is a SimHistory, documented in the POMDPTools simulator section Histories. The information in this object can be accessed in several ways. For example, there is a function:

discounted_reward(history)
-13.145

Accessor functions like state_hist and action_hist can also be used to access parts of the history:

state_hist(history)
6-element Vector{Int64}:
  1
  1
  1
  1
- 1
collect(action_hist(history))
5-element Vector{Int64}:
- 1
  1
+ 1
collect(action_hist(history))
5-element Vector{Int64}:
  1
  1
+ 3
+ 2
  1

Keeping track of which states, actions, and observations belong together can be tricky (for example, since there is a starting state, and ending state, but no action is taken from the ending state, the list of actions has a different length than the list of states). It is often better to think of histories in terms of steps that include both starting and ending states.

The most powerful function for accessing the information in a SimHistory is the eachstep function which returns an iterator through named tuples representing each step in the history. The eachstep function is similar to the stepthrough function above except that it iterates through the immutable steps of a previously simulated history instead of conducting the simulation as the for loop is being carried out.

r_sum = 0.0
 step = 0
 for step_i in eachstep(sim_history, "b,s,a,o,r")
@@ -75,11 +75,11 @@
 end
 end # hide
Step 1
 step_i.b = sated => 1.0, hungry => 0.0
-step_i.s = 2
+step_i.s = 1
 step_i.a = 1
 step_i.o = 2
-step_i.r = -15.0
-r_sum = -15.0
+step_i.r = -5.0
+r_sum = -5.0
 
 Step 2
 step_i.b = sated => 1.0, hungry => 0.0
@@ -87,31 +87,31 @@
 step_i.a = 1
 step_i.o = 2
 step_i.r = -5.0
-r_sum = -20.0
+r_sum = -10.0
 
 Step 3
 step_i.b = sated => 1.0, hungry => 0.0
 step_i.s = 1
-step_i.a = 1
+step_i.a = 3
 step_i.o = 2
-step_i.r = -5.0
-r_sum = -25.0
+step_i.r = 0.0
+r_sum = -10.0
 
 Step 4
-step_i.b = sated => 1.0, hungry => 0.0
+step_i.b = sated => 0.9759036144578314, hungry => 0.024096385542168676
 step_i.s = 1
-step_i.a = 1
+step_i.a = 2
 step_i.o = 2
-step_i.r = -5.0
-r_sum = -30.0
+step_i.r = -0.5
+r_sum = -10.5
 
 Step 5
-step_i.b = sated => 1.0, hungry => 0.0
+step_i.b = sated => 0.9863347314301176, hungry => 0.01366526856988229
 step_i.s = 1
 step_i.a = 1
 step_i.o = 2
 step_i.r = -5.0
-r_sum = -35.0

Parallel Simulations

It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of Sim objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using run_parallel, which will return a DataFrame with the results.

In this example, we will compare the performance of the policies we computed in the Using Different Solvers section (i.e. sarsop_policy, pomcp_planner, and heuristic_policy). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 Sim objects of each policy to the list.

using DataFrames
+r_sum = -15.5

Parallel Simulations

It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of Sim objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using run_parallel, which will return a DataFrame with the results.

In this example, we will compare the performance of the policies we computed in the Using Different Solvers section (i.e. sarsop_policy, pomcp_planner, and heuristic_policy). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 Sim objects of each policy to the list.

using DataFrames
 using StatsBase: std
 
 # Defining paramters for the simulations
@@ -178,4 +178,4 @@
 
 # Calculate the mean and confidence interval for each policy
 grouped_df = groupby(data, :policy)
-result = combine(grouped_df, :reward => mean_and_ci => AsTable)
4×3 DataFrame
Rowpolicymeanci
String?Float64Float64
1sarsop-14.62641.81814
2pomcp-18.69041.57649
3heuristic-16.46382.03513
4random-30.42012.64208

By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the Sim-history pair and record additional statistics. Reference the POMDPTools simulator section for more information (Specifying information to be recorded).

+result = combine(grouped_df, :reward => mean_and_ci => AsTable)
4×3 DataFrame
Rowpolicymeanci
String?Float64Float64
1sarsop-14.62641.81814
2pomcp-18.69041.57649
3heuristic-15.82342.06755
4random-30.42012.64208

By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the Sim-history pair and record additional statistics. Reference the POMDPTools simulator section for more information (Specifying information to be recorded).

diff --git a/dev/example_solvers/index.html b/dev/example_solvers/index.html index 06d49732..6fec65cd 100644 --- a/dev/example_solvers/index.html +++ b/dev/example_solvers/index.html @@ -25,19 +25,19 @@ For solve(::QMDP.QMDPSolver, ::POMDP): [No additional requirements] For solve(::ValueIterationSolver, ::Union{MDP,POMDP}) (in solve(::QMDP.QMDPSolver, ::POMDP)): - [✔] discount(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) - [✔] transition(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol) - [✔] reward(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol, ::Symbol) - [✔] stateindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) - [✔] actionindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) - [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) + [✔] discount(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) + [✔] transition(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol) + [✔] reward(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol, ::Symbol) + [✔] stateindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) + [✔] actionindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) + [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) [✔] length(::Array{Symbol1}) [✔] support(::Deterministic{Symbol}) [✔] pdf(::Deterministic{Symbol}, ::Symbol) For ordered_states(::Union{MDP,POMDP}) (in solve(::ValueIterationSolver, ::Union{MDP,POMDP})): - [✔] states(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) + [✔] states(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) For ordered_actions(::Union{MDP,POMDP}) (in solve(::ValueIterationSolver, ::Union{MDP,POMDP})): - [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("871216c0-2fd4-4cb4-9ecd-379649deab37"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) + [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("ecf5fe92-676e-48a1-a5ed-d76c8d6c052e"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) Explicit Crying Baby POMDP INFO: POMDPLinter requirements for solve(::QMDP.QMDPSolver, ::POMDP) and dependencies. ([✔] = implemented correctly; [X] = not implemented; [?] = could not determine) @@ -152,4 +152,4 @@ @show [a1, a2]
2-element Vector{Symbol}:
  :feed
- :ignore
+ :sing diff --git a/dev/examples/index.html b/dev/examples/index.html index 31adb14c..0fc7bea9 100644 --- a/dev/examples/index.html +++ b/dev/examples/index.html @@ -1,2 +1,2 @@ -Examples · POMDPs.jl

Examples

This section contains examples of how to use POMDPs.jl. For specific informaiton about the interface and functions used in the examples, please reference the correpsonding area in the documenation or the API Documentation.

The examples are organized by topic. The exmaples are designed to build through each step. First, we have to define a POMDP. Then we need to solve the POMDP to get a policy. Finally, we can simulate the policy to see how it performs. The examples are designed to be exeucted in order. For example, the examples in Simulations Examples assume that the POMDPs defined in the Defining a POMDP section have been defined and we have a policy we would like to simulate that we computed in the Using Different Solvers section.

The GridWorld MDP Tutorial section is a standalone example that does not require any of the other examples.

Outline

+Examples · POMDPs.jl

Examples

This section contains examples of how to use POMDPs.jl. For specific informaiton about the interface and functions used in the examples, please reference the correpsonding area in the documenation or the API Documentation.

The examples are organized by topic. The exmaples are designed to build through each step. First, we have to define a POMDP. Then we need to solve the POMDP to get a policy. Finally, we can simulate the policy to see how it performs. The examples are designed to be exeucted in order. For example, the examples in Simulations Examples assume that the POMDPs defined in the Defining a POMDP section have been defined and we have a policy we would like to simulate that we computed in the Using Different Solvers section.

The GridWorld MDP Tutorial section is a standalone example that does not require any of the other examples.

Outline

diff --git a/dev/faq/index.html b/dev/faq/index.html index 9369f030..9418cdb9 100644 --- a/dev/faq/index.html +++ b/dev/faq/index.html @@ -14,4 +14,4 @@ end end -POMDPs.reward(m, s, a) = rdict[(s, a)]

Why do I need to put type assertions pomdp::POMDP into the function signature?

Specifying the type in your function signature allows Julia to call the appropriate function when your custom type is passed into it. For example if a POMDPs.jl solver calls states on the POMDP that you passed into it, the correct states function will only get dispatched if you specified that the states function you wrote works with your POMDP type. Because Julia supports multiple-dispatch, these type assertion are a way for doing object-oriented programming in Julia.

+POMDPs.reward(m, s, a) = rdict[(s, a)]

Why do I need to put type assertions pomdp::POMDP into the function signature?

Specifying the type in your function signature allows Julia to call the appropriate function when your custom type is passed into it. For example if a POMDPs.jl solver calls states on the POMDP that you passed into it, the correct states function will only get dispatched if you specified that the states function you wrote works with your POMDP type. Because Julia supports multiple-dispatch, these type assertion are a way for doing object-oriented programming in Julia.

diff --git a/dev/gallery/index.html b/dev/gallery/index.html index 2d41d9cb..1d6a20b9 100644 --- a/dev/gallery/index.html +++ b/dev/gallery/index.html @@ -187,4 +187,4 @@ sim = GifSimulator(; filename="examples/TagPOMDP.gif", max_steps=50, rng=MersenneTwister(1), show_progress=false) saved_gif = simulate(sim, pomdp, policy) -println("gif saved to: $(saved_gif.filename)")
gif saved to: examples/TagPOMDP.gif

To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the gallery.md file in docs/src/. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documentation using @eval and saved in the docs/src/examples/ directory. The gif should be named problem_name.gif where problem_name is the name of the problem. The gif can then be included using ![problem_name](examples/problem_name.gif).

+println("gif saved to: $(saved_gif.filename)")
gif saved to: examples/TagPOMDP.gif

To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the gallery.md file in docs/src/. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documentation using @eval and saved in the docs/src/examples/ directory. The gif should be named problem_name.gif where problem_name is the name of the problem. The gif can then be included using ![problem_name](examples/problem_name.gif).

diff --git a/dev/get_started/index.html b/dev/get_started/index.html index c77ea3da..da453620 100644 --- a/dev/get_started/index.html +++ b/dev/get_started/index.html @@ -13,4 +13,4 @@ init_dist = initialstate(pomdp) # from POMDPModels hr = HistoryRecorder(max_steps=100) # from POMDPTools hist = simulate(hr, pomdp, policy, belief_updater, init_dist) # run 100 step simulation -println("reward: $(discounted_reward(hist))")

The first part of the code loads the desired packages and initializes the problem and the solver. Next, we compute a POMDP policy. Lastly, we evaluate the results.

There are a few things to mention here. First, the TigerPOMDP type implements all the functions required by QMDPSolver to compute a policy. Second, each policy has a default updater (essentially a filter used to update the belief of the POMDP). To learn more about Updaters check out the Concepts and Architecture section.

+println("reward: $(discounted_reward(hist))")

The first part of the code loads the desired packages and initializes the problem and the solver. Next, we compute a POMDP policy. Lastly, we evaluate the results.

There are a few things to mention here. First, the TigerPOMDP type implements all the functions required by QMDPSolver to compute a policy. Second, each policy has a default updater (essentially a filter used to update the belief of the POMDP). To learn more about Updaters check out the Concepts and Architecture section.

diff --git a/dev/index.html b/dev/index.html index db70a621..8f567d4c 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -POMDPs.jl · POMDPs.jl

POMDPs.jl

A Julia interface for defining, solving and simulating partially observable Markov decision processes and their fully observable counterparts.

Package and Ecosystem Features

  • General interface that can handle problems with discrete and continuous state/action/observation spaces
  • A number of popular state-of-the-art solvers implemented for use out-of-the-box
  • Tools that make it easy to define problems and simulate solutions
  • Simple integration of custom solvers into the existing interface

Available Packages

The POMDPs.jl package contains only the interface used for expressing and solving Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). The POMDPTools package acts as a "standard library" for the POMDPs.jl interface, providing implementations of commonly-used components such as policies, belief updaters, distributions, and simulators. The list of solver and support packages maintained by the JuliaPOMDP community is available at the POMDPs.jl Readme.

Documentation Outline

Documentation comes in three forms:

  1. An explanatory guide is available in the sections outlined below.
  2. How-to examples are available throughout this documentation with specicic examples in Examples and Gallery of POMDPs.jl Problems.
  3. Reference docstrings for the entire POMDPs.jl interface are available in the API Documentation section.
Note

When updating these documents, make sure this is synced with docs/make.jl!!

Basics

Defining POMDP Models

Writing Solvers and Updaters

Analyzing Results

POMDPTools - the standard library for POMDPs.jl

Reference

+POMDPs.jl · POMDPs.jl

POMDPs.jl

A Julia interface for defining, solving and simulating partially observable Markov decision processes and their fully observable counterparts.

Package and Ecosystem Features

  • General interface that can handle problems with discrete and continuous state/action/observation spaces
  • A number of popular state-of-the-art solvers implemented for use out-of-the-box
  • Tools that make it easy to define problems and simulate solutions
  • Simple integration of custom solvers into the existing interface

Available Packages

The POMDPs.jl package contains only the interface used for expressing and solving Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). The POMDPTools package acts as a "standard library" for the POMDPs.jl interface, providing implementations of commonly-used components such as policies, belief updaters, distributions, and simulators. The list of solver and support packages maintained by the JuliaPOMDP community is available at the POMDPs.jl Readme.

Documentation Outline

Documentation comes in three forms:

  1. An explanatory guide is available in the sections outlined below.
  2. How-to examples are available throughout this documentation with specicic examples in Examples and Gallery of POMDPs.jl Problems.
  3. Reference docstrings for the entire POMDPs.jl interface are available in the API Documentation section.
Note

When updating these documents, make sure this is synced with docs/make.jl!!

Basics

Defining POMDP Models

Writing Solvers and Updaters

Analyzing Results

POMDPTools - the standard library for POMDPs.jl

Reference

diff --git a/dev/install/index.html b/dev/install/index.html index 2602b84e..ed5a3ca7 100644 --- a/dev/install/index.html +++ b/dev/install/index.html @@ -1,3 +1,3 @@ Installation · POMDPs.jl

Installation

If you have a running Julia distribution (Julia 0.4 or greater), you have everything you need to install POMDPs.jl. To install the package, simply run the following from the Julia REPL:

import Pkg
-Pkg.add("POMDPs") # installs the POMDPs.jl package

Some auxiliary packages and older versions of solvers may be found in the JuliaPOMDP registry. To install this registry, run:

using Pkg; pkg"registry add https://github.com/JuliaPOMDP/Registry"

Note: to use this registry, JuliaPro users must also run edit(normpath(Sys.BINDIR,"..","etc","julia","startup.jl")), comment out the line ENV["DISABLE_FALLBACK"] = "true", save the file, and restart JuliaPro as described in this issue.

+Pkg.add("POMDPs") # installs the POMDPs.jl package

Some auxiliary packages and older versions of solvers may be found in the JuliaPOMDP registry. To install this registry, run:

using Pkg; pkg"registry add https://github.com/JuliaPOMDP/Registry"

Note: to use this registry, JuliaPro users must also run edit(normpath(Sys.BINDIR,"..","etc","julia","startup.jl")), comment out the line ENV["DISABLE_FALLBACK"] = "true", save the file, and restart JuliaPro as described in this issue.

diff --git a/dev/interfaces/index.html b/dev/interfaces/index.html index 4df7ffa3..d4820143 100644 --- a/dev/interfaces/index.html +++ b/dev/interfaces/index.html @@ -1,2 +1,2 @@ -Spaces and Distributions · POMDPs.jl

Spaces and Distributions

Two important components of the definitions of MDPs and POMDPs are spaces, which specify the possible states, actions, and observations in a problem and distributions, which define probability distributions. In order to provide for maximum flexibility spaces and distributions may be of any type (i.e. there are no abstract base types). Solvers and simulators will interact with space and distribution types using the functions defined below.

Spaces

A space object should contain the information needed to define the set of all possible states, actions or observations. The implementation will depend on the attributes of the elements. For example, if the space is continuous, the space object may only contain the limits of the continuous range. In the case of a discrete problem, a vector containing all states is appropriate for representing a space.

The following functions may be called on a space object (Click on a function to read its documentation):

Distributions

A distribution object represents a probability distribution.

The following functions may be called on a distribution object (Click on a function to read its documentation):

You can find some useful pre-made distribution objects in Distributions.jl or POMDPTools.

  • 1Distributions should support both rand(rng::AbstractRNG, d) and rand(d). The recommended way to do this is by implmenting Base.rand(rng::AbstractRNG, s::Random.SamplerTrivial{<:YourDistribution}) from the julia rand interface.
+Spaces and Distributions · POMDPs.jl

Spaces and Distributions

Two important components of the definitions of MDPs and POMDPs are spaces, which specify the possible states, actions, and observations in a problem and distributions, which define probability distributions. In order to provide for maximum flexibility spaces and distributions may be of any type (i.e. there are no abstract base types). Solvers and simulators will interact with space and distribution types using the functions defined below.

Spaces

A space object should contain the information needed to define the set of all possible states, actions or observations. The implementation will depend on the attributes of the elements. For example, if the space is continuous, the space object may only contain the limits of the continuous range. In the case of a discrete problem, a vector containing all states is appropriate for representing a space.

The following functions may be called on a space object (Click on a function to read its documentation):

Distributions

A distribution object represents a probability distribution.

The following functions may be called on a distribution object (Click on a function to read its documentation):

You can find some useful pre-made distribution objects in Distributions.jl or POMDPTools.

  • 1Distributions should support both rand(rng::AbstractRNG, d) and rand(d). The recommended way to do this is by implmenting Base.rand(rng::AbstractRNG, s::Random.SamplerTrivial{<:YourDistribution}) from the julia rand interface.
diff --git a/dev/offline_solver/index.html b/dev/offline_solver/index.html index 73477acc..aa457fc4 100644 --- a/dev/offline_solver/index.html +++ b/dev/offline_solver/index.html @@ -70,4 +70,4 @@ @assert action(policy, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT @assert action(policy, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT -@assert action(policy, Uniform(states(tiger))) == TIGER_LISTEN +@assert action(policy, Uniform(states(tiger))) == TIGER_LISTEN diff --git a/dev/online_solver/index.html b/dev/online_solver/index.html index ad518a30..f18f280a 100644 --- a/dev/online_solver/index.html +++ b/dev/online_solver/index.html @@ -56,4 +56,4 @@ @assert action(planner, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT @assert action(planner, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT -# note action(planner, Uniform(states(tiger))) is not very reliable with this number of samples +# note action(planner, Uniform(states(tiger))) is not very reliable with this number of samples diff --git a/dev/policy_interaction/index.html b/dev/policy_interaction/index.html index 670717e0..573ebb67 100644 --- a/dev/policy_interaction/index.html +++ b/dev/policy_interaction/index.html @@ -1,2 +1,2 @@ -Interacting with Policies · POMDPs.jl

Interacting with Policies

A solution to a POMDP is a policy that maps beliefs or action-observation histories to actions. In POMDPs.jl, these are represented by Policy objects. See Solvers and Policies for more information about what a policy can represent in general.

One common task in evaluating POMDP solutions is examining the policies themselves. Since the internal representation of a policy is an esoteric implementation detail, it is best to interact with policies through the action and value interface functions. There are three relevant methods

  • action(policy, s) returns the best action (or one of the best) for the given state or belief.
  • value(policy, s) returns the expected sum of future rewards if the policy is executed.
  • value(policy, s, a) returns the "Q-value", that is, the expected sum of rewards if action a is taken on the next step and then the policy is executed.

Note that the quantities returned by these functions are what the policy/solver expects to be the case after its (usually approximate) computations; they may be far from the true value if the solution is not exactly optimal.

+Interacting with Policies · POMDPs.jl

Interacting with Policies

A solution to a POMDP is a policy that maps beliefs or action-observation histories to actions. In POMDPs.jl, these are represented by Policy objects. See Solvers and Policies for more information about what a policy can represent in general.

One common task in evaluating POMDP solutions is examining the policies themselves. Since the internal representation of a policy is an esoteric implementation detail, it is best to interact with policies through the action and value interface functions. There are three relevant methods

  • action(policy, s) returns the best action (or one of the best) for the given state or belief.
  • value(policy, s) returns the expected sum of future rewards if the policy is executed.
  • value(policy, s, a) returns the "Q-value", that is, the expected sum of rewards if action a is taken on the next step and then the policy is executed.

Note that the quantities returned by these functions are what the policy/solver expects to be the case after its (usually approximate) computations; they may be far from the true value if the solution is not exactly optimal.

diff --git a/dev/run_simulation/index.html b/dev/run_simulation/index.html index f8092d30..000972e1 100644 --- a/dev/run_simulation/index.html +++ b/dev/run_simulation/index.html @@ -1,3 +1,3 @@ Running Simulations · POMDPs.jl

Running Simulations

Running a simulation consists of two steps, creating a simulator and calling the simulate function. For example, given a POMDP or MDP model m, and a policy p, one can use the RolloutSimulator from POMDPTools to find the accumulated discounted reward from a single simulated trajectory as follows:

sim = RolloutSimulator()
-r = simulate(sim, m, p)

More inputs, such as a belief updater, initial state, initial belief, etc. may be specified as arguments to simulate. See the docstring for simulate and the appropriate "Input" sections in the Simulation Standard page for more information.

More examples can be found in the Simulations Examples section. A variety of simulators that return more information and interact in different ways can be found in POMDPTools.

+r = simulate(sim, m, p)

More inputs, such as a belief updater, initial state, initial belief, etc. may be specified as arguments to simulate. See the docstring for simulate and the appropriate "Input" sections in the Simulation Standard page for more information.

More examples can be found in the Simulations Examples section. A variety of simulators that return more information and interact in different ways can be found in POMDPTools.

diff --git a/dev/simulation/index.html b/dev/simulation/index.html index 771e9c80..6dd4473a 100644 --- a/dev/simulation/index.html +++ b/dev/simulation/index.html @@ -21,4 +21,4 @@ d *= discount(mdp) end

In terms of the explicit interface, the @gen macro above expands to the equivalent of:

    sp = rand(transition(pomdp, s, a))
     r = reward(pomdp, s, a, sp)
-    s = sp
+ s = sp