`stepthrough` alternative which is contingent on action sequences #531

jmuchovej · 2023-12-12T02:32:45Z

jmuchovej
Dec 12, 2023

In computational social cognition (an area of cognitive psychology that aims to build computational models of "theory of mind" (essentially, inferring other's beliefs and rewards from action sequences)), we use [PO]MDPs quite frequently. A common process is to compute the likelihood of an action sequence under a given reward function (and/or beliefs iff using POMDPs).

To do this, the general process is to do everything in stepthrough, except according a probabilistic variant of the optimal policy (we convert it into a probabilistic policy via softmax with some temperature (τ or β) depending on the "rationality" of the agent). I believe there is a pretty strong assumption that offline planners are used (thus all state/belief-action pairs have Q-estimates), but it could also be that I've just never seen/used online planners for this.

I've implemented this pretty often, but it's a fair bit of boilerplate code that I have to repeat, so I think would be cool to have in POMDPs.jl.

What do you think? (I'm happy to implement this and submit a PR, but figured it would be best to discuss before opening an issue/PR.)

zsunberg · 2023-12-12T15:43:40Z

zsunberg
Dec 12, 2023
Maintainer

Hey @jmuchovej , this is an intriguing idea.

Creating a softmax policy should be fairly straightforward, and would not require any changes to stepthrough to simulate. But, it sounds like your goal is to output something other than the history. Can you describe what the inputs and outputs of the function you are proposing would be? If you want the likelihood of each action under the policy, you could return in from the policy via action_info.

I am generally happy to either add optional outputs to stepthrough or create other functions that calculate quantities based on (PO)MDPs and policies. However, I want to avoid any simulation loops that have behavior inconsistent with the simulation standard.

5 replies

jmuchovej Dec 12, 2023
Author

Inputs & Outputs

POMDPs:

Inputs:

pomdp::POMDP
policy::Policy
up::Updater
b0
As: the observed action sequence (taken by an agent – note that this is because we use POMDPs as proposals of an agent's beliefs and rewards)
Os: (possibly optional) the observations the agent receives
τ (or β): the temperature to control "rationality" of the agent

Outputs:

if used in a simulate(...) context, then just the likelihood of the action sequence (e.g., simulate(ActionLikelihoods, policy, As, ...)).
if used in a stepthrough(...) context, then the likelihood of the action should come back as part of the @gen(...) tuple.

MDPs:

Inputs:

mdp::MDP
policy::Policy
s0: the initial state
As: the observed action sequence (taken by an agent – note that this is because we use MDPs as proposals of an agent's rewards)
τ (or β): the temperature to control "rationality" of the agent

Outputs:

if used in a simulate(...) context, then just the likelihood of the action sequence (e.g., simulate(ActionLikelihoods, policy, As, ...)).
if used in a stepthrough(...) context, then the likelihood of the action should come back as part of the @gen(...) tuple.

In terms of the simulation standard, I suppose this doesn't quite modify much other than, perhaps, adding metadata (which I suppose is why you mentioned action_info?).

Example of how this is currently used

pomdp = SomePOMDP()
solver = SARSOPSolver()

policy = solve(solver, pomdp)

b = initialize_belief(updater(policy), some_belief)
a_likelihood = []
for (a, o) in zip(As, Os)
    Qs = actionvalues(policy, b)
    softQ = softmax(Qs ./ τ)
    push!(a_likelihood, softQ[actionindex(pomdp, a)])

    b = update(updater(policy), b, a, o)
end

likelihood = prod(a_likelihood)
@show likelihood

(I can provide a real-world example that's non-trivial if that would be helpful. This is just some pseudocode.)

zsunberg Dec 19, 2023
Maintainer

Hi @jmuchovej ,

Sorry for the delay in response. This is how I would recommend doing it:

likelihood = 1.0
for (a, b) in stepthrough(m, policy, "a,b")
    softQ = softmax(actionvalues(policy, b) ./ tau)
    likelihood *= softQ[actionindex(m, a)]
end

Essentially, the only difference from your code was that the update is inside stepthrough.

Alternatively, if the policy was modified to output the likelihood in action_info, you could use

prod(info[:likelihood] for info in stepthrough(m, policy, "action_info"))

I think that the first way is nice enough. Can you think of a way to modify stepthrough that would make this a lot easier without adding complexity to the simulators themselves?

jmuchovej Dec 27, 2023
Author

👋 so sorry I didn't see this! I do like the modifications to stepthrough(...)! Though I think this approach is missing the ability to control the action sequence (currently it only follows the optimal policy).

Additionally, I actually ran into a case earlier last week where offline solvers (e.g., SARSOP) don't have Alpha Vectors for each action. (This is totally understandable given the implementation details of SARSOP and other point-based methods.) This lead to using some of the approximation details stored within SARSOPTree (notably, Vs_upper which is supposed to be V̅, is I understand the code correctly). My question here is: in the edge-case where one needs to use implementation details like the ones I did, should this be a supported use case by stepthrough?

Another question: is there anything problematic with modifying stepthrough so that it consumes an action (and observation for POMDPs) sequence? (e.g., stepthrough as you have it would follow the optimal policy, but my example above assumes we're given a pair of action-observation pairs in sequence to score the actions (sorry for omitting that detail, slipped my mind when I was writing it up).)

e.g., there might be two implementations that look something like this:

function stepthrough(m::MDP, policy::Policy, spec=default_spec(m); As::Vector, kwargs...)
   # do everything for MDPs here
end

function stepthrough(p::POMDP, policy::Policy, spec=default_spec(p); As::Vector, Os::Vector, kwargs...)
   # do everything for POMDPs here
end

(Of course, I'll follow the coding styles in stepthrough, just wanted to give a rough example of how this could be implemented for actions given.)

Additionally, what do you think of implementing both approaches? I certainly like the ability of having a one-liner like you have in the second example, but I'm not sure what your thoughts are on having two ways to do the same thing.

zsunberg Dec 27, 2023
Maintainer

Ah, I see the difficulty with a different sequence of actions. Were you aware of the PlaybackPolicy? That would allow you to run a simulation with a fixed action sequence.*

However, I see that your second signature also has Os. What functionality are you trying to implement with that? Do you want to take in an action and observation sequence and return an iterator over a history of states and beliefs?

If you are trying to get stepthrough to calculate the log probability of the action sequence, I do not think that stepthrough is the right place to do it. Consider the programming maxim "Each function should do exactly one thing." stepthrough runs a simulation. Calculating a log probability based on the result of a simulation is a separate thing.

I think it may be better to implement another function that can be used like this:

hist = collect(stepthrough(m, p)) # or hist = sim(HistoryRecorder(), m, p)
log_action_probability(hist, p)

*(interestingly, it seems that someone else tried to do something with the logpdf field in PlaybackPolicy. I think that should be removed and put elsewhere.)

jmuchovej Jan 3, 2024
Author

... Were you aware of the PlaybackPolicy? ...

Nope! Definitely looks interesting, thought it seems to require that the pdf is computed prior to constructing the policy? PlaybackPolicy definitely seems like an interesting place to start.

However, I see that your second signature also has Os. What functionality are you trying to implement with that? ...

The goal is to take an action and observation sequence _as this is how we (the engineer/researcher) can design specific world configurations. The benchmark for comparison is against stimuli we present to humans (thus we need to present the models with the same configurations humans are presented with).

... Do you want to take in an action and observation sequence and return an iterator over a history of states and beliefs?

I think, for visualization, this is actually useful. However, I can see how calculating likelihoods isn't right for stepthrough. I agree on your alternative implementations.

So, I guess this actually needs to be broken into two ideas:

Should POMDPs.jl support the use of policies as scoring mechanisms?
Could POMDPs.jl support iterative computation of likelihoods (via a stepthrough-like mechanism)?

Elaborating on (1): In computational social cognition, we use [PO]MDPs as proposals for observed agents. So, say you're observing a friend walk through a grocery store to select a few items from a list – we might formulate this as a POMDP and you (the observer) are now engaging in an inference problem aiming to conclude candidate rewards (say, over the items) and beliefs (say, over item locations). Thus, we end up simulating many [PO]MDPs with the goal of generating a posterior over these rewards and beliefs. Currently, I'm only asking about the incremental steps towards building that posterior – computing a likelihood of an observed action sequence.

Elaborating on (2): In computational cognitive modeling (broader field I'm in), we've been using PPLs (probabilistic programming languages) like Gen.jl. (Let me know if I need to elaborate on what PPLs are 🙂.) Since we can convert the policies into probabilistic policies (with some strong assumptions) via something like a softmax with temperature (mentioned earlier), this now provides a way to essentially treat action sequences as something sampled from a distribution (according to the policy) – which, in turn, allows them to be used within PPLs. Perhaps this is tooling that would need to be built outside of POMDPs.jl (since it's the only case I know of where it's needed). The main ask here is that there are performance optimizations (in Gen.jl) which could be made by having an iterative way to compute likelihoods rather than something like... log_action_probability(hist, p) – as users of Gen.jl would actually want to condition their model on the observed action sequence.

(Of course, let me know if anythings unclear. 🙂 Also happy to roll my own implementation of these things if it's unclear whether it would be beneficial to POMDPs.jl.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`stepthrough` alternative which is contingent on action sequences #531

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

POMDPs:

MDPs:

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

stepthrough alternative which is contingent on action sequences #531

jmuchovej Dec 12, 2023

Replies: 1 comment · 5 replies

zsunberg Dec 12, 2023 Maintainer

jmuchovej Dec 12, 2023 Author

Inputs & Outputs

POMDPs:

MDPs:

Example of how this is currently used

zsunberg Dec 19, 2023 Maintainer

jmuchovej Dec 27, 2023 Author

zsunberg Dec 27, 2023 Maintainer

jmuchovej Jan 3, 2024 Author

`stepthrough` alternative which is contingent on action sequences #531

jmuchovej
Dec 12, 2023

Replies: 1 comment 5 replies

zsunberg
Dec 12, 2023
Maintainer

jmuchovej Dec 12, 2023
Author

zsunberg Dec 19, 2023
Maintainer

jmuchovej Dec 27, 2023
Author

zsunberg Dec 27, 2023
Maintainer

jmuchovej Jan 3, 2024
Author