Pathmind python reward terms #3678

slinlee · 2021-10-29T22:07:13Z

slinlee
Oct 29, 2021

Let's talk about how reward terms will be used in the webapp with Pathmind Python simulations. This is a proposal that I'm looking for feedback on.

cc @fionnachan @maxpumperla @ejunprung @alexamakarov @kepricon @chrisvnicholson @brettskymind @johnnyL7 @EvgeniyEA @SaharEs

Here is an example of a simple reward terms in a Pathmind model:

with two elements, found_cheese and took_step

 def get_reward(self, agent_id) -> typing.Dict[str, float]:
     return {
        "found_cheese": 1 if self.mouse == self.cheese else 0,
        "took_step": -1
     }

This is what the webapp's reward terms UI looks like for AnyLogic simulations:

In AnyLogic models, users provide a list of variables they are interested in tracking. These are the simulation metrics or reward variables. These are often raw values from the simulation like 'goodsProduced' 'cost' 'throughput'

Users can pick from 'reward variables' to minimize or maximize. Behind the scenes, this generates a code snippet for each reward term like reward = after.found_cheese - before.found_cheese; This means we can also let users enter custom reward terms with java code and do basic validation that they're using variables that are available.

Here is what would change for Pathmind simulations:

For Pathmind simulations, the code snippet is already written by the users in their Python simulation. Their names would be displayed in the webapp so that they can (1) choose to include it, or not, and (2) add importance weights. This is the limit for reward shaping without going back to the Python simulation.

there isn't a built in concept of 'before' and 'after' so we will need to provide examples of 'gather' type goals.
since these reward terms are calculations based on metrics to be used in a reward function, we might need to have a separate list of 'simulation metrics' that are just the raw values so we can track and display the metrics charts and run MC for the users in the webapp.

Checkboxes to include the reward term
no need to indicate maximize/minimize
no ability to add new reward terms from the webapp.

Again, I appreciate any feedback or proposals if you were thinking of something different. Thanks!

ejunprung · 2021-10-29T22:28:20Z

ejunprung
Oct 29, 2021
Maintainer

This is assuming the rewards are auto-normalized already, right? Otherwise, the user needs to be mindful of the magnitude of each reward before assigning weights.

But yeah, other than that, it makes sense to me.

3 replies

slinlee Oct 30, 2021
Author

Yeah, ideally we have normalization sorted out nicely. but I think the weights are helpful even without it.

@ejunprung - btw, what do you think about the idea of writing the reward terms in the webapp in python more like how we do with AnyLogic?

thetwotravelers Nov 2, 2021

All looks good to me.

The python user will be less dragged down by the clutter of AL PathmindHelper fields and the re-exporting of an AL .jar file. So while they experiment with training, defining a generous amount of custom reward terms in their python script won't feel too bad.

ejunprung Nov 2, 2021
Maintainer

@slinlee Yeah, I'd prefer to write my rewards in Pathmind like how we do it with AnyLogic. It's a lot more efficient.

slinlee · 2021-11-01T17:31:58Z

slinlee
Nov 1, 2021
Author

@maxpumperla @brettskymind - I was in the process of adding tests for reward terms and weights and realized we don't have a good way to specify chosen reward terms. And the weights are an array that needs the reward terms to always be in the same order.

What do you think about switching it to this for setting the weights? The keys are the same from the get_reward function:

weights = {
  "found_cheese": 10.0,
  "took_step": 1.0
}

If the user only wants to include found_cheese in the reward function, it would be:

weights = {
  "found_cheese": 1.0,
}

The weights param is optional. If weights isn't included at all, all the terms are summed together with weight=1.0 as it does now.

2 replies

slinlee Nov 1, 2021
Author

I'll have to check how this changes things for anylogic models.

thetwotravelers Nov 2, 2021

I like this idea.

SaharEs · 2021-11-02T18:20:30Z

SaharEs
Nov 2, 2021

Looks good to me too! I am just confused about this part:

there isn't a built in concept of 'before' and 'after' so we will need to provide examples of 'gather' type goals.

since these reward terms are calculations based on metrics to be used in a reward function, we might need to have a separate list of 'simulation metrics' that are just the raw values so we can track and display the metrics charts and run MC for the users in the webapp.

So does this mean all the metrics have to be 'gather' based? What if a user wants to write a reward function like this;
reward = Math.pow(after.distance, 2)
With no after-before. They wont be able to do this?

9 replies

SaharEs Nov 2, 2021

@brettskymind Thanks for explaining. So the get_reward function is returning only the reward for a single step.

And in order to show the final value of metrics in the webapp, we will have to sum up the metrics throughout the episode? Is this a good assumption? This value will not have much meaning for metrics like metric_distance

SaharEs Nov 2, 2021

I still think using after-before makes more sense. Plus, this way we can allow users to write custom reward terms in webapp which is much more efficient. So instead of defining all these metrics and manually holding account of old_distance, they would just give a distance value to pathmind. Wouldn't this add more value to our product?

         "distance": self.distance,
         "metric_distance": -1 * math.pow(self.distance, 2),
         "change_distance": self.distance - self.old_distance
         "change_metric_distance": ( -1 * math.pow(self.distance, 2) ) - ( -1 * math.pow(self.old_distance, 2) )

thetwotravelers Nov 2, 2021

You bring up a good point. At each step, we could save a copy of the entire reward dictionary to characterize the "before step", and give them the option to shape rewards in webapp using this. Assuming python users want to use the webapp UI for this.

For final value of metrics in the webapp:
We track the episode contribution of each reward term, as well as the max, min, and mean of each reward term. (As of now, in pathmind simulations, there are just reward terms for metrics.) We need to decide what is necessary to show, is something missing from these metrics?

SaharEs Nov 2, 2021

So webapp is a place of doing experimentation; defining new reward terms is a part of the experimentation. If someone prefers to define reward terms in their python script they still can. But we already have a nice UI that can do the job (possibly more efficiently), why retire that?

slinlee Nov 2, 2021
Author

Good point. Gym examples have the reward built in because they are used to compare algorithms against standard environments. Where our use case is experimenting with reward functions to get a good policy.

maxpumperla · 2021-11-03T10:25:19Z

maxpumperla
Nov 3, 2021

@slinlee sorry for showing up late to the party. I'm a little confused by your argumentation, but hopefully we can clear that all up.

First of all, why wouldn't we want to create before and after for Python simulations in the same way we do for AL? This is not difficult to pull off in nativerl and would be good for consistency (i.e. not having to modify our UI etc.). We're doing the exact same thing for AL models right now by storing before = prev_reward and after = current_reward. I don't see why we'd want to deviate from that pattern now (and I'm saying this as someone who's massively opposed to the whole before vs after shenanigans in the first place)? It's an established pattern, and I think we should keep it. It's proven to be valuable, e.g. for the reasons @SaharEs mentions above.
After discussing with you (@slinlee), @brettskymind and @SaharEs, we decided to remove get_metrics from the Python simulation interface for two reasons. a) so that we don't confuse users and b) so that get_reward can act precisely like its AL equivalent, namely by providing reward variables / metrics / whatever you want to call it (btw, still not happy with the lack of consistent nomenclature around this topic).
get_reward is a dictionary, meaning we provide "named values". That's my idea of a reward variable that can be used for shaping rewards in the Pathmind web UI.
In the absence of a reward term spec, just sum up all provided terms. E.g. when starting training locally with sim.train() and forcing start=True this might be useful, as in this case users will use their Python environment for reward term crafting and kicking-off training (and not even touch the web UI).
But I think we should encourage users to got to our UI and do the experimental setup there, for all the benefits like better parameter tracking etc.

12 replies

SaharEs Nov 4, 2021

I don't think we should assume that our users will be RL experts who rewards are natural to them. Even if they are, changing the name to get_metrics would be a good signal for them to think about it differently.

maxpumperla Nov 4, 2021

I thought all that was covered by one function in AL, too. Is that not the case?

slinlee Nov 4, 2021
Author

In AL, we get the metrics. When we train the reward function is a separate 'reward snippet' passed into the train.sh script. It's separate from the metrics being passed in.

maxpumperla Nov 5, 2021

exactly, so we don't have two separate things in AL either. that's my point. whether or not we rename rewards to metrics or vice versa does not make that much of a difference.

slinlee Nov 5, 2021
Author

Ok. But that's not implemented now, right? It doesn't accept a separate reward function that doesn't override get_reward and this just sums up the terms https://github.com/SkymindIO/nativerl/blob/dev/nativerl/python/pathmind_training/environments.py#L440

Let me check on this today. If it's there already, that's great! If it was removed, we can add it again.

If we can do the two things I listed before, then I'm happy for now. The terminology confusion will just stay in the back log.

plot metrics and run monte carlo in the webapp?

also let users write reward terms in the webapp?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pathmind python reward terms #3678

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Pathmind python reward terms #3678

Here is an example of a simple reward terms in a Pathmind model:

This is what the webapp's reward terms UI looks like for AnyLogic simulations:

Here is what would change for Pathmind simulations:

Replies: 4 comments · 26 replies

ejunprung Oct 29, 2021 Maintainer

slinlee Oct 30, 2021 Author

ejunprung Nov 2, 2021 Maintainer

slinlee Nov 1, 2021 Author

slinlee Nov 1, 2021 Author

slinlee Nov 2, 2021 Author

slinlee Nov 4, 2021 Author

slinlee Nov 5, 2021 Author

Replies: 4 comments 26 replies

ejunprung
Oct 29, 2021
Maintainer

slinlee Oct 30, 2021
Author

ejunprung Nov 2, 2021
Maintainer

slinlee
Nov 1, 2021
Author

slinlee Nov 1, 2021
Author

slinlee Nov 2, 2021
Author

slinlee Nov 4, 2021
Author

slinlee Nov 5, 2021
Author