Replies: 4 comments 26 replies
-
This is assuming the rewards are auto-normalized already, right? Otherwise, the user needs to be mindful of the magnitude of each reward before assigning weights. But yeah, other than that, it makes sense to me. |
Beta Was this translation helpful? Give feedback.
-
@maxpumperla @brettskymind - I was in the process of adding tests for reward terms and weights and realized we don't have a good way to specify chosen reward terms. And the weights are an array that needs the reward terms to always be in the same order. What do you think about switching it to this for setting the weights? The keys are the same from the get_reward function: weights = {
"found_cheese": 10.0,
"took_step": 1.0
} If the user only wants to include weights = {
"found_cheese": 1.0,
} The weights param is optional. If weights isn't included at all, all the terms are summed together with weight=1.0 as it does now. |
Beta Was this translation helpful? Give feedback.
-
Looks good to me too! I am just confused about this part:
So does this mean all the metrics have to be 'gather' based? What if a user wants to write a reward function like this; |
Beta Was this translation helpful? Give feedback.
-
@slinlee sorry for showing up late to the party. I'm a little confused by your argumentation, but hopefully we can clear that all up.
|
Beta Was this translation helpful? Give feedback.
-
Let's talk about how reward terms will be used in the webapp with Pathmind Python simulations. This is a proposal that I'm looking for feedback on.
cc @fionnachan @maxpumperla @ejunprung @alexamakarov @kepricon @chrisvnicholson @brettskymind @johnnyL7 @EvgeniyEA @SaharEs
Here is an example of a simple reward terms in a Pathmind model:
with two elements,
found_cheese
andtook_step
This is what the webapp's reward terms UI looks like for AnyLogic simulations:
In AnyLogic models, users provide a list of variables they are interested in tracking. These are the simulation metrics or reward variables. These are often raw values from the simulation like 'goodsProduced' 'cost' 'throughput'
Users can pick from 'reward variables' to minimize or maximize. Behind the scenes, this generates a code snippet for each reward term like
reward = after.found_cheese - before.found_cheese;
This means we can also let users enter custom reward terms with java code and do basic validation that they're using variables that are available.Here is what would change for Pathmind simulations:
For Pathmind simulations, the code snippet is already written by the users in their Python simulation. Their names would be displayed in the webapp so that they can (1) choose to include it, or not, and (2) add importance weights. This is the limit for reward shaping without going back to the Python simulation.
Again, I appreciate any feedback or proposals if you were thinking of something different. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions