You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find something odd and I'd like to know if there's something I'm missing or if it's normal.
In the buffers, you define the action_log_probs to have "act_shape" as their last dimension (https://github.com/marlbenchmark/on-policy/blob/d53c4902cf2c291c93ced2c42c621371982ca2eb/onpolicy/utils/shared_buffer.py#L79C9-L80C100).
With continuous actions, this means that the last dimension of action_log_probs would be the dimension of the action. But, the actual log probability of an action is just a single value. The model actually outputs one value for each action when actions are evaluated (and then we store them in an array of shape (ep_len, n_rollouts, n_agents, act_dim), which broadcasts the single value to act_dim).
Now, this actually doesn't cause any problem during training. So I guess you may have put this to fit the needs of other action spaces (maybe multidiscrete?).
And I guess, for continuous actions only, I could replace the "act_shape" by "1" in the dimensions of action_log_probs in the buffer.
Have I understood this correctly? Or is there something I'm missing?
Thank you!
The text was updated successfully, but these errors were encountered:
Hi,
I find something odd and I'd like to know if there's something I'm missing or if it's normal.
In the buffers, you define the action_log_probs to have "act_shape" as their last dimension (https://github.com/marlbenchmark/on-policy/blob/d53c4902cf2c291c93ced2c42c621371982ca2eb/onpolicy/utils/shared_buffer.py#L79C9-L80C100).
With continuous actions, this means that the last dimension of action_log_probs would be the dimension of the action. But, the actual log probability of an action is just a single value. The model actually outputs one value for each action when actions are evaluated (and then we store them in an array of shape (ep_len, n_rollouts, n_agents, act_dim), which broadcasts the single value to act_dim).
Now, this actually doesn't cause any problem during training. So I guess you may have put this to fit the needs of other action spaces (maybe multidiscrete?).
And I guess, for continuous actions only, I could replace the "act_shape" by "1" in the dimensions of action_log_probs in the buffer.
Have I understood this correctly? Or is there something I'm missing?
Thank you!
The text was updated successfully, but these errors were encountered: