-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect implementation of REINFORCE algorithm #3
Comments
Hi Ehtsham, I believe you might be misreading this, please correct me if I misunderstood you: here, we implement the most naive form of REINFORCE, which weights the total log-prob of the trajectory by the total reward, i.e. Best, |
Hi Martin, you correctly understand my point and thanks for elaborating the implementation, it is exactly what I also interpreted. I understood that the log_prob tensor was only batch_size x 1 which gets weighted by the total reward of the trajectory. This naive implementation implies that the the log_probability term of an action taken at time 't' has a weight that includes rewards generated prior to time 't'. The returns are supposed to be computed from time 't' "onwards". |
I think that for the case of finite horizons, both approaches end up estimating the same quantity. You're describing the nested version which is derived from differentiating the Bellman equation, but we use the non-recursive version derived directly from the flat form of the reward expectation, see e.g. the last equation of slide 8 here. We chose the latter for simplicity here. |
thanks Martin, not sure if the two would be equivalent. For example, its not clear how one would study the impact of discounting in this implementation. I ll try to contribute the other implementation in the project :) |
Oh, one more thing I realized when thinking this through. I suspect that the recursive gradient estimator might not be correct in the case where the agent suffers from an aliased state representation (e.g. in a POMDP with incomplete belief state representation as is the case in this example), since the Bellman equation holds only in a local sense. E.g. Theorem 1 in this paper. In this regime, the state transition probabilities become policy-dependent and will introduce to additional terms in the derivative. This could lead to interesting bias-variance trade-offs. I'd be curious to see what happens if you have the time to experiment with it. |
last_state = tf_runtime.execute(num_steps=horizon - 1)
last_metric_value = last_state['metrics state'].get(metric_to_optimize)
log_prob = last_state['slate docs_log_prob_accum'].get('doc_ranks')
objective = -tf.tensordot(tf.stop_gradient(last_metric_value), log_prob, 1)
In this above code snippet (interest_evolution_simulation.py, line 45-48),
the cumulative_log_prob (across the entire episode) is being weighted by the total cumulative reward from the start to the end of the episode.
This seems incorrect as the log_prob at step 't' should be weighted by cumulative reward from step 't' onwards (not from the start).
The text was updated successfully, but these errors were encountered: