Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect implementation of REINFORCE algorithm #3

Open
ehtsham opened this issue Mar 28, 2022 · 5 comments
Open

Incorrect implementation of REINFORCE algorithm #3

ehtsham opened this issue Mar 28, 2022 · 5 comments

Comments

@ehtsham
Copy link

ehtsham commented Mar 28, 2022

last_state = tf_runtime.execute(num_steps=horizon - 1)
last_metric_value = last_state['metrics state'].get(metric_to_optimize)
log_prob = last_state['slate docs_log_prob_accum'].get('doc_ranks')
objective = -tf.tensordot(tf.stop_gradient(last_metric_value), log_prob, 1)

In this above code snippet (interest_evolution_simulation.py, line 45-48),
the cumulative_log_prob (across the entire episode) is being weighted by the total cumulative reward from the start to the end of the episode.
This seems incorrect as the log_prob at step 't' should be weighted by cumulative reward from step 't' onwards (not from the start).

@mmladenov-google
Copy link

Hi Ehtsham,

I believe you might be misreading this, please correct me if I misunderstood you:

here, we implement the most naive form of REINFORCE, which weights the total log-prob of the trajectory by the total reward, i.e. $\left(\sum_t R(a_t)\right)\cdot\left(\sum_t \log p_\theta(a_t)\right)$. log_prob is thus a tensor of batch_size x 1 containing the sum of all action log probs in an individual trajectory and last_metric_value accordingly contains the sum of all rewards (again batch_size x 1). I believe you're interpreting log_prob to be a tensor of batch_size x T containing cumulative sums of log probs, which would have been the case had we called tf_runtime.trajectory instead of tf_runtime.execute and then the reward tensor would have been adjusted as you suggest. Is my reading correct?

Best,
Martin

@ehtsham
Copy link
Author

ehtsham commented Mar 29, 2022

Hi Martin, you correctly understand my point and thanks for elaborating the implementation, it is exactly what I also interpreted. I understood that the log_prob tensor was only batch_size x 1 which gets weighted by the total reward of the trajectory. This naive implementation implies that the the log_probability term of an action taken at time 't' has a weight that includes rewards generated prior to time 't'. The returns are supposed to be computed from time 't' "onwards".
Yes the code should have used tf_runtime.trajectory function instead although there might be more changes needed so that the cumulative reward for each action at time 't' is generated from that time onwards (currently that is not happening).

@mmladenov-google
Copy link

I think that for the case of finite horizons, both approaches end up estimating the same quantity. You're describing the nested version which is derived from differentiating the Bellman equation, but we use the non-recursive version derived directly from the flat form of the reward expectation, see e.g. the last equation of slide 8 here. We chose the latter for simplicity here.

@ehtsham
Copy link
Author

ehtsham commented Mar 29, 2022

thanks Martin, not sure if the two would be equivalent. For example, its not clear how one would study the impact of discounting in this implementation. I ll try to contribute the other implementation in the project :)

@mmladenov-google
Copy link

Oh, one more thing I realized when thinking this through. I suspect that the recursive gradient estimator might not be correct in the case where the agent suffers from an aliased state representation (e.g. in a POMDP with incomplete belief state representation as is the case in this example), since the Bellman equation holds only in a local sense. E.g. Theorem 1 in this paper. In this regime, the state transition probabilities become policy-dependent and will introduce to additional terms in the derivative. This could lead to interesting bias-variance trade-offs. I'd be curious to see what happens if you have the time to experiment with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants