Skip to content

Commit

Permalink
minor changes
Browse files Browse the repository at this point in the history
  • Loading branch information
HyperPotatoNeo committed Dec 9, 2024
1 parent 144b969 commit f3725d3
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions ft.html
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ <h2 class="title is-3" id="method">Relative trajectory balance</h2>
</div>
<div class="content has-text-justified">
<p>
Here, \( Z_{\phi} \) is a learnable normalization constant. By aligning the trajectory probabilities in this manner, RTB facilitates unbiased sampling from the desired posterior distribution \( p^{\text{post}}(\mathbf{x}) \propto p_\theta(\mathbf{x}) r(\mathbf{x}) \), effectively incorporating the constraints imposed by \( r(\mathbf{x}) \) into the diffusion model's generative process.
Here, \( Z_{\phi} \) is a learnable normalization constant. Satisfying the RTB constraint (minimizing loss to 0) for all diffusion trajectories facilitates unbiased sampling from the desired posterior distribution \( p^{\text{post}}(\mathbf{x}) \propto p_\theta(\mathbf{x}) r(\mathbf{x}) \).
</p>
</div>

Expand Down Expand Up @@ -313,7 +313,7 @@ <h3 class="title is-4" id="results">Diffusion language models</h3>

<h3 class="title is-4" id="results">Offline RL</h3>
<p>
An important problem in offline RL is KL regularized policy extraction using the behavior policy as prior, and the trained Q function obtained using an off-the-shelf Q-learning algorithm. Diffusion policies are expressive and can model highly multimodal behavior policies. Given this diffusion prior \(mu(a|s)\) and a Q function trained with IQL \(Q(s,a)\), we use RTB to obtain the KL regularized optimal policy of the form \(\pi^*(a|s) \propto \mu(a|s)e^{Q(s,a)}\). We match state of the art results in the D4RL benchmark.
An important problem in offline RL is KL regularized policy extraction using the behavior policy as prior, and the trained Q function obtained using an off-the-shelf Q-learning algorithm. Diffusion policies are expressive and can model highly multimodal behavior policies. Given this diffusion prior \(\mu(a|s)\) and a Q function trained with IQL \(Q(s,a)\), we use RTB to obtain the KL regularized optimal policy of the form \(\pi^*(a|s) \propto \mu(a|s)e^{Q(s,a)}\). We match state of the art results in the D4RL benchmark.
</p>
<div class="content has-text-justified"></div>
<center>
Expand Down

0 comments on commit f3725d3

Please sign in to comment.