Skip to content

Latest commit

 

History

History
69 lines (45 loc) · 2.4 KB

File metadata and controls

69 lines (45 loc) · 2.4 KB

Optimal Control vs. Reinforcement Learning (RL) 📘

Introduction to RL 🚀

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize a reward. In the context of optimal control, RL offers various algorithms and methods to find the best control policy. Here are some of the popular RL methods:

1. Model-based RL 🧠

In model-based RL, the agent tries to learn a model of the environment, represented by ( f ).

$$ \min_\theta ||x_{n+1}-f_\theta(x,u)||_2^2 $$

🔴 Cons:

  • The agent might learn unnecessary details about the environment.
  • Even after learning, there's still a need to solve a model.

2. Q-learning 🎮

Q-learning is a value-based method where the agent learns the Q-function, which is closely related to dynamic programming.

$$ V_{n-1}(x) = \underbrace{\min_u l(x,u) + V_n(f(x,u))}_{Q(x,u)} $$

🔴 Cons:

  • Cannot generalize to new tasks.
  • Has a high bias.
  • Overestimation of Q-values. Solutions include using multiple Q-values or adjusting the update rate.

3. Policy Gradient 📈

In Policy Gradient methods, the agent directly optimizes the policy parameterized by ( \theta ).

$$ \min_\theta J = \sum_{n=1}^{N-1} l(x_n, u_\theta(x_n)) + l_N(x_n) \quad s.t. \quad x_{n+1} = f(x_n, u_\theta(x_n)) $$

The update method is given by:

$$ \min_\theta E_{P(\tau;\theta)}[J(\tau)] = \min \int_\text{all trajectory} J(\tau)p(\tau;\theta)d\tau $$

And its gradient:

$$ \nabla_\theta E_{p(\tau;\theta)}[J(\tau)] = E_{p(\tau;\theta)}[J(\tau)\nabla_\theta \log(p(\tau;\theta))] $$

This gradient can be estimated using Monte Carlo search.

🔴 Cons:

  • Low sample efficiency.
  • Instability in learning.
  • High variance. Solutions include trust region methods, scheduling covariance, gradient clipping, and using an advantage function.

4. Actor-Critic 🎭

In Actor-Critic methods, the value estimation in the policy gradient is replaced with a neural network. Alternatively, a specific function can be used to solve the Q-network in the continuous action case.

🟢 Pros:

  • Combines the benefits of value-based and policy-based methods.
  • Can handle continuous action spaces.

Conclusion 🌟

With a good simulator and an accurate cost function, many RL problems can be largely solved. The choice of method depends on the specific problem, available data, and computational resources.