I try to explain modern Reinforcement Learning (RL) concepts in the era of Large Language Models (LLMs) with clean code illustrated on simple examples.
Note:
- The code is not optimized for performance, but for clarity.
- The examples are not meant to be practical but rather to illustrate the concepts.
For the algorithms, I use a simple decoder-only Transformer as the policy model and a simple environment (FrozenLake) as the task. The files could in principle be quite easily adjusted for LLM reasoning tasks. i.e. to acquire reasoning capabilities.
The PPO algorithm is implemented in the ppo.py
file.
To be implemented.