← go back
NO FREE LUNCH IN RL
April 5 2026
RL is hard. A lot of people have expressed frustration at the lack of progress over the field the past decade. From Andrej Karpathy on Hacker News
There's already an impressive number of blogs detailing the RL struggle, including:
- Alex Irpan: Deep Reinforcement Learning Doesn't Work Yet
- George Hotz: RL is dumb and doesn't work
- Andy Jones: Debugging RL, Without the Agonizing Pain
- Dwarkesh Patel: RL is even more information inefficient than you thought
Why RL is hard
Hopefully you get the point. RL is hard. Very hard. These are some of the best people in the field having trouble with it. Indeed, there are many problems that make it extremely hard to make progress
Superhuman Performance
You might ask yourself, is all of this trouble worth it?
Remember, there is no such thing as free lunch. This pain comes with a large upside, consistently bringing us step-changes:
- Training language models to follow instructions with human feedback. Also known as RLHF, which is what put the Chat in ChatGPT.
- Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. This was the big breakthrough from OpenAI and was unveiled by DeepSeek R1 as being powered by Group Relative Policy Optimization, an advanced RL policy gradient method.
- Superhuman play for Chess, Atari, Go in one model, with MuZero
- Cute soccer robotsThis plot sums it up well:

Best Practices
If you go through the Andy Jones articles you'll find lots of best practices:
- Use a really large batch size
- Mix your vectorized envs
- Work from a reference implementation
- Loss curves are a red herring
- Unit test the tricky bits
Some things that I can recommend:
- Use JAX. vmap + grad + jit + random key splitting == awesome
- Use Weight and Bias's sweeps
- Log video to wandb
- Tune PPO
- Minimum 3 seeds per config
- Use vectorized environments like Brax, Craftax, GCRL
Tune Your PPO Baselines
So the point of this article was to show uncover the some reasons why it is so difficult to work on RL. However I believe that this pain is worth it as it brings magical results when done right. Most of the progress being made has been with policy gradient methods, such as PPO and their LLM adaptations. I think that an actionable insight for researchers would be to switch to JAX and make sure you properly tune PPO baselines.