NO FREE LUNCH IN RL | KEVIN BUHLER

← go back

NO FREE LUNCH IN RL

April 5 2026

RL is hard. A lot of people have expressed frustration at the lack of progress over the field the past decade. From Andrej Karpathy on Hacker News

If it makes you feel any better, I've been doing this for a while and it took me last ~6 weeks to get a from-scratch policy gradients implementation to work 50% of the time on a bunch of RL problems. And I also have a GPU cluster available to me, and a number of friends I get lunch with every day who've been in the area for the last few years.

There's already an impressive number of blogs detailing the RL struggle, including:

- Alex Irpan: Deep Reinforcement Learning Doesn't Work Yet

- George Hotz: RL is dumb and doesn't work

- Andy Jones: Debugging RL, Without the Agonizing Pain

- Dwarkesh Patel: RL is even more information inefficient than you thought

Why RL is hard

Hopefully you get the point. RL is hard. Very hard. These are some of the best people in the field having trouble with it. Indeed, there are many problems that make it extremely hard to make progress

Distribution shift (non-stationarity)
Credit assignment problem as there are no true labels
Data is dependent on your model action selections
The deadly triad: off-policy + function approximation + bootstrapping
Individual runs can take a long time even with multiple gpus. Some papers requires up 2B frames to reach max performance. 2B frames is 385.8 days at 60FPS. This might be fine, but then the policy doesn't transfer over to any other environment
Depending on random seed, runs are known to randomly just die
We don't have any good introspection debugging tools
RL is very reliant on hyperparameter tuning
Implementations have to be incredibly precise.
There is a large formalism-implementation gap
Many papers have under-tuned baselines. How are there so many papers claiming that they beat PPO, but no one uses them?
Paper results are hard to reproduce, hence why we see a push towards rewriting methods in JAX. We can see this with PureJaxRL and Unifloral.
Long rollouts can only get up to 1 bit of information of feedback
Manually defined rewards aren't scalable. Intrinsic rewards to help exploration seem like a hack. We also see this pain point arising in RLVR

Superhuman Performance

You might ask yourself, is all of this trouble worth it?

Remember, there is no such thing as free lunch. This pain comes with a large upside, consistently bringing us step-changes:

- Training language models to follow instructions with human feedback. Also known as RLHF, which is what put the Chat in ChatGPT.

- Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. This was the big breakthrough from OpenAI and was unveiled by DeepSeek R1 as being powered by Group Relative Policy Optimization, an advanced RL policy gradient method.

- Superhuman play for Chess, Atari, Go in one model, with MuZero

- Cute soccer robots

This plot sums it up well:

Best Practices

If you go through the Andy Jones articles you'll find lots of best practices:

- Use a really large batch size

- Mix your vectorized envs

- Work from a reference implementation

- Loss curves are a red herring

- Unit test the tricky bits

Some things that I can recommend:

- Use JAX. vmap + grad + jit + random key splitting == awesome

- Use Weight and Bias's sweeps

- Log video to wandb

- Tune PPO

- Minimum 3 seeds per config

- Use vectorized environments like Brax, Craftax, GCRL

Tune Your PPO Baselines

So the point of this article was to show uncover the some reasons why it is so difficult to work on RL. However I believe that this pain is worth it as it brings magical results when done right. Most of the progress being made has been with policy gradient methods, such as PPO and their LLM adaptations. I think that an actionable insight for researchers would be to switch to JAX and make sure you properly tune PPO baselines.