← go back
NOTES ON RL AND PILCO
nov 28 2025
MSRE
An observable error at each timestep:
Mean squared error return in the on-policy case:
So we have a learnable value error,
It measures how far our learned value function is from the true value function. It depends on w and can, in principle, be reduced by more data, a better architecture, better optimization, or a rich function class. What architecture do we know of that loves data? Transformers. A lot of people think that transformers are incompatible with RL, but I think they can serve as the backbone for powerful value functions.
We also have a second variance term that does not depend on the parameter vector. This can be viewed as the "return variance". It depends entirely on the environment randomness, not your value function. It represents the unavoidable randomness in the return that does not depend on your weights. So this is why minimizing RE is equivalent to minimizing VE, the extra noise term doesn't change with w so gradient descent ignores it. Can we take advantage of the knowledge of the return variance?
The gradient of the RE is
We can write the return as
Then the sampled gradient is
The first part is the "true" gradient of the valye error. The second part is mean zero pure noise coming from return randomness. If we somehow had access to , the gradient would be
which has zero noise from . Since we don't know , this suggests a strategy to use a lower variance estimate instead of the raw monte carlo return, like TD(0), n-step, td(lambda), GAE.
WHY DON"T WE KNOW V_PI? By definition it is the expected discounted sum of all future rewards if we start in state s and follow the policy forever,
The key word is expected, the averafe of all possible future trajectories under pi weighted by their probabilities. To know exactly, we'd need the full environment dynamics. We can learn this with a model for .
We can see that variance reduction doesn't change the optimum, only the signal-to-noise ratio. We know that the low variance results in improved learning for RL algorithms, so perhaps there is something we can do to alievate this term. Additionally, we can now do uncertainty aware RL where we care about risk and explicitly model the aleatoric component. We can also add this signal into exploration where we have epistemic uncertainty about or .
Is the mean squared error return second term a manifestation of the predictive uncertainty in the predictive posterior? The Bayesian Predictive distribution:
where is the aleatoric uncertainty and is the epistemic uncertainty.
Squared Exponential (RBF) Kernel
Encodes the assumption that nearby inputs should have strong correlated outputs and that the function is very smooth.
For input vectors , the isotropic squared exponential kernel is
is the signal variance (overall scale of function values).
= length-scale (how quickly the function changes with distance)
Intuition
if two points are close, the kernel value is near then the model thinks their outputs are highly correlated
if they are far apart, the kernel value decays toward 0, which means their outputs are almost independent
ARD (Automatic Relevance Determination)
A bayesian mechanism that lets the model learn which features matter, usually by giving each feature its own hyperparameter controlling how much it can affect the model
ARD modifies the squared exponential to give each input dimension its own length-scale
Why this gives "relevance determination"
if is small, small changes in matter a lot, so dimension is important
if is very large, changes in barely affect the kernel, so the dimension is unimportant
PILCO
PILCO v2
PILCO has extraordinary processing capablility and is SOTA for low dimensional data. It takes just 10 samples to learn how to control Cartpole, which has a 4 dimensional continuous vector as input (cart position, cart velocity, pole angle, pole angular velocity) and a discrete binary output: 0 for push cart left and 1 for push cart right.
However, PILCO scaled very poorly to high dimensional data, and hasn't been ran on any atari games. I propose that we combine two powerful forces: deep learning for powerful mapping from high dims to rich low dim latent vectors, and then feed this to PILCO for control.
So conceptually: we use a NN to solve the images state problem, and then let PILCO do its classic data efficient model-based rl thing in that latent space.
More formally,
where
CHAT WARNING: if you try to learn the encoder only from PILCO's GP marginal likelihood and the final control reward signal, then you're in trouble. You may not have enough data to shape the representation. It suggests to pretrain with unsupervised/self-supervised objectives or train encoder+GP jointly with a strong auxiliary predictive loss (predict or ), not just control return.
PILCO COMPUTATIONAL PAIN: GP complexity is in number of state transitions. We'll likely need space/variational GPs or fewer training points. This might be where the looping buffer comes into play, we can discard points once we feel confident about them. chat also warns that backpropagating through PILCO's analytical expected reward and through the encoder can get numerically fiddly.
Action Spaces
For virtual desktop control the input space will be [num keys + x, y, dx, dy, click]
An optimization for the keyboard could be a mapping of ascii characters to a number and then the model predicts the ascii number it wants. Then we have an additional key for NOOP.
Optimized keyboard : . This is the number of ascii plus noop action.
ALE has 18 discrete actions: 17 directions + 1 no op
Why PILCO OOMs RL
PILCO squeezes way more constraints out of a single transition than typical RL.
For action , state , next state , and cost .
raw data is (), which says that the true dynamics at point () maps to
The key questions, The answer is very different in PILCO vs typical RL:
What RL does:
In the canonical model free RL setup (q-learning, a2c, ppo), the sample primarily changes the agent's beliefs about what is good/bad in that specific neighborhood of state-action space. It influences other states only indirectly via neural net function approximation generalization and slow bootstrapping of values through many episodes. Put differently, in model-free RL, each sample is mostly about "how good was it to do this thing here" plus some fuzzy generalization. It does not explicitly use the fact that this sample also teach you about the dynamics of the environment in a reusable way. So one transition one local learning nudge.
What PILCO does:
PILCO first says, "This transition is a concrete measurement of the dynamics: . Let me update my dynamics model with it". The dynamics are modeled with a Gaussian process . Because of the GP's smoothness prior, that one data point says not only "at () the next state is about ()" but also "in a whole neighborhood around (), the dynamics must be similar to what i just saw, unless the data says otherwise".
That is already a huge increase in constraints: one sample constrains infinitely many nearby inputs in a principles way (through the kernel).
Then PILCO:
So this single transition now influences:
The predicted next state distribution at ()
The predicted trajectories for a large family of possible policies
The expected long-term cost of the policies
The gradient of that cost w.r.t. policy parameters
So in PILCO, one transition adjusts your belief about the dynamics, and that revised belief is used in every imagined rollout of every policy you consider.
This sample is used again and again to predict the evolution of x wherever relevant.
Why uncertainty modeling matters for signal vs noise
If you naively built a deterministic model and rolled it forward forever, small errors would explode and the imagined trajectories would become nonsense. Then each sample would contribute not just signal but a ton of misleading infromation (model bias)
This is what comma suffers from
and dreamer
PILCO's GP gives you a mean and variance
Where you have lots of data, the model is confident small variance
Where you don't, the model is uncertain large variance
When PILCO computes expected long-term cost, it integrates over that uncertainty.
Policies that rely on poorly known regions get penalized
Policies that keep the system in well-understood regions look better
This means taht the unknown parts of the dynamics don't "scream" as confidently as known parts. The algorithm effectively says: "This sample tells me something reliable here. Other regions are uncertain so I won't overtrust them". This keeps the learning signal more truthful and reduces contamination by model errors.
Analytic gradients vs sampled credit assignment
How does the algorithm decide which parameter change improves performance?
Model free policy gradient: Estimates how return changes when you wiggle the policy parameters. These estimates are noisy because each trajectory is random; you need lots of them to get a clear direction
PILCO: Has a differentiable model of the dynamics and cost (in expectation). It takes derivatives by propogating derivatives through the GP model and the cost over time. Since we know how a quantity depends on parameters analytically, we can find how to change those parameters with far fewer noisy samples than if you try to estimate the relationship purely from random trials.
One of the key differences with Gaussian Processes is that they optimize in function space, not weight space. So they are a probability distribution over possible functions that fit a set of points.
Aditionally, they are non-parametric, aka -parametric. The model parameters are the information bottleneck between training and test data. . It is an unbounded model ( grows with data) pruned by data
It is also probabilistic. This is good because decision making needs well-calibrated predictive uncertainty. Neural networks do not provide well calibrated uncertainty estimates.
PILCO doesn't look like Q-learning, TD-learning, actor-critic, or REINFORCE. Why do we still call it reinforcement learning?
The broad, standard definition, of RL is:
"Learning to choose actions over time to minimize expected cumulative cost (or maximize reward) from interaction with an environment, without a supervised target for the correct actions"
PILCO fits that completely.
So PILCO doesn't do temporal difference learning, q-function learning, value-teration-style dynamic programming. Those are specific algorithms within RL. RL as a field is much broader than TD/Q. It includes: Direct policy search methods (REINFORCE, black-box optimization in policy space), Model-based optimal control with learned models, Bayesian RL. PILCO is in the model-based, policy search corner of that space. SO no TD doesn't mean "not RL", it just meas "not value-based RL"
TODO
Implement PILCO and verify it works on Cartpole. Grab the old matlab version and then convert it to Torch. Then try to do it on tinygrad and see which is faster.
Still need to figure out how to do internal reward and how to remove redundant points so that compute time doesn't grow exponentially.
Then get it to work on Pong.
Then try to play it on two games at once. Or sequentially.
Then start scaling and doing the fancy stuff like psuedo inputs. Attach a CNN encoder to it. Sherman Morrison and Nystrom Approximation.