← go back

NOTES ON RL AND PILCO

nov 28 2025

MSRE

An observable error at each timestep:

(GTv^(St,w))2(G_T - \hat{v}(S_t, w))^2

Mean squared error return in the on-policy case:

RE(w)=Eμ[(Gtv^(St,w))2]\overline{\text{RE}}(w) = \mathbb{E}_{\mu}[(G_t - \hat{v}(S_t, w))^2] =Eμ[((vπ(St)v^(st,w))+(Gtvπ(St)))2]= \mathbb{E}_{\mu}[((v_{\pi}(S_t) - \hat{v}(s_t, w))+(G_t - v_{\pi}(S_t)))^2] =Eμ[(vπ(St)v^(st,w))2+(Gtvπ(St))2]= \mathbb{E}_{\mu}[(v_{\pi}(S_t) - \hat{v}(s_t, w))^2+(G_t - v_{\pi}(S_t))^2] =Eμ[(vπ(St)v^(st,w))2]+Eμ[(Gtvπ(St))2]= \mathbb{E}_{\mu}[(v_{\pi}(S_t) - \hat{v}(s_t, w))^2] + \mathbb{E}_{\mu}[(G_t - v_{\pi}(S_t))^2] =VE(w)+Eμ[(Gtvπ(St))2]= \overline{\text{VE}}(w)+ \mathbb{E}_{\mu}[(G_t - v_{\pi}(S_t))^2]

So we have a learnable value error,

VE(w)\overline{\text{VE}}(w)

It measures how far our learned value function is from the true value function. It depends on w and can, in principle, be reduced by more data, a better architecture, better optimization, or a rich function class. What architecture do we know of that loves data? Transformers. A lot of people think that transformers are incompatible with RL, but I think they can serve as the backbone for powerful value functions.

We also have a second variance term that does not depend on the parameter vector. This can be viewed as the "return variance". It depends entirely on the environment randomness, not your value function. It represents the unavoidable randomness in the return that does not depend on your weights. So this is why minimizing RE is equivalent to minimizing VE, the extra noise term doesn't change with w so gradient descent ignores it. Can we take advantage of the knowledge of the return variance?

The gradient of the RE is

wRE(w)=E[2(Gtv^(St,w))wv^(St,w)]\nabla_w \overline{\text{RE}}(w) = \mathbb{E}[-2(G_t - \hat{v}(S_t, w))\nabla_w \hat{v}(S_t, w)]

We can write the return as

Gt=vπ(St)+ϵt,E[ϵtSt]=0G_t = v_{\pi}(S_t) + \epsilon_t, \mathbb{E}[\epsilon_t | S_t] = 0

Then the sampled gradient is

gt(w)=2(Gtv^)wv^=2(vπv^)wv^2ϵtwv^g_t(w) = -2(G_t - \hat{v})\nabla_w \hat{v} = -2(v_{\pi} - \hat{v})\nabla_w \hat{v} - 2 \epsilon_t \nabla_w \hat{v}

The first part is the "true" gradient of the valye error. The second part is mean zero pure noise coming from return randomness. If we somehow had access to vπ(St)v_{\pi}(S_t), the gradient would be

g~t(w)=2(vπ(St)v^(St,w))wv^(St,w)\tilde{g}_t(w) = -2(v_{\pi}(S_t) - \hat{v}(S_t, w)) \nabla_w \hat{v}(S_t, w)

which has zero noise from ϵt\epsilon_t. Since we don't know vpiv_{pi}, this suggests a strategy to use a lower variance estimate instead of the raw monte carlo return, like TD(0), n-step, td(lambda), GAE.

WHY DON"T WE KNOW V_PI? By definition it is the expected discounted sum of all future rewards if we start in state s and follow the policy forever,

vπ(s)=E[GtSt=s]v_{\pi}(s) = \mathbb{E}[G_t | S_t = s]

The key word is expected, the averafe of all possible future trajectories under pi weighted by their probabilities. To know vpiv_{pi} exactly, we'd need the full environment dynamics. We can learn this with a model for p(s,rs,a)p(s^{\prime}, r | s, a).

We can see that variance reduction doesn't change the optimum, only the signal-to-noise ratio. We know that the low variance results in improved learning for RL algorithms, so perhaps there is something we can do to alievate this term. Additionally, we can now do uncertainty aware RL where we care about risk and explicitly model the aleatoric component. We can also add this signal into exploration where we have epistemic uncertainty about vπv_{\pi} or qπq_{\pi}.

Is the mean squared error return second term a manifestation of the predictive uncertainty in the predictive posterior? The Bayesian Predictive distribution:

p(yx,X,Y)=N(y;μTφ(x),σ2+φ(x)TΣφ(x))p(y^* | x^*, X, Y) = \mathcal{N}(y^*; \mu^{\prime T} \varphi(x^*), \sigma^2 + \varphi(x^*)^T \Sigma^{\prime} \varphi(x^*))

where σ2\sigma^2 is the aleatoric uncertainty and φ(x)TΣφ(x)\varphi(x^*)^T \Sigma^{\prime} \varphi(x^*) is the epistemic uncertainty.

Squared Exponential (RBF) Kernel

Encodes the assumption that nearby inputs should have strong correlated outputs and that the function is very smooth.

For input vectors x,xRDx, x^\prime \in \mathbb{R}^D, the isotropic squared exponential kernel is

k(x,x)=σ2exp(xx22l2)k(x, x^\prime) = \sigma^2 \text{exp}(-\frac{||x-x^\prime||^2}{2l^2})

σ2\sigma^2 is the signal variance (overall scale of function values).

ll = length-scale (how quickly the function changes with distance)

Intuition

  • if two points x,xx, x^\prime are close, the kernel value is near σ2\sigma^2 then the model thinks their outputs are highly correlated

  • if they are far apart, the kernel value decays toward 0, which means their outputs are almost independent

ARD (Automatic Relevance Determination)

A bayesian mechanism that lets the model learn which features matter, usually by giving each feature its own hyperparameter controlling how much it can affect the model

ARD modifies the squared exponential to give each input dimension its own length-scale

k(x,x)=σ2exp(12Σd=1D(xdxd)2ld2)k(x, x^\prime) = \sigma^2 \text{exp}(-\frac{1}{2} \Sigma_{d=1}^D \frac{(x_d - x_d^\prime)^2}{l_d^2})

Why this gives "relevance determination"

  • if ldl_d is small, small changes in xdx_d matter a lot, so dimension dd is important

  • if ldl_d is very large, changes in xdx_d barely affect the kernel, so the dimension is unimportant

PILCO

PILCO v2

PILCO has extraordinary processing capablility and is SOTA for low dimensional data. It takes just 10 samples to learn how to control Cartpole, which has a 4 dimensional continuous vector as input (cart position, cart velocity, pole angle, pole angular velocity) and a discrete binary output: 0 for push cart left and 1 for push cart right.

However, PILCO scaled very poorly to high dimensional data, and hasn't been ran on any atari games. I propose that we combine two powerful forces: deep learning for powerful mapping from high dims to rich low dim latent vectors, and then feed this to PILCO for control.

So conceptually: we use a NN to solve the images \to state problem, and then let PILCO do its classic data efficient model-based rl thing in that latent space.

More formally,

  1. Encoder: a neural net maps pixels to a low dimensional latent state
zt=eϕ(ot)z_t = e_{\phi}(o_t)
  1. GP dynamics model: Run PILCO on this latent state
zt+1=f(zt,at)+ϵz_{t+1} = f(z_t, a_t) + \epsilon

where fGPf \sim GP

  1. Policy: πθ(zt)\pi_{\theta}(z_t) maps latent state to actions, optimized via PILCO's usual analytic gradients through the GP dynamics

CHAT WARNING: if you try to learn the encoder only from PILCO's GP marginal likelihood and the final control reward signal, then you're in trouble. You may not have enough data to shape the representation. It suggests to pretrain with unsupervised/self-supervised objectives or train encoder+GP jointly with a strong auxiliary predictive loss (predict zt+1z_{t+1} or ot+1o_{t+1}), not just control return.

PILCO COMPUTATIONAL PAIN: GP complexity is O(N3)O(N^3) in number of state transitions. We'll likely need space/variational GPs or fewer training points. This might be where the looping buffer comes into play, we can discard points once we feel confident about them. chat also warns that backpropagating through PILCO's analytical expected reward and through the encoder can get numerically fiddly.

Action Spaces

For virtual desktop control the input space will be [num keys + x, y, dx, dy, click]

  • An optimization for the keyboard could be a mapping of ascii characters to a number and then the model predicts the ascii number it wants. Then we have an additional key for NOOP.

  • Optimized keyboard : A=[a[0,127+1]]\mathcal{A} = [a \in [0,127+1]]. This is the number of ascii plus noop action.

ALE has 18 discrete actions: 17 directions + 1 no op

  1. NOOP (do nothing)
  2. FIRE
  3. UP
  4. RIGHT
  5. LEFT
  6. DOWN
  7. UPRIGHT
  8. UPLEFT
  9. DOWNRIGHT
  10. DOWNLEFT
  11. RIGHTFIRE
  12. LEFTFIRE
  13. DOWNFIRE
  14. UPFIRE
  15. UPRIGHTFIRE
  16. UPLEFTFIRE
  17. DOWNRIGHTFIRE
  18. DOWNLEFTFIRE

Why PILCO OOMs RL

PILCO squeezes way more constraints out of a single transition than typical RL.

For action utu_t, state xtx_t, next state xt+1x_{t+1}, and cost cc.

raw data is (xt,ut,xt+1,ctx_t, u_t, x_{t+1}, c_t), which says that the true dynamics at point (xt,utx_t, u_t) maps to xt+1x_{t+1}

The key questions, The answer is very different in PILCO vs typical RL:

  1. Where does this information get used?
  2. How many different decisions does it influence?

What RL does:

  1. Updates a value or q-function like target=ct+γV(xt+1)\text{target} = c_t+\gamma V(x_{t+1})
  2. Maybe updates the policy a bit in the direction "if this transition looked good, favor u_t in state x_t. if bad, disfavor it"

In the canonical model free RL setup (q-learning, a2c, ppo), the sample primarily changes the agent's beliefs about what is good/bad in that specific neighborhood of state-action space. It influences other states only indirectly via neural net function approximation generalization and slow bootstrapping of values through many episodes. Put differently, in model-free RL, each sample is mostly about "how good was it to do this thing here" plus some fuzzy generalization. It does not explicitly use the fact that this sample also teach you about the dynamics of the environment in a reusable way. So one transition \approx one local learning nudge.

What PILCO does:

PILCO first says, "This transition is a concrete measurement of the dynamics: xt+1=f(xt,ut)+noisex_{t+1} = f(x_t, u_t) + \text{noise}. Let me update my dynamics model with it". The dynamics are modeled with a Gaussian process fGP(smooth kernel)f \sim \text{GP(smooth kernel)}. Because of the GP's smoothness prior, that one data point says not only "at (xt,utx_t, u_t) the next state is about (xt+1x_{t+1})" but also "in a whole neighborhood around (xt,utx_t, u_t), the dynamics must be similar to what i just saw, unless the data says otherwise".

That is already a huge increase in constraints: one sample constrains infinitely many nearby inputs in a principles way (through the kernel).

Then PILCO:

  1. Uses the updated GP to simulate how the system will evolve under any candidate policy, across many time steps, from the start distribution.
  2. For each candidate policy, it computes (approximately) the entire trajectory distribution of states over time.
  3. From that, it computes the expected cumulative cost of that policy
  4. It differentiates that expected cost w.r.t. the policy parameters to get a gradient

So this single transition now influences:

  • The predicted next state distribution at (xt,utx_t, u_t)

  • The predicted trajectories for a large family of possible policies

  • The expected long-term cost of the policies

  • The gradient of that cost w.r.t. policy parameters

So in PILCO, one transition adjusts your belief about the dynamics, and that revised belief is used in every imagined rollout of every policy you consider.

This sample is used again and again to predict the evolution of x wherever relevant.

Why uncertainty modeling matters for signal vs noise

If you naively built a deterministic model and rolled it forward forever, small errors would explode and the imagined trajectories would become nonsense. Then each sample would contribute not just signal but a ton of misleading infromation (model bias)

  • This is what comma suffers from

  • and dreamer

PILCO's GP gives you a mean and variance

  • Where you have lots of data, the model is confident \to small variance

  • Where you don't, the model is uncertain \to large variance

When PILCO computes expected long-term cost, it integrates over that uncertainty.

  • Policies that rely on poorly known regions get penalized

  • Policies that keep the system in well-understood regions look better

This means taht the unknown parts of the dynamics don't "scream" as confidently as known parts. The algorithm effectively says: "This sample tells me something reliable here. Other regions are uncertain so I won't overtrust them". This keeps the learning signal more truthful and reduces contamination by model errors.

Analytic gradients vs sampled credit assignment

How does the algorithm decide which parameter change improves performance?

  • Model free policy gradient: Estimates how return changes when you wiggle the policy parameters. These estimates are noisy because each trajectory is random; you need lots of them to get a clear direction

  • PILCO: Has a differentiable model of the dynamics and cost (in expectation). It takes derivatives E[cost]Θ\frac{\partial \mathbb{E}[cost]}{\partial \Theta} by propogating derivatives through the GP model and the cost over time. Since we know how a quantity depends on parameters analytically, we can find how to change those parameters with far fewer noisy samples than if you try to estimate the relationship purely from random trials.

Weight space vs function space

One of the key differences with Gaussian Processes is that they optimize in function space, not weight space. So they are a probability distribution over possible functions that fit a set of points.

Aditionally, they are non-parametric, aka \infty-parametric. The model parameters are the information bottleneck between training and test data. training dataΘtest data\text{training data} \to \Theta \to \text{test data}. It is an unbounded model (Θ\Theta grows with data) pruned by data

It is also probabilistic. This is good because decision making needs well-calibrated predictive uncertainty. Neural networks do not provide well calibrated uncertainty estimates.

Why PILCO is considered RL at all

PILCO doesn't look like Q-learning, TD-learning, actor-critic, or REINFORCE. Why do we still call it reinforcement learning?

The broad, standard definition, of RL is:

"Learning to choose actions over time to minimize expected cumulative cost (or maximize reward) from interaction with an environment, without a supervised target for the correct actions"

PILCO fits that completely.

  1. Interaction-based learning. It intereacts with the environmnet: executes a policy on the real system, collects transitions. There is no supervised "ground truth" for the optimal actions-only, only costs from the environment
  2. Sequential decision making / control. It chooses a policy mapping states \to actions. It evaluates policies by their long-term cumulative cost under the learned dynamics model.
  3. Data-driven improvement loop: it uses observed data to improve its model and its policy in a closed loop. Collection data \to fit/upgrade dynamics model \to update policy \to collection more data \to ...

So PILCO doesn't do temporal difference learning, q-function learning, value-teration-style dynamic programming. Those are specific algorithms within RL. RL as a field is much broader than TD/Q. It includes: Direct policy search methods (REINFORCE, black-box optimization in policy space), Model-based optimal control with learned models, Bayesian RL. PILCO is in the model-based, policy search corner of that space. SO no TD doesn't mean "not RL", it just meas "not value-based RL"

TODO

Implement PILCO and verify it works on Cartpole. Grab the old matlab version and then convert it to Torch. Then try to do it on tinygrad and see which is faster.

Still need to figure out how to do internal reward and how to remove redundant points so that compute time doesn't grow exponentially.

  • Idea is to summarise dataset by small number (M) psuedo data. Then we can have constant or logarithmic scaling and choose M to be within a suitable range for available compute.

Then get it to work on Pong.

Then try to play it on two games at once. Or sequentially.

Then start scaling and doing the fancy stuff like psuedo inputs. Attach a CNN encoder to it. Sherman Morrison and Nystrom Approximation.