NOTES ON RL AND PILCO | KEVIN BUHLER

MSRE

An observable error at each timestep:

(G_T - \hat{v}(S_t, w))^2

Mean squared error return in the on-policy case:

\overline{\text{RE}}(w) = \mathbb{E}_{\mu}[(G_t - \hat{v}(S_t, w))^2]

= \mathbb{E}_{\mu}[((v_{\pi}(S_t) - \hat{v}(s_t, w))+(G_t - v_{\pi}(S_t)))^2]

= \mathbb{E}_{\mu}[(v_{\pi}(S_t) - \hat{v}(s_t, w))^2+(G_t - v_{\pi}(S_t))^2]

= \mathbb{E}_{\mu}[(v_{\pi}(S_t) - \hat{v}(s_t, w))^2] + \mathbb{E}_{\mu}[(G_t - v_{\pi}(S_t))^2]

= \overline{\text{VE}}(w)+ \mathbb{E}_{\mu}[(G_t - v_{\pi}(S_t))^2]

So we have a learnable value error,

\overline{\text{VE}}(w)

It measures how far our learned value function is from the true value function. It depends on w and can, in principle, be reduced by more data, a better architecture, better optimization, or a rich function class. What architecture do we know of that loves data? Transformers. A lot of people think that transformers are incompatible with RL, but I think they can serve as the backbone for powerful value functions.

We also have a second variance term that does not depend on the parameter vector. This can be viewed as the "return variance". It depends entirely on the environment randomness, not your value function. It represents the unavoidable randomness in the return that does not depend on your weights. So this is why minimizing RE is equivalent to minimizing VE, the extra noise term doesn't change with w so gradient descent ignores it. Can we take advantage of the knowledge of the return variance?

The gradient of the RE is

\nabla_w \overline{\text{RE}}(w) = \mathbb{E}[-2(G_t - \hat{v}(S_t, w))\nabla_w \hat{v}(S_t, w)]

We can write the return as

G_t = v_{\pi}(S_t) + \epsilon_t, \mathbb{E}[\epsilon_t | S_t] = 0

Then the sampled gradient is

g_t(w) = -2(G_t - \hat{v})\nabla_w \hat{v} = -2(v_{\pi} - \hat{v})\nabla_w \hat{v} - 2 \epsilon_t \nabla_w \hat{v}

The first part is the "true" gradient of the valye error. The second part is mean zero pure noise coming from return randomness. If we somehow had access to $v_{\pi}(S_t)$ , the gradient would be

\tilde{g}_t(w) = -2(v_{\pi}(S_t) - \hat{v}(S_t, w)) \nabla_w \hat{v}(S_t, w)

which has zero noise from $\epsilon_t$ . Since we don't know $v_{pi}$ , this suggests a strategy to use a lower variance estimate instead of the raw monte carlo return, like TD(0), n-step, td(lambda), GAE.

WHY DON"T WE KNOW V_PI? By definition it is the expected discounted sum of all future rewards if we start in state s and follow the policy forever,

v_{\pi}(s) = \mathbb{E}[G_t | S_t = s]

The key word is expected, the averafe of all possible future trajectories under pi weighted by their probabilities. To know $v_{pi}$ exactly, we'd need the full environment dynamics. We can learn this with a model for $p(s^{\prime}, r | s, a)$ .

We can see that variance reduction doesn't change the optimum, only the signal-to-noise ratio. We know that the low variance results in improved learning for RL algorithms, so perhaps there is something we can do to alievate this term. Additionally, we can now do uncertainty aware RL where we care about risk and explicitly model the aleatoric component. We can also add this signal into exploration where we have epistemic uncertainty about $v_{\pi}$ or $q_{\pi}$ .

Is the mean squared error return second term a manifestation of the predictive uncertainty in the predictive posterior? The Bayesian Predictive distribution:

p(y^* | x^*, X, Y) = \mathcal{N}(y^*; \mu^{\prime T} \varphi(x^*), \sigma^2 + \varphi(x^*)^T \Sigma^{\prime} \varphi(x^*))

where $\sigma^2$ is the aleatoric uncertainty and $\varphi(x^*)^T \Sigma^{\prime} \varphi(x^*)$ is the epistemic uncertainty.

Squared Exponential (RBF) Kernel

Encodes the assumption that nearby inputs should have strong correlated outputs and that the function is very smooth.

For input vectors $x, x^\prime \in \mathbb{R}^D$ , the isotropic squared exponential kernel is

k(x, x^\prime) = \sigma^2 \text{exp}(-\frac{||x-x^\prime||^2}{2l^2})

$\sigma^2$ is the signal variance (overall scale of function values).

$l$ = length-scale (how quickly the function changes with distance)

Intuition

if two points $x, x^\prime$ are close, the kernel value is near $\sigma^2$ then the model thinks their outputs are highly correlated
if they are far apart, the kernel value decays toward 0, which means their outputs are almost independent

ARD (Automatic Relevance Determination)

A bayesian mechanism that lets the model learn which features matter, usually by giving each feature its own hyperparameter controlling how much it can affect the model

ARD modifies the squared exponential to give each input dimension its own length-scale

k(x, x^\prime) = \sigma^2 \text{exp}(-\frac{1}{2} \Sigma_{d=1}^D \frac{(x_d - x_d^\prime)^2}{l_d^2})

Why this gives "relevance determination"

if $l_d$ is small, small changes in $x_d$ matter a lot, so dimension $d$ is important
if $l_d$ is very large, changes in $x_d$ barely affect the kernel, so the dimension is unimportant

PILCO

PILCO v2

PILCO has extraordinary processing capablility and is SOTA for low dimensional data. It takes just 10 samples to learn how to control Cartpole, which has a 4 dimensional continuous vector as input (cart position, cart velocity, pole angle, pole angular velocity) and a discrete binary output: 0 for push cart left and 1 for push cart right.

However, PILCO scaled very poorly to high dimensional data, and hasn't been ran on any atari games. I propose that we combine two powerful forces: deep learning for powerful mapping from high dims to rich low dim latent vectors, and then feed this to PILCO for control.

So conceptually: we use a NN to solve the images $\to$ state problem, and then let PILCO do its classic data efficient model-based rl thing in that latent space.

More formally,

Encoder: a neural net maps pixels to a low dimensional latent state

z_t = e_{\phi}(o_t)

GP dynamics model: Run PILCO on this latent state

z_{t+1} = f(z_t, a_t) + \epsilon

where $f \sim GP$

Policy: $\pi_{\theta}(z_t)$ maps latent state to actions, optimized via PILCO's usual analytic gradients through the GP dynamics

CHAT WARNING: if you try to learn the encoder only from PILCO's GP marginal likelihood and the final control reward signal, then you're in trouble. You may not have enough data to shape the representation. It suggests to pretrain with unsupervised/self-supervised objectives or train encoder+GP jointly with a strong auxiliary predictive loss (predict $z_{t+1}$ or $o_{t+1}$ ), not just control return.

PILCO COMPUTATIONAL PAIN: GP complexity is $O(N^3)$ in number of state transitions. We'll likely need space/variational GPs or fewer training points. This might be where the looping buffer comes into play, we can discard points once we feel confident about them. chat also warns that backpropagating through PILCO's analytical expected reward and through the encoder can get numerically fiddly.

Action Spaces

For virtual desktop control the input space will be [num keys + x, y, dx, dy, click]

An optimization for the keyboard could be a mapping of ascii characters to a number and then the model predicts the ascii number it wants. Then we have an additional key for NOOP.
Optimized keyboard : $\mathcal{A} = [a \in [0,127+1]]$ . This is the number of ascii plus noop action.

ALE has 18 discrete actions: 17 directions + 1 no op

NOOP (do nothing)
FIRE
UP
RIGHT
LEFT
DOWN
UPRIGHT
UPLEFT
DOWNRIGHT
DOWNLEFT
RIGHTFIRE
LEFTFIRE
DOWNFIRE
UPFIRE
UPRIGHTFIRE
UPLEFTFIRE
DOWNRIGHTFIRE
DOWNLEFTFIRE

Why PILCO OOMs RL

PILCO squeezes way more constraints out of a single transition than typical RL.

For action $u_t$ , state $x_t$ , next state $x_{t+1}$ , and cost $c$ .

raw data is ( $x_t, u_t, x_{t+1}, c_t$ ), which says that the true dynamics at point ( $x_t, u_t$ ) maps to $x_{t+1}$

The key questions, The answer is very different in PILCO vs typical RL:

Where does this information get used?
How many different decisions does it influence?

What RL does:

Updates a value or q-function like $\text{target} = c_t+\gamma V(x_{t+1})$
Maybe updates the policy a bit in the direction "if this transition looked good, favor u_t in state x_t. if bad, disfavor it"

In the canonical model free RL setup (q-learning, a2c, ppo), the sample primarily changes the agent's beliefs about what is good/bad in that specific neighborhood of state-action space. It influences other states only indirectly via neural net function approximation generalization and slow bootstrapping of values through many episodes. Put differently, in model-free RL, each sample is mostly about "how good was it to do this thing here" plus some fuzzy generalization. It does not explicitly use the fact that this sample also teach you about the dynamics of the environment in a reusable way. So one transition $\approx$ one local learning nudge.

What PILCO does:

PILCO first says, "This transition is a concrete measurement of the dynamics: $x_{t+1} = f(x_t, u_t) + \text{noise}$ . Let me update my dynamics model with it". The dynamics are modeled with a Gaussian process $f \sim \text{GP(smooth kernel)}$ . Because of the GP's smoothness prior, that one data point says not only "at ( $x_t, u_t$ ) the next state is about ( $x_{t+1}$ )" but also "in a whole neighborhood around ( $x_t, u_t$ ), the dynamics must be similar to what i just saw, unless the data says otherwise".

That is already a huge increase in constraints: one sample constrains infinitely many nearby inputs in a principles way (through the kernel).

Then PILCO:

Uses the updated GP to simulate how the system will evolve under any candidate policy, across many time steps, from the start distribution.
For each candidate policy, it computes (approximately) the entire trajectory distribution of states over time.
From that, it computes the expected cumulative cost of that policy
It differentiates that expected cost w.r.t. the policy parameters to get a gradient

So this single transition now influences:

The predicted next state distribution at ( $x_t, u_t$ )
The predicted trajectories for a large family of possible policies
The expected long-term cost of the policies
The gradient of that cost w.r.t. policy parameters

So in PILCO, one transition adjusts your belief about the dynamics, and that revised belief is used in every imagined rollout of every policy you consider.

This sample is used again and again to predict the evolution of x wherever relevant.

Why uncertainty modeling matters for signal vs noise

If you naively built a deterministic model and rolled it forward forever, small errors would explode and the imagined trajectories would become nonsense. Then each sample would contribute not just signal but a ton of misleading infromation (model bias)

This is what comma suffers from
and dreamer

PILCO's GP gives you a mean and variance

Where you have lots of data, the model is confident $\to$ small variance
Where you don't, the model is uncertain $\to$ large variance

When PILCO computes expected long-term cost, it integrates over that uncertainty.

Policies that rely on poorly known regions get penalized
Policies that keep the system in well-understood regions look better

This means taht the unknown parts of the dynamics don't "scream" as confidently as known parts. The algorithm effectively says: "This sample tells me something reliable here. Other regions are uncertain so I won't overtrust them". This keeps the learning signal more truthful and reduces contamination by model errors.

Analytic gradients vs sampled credit assignment

How does the algorithm decide which parameter change improves performance?

Model free policy gradient: Estimates how return changes when you wiggle the policy parameters. These estimates are noisy because each trajectory is random; you need lots of them to get a clear direction
PILCO: Has a differentiable model of the dynamics and cost (in expectation). It takes derivatives $\frac{\partial \mathbb{E}[cost]}{\partial \Theta}$ by propogating derivatives through the GP model and the cost over time. Since we know how a quantity depends on parameters analytically, we can find how to change those parameters with far fewer noisy samples than if you try to estimate the relationship purely from random trials.

Weight space vs function space

One of the key differences with Gaussian Processes is that they optimize in function space, not weight space. So they are a probability distribution over possible functions that fit a set of points.

Aditionally, they are non-parametric, aka $\infty$ -parametric. The model parameters are the information bottleneck between training and test data. $\text{training data} \to \Theta \to \text{test data}$ . It is an unbounded model ( $\Theta$ grows with data) pruned by data

It is also probabilistic. This is good because decision making needs well-calibrated predictive uncertainty. Neural networks do not provide well calibrated uncertainty estimates.

Why PILCO is considered RL at all

PILCO doesn't look like Q-learning, TD-learning, actor-critic, or REINFORCE. Why do we still call it reinforcement learning?

The broad, standard definition, of RL is:

"Learning to choose actions over time to minimize expected cumulative cost (or maximize reward) from interaction with an environment, without a supervised target for the correct actions"

PILCO fits that completely.

Interaction-based learning. It intereacts with the environmnet: executes a policy on the real system, collects transitions. There is no supervised "ground truth" for the optimal actions-only, only costs from the environment
Sequential decision making / control. It chooses a policy mapping states $\to$ actions. It evaluates policies by their long-term cumulative cost under the learned dynamics model.
Data-driven improvement loop: it uses observed data to improve its model and its policy in a closed loop. Collection data $\to$ fit/upgrade dynamics model $\to$ update policy $\to$ collection more data $\to$ ...

So PILCO doesn't do temporal difference learning, q-function learning, value-teration-style dynamic programming. Those are specific algorithms within RL. RL as a field is much broader than TD/Q. It includes: Direct policy search methods (REINFORCE, black-box optimization in policy space), Model-based optimal control with learned models, Bayesian RL. PILCO is in the model-based, policy search corner of that space. SO no TD doesn't mean "not RL", it just meas "not value-based RL"

TODO

Implement PILCO and verify it works on Cartpole. Grab the old matlab version and then convert it to Torch. Then try to do it on tinygrad and see which is faster.

Still need to figure out how to do internal reward and how to remove redundant points so that compute time doesn't grow exponentially.

Idea is to summarise dataset by small number (M) psuedo data. Then we can have constant or logarithmic scaling and choose M to be within a suitable range for available compute.

Then get it to work on Pong.

Then try to play it on two games at once. Or sequentially.

Then start scaling and doing the fancy stuff like psuedo inputs. Attach a CNN encoder to it. Sherman Morrison and Nystrom Approximation.