← go back

IMPORTANT AI DERIVATIONS

may 26 2026

Deriving the most important formulas in AI from first principles.

Policy Gradient: where θ\theta is the policy parameters, η(s)\eta(s) is the unnormalized state visitation count counting the expected number of time steps spent in state ss per episode, μ(s)\mu(s) is the on-policy stationary distribution over states normalized into a probability distribution. Derivation from RL book.

vπ(s)=[aπ(as)qπ(s,a)]=a[π(as)qπ(s,a)+π(as)π(s,a)]=a[π(as)qπ(s,a)+π(as)s,rp(s,rs,a)(r+vπ(s))]=a[π(as)qπ(s,a)+π(as)s,rp(s,rs,a)vπ(s)]=a[π(as)qπ(s,a)+π(as)s,rp(s,rs,a)a[π(as)qπ(s,a)+π(as)sp(ss,a)vπ(s)]]=xSk=0Pr(sx,k,π)aπ(ax)qπ(x,a)J(θ)=vπ(s0)=s(k=0Pr(s0s,k,π))aπ(as)qπ(s,a)=sη(s)aπ(as)qπ(s,a)=sη(s)sη(s)sη(s)aπ(as)qπ(s,a)=sη(s)sμ(s)aπ(as)qπ(s,a)sμ(s)aπ(as)qπ(s,a)\begin{aligned} \nabla v_\pi(s) &= \nabla[\sum_a \pi(a|s) q_{\pi}(s,a)] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla_{\pi}(s,a)] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla \sum_{s',r} p(s',r|s,a)(r+v_{\pi}(s')) ] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \sum_{s',r} p(s',r|s,a)\nabla v_{\pi}(s') ] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \sum_{s',r} p(s',r|s,a) \sum_{a'} [ \nabla \pi(a'|s') q_{\pi}(s',a') + \pi(a'|s') \sum_{s''} p(s''|s',a') \nabla v_{\pi}(s'') ]] \\ &= \sum_{x\in \mathcal{S}} \sum^\infty_{k=0} Pr(s \rightarrow x, k, \pi) \sum_a \nabla \pi(a|x) q_{\pi}(x,a) \\ \nabla J(\theta) &= v_{\pi}(s_0) \\ &= \sum_s (\sum_{k=0}^{\infty} Pr(s_0 \rightarrow s, k, \pi) ) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &= \sum_s \eta(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &= \sum_{s'} \eta(s') \sum_s \frac{\eta(s)}{\sum_{s'} \eta(s')} \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &= \sum_s \eta(s) \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \end{aligned}

Kullback-Leibler Divergence:

I(x)=logp(x)H(P)=Ex P[logp(x)]=xp(s)logp(x)H(P,Q)=xp(s)logq(x)DKL(PQ)=H(P,Q)H(P)=xp(x)logq(x)+xp(x)logp(x)=xp(x)logp(x)q(x)\begin{aligned} I(x) &= - \log p(x) \\ H(P) &= \mathbb{E}_{x~P} [-\log p(x) ] = - \sum_x p(s) \log p(x) \\ H(P,Q) &= - \sum_x p(s) \log q(x) \\ D_{KL} (P || Q) &= H(P,Q) - H(P) = -\sum_x p(x) \log q(x) + \sum_x p(x) \log p(x)\\ &= \sum_x p(x) \log \frac{p(x)}{q(x)} \end{aligned}

Variational Lower Bound (ELBO):

We know that from Bayes rule that

p(x)=p(x,z)p(zx) p(x) = \frac{p(x,z)}{p(z|x)}

If we assume that p(x)p(x) is a constant, we can take an expectation and it won't change anything since an expectation of a constant is a constant. So lets take with expectation to q(z|x)

p(x)=Eq(zx)[p(x,z)p(zx)] p(x) = \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} ] p(x)=Eq(zx)[p(x,z)p(zx)q(zx)q(zx)] p(x) = \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ] logp(x)=logEq(zx)[p(x,z)p(zx)q(zx)q(zx)] \log p(x) = \log \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ]

Since p(x)p(x) is a constant the log gets pushed in

logp(x)=Eq(zx)[logp(x,z)p(zx)q(zx)q(zx)] \log p(x) = \mathbb{E}_{q(z|x)}[\log \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ]

First term is the ELBO L\mathcal{L} and the second is KL divergence

logp(x)=Eq(zx)[logp(x,z)logq(zx)]+Eq(zx)[logq(zx)p(zx)] \log p(x) = \mathbb{E}_{q(z|x)}[\log p(x,z) - \log q(z|x)] + \mathbb{E}_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}] logp(x)=L+KL(q(zx)p(zx)) \log p(x) = \mathcal{L} + KL(q(z|x) || p(z|x))

Equivalently

KL(q(zx)p(zx))=logp(x)L KL(q(z|x) || p(z|x)) = \log p(x) - \mathcal{L}

The KL divergence is always non-negative and p(x)p(x) is constant. Beautifully, this means that maximizing L\mathcal{L} means KL(q(zx)p(zx))KL(q(z|x) || p(z|x)) is minimized and pushed closer to the true posterior.

REINFORCE

Derivation from UCL x Deepmind 2021 slides:

θ=θaπθ(a)E[RtAt=a]=θaπθ(a)q(a)=aq(a)θπθ(a)=aq(a)πθ(a)πθ(a)θπθ(a)=aπθ(a)q(a)θπθ(a)πθ(a)=E[Rtθπθ(At)πθ(At)]=E[Rtθlogπθ(At)]\begin{aligned} \nabla_{\theta} &= \nabla_{\theta} \sum_a \pi_{\theta}(a) \mathbb{E}[R_t | A_t = a] \\ &= \nabla_{\theta} \sum_a \pi_{\theta}(a) q(a) \\ &= \sum_a q(a) \nabla_{\theta} \pi_{\theta}(a) \\ &= \sum_a q(a) \frac{ \pi_{\theta}(a)}{ \pi_{\theta}(a)} \nabla_{\theta} \pi_{\theta}(a) \\ &= \sum_a \pi_{\theta}(a) q(a) \frac{\nabla_{\theta} \pi_{\theta}(a)}{ \pi_{\theta}(a)} \\ &= \mathbb{E}[R_t \frac{\nabla_{\theta} \pi_{\theta}(A_t)}{ \pi_{\theta}(A_t)} ] \\ &= \mathbb{E}[R_t \nabla_{\theta} \log \pi_{\theta}(A_t) ] \end{aligned}

Backprop

This somes from "Learning representations by back-propagating errors"

Starting with energy function of E=12cj(yc,jdj,c)2E = \frac{1}{2} \sum_c \sum_j (y_{c,j}-d_{j,c})^2

Differentiating,

Eyj=yjdj\frac{\partial E}{\partial y_j} = y_j - d_j

By chain rule

Exj=Eyjdyjdxj\frac{\partial E}{\partial x_j} = \frac{\partial E}{\partial y_j} \frac{d y_j}{d x_j} =Eyjyj(1yj)= \frac{\partial E}{\partial y_j} \cdot y_j (1-y_j)

This expression tells us how a change in input xjx_j will affect the error. With this we can compute the error affected by changing the network's weights

Ewji=Exjxjwji \frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial x_j} \cdot \frac{\partial x_j}{\partial w_{ji}} =Exjyj= \frac{\partial E}{\partial x_j} \cdot y_j

You can now update the weights with

Δw=ϵEw\Delta w = - \epsilon \frac{\partial E}{\partial w}

Upper Confidence Bounds

UCB is an algorithm that defines an upper bound Ut(a)U_t(a) for each action value, which represents the uncertainty such that q(a)Qt(a)+Ut(a)q(a) \leq Q_t(a) + U_t(a). It is optimal in terms of least regret, and actions are chosen such that

at=arg maxaAQt(a)+Ut(a)a_t = \argmax_{a \in \mathcal{A}} Q_t(a) + U_t(a)

Hoeffding's Inequality states that for X1,...,XnX_1, ..., X_n i.i.d random vriables in [0,1] with mean μ=E[X]\mu = \mathbb{E}[X] and Xtˉ=1ni=1nXu\bar{X_t} = \frac{1}{n} \sum_{i=1}^n X_u be the sample mean, then:

p(Xnˉ+uμ)e2nu2p(\bar{X_n} + u \leq \mu) \leq e^{-2nu^2}

One can apply this to the rewards of a bandit:

p(Q(a)>Qt(a)^+Ut(a))e2Nt(a)Ut(a)2\begin{aligned} p(Q(a) > \hat{Q_t(a)} + U_t(a)) \leq e^{-2N_t(a)U_t(a)^2} \end{aligned}

So essentially we should pick a probability for pp such that it is equal to e2Nt(a)Ut(a)2e^{-2N_t(a)U_t(a)^2} and then solve for the uncertainty Ut(a)U_t(a).

e2Nt(a)Ut(a)2=p2Nt(a)Ut(a)2=logpUt(a)2=logp2Nt(a)Ut(a)=logp2Nt(a)\begin{aligned} e^{-2N_t(a)U_t(a)^2} &= p \\ 2N_t(a)U_t(a)^2 &= -\log p \\ U_t(a)^2 &= \frac{-\log p}{2N_t(a)} \\ U_t(a) &= \sqrt{ \frac{- \log p}{2N_t(a)} } \end{aligned}

Now we need to reduce p as we observe more rewards. UCB1 reduces p=t4p=t^{-4} which ensures we select the optimal action as tt \rightarrow \infty and achieves logarithmic asymptotic total regret

Ut(a)=logt42Nt(a)Ut(a)=4logt2Nt(a)Ut(a)=2logtNt(a)\begin{aligned} U_t(a) &= \sqrt{ \frac{- \log t^{-4}}{2N_t(a)} } \\ U_t(a) &= \sqrt{ \frac{4 \log t}{2N_t(a)} } \\ U_t(a) &= \sqrt{ \frac{2 \log t}{N_t(a)} } \end{aligned}