IMPORTANT AI DERIVATIONS

Deriving the most important formulas in AI from first principles.

Policy Gradient: where $\theta$ is the policy parameters, $\eta(s)$ is the unnormalized state visitation count counting the expected number of time steps spent in state $s$ per episode, $\mu(s)$ is the on-policy stationary distribution over states normalized into a probability distribution. Derivation from RL book.

\begin{aligned} \nabla v_\pi(s) &= \nabla[\sum_a \pi(a|s) q_{\pi}(s,a)] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla_{\pi}(s,a)] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla \sum_{s',r} p(s',r|s,a)(r+v_{\pi}(s')) ] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \sum_{s',r} p(s',r|s,a)\nabla v_{\pi}(s') ] \\ &= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \sum_{s',r} p(s',r|s,a) \sum_{a'} [ \nabla \pi(a'|s') q_{\pi}(s',a') + \pi(a'|s') \sum_{s''} p(s''|s',a') \nabla v_{\pi}(s'') ]] \\ &= \sum_{x\in \mathcal{S}} \sum^\infty_{k=0} Pr(s \rightarrow x, k, \pi) \sum_a \nabla \pi(a|x) q_{\pi}(x,a) \\ \nabla J(\theta) &= v_{\pi}(s_0) \\ &= \sum_s (\sum_{k=0}^{\infty} Pr(s_0 \rightarrow s, k, \pi) ) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &= \sum_s \eta(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &= \sum_{s'} \eta(s') \sum_s \frac{\eta(s)}{\sum_{s'} \eta(s')} \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &= \sum_s \eta(s) \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\ &\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \end{aligned}

Kullback-Leibler Divergence:

\begin{aligned} I(x) &= - \log p(x) \\ H(P) &= \mathbb{E}_{x~P} [-\log p(x) ] = - \sum_x p(s) \log p(x) \\ H(P,Q) &= - \sum_x p(s) \log q(x) \\ D_{KL} (P || Q) &= H(P,Q) - H(P) = -\sum_x p(x) \log q(x) + \sum_x p(x) \log p(x)\\ &= \sum_x p(x) \log \frac{p(x)}{q(x)} \end{aligned}

Variational Lower Bound (ELBO):

We know that from Bayes rule that

p(x) = \frac{p(x,z)}{p(z|x)}

If we assume that $p(x)$ is a constant, we can take an expectation and it won't change anything since an expectation of a constant is a constant. So lets take with expectation to q(z|x)

p(x) = \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} ]

p(x) = \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ]

\log p(x) = \log \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ]

Since $p(x)$ is a constant the log gets pushed in

\log p(x) = \mathbb{E}_{q(z|x)}[\log \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ]

First term is the ELBO $\mathcal{L}$ and the second is KL divergence

\log p(x) = \mathbb{E}_{q(z|x)}[\log p(x,z) - \log q(z|x)] + \mathbb{E}_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}]

\log p(x) = \mathcal{L} + KL(q(z|x) || p(z|x))

Equivalently

KL(q(z|x) || p(z|x)) = \log p(x) - \mathcal{L}

The KL divergence is always non-negative and $p(x)$ is constant. Beautifully, this means that maximizing $\mathcal{L}$ means $KL(q(z|x) || p(z|x))$ is minimized and pushed closer to the true posterior.

REINFORCE

Derivation from UCL x Deepmind 2021 slides:

\begin{aligned} \nabla_{\theta} &= \nabla_{\theta} \sum_a \pi_{\theta}(a) \mathbb{E}[R_t | A_t = a] \\ &= \nabla_{\theta} \sum_a \pi_{\theta}(a) q(a) \\ &= \sum_a q(a) \nabla_{\theta} \pi_{\theta}(a) \\ &= \sum_a q(a) \frac{ \pi_{\theta}(a)}{ \pi_{\theta}(a)} \nabla_{\theta} \pi_{\theta}(a) \\ &= \sum_a \pi_{\theta}(a) q(a) \frac{\nabla_{\theta} \pi_{\theta}(a)}{ \pi_{\theta}(a)} \\ &= \mathbb{E}[R_t \frac{\nabla_{\theta} \pi_{\theta}(A_t)}{ \pi_{\theta}(A_t)} ] \\ &= \mathbb{E}[R_t \nabla_{\theta} \log \pi_{\theta}(A_t) ] \end{aligned}

Backprop

This somes from "Learning representations by back-propagating errors"

Starting with energy function of $E = \frac{1}{2} \sum_c \sum_j (y_{c,j}-d_{j,c})^2$

Differentiating,

$\frac{\partial E}{\partial y_j} = y_j - d_j$

By chain rule

$\frac{\partial E}{\partial x_j} = \frac{\partial E}{\partial y_j} \frac{d y_j}{d x_j}$ $= \frac{\partial E}{\partial y_j} \cdot y_j (1-y_j)$

This expression tells us how a change in input $x_j$ will affect the error. With this we can compute the error affected by changing the network's weights

$\frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial x_j} \cdot \frac{\partial x_j}{\partial w_{ji}}$ $= \frac{\partial E}{\partial x_j} \cdot y_j$

You can now update the weights with

$\Delta w = - \epsilon \frac{\partial E}{\partial w}$

Upper Confidence Bounds

UCB is an algorithm that defines an upper bound $U_t(a)$ for each action value, which represents the uncertainty such that $q(a) \leq Q_t(a) + U_t(a)$ . It is optimal in terms of least regret, and actions are chosen such that

a_t = \argmax_{a \in \mathcal{A}} Q_t(a) + U_t(a)

Hoeffding's Inequality states that for $X_1, ..., X_n$ i.i.d random vriables in [0,1] with mean $\mu = \mathbb{E}[X]$ and $\bar{X_t} = \frac{1}{n} \sum_{i=1}^n X_u$ be the sample mean, then:

p(\bar{X_n} + u \leq \mu) \leq e^{-2nu^2}

One can apply this to the rewards of a bandit:

\begin{aligned} p(Q(a) > \hat{Q_t(a)} + U_t(a)) \leq e^{-2N_t(a)U_t(a)^2} \end{aligned}

So essentially we should pick a probability for $p$ such that it is equal to $e^{-2N_t(a)U_t(a)^2}$ and then solve for the uncertainty $U_t(a)$ .

\begin{aligned} e^{-2N_t(a)U_t(a)^2} &= p \\ 2N_t(a)U_t(a)^2 &= -\log p \\ U_t(a)^2 &= \frac{-\log p}{2N_t(a)} \\ U_t(a) &= \sqrt{ \frac{- \log p}{2N_t(a)} } \end{aligned}

Now we need to reduce p as we observe more rewards. UCB1 reduces $p=t^{-4}$ which ensures we select the optimal action as $t \rightarrow \infty$ and achieves logarithmic asymptotic total regret

\begin{aligned} U_t(a) &= \sqrt{ \frac{- \log t^{-4}}{2N_t(a)} } \\ U_t(a) &= \sqrt{ \frac{4 \log t}{2N_t(a)} } \\ U_t(a) &= \sqrt{ \frac{2 \log t}{N_t(a)} } \end{aligned}