Deriving the most important formulas in AI from first principles.
Policy Gradient : where θ \theta θ is the policy parameters, η ( s ) \eta(s) η ( s ) is the unnormalized state visitation count counting the expected number of time steps spent in state s s s per episode, μ ( s ) \mu(s) μ ( s ) is the on-policy stationary distribution over states normalized into a probability distribution. Derivation from RL book.
∇ v π ( s ) = ∇ [ ∑ a π ( a ∣ s ) q π ( s , a ) ] = ∑ a [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) ∇ π ( s , a ) ] = ∑ a [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) ∇ ∑ s ′ , r p ( s ′ , r ∣ s , a ) ( r + v π ( s ′ ) ) ] = ∑ a [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) ∇ v π ( s ′ ) ] = ∑ a [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) ∑ a ′ [ ∇ π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) + π ( a ′ ∣ s ′ ) ∑ s ′ ′ p ( s ′ ′ ∣ s ′ , a ′ ) ∇ v π ( s ′ ′ ) ] ] = ∑ x ∈ S ∑ k = 0 ∞ P r ( s → x , k , π ) ∑ a ∇ π ( a ∣ x ) q π ( x , a ) ∇ J ( θ ) = v π ( s 0 ) = ∑ s ( ∑ k = 0 ∞ P r ( s 0 → s , k , π ) ) ∑ a ∇ π ( a ∣ s ) q π ( s , a ) = ∑ s η ( s ) ∑ a ∇ π ( a ∣ s ) q π ( s , a ) = ∑ s ′ η ( s ′ ) ∑ s η ( s ) ∑ s ′ η ( s ′ ) ∑ a ∇ π ( a ∣ s ) q π ( s , a ) = ∑ s η ( s ) ∑ s μ ( s ) ∑ a ∇ π ( a ∣ s ) q π ( s , a ) ∝ ∑ s μ ( s ) ∑ a ∇ π ( a ∣ s ) q π ( s , a ) \begin{aligned}
\nabla v_\pi(s) &= \nabla[\sum_a \pi(a|s) q_{\pi}(s,a)] \\
&= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla_{\pi}(s,a)] \\
&= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla \sum_{s',r} p(s',r|s,a)(r+v_{\pi}(s')) ] \\
&= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \sum_{s',r} p(s',r|s,a)\nabla v_{\pi}(s') ] \\
&= \sum_a [ \nabla \pi(a|s) q_{\pi}(s,a) + \pi(a|s) \sum_{s',r} p(s',r|s,a) \sum_{a'} [ \nabla \pi(a'|s') q_{\pi}(s',a') + \pi(a'|s') \sum_{s''} p(s''|s',a') \nabla v_{\pi}(s'') ]] \\
&= \sum_{x\in \mathcal{S}} \sum^\infty_{k=0} Pr(s \rightarrow x, k, \pi) \sum_a \nabla \pi(a|x) q_{\pi}(x,a) \\
\nabla J(\theta) &= v_{\pi}(s_0) \\
&= \sum_s (\sum_{k=0}^{\infty} Pr(s_0 \rightarrow s, k, \pi) ) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\
&= \sum_s \eta(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\
&= \sum_{s'} \eta(s') \sum_s \frac{\eta(s)}{\sum_{s'} \eta(s')} \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\
&= \sum_s \eta(s) \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a) \\
&\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi}(s,a)
\end{aligned} ∇ v π ( s ) ∇ J ( θ ) = ∇ [ a ∑ π ( a ∣ s ) q π ( s , a )] = a ∑ [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) ∇ π ( s , a )] = a ∑ [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) ∇ s ′ , r ∑ p ( s ′ , r ∣ s , a ) ( r + v π ( s ′ ))] = a ∑ [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) s ′ , r ∑ p ( s ′ , r ∣ s , a ) ∇ v π ( s ′ )] = a ∑ [ ∇ π ( a ∣ s ) q π ( s , a ) + π ( a ∣ s ) s ′ , r ∑ p ( s ′ , r ∣ s , a ) a ′ ∑ [ ∇ π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) + π ( a ′ ∣ s ′ ) s ′′ ∑ p ( s ′′ ∣ s ′ , a ′ ) ∇ v π ( s ′′ )]] = x ∈ S ∑ k = 0 ∑ ∞ P r ( s → x , k , π ) a ∑ ∇ π ( a ∣ x ) q π ( x , a ) = v π ( s 0 ) = s ∑ ( k = 0 ∑ ∞ P r ( s 0 → s , k , π )) a ∑ ∇ π ( a ∣ s ) q π ( s , a ) = s ∑ η ( s ) a ∑ ∇ π ( a ∣ s ) q π ( s , a ) = s ′ ∑ η ( s ′ ) s ∑ ∑ s ′ η ( s ′ ) η ( s ) a ∑ ∇ π ( a ∣ s ) q π ( s , a ) = s ∑ η ( s ) s ∑ μ ( s ) a ∑ ∇ π ( a ∣ s ) q π ( s , a ) ∝ s ∑ μ ( s ) a ∑ ∇ π ( a ∣ s ) q π ( s , a )
Kullback-Leibler Divergence :
I ( x ) = − log p ( x ) H ( P ) = E x P [ − log p ( x ) ] = − ∑ x p ( s ) log p ( x ) H ( P , Q ) = − ∑ x p ( s ) log q ( x ) D K L ( P ∣ ∣ Q ) = H ( P , Q ) − H ( P ) = − ∑ x p ( x ) log q ( x ) + ∑ x p ( x ) log p ( x ) = ∑ x p ( x ) log p ( x ) q ( x ) \begin{aligned}
I(x) &= - \log p(x) \\
H(P) &= \mathbb{E}_{x~P} [-\log p(x) ] = - \sum_x p(s) \log p(x) \\
H(P,Q) &= - \sum_x p(s) \log q(x) \\
D_{KL} (P || Q) &= H(P,Q) - H(P) = -\sum_x p(x) \log q(x) + \sum_x p(x) \log p(x)\\
&= \sum_x p(x) \log \frac{p(x)}{q(x)}
\end{aligned} I ( x ) H ( P ) H ( P , Q ) D K L ( P ∣∣ Q ) = − log p ( x ) = E x P [ − log p ( x )] = − x ∑ p ( s ) log p ( x ) = − x ∑ p ( s ) log q ( x ) = H ( P , Q ) − H ( P ) = − x ∑ p ( x ) log q ( x ) + x ∑ p ( x ) log p ( x ) = x ∑ p ( x ) log q ( x ) p ( x )
Variational Lower Bound (ELBO) :
We know that from Bayes rule that
p ( x ) = p ( x , z ) p ( z ∣ x ) p(x) = \frac{p(x,z)}{p(z|x)} p ( x ) = p ( z ∣ x ) p ( x , z )
If we assume that p ( x ) p(x) p ( x ) is a constant, we can take an expectation and it won't change anything since an expectation of a constant is a constant. So lets take with expectation to q(z|x)
p ( x ) = E q ( z ∣ x ) [ p ( x , z ) p ( z ∣ x ) ] p(x) = \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} ] p ( x ) = E q ( z ∣ x ) [ p ( z ∣ x ) p ( x , z ) ]
p ( x ) = E q ( z ∣ x ) [ p ( x , z ) p ( z ∣ x ) q ( z ∣ x ) q ( z ∣ x ) ] p(x) = \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ] p ( x ) = E q ( z ∣ x ) [ p ( z ∣ x ) p ( x , z ) q ( z ∣ x ) q ( z ∣ x ) ]
log p ( x ) = log E q ( z ∣ x ) [ p ( x , z ) p ( z ∣ x ) q ( z ∣ x ) q ( z ∣ x ) ] \log p(x) = \log \mathbb{E}_{q(z|x)}[ \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ] log p ( x ) = log E q ( z ∣ x ) [ p ( z ∣ x ) p ( x , z ) q ( z ∣ x ) q ( z ∣ x ) ]
Since p ( x ) p(x) p ( x ) is a constant the log gets pushed in
log p ( x ) = E q ( z ∣ x ) [ log p ( x , z ) p ( z ∣ x ) q ( z ∣ x ) q ( z ∣ x ) ] \log p(x) = \mathbb{E}_{q(z|x)}[\log \frac{p(x,z)}{p(z|x)} \frac{q(z|x)}{q(z|x)} ] log p ( x ) = E q ( z ∣ x ) [ log p ( z ∣ x ) p ( x , z ) q ( z ∣ x ) q ( z ∣ x ) ]
First term is the ELBO L \mathcal{L} L and the second is KL divergence
log p ( x ) = E q ( z ∣ x ) [ log p ( x , z ) − log q ( z ∣ x ) ] + E q ( z ∣ x ) [ log q ( z ∣ x ) p ( z ∣ x ) ] \log p(x) = \mathbb{E}_{q(z|x)}[\log p(x,z) - \log q(z|x)] + \mathbb{E}_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}] log p ( x ) = E q ( z ∣ x ) [ log p ( x , z ) − log q ( z ∣ x )] + E q ( z ∣ x ) [ log p ( z ∣ x ) q ( z ∣ x ) ]
log p ( x ) = L + K L ( q ( z ∣ x ) ∣ ∣ p ( z ∣ x ) ) \log p(x) = \mathcal{L} + KL(q(z|x) || p(z|x)) log p ( x ) = L + K L ( q ( z ∣ x ) ∣∣ p ( z ∣ x ))
Equivalently
K L ( q ( z ∣ x ) ∣ ∣ p ( z ∣ x ) ) = log p ( x ) − L KL(q(z|x) || p(z|x)) = \log p(x) - \mathcal{L} K L ( q ( z ∣ x ) ∣∣ p ( z ∣ x )) = log p ( x ) − L
The KL divergence is always non-negative and p ( x ) p(x) p ( x ) is constant. Beautifully, this means that maximizing L \mathcal{L} L means K L ( q ( z ∣ x ) ∣ ∣ p ( z ∣ x ) ) KL(q(z|x) || p(z|x)) K L ( q ( z ∣ x ) ∣∣ p ( z ∣ x )) is minimized and pushed closer to the true posterior.
REINFORCE
Derivation from UCL x Deepmind 2021 slides:
∇ θ = ∇ θ ∑ a π θ ( a ) E [ R t ∣ A t = a ] = ∇ θ ∑ a π θ ( a ) q ( a ) = ∑ a q ( a ) ∇ θ π θ ( a ) = ∑ a q ( a ) π θ ( a ) π θ ( a ) ∇ θ π θ ( a ) = ∑ a π θ ( a ) q ( a ) ∇ θ π θ ( a ) π θ ( a ) = E [ R t ∇ θ π θ ( A t ) π θ ( A t ) ] = E [ R t ∇ θ log π θ ( A t ) ] \begin{aligned}
\nabla_{\theta} &= \nabla_{\theta} \sum_a \pi_{\theta}(a) \mathbb{E}[R_t | A_t = a] \\
&= \nabla_{\theta} \sum_a \pi_{\theta}(a) q(a) \\
&= \sum_a q(a) \nabla_{\theta} \pi_{\theta}(a) \\
&= \sum_a q(a) \frac{ \pi_{\theta}(a)}{ \pi_{\theta}(a)} \nabla_{\theta} \pi_{\theta}(a) \\
&= \sum_a \pi_{\theta}(a) q(a) \frac{\nabla_{\theta} \pi_{\theta}(a)}{ \pi_{\theta}(a)} \\
&= \mathbb{E}[R_t \frac{\nabla_{\theta} \pi_{\theta}(A_t)}{ \pi_{\theta}(A_t)} ] \\
&= \mathbb{E}[R_t \nabla_{\theta} \log \pi_{\theta}(A_t) ]
\end{aligned} ∇ θ = ∇ θ a ∑ π θ ( a ) E [ R t ∣ A t = a ] = ∇ θ a ∑ π θ ( a ) q ( a ) = a ∑ q ( a ) ∇ θ π θ ( a ) = a ∑ q ( a ) π θ ( a ) π θ ( a ) ∇ θ π θ ( a ) = a ∑ π θ ( a ) q ( a ) π θ ( a ) ∇ θ π θ ( a ) = E [ R t π θ ( A t ) ∇ θ π θ ( A t ) ] = E [ R t ∇ θ log π θ ( A t )]
Backprop
This somes from "Learning representations by back-propagating errors"
Starting with energy function of E = 1 2 ∑ c ∑ j ( y c , j − d j , c ) 2 E = \frac{1}{2} \sum_c \sum_j (y_{c,j}-d_{j,c})^2 E = 2 1 ∑ c ∑ j ( y c , j − d j , c ) 2
Differentiating,
∂ E ∂ y j = y j − d j \frac{\partial E}{\partial y_j} = y_j - d_j ∂ y j ∂ E = y j − d j
By chain rule
∂ E ∂ x j = ∂ E ∂ y j d y j d x j \frac{\partial E}{\partial x_j} = \frac{\partial E}{\partial y_j} \frac{d y_j}{d x_j} ∂ x j ∂ E = ∂ y j ∂ E d x j d y j
= ∂ E ∂ y j ⋅ y j ( 1 − y j ) = \frac{\partial E}{\partial y_j} \cdot y_j (1-y_j) = ∂ y j ∂ E ⋅ y j ( 1 − y j )
This expression tells us how a change in input x j x_j x j will affect the error. With this we can compute the error affected by changing the network's weights
∂ E ∂ w j i = ∂ E ∂ x j ⋅ ∂ x j ∂ w j i \frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial x_j} \cdot \frac{\partial x_j}{\partial w_{ji}} ∂ w ji ∂ E = ∂ x j ∂ E ⋅ ∂ w ji ∂ x j
= ∂ E ∂ x j ⋅ y j = \frac{\partial E}{\partial x_j} \cdot y_j = ∂ x j ∂ E ⋅ y j
You can now update the weights with
Δ w = − ϵ ∂ E ∂ w \Delta w = - \epsilon \frac{\partial E}{\partial w} Δ w = − ϵ ∂ w ∂ E
Upper Confidence Bounds
UCB is an algorithm that defines an upper bound U t ( a ) U_t(a) U t ( a ) for each action value, which represents the uncertainty such that q ( a ) ≤ Q t ( a ) + U t ( a ) q(a) \leq Q_t(a) + U_t(a) q ( a ) ≤ Q t ( a ) + U t ( a ) . It is optimal in terms of least regret, and actions are chosen such that
a t = arg max a ∈ A Q t ( a ) + U t ( a ) a_t = \argmax_{a \in \mathcal{A}} Q_t(a) + U_t(a) a t = a ∈ A arg max Q t ( a ) + U t ( a )
Hoeffding's Inequality states that for X 1 , . . . , X n X_1, ..., X_n X 1 , ... , X n i.i.d random vriables in [0,1] with mean μ = E [ X ] \mu = \mathbb{E}[X] μ = E [ X ] and X t ˉ = 1 n ∑ i = 1 n X u \bar{X_t} = \frac{1}{n} \sum_{i=1}^n X_u X t ˉ = n 1 ∑ i = 1 n X u be the sample mean, then:
p ( X n ˉ + u ≤ μ ) ≤ e − 2 n u 2 p(\bar{X_n} + u \leq \mu) \leq e^{-2nu^2} p ( X n ˉ + u ≤ μ ) ≤ e − 2 n u 2
One can apply this to the rewards of a bandit:
p ( Q ( a ) > Q t ( a ) ^ + U t ( a ) ) ≤ e − 2 N t ( a ) U t ( a ) 2 \begin{aligned}
p(Q(a) > \hat{Q_t(a)} + U_t(a)) \leq e^{-2N_t(a)U_t(a)^2}
\end{aligned} p ( Q ( a ) > Q t ( a ) ^ + U t ( a )) ≤ e − 2 N t ( a ) U t ( a ) 2
So essentially we should pick a probability for p p p such that it is equal to e − 2 N t ( a ) U t ( a ) 2 e^{-2N_t(a)U_t(a)^2} e − 2 N t ( a ) U t ( a ) 2 and then solve for the uncertainty U t ( a ) U_t(a) U t ( a ) .
e − 2 N t ( a ) U t ( a ) 2 = p 2 N t ( a ) U t ( a ) 2 = − log p U t ( a ) 2 = − log p 2 N t ( a ) U t ( a ) = − log p 2 N t ( a ) \begin{aligned}
e^{-2N_t(a)U_t(a)^2} &= p \\
2N_t(a)U_t(a)^2 &= -\log p \\
U_t(a)^2 &= \frac{-\log p}{2N_t(a)} \\
U_t(a) &= \sqrt{ \frac{- \log p}{2N_t(a)} }
\end{aligned} e − 2 N t ( a ) U t ( a ) 2 2 N t ( a ) U t ( a ) 2 U t ( a ) 2 U t ( a ) = p = − log p = 2 N t ( a ) − log p = 2 N t ( a ) − log p
Now we need to reduce p as we observe more rewards. UCB1 reduces p = t − 4 p=t^{-4} p = t − 4 which ensures we select the optimal action as t → ∞ t \rightarrow \infty t → ∞ and achieves logarithmic asymptotic total regret
U t ( a ) = − log t − 4 2 N t ( a ) U t ( a ) = 4 log t 2 N t ( a ) U t ( a ) = 2 log t N t ( a ) \begin{aligned}
U_t(a) &= \sqrt{ \frac{- \log t^{-4}}{2N_t(a)} } \\
U_t(a) &= \sqrt{ \frac{4 \log t}{2N_t(a)} } \\
U_t(a) &= \sqrt{ \frac{2 \log t}{N_t(a)} }
\end{aligned} U t ( a ) U t ( a ) U t ( a ) = 2 N t ( a ) − log t − 4 = 2 N t ( a ) 4 log t = N t ( a ) 2 log t