The Statistics of Stochastic Gradients

analysis
Published

March 12, 2023

My collaborators Anna Golubeva and Michael Buchhold and I wanted to characterize in as complete terms as possible the statistics of the noise in stochastic gradient descent (Robbins and Monro 1951). Our main objective was to obtain the rule under which the probability distribution over network parameters \(P_t(w)\) evolves, knowing that the parameters \(w^i_t\) evolve according to the update equation:

\[ w^i_{t+1} = w^i_{t}- \eta g^{i}_{t\,\alpha}\rho^{\alpha}_t\\ \] where \[ \rho^{\alpha}_t = \frac{1}{b_t}I^{\alpha}_t \] where \(b_t = \vert B_t\vert\) is the size of the minibatch and \[ I^{\alpha}_t = \begin{cases} 1 \,\textrm{ if } \alpha\in B_t \\ 0\,\textrm{ otherwise}. \end{cases} \] We assume that the examples are sampled without replacement from the dataset and therefore \(\rho^{\alpha}\) is drawn from the Hypergeometric distribution.

(Were we to sample with replacement, then \(\rho^{\alpha}\) will be drawn from the Binomial distribution.)

We found this useful way of splitting up the stochastic gradients into weight independent and weight dependent pieces in (Wu et al. 2020).

The reason there is any noise at all in this process is due to the random selection and shuffling of minibatches. This process induces a noise distribution on the space of weights. We want to understand kind of noise that is. So before deriving the dynamical equation for \(P_t(w)\), let us write down an expression for the moment generating function of \(\rho^{\alpha},\) and then see how this translates into a generating function for stochastic gradient noise.

The subtlety here lies in the fact that \[ I^n_{\alpha} =I_{\alpha},\,\,n>0 \] which in turn means that \[ \rho^{n}_{\alpha} =\frac{1}{b^n}I^n_{\alpha} = \frac{1}{b^{n-1}}\left(\frac{1}{b}I_{\alpha}\right)= \frac{1}{b^{n-1}}\rho_{\alpha},\,\,n>0. \]

So the following relations hold: \[ \langle \rho^m_{\alpha}\rangle = \frac{1}{b^{m-1}}\langle\rho_{\alpha} \rangle. \]

\[ \langle\rho^{m}_{\alpha_1}\rho_{\alpha_2}\cdots \rho_{\alpha_k} \rangle = \frac{1}{b^{m-1}}\langle\rho_{\alpha_1}\rho_{\alpha_2}\cdots\rho_{\alpha_k} \rangle \]

where \(\alpha_i\) are all distinct.

Now, the most general correlation function can be written as \[ \langle \rho^{m_1}_{\alpha_1}\rho^{m_2}_{\alpha_2}\cdots \rho^{m_k}_{\alpha_k}\rangle =\frac{1}{b^{(\sum^k_{j=1}m_k)-k}}\langle \rho_{\alpha_1}\rho_{\alpha_2}\cdots \rho_{\alpha_k}\rangle \] where all \(\alpha_i\) are distinct.

A useful object to introduce at this point is the generalized Kronecker delta, a gadget we picked up from (Ondo 2017): \[ \delta^{\mu_1\cdots \mu_p}_{\nu_1\cdots \nu_p} \equiv \delta^{\mu_1}_{[\nu_1}\cdots \delta^{\mu_p}_{\nu_p]}. \] It has the property that: \[ \delta_{\nu_{1} \ldots \nu_{p}}^{\mu_{1} \ldots \mu_{p}}=\left\{\begin{array}{cl} +1 & \text { if } \nu_{1} \ldots \nu_{p} \text { are distinct integers and are an even permutation of } \mu_{1} \ldots \mu_{p} \\ -1 & \text { if } \nu_{1} \ldots \nu_{p} \text { are distinct integers and are an odd permutation of } \mu_{1} \ldots \mu_{p} \\ 0 & \text { in all other cases. } \end{array}\right. \] Note that if we take the absolute value of this object, we get: \[ \vert\delta_{\nu_{1} \ldots \nu_{p}}^{\mu_{1} \ldots \mu_{p}}\vert \equiv \vert\delta\vert^{\mu_1\cdots \mu_p}_{\nu_1\cdots \nu_p}=\left\{\begin{array}{cl} +1 & \text { if } \nu_{1} \ldots \nu_{p} \text { are distinct integers and are some permutation of } \mu_{1} \ldots \mu_{p} \\ 0 & \text { in all other cases. } \end{array}\right. \] This object has the property that \[ \langle\rho_{\mu_1}\cdots \rho_{\mu_p} \rangle \vert \delta \vert^{\mu_1\cdots\mu_p}_{\nu_1\cdots \nu_p} = \begin{cases}\langle \rho_{\nu_1}\cdots \rho_{\nu_p}\rangle \text{ if all } \nu_i \text{ are distinct} \\ 0\text{ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,otherwise}\end{cases} \] In the generating function for connected correlation functions, we can therefore write the contribution of the correlation functions of \(\rho\) with all distinct indices as: \[ W_{distinct}[j]=\langle \rho_{\alpha}\rangle j^{\alpha} + \frac{1}{2}\vert\delta\vert^{\alpha_1\alpha_2}_{\mu_1 \mu_2}\langle \rho_{\alpha_1}\rho_{\alpha_2}\rangle j^{\mu_1}j^{\mu_2}+\cdots +\frac{1}{k!}\vert \delta\vert^{\alpha_1\cdots \alpha_k}_{\mu_1\cdots\mu_k}\langle \rho_{\alpha_1}\cdots \rho_{\alpha_k}\rangle j^{\mu_1}\cdots j^{\mu_k}+\cdots \] This expansion doesn’t capture the contributions to the lower point correlators coming from the higher point ones. I.e. there is a contribution to the two point correlator coming from the case where the two indices are the same: \[ \langle \rho_{\alpha}\rho_{\alpha}\rangle = \langle \rho^2_{\alpha} \rangle = \frac{1}{b}\langle \rho_{\alpha}\rangle. \] In fact, the mean receives contributions from every higher order correlator: \[ \langle \rho_{\alpha}\underbrace{\cdots}_{k-2}\rho_{\alpha} \rangle = \langle\rho^k_{\alpha} \rangle = \frac{1}{b^{k-1}}\langle \rho_{\alpha}\rangle \] In the generating function, all the terms proportional to $_{} $ take the form: \[ \langle \rho_{\alpha} \rangle \left(j^{\alpha}+\sum^{\infty}_{k=2}\frac{1}{k!b^{k-1}}(j^{\alpha})^k\right) = \langle \rho_{\alpha}\rangle\left(\sum^{\infty}_{k=1}\frac{1}{k!b^{k-1}}(j^{\alpha})^k\right). \] Repeating this exercise for the two point function gives us: \[ \frac{1}{2!}|\delta|^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho_{\mu_2}\rangle j^{\alpha_1}j^{\alpha_2}+\\ \frac{1}{2!}\sum^{\infty}_{k_2=2}\frac{1}{(k_2+1)!}\vert \delta \vert ^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho^{k_2}_{\mu_2}\rangle j^{\alpha_1}(j^{\alpha_2})^{k_2}+ \frac{1}{2!}\sum^{\infty}_{k_1=2}\frac{1}{(k_1+1)!}\vert \delta \vert ^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho^{k_1}_{\mu_1}\rho_{\mu_2}\rangle (j^{\alpha_1})^{k_1}j^{\alpha_2}+ \] \[+\frac{1}{2!}\sum^{\infty}_{k_1=2}\sum^{\infty}_{k_2=2}|\delta|^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho^{k_1}_{\mu_1}\rho^{k_2}_{\mu_2}\rangle \frac{(j^{\alpha_1})^{k_1}}{(k_1+1)!}\frac{(j^{\alpha_2})^{k_2}}{(k_2+1)!} = \] \[ =\frac{1}{2!}|\delta|^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho_{\mu_2}\rangle j^{\alpha_1}j^{\alpha_2}+\] \[+\frac{1}{2!}\sum^{\infty}_{k_2=2}\frac{1}{(k_2+1)!}\vert \delta \vert ^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho_{\mu_2}\rangle \frac{1}{b^{k_2-1}}j^{\alpha_1}(j^{\alpha_2})^{k_2}+ \frac{1}{2!}\sum^{\infty}_{k_1=2}\frac{1}{(k_1+1)!}\vert \delta \vert ^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho_{\mu_2}\rangle \frac{1}{b^{k_1-1}}(j^{\alpha_1})^{k_1}j^{\alpha_2}+ \] \[ +\frac{1}{2!}\sum^{\infty}_{k_1=2}\sum^{\infty}_{k_2=2}|\delta|^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho_{\mu_2}\rangle \frac{1}{b^{k_1+k_2-2}}\frac{(j^{\alpha_1})^{k_1}}{(k_1+1)!}\frac{(j^{\alpha_2})^{k_2}}{(k_2+1)!} \]

\[ = \frac{1}{2!} \vert \delta\vert^{\mu_1\mu_2}_{\alpha_1\alpha_2}\langle \rho_{\mu_1}\rho_{\mu_2}\rangle\bigg(j^{\alpha_1}+b\sum^{\infty}_{k_1=2}\frac{1}{(k_1+1)!}\left(\frac{j^{\alpha_1}}{b}\right)^{k_1}\bigg)\bigg(j^{\alpha_2}+b\sum^{\infty}_{k_2=2}\frac{1}{(k_2+1)!}\left(\frac{j^{\alpha_2}}{b}\right)^{k_2}\bigg) \]

We see a pattern here, for the one point correlator (which I will denote \(c=1\)), i.e. the mean, the source was modified as: \[ j^{\alpha}\rightarrow j^{\alpha}+ b\sum^{\infty}_{k=2}\frac{1}{k!b^k}\left(\frac{j^{\alpha}}{b}\right)^k \] Then for \(c=2\): \[ j^{\alpha}\rightarrow j^{\alpha}+ b\sum^{\infty}_{k=2}\frac{1}{(k+1)!b^k}\left(\frac{j^{\alpha}}{b}\right)^k \] The reason for the shift has to do with the fact that the contributions to the one point function can start to appear starting with the two point function and hence give us a term with \(1/2!\). Whereas for the two point function, the first non trivial descent is from the three point function, and therefore the first contributing term comes with coefficient \(1/3!\). Proceeding in this fashion, we will find that in the order-\(c\) correlator, the source is replaced as: \[ j^{\alpha}\rightarrow j^{\alpha}+ b \sum^{\infty}_{k=2}\frac{1}{(k+(c-1))!b^k}\left(\frac{j^{\alpha}}{b}\right)^k \] Therefore, the full generating function, having taken into account the reduction from higher to lower correlators that arises as a consequence of the fact that the indicator raised to any power equals itself is given by \[ W[j]=\frac{1}{c!}\sum^{\infty}_{c=1}\vert\delta\vert^{\mu_1\cdots \mu_c}_{\alpha_1\cdots \alpha_c}\langle\rho_{\mu_1}\cdots \rho_{\mu_c}\rangle \times \] \[ \left(j^{\alpha_1}+\sum^{\infty}_{k_1=2}\frac{1}{b^{k_1}(k_1+(c-1))!}\left(\frac{j^{\alpha_1}}{b}\right)^{k_1}\right) \] \[ \cdots \left(j^{\alpha_c}+\sum^{\infty}_{k_c=2}\frac{1}{b^{k_c}(k_c+(c-1))!}\left(\frac{j^{\alpha_c}}{b}\right)^{k_c}\right) \]

All that is left to do is write down the form of the correlators:

\[ \langle\rho_{\mu_1}\cdots \rho_{\mu_c}\rangle \] themselves.

For all distinct \(\alpha_i\) we have that every entry in the correlator \(\langle \rho_{\alpha_1}\cdots \rho_{\alpha_c} \rangle\) is given by: \[ \frac{1}{b^c}\frac{\binom{N-c}{b-c}}{\binom{N}{b}}. \] In other words: \[ \langle \rho_{\alpha_1}\cdots \rho_{\alpha_c}\rangle =\frac{1}{b^c}\frac{\binom{N-c}{b-c}}{\binom{N}{b}} \mathbf{1}_{\alpha_1}\cdots \mathbf{1}_{\alpha_c} \] So: \[ W[j]= \frac{1}{c!} \sum^{\infty}_{c=1}\frac{1}{b^c}\frac{\binom{N-c}{b-c}}{\binom{N}{b}}\\\sum^N_{\alpha_1=1}\left(j^{\alpha_1}+b\sum^{\infty}_{k_1=2}\frac{1}{b^{k_1}(k_1+(c-1))!}\left(\frac{j^{\alpha_1}}{b}\right)^{k_1}\right)\cdots \] \[\cdots \sum^N_{\alpha_c=1}\left(j^{\alpha_c}+b\sum^{\infty}_{k_c=2}\frac{1}{b^{k_c}(k_c+(c-1))!}\left(\frac{j^{\alpha_c}}{b}\right)^{k_c}\right) \]

Now what we really want to track are the correlation functions of the stochastic gradients themselves. We can accomplish this by injecting the gradient per example matrix through a delta function:

\[ Z_{grad-noise}[g\cdot J]=\int \mathcal{D}\sigma\,\delta^N(\sigma^{\alpha}-g^{\alpha}_i J^i)e^{-W[\sigma]},\\ \] and then exponentiating the delta function with the help of an auxiliary field \(\sigma\): \[ =\int \mathcal{D}\sigma \mathcal{D}\pi\,e^{i\pi_{\alpha}(\sigma^{\alpha}-g^\alpha_iJ^i)}e^{-W[\sigma]} \]

Kramers-Moyal expansion

The derivation in this section follows very closely to the one in (Leen and Moody 1992) and (Orr and Leen 1992). The time evolution of \(P(w,t)\) is described by the Kolmogorov equation \[ P_{t+1}(w) = \int d^D w'~T(w'\rightarrow w)~P_t(w')\,, \] where \(T(w'\rightarrow w)\) is the transition probability density. For a given \(\rho\), it takes the simple form of the Dirac delta function that satisfies the weight update equation: \[ T(w'\rightarrow w~\vert~\rho) = \delta^D\left(w - w' + \eta g^\alpha \rho_\alpha \right), \]
Note that the gradient \(g_\alpha\) is with respect to \(w'\). The total transition probability is obtained from this conditional transition probability by averaging over \(\rho\): \[ T(w'\rightarrow w) = \int d^{N}\rho P(\rho) T(w'\rightarrow w \vert \rho) \] \[ \equiv \bigg\langle \delta^D\left( w - w'\eta g^\alpha \rho_\alpha \right) \bigg\rangle_{P(\rho)}\,. \] Next, we expand the expression as a power series in \(\eta\) in order to obtain an expression in terms of the moments of \((\rho)\):

\[ \bigg\langle \delta^D\left(w - w' + \eta g^\alpha \rho_\alpha \right)\bigg\rangle_{P(\rho)} = \] \[ = \delta^D(w - w') + \langle \eta g_i^\alpha \rho_\alpha \rangle_{P(\rho)} \frac{\partial}{\partial w'_i}~\delta^D(w - w') \] \[ + \left\langle \left(\eta g_{i_1}^{\alpha_1}\rho_{\alpha_1}\right) \left(\eta g_{i_2}^{\alpha_2}\rho_{\alpha_2}\right) \right\rangle_{P(\rho)} \frac{\partial^2}{\partial w'_{i_1}\partial w'_{i_2}}\delta^D(w - w') + \cdots \] \[ + \bigg\langle \left(\eta g_{i_1}^{\alpha_1}\rho_{\alpha_1}\right) \cdots \left(\eta g_{i_k}^{\alpha_k}\rho_{\alpha_k}\right) \bigg\rangle_{P(\rho)}~\frac{\partial^k}{\partial w'_{i_1}\cdots\partial w'_{i_k}}~\delta^D(w - w') + \cdots \]

We shift the derivative from \(w'\) (weights at the previous time) to \(w\) (weights at current time) using the relation: \[ \int dx \int dy~f(x)~\partial_y \delta(x-y) = - \int dx \int dy~\partial_x f(x)~\delta(x-y) \]

Plugging this into our previous expression, we obtain: \[ P_{t+1}(w_{t+1})-P_{t}(w_{t+1}) = \] \[\sum^{\infty }_{c=1}(-1)^c\eta^c \frac{\partial^c}{\partial w^{i_1}_{t+1}\cdots \partial w^{i_c}_{t+1}} \nonumber\\ \bigg(\bigg\langle\left( g_{i_1,t+1}^{\alpha_1} \rho_{\alpha_1}\right)\cdots\left( g_{i_c,t+1}^{\alpha_c} \rho_{\alpha_c}\right) \bigg\rangle_{P(\rho)} P_t(w_{t+1})\bigg). \]

Let us now drop the time step index, and reduce this to a sum of distinct correlators. \[ \left(g^{\alpha_1}_{j_1}g^{\alpha_2}_{j_2}\ldots g^{\alpha_c}_{j_c}\right)\langle \rho_{\alpha_1}\cdots \rho_{\alpha_c}\rangle = \frac{1}{b^{c-1}}\left(g^{\alpha}_{j_1}\cdots g^{\alpha}_{j_c}\right)\langle \rho_{\alpha} \rangle+ \] \[ +\frac{\binom{c}{c-1}}{b^{c-2}}\left(g^{\gamma}_{(j_1}g^{\delta}_{j_2}\cdots g^{\delta}_{j_c)}\right)\vert \delta \vert^{\alpha\beta}_{\gamma\delta} \langle\rho_{\alpha}\rho_{\beta} \rangle +\\ \] \[ +\cdots +\frac{\binom{c}{c-M}}{b^{c-M}}\left(g^{\gamma_1}_{(j_1}g^{\gamma_2}_{j_2}\cdots g^{\gamma_{M}}_{j_M}\cdots g^{\gamma_M}_{j_c)}\right)\vert\delta \vert^{\alpha_1\cdots \alpha_M}_{\gamma_1\cdots \gamma_M}\langle \rho_{\alpha_1}\cdots \rho_{\alpha_M}\rangle+\cdots \] We can compactly write this as: \[ \left(g^{\alpha_1}_{j_1}g^{\alpha_2}_{j_2}\ldots g^{\alpha_c}_{j_c}\right)\langle \rho_{\alpha_1}\cdots \rho_{\alpha_c}\rangle =\sum^{c-1}_{I =1} \frac{1}{b^{c-I}}\binom{c}{c-I}\left(g^{\gamma_1}_{(j_1}g^{\gamma_2}_{j_2}\cdots g^{\gamma_I}_{j_I}g^{\gamma_{I}}_{j_{I+1}}\cdots g^{\gamma_I}_{j_{c})}\right)\vert \delta\vert ^{\alpha_1\cdots \alpha_I}_{\gamma_1\cdots \gamma_I}\langle\rho_{\alpha_1}\cdots \rho_{\alpha_I} \rangle \] Using \[ \langle \rho_{\alpha_1}\cdots \rho_{\alpha_k}\rangle =\frac{1}{b^k}\frac{\binom{N-k}{b-k}}{\binom{N}{b}} \mathbf{1}_{\alpha_1}\cdots \mathbf{1}_{\alpha_k}, \] we have \[ \left(g^{\alpha_1}_{j_1}g^{\alpha_2}_{j_2}\ldots g^{\alpha_c}_{j_c}\right)\langle \rho_{\alpha_1}\cdots \rho_{\alpha_c}\rangle = \] \[ \sum^c_{I=1}\frac{1}{b^{c-I}}\binom{c}{c-I}\left(g^{\gamma_1}_{(j_1}\mathbf{1}_{\gamma_1}g^{\gamma_2}_{j_2}\mathbf{1}_{\gamma_2}\cdots \mathbf{1}_{\gamma_I}\left(g^{\gamma_I}_{j_I}g^{\gamma_{I}}_{j_{I+1}}\cdots g^{\gamma_I}_{j_{c})}\right)\right)\frac{1}{b^I}\frac{\binom{N-I}{b-I}}{\binom{N}{b}} \]

\[ = \sum^c_{I=1}\frac{N^{I}}{b^c}\binom{c}{c-I}\frac{\binom{N-I}{b-I}}{\binom{N}{b}}\bigg(\partial_{(j_1}\Phi\cdots \partial_{j_{I-1}}\Phi\sum^N_{\alpha=1}\left(g^{\alpha}_{j_{I}}\cdots g^{\alpha}_{j_c)}\right)\bigg) \]

Therefore we now have the exact Fokker Planck equation for Stochastic Gradient Descent:

\[ P_{t+1}(w_t)-P_t(w_t)=\\\] \[ \sum_{c=1}^{\infty} \frac{(-1)^{c}}{c !} \sum_{j_{1}, \ldots j_{i}=1}^{D} \frac{\partial^{c}}{\partial w_{j_{1}} \partial w_{j_{2}} \ldots \partial w_{j_{c}}}\] \[\\N^{I}\left(\frac{\eta}{b}\right)^c\bigg(\sum^c_{I=1}\binom{c}{c-I}\frac{\binom{N-I}{b-I}}{\binom{N}{b}}\bigg(\partial_{(j_1}\Phi\cdots \partial_{j_{I-1}}\Phi\sum^N_{\alpha=1}\left(g^{\alpha}_{j_{I}}\cdots g^{\alpha}_{j_c)}\right)\bigg)P(w_t)\bigg) \]

References

Leen, Todd, and John Moody. 1992. “Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria.” Advances in Neural Information Processing Systems 5.
Ondo, Nicholas Andrew. 2017. Exotic gauge theories of spin-2 fields.” PhD thesis, Imperial Coll., London.
Orr, Genevieve, and Todd Leen. 1992. “Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times.” Advances in Neural Information Processing Systems 5.
Robbins, Herbert, and Sutton Monro. 1951. A Stochastic Approximation Method.” The Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.
Wu, Jingfeng, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, and Zhanxing Zhu. 2020. “On the Noisy Gradient Descent That Generalizes as SGD.” In ICML.