Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations—a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm, which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders trained on realistic datasets such as CIFAR10, thus establishing shallow autoencoders as an instance of the recently observed Gaussian universality.
1.Introduction
Autoencoders are a class of neural networks trained to reconstruct their inputs by minimising the distance between an input , and the network output . The key idea for learning good features with autoencoders is to make the intermediate layer in the middle of the network (significantly) smaller than the input dimension. This 'bottleneck' forces the network to develop a compressed representation of its inputs, together with an encoder and a decoder. Autoencoders of this type are called undercomplete, see figure 1, and provide an ideal framework to study feature learning, which appears to be a key property of deep learning, yet remains poorly understood from a theoretical point of view.
How well can shallow autoencoders with a single hidden layer of neurons perform on a reconstruction task? For linear autoencoders, Eckart and Young [1] established that the optimal reconstruction error is obtained by a network whose weights are given by (a rotation of) the K leading principal components of the data, i.e. the K leading eigenvectors of the input covariance matrix. The reconstruction error of such a network is therefore called the PCA error. How linear autoencoders learn these components when trained using stochastic gradient descent (SGD) was first studied by Bourlard and Kamp [2] and Baldi and Hornik [3]. A series of recent work analysed linear autoencoders in more detail, focusing on the loss landscape [4], the convergence and implicit bias of SGD dynamics [5–7], and on recovering the exact principal components [8–10].
Adding a non-linearity to autoencoders is a key step on the way to more realistic models of neural networks, which crucially rely on non-linearities in their hidden layers to extract good features from data. On the other hand, non-linearities pose a challenge for theoretical analysis. In the context of high-dimensional models of learning, an additional complication arises from analysing non-linear models trained on inputs with non-trivial correlations, which are a fundamental prerequisite for unsupervised learning. The first step in this direction was taken by Nguyen [11], who analysed overcomplete autoencoders where the number of hidden neurons grows polynomially with the input dimension, focusing on the special case of weight-tied autoencoders, where encoder and decoder have the same weights.
Here, we leverage recent developments in the analysis of supervised learning on complex input distributions to study the learning dynamics of undercomplete, non-linear autoencoders with both tied and untied weights. While their reconstruction error is also limited by the PCA reconstruction error [2, 3], we will see that their learning dynamics is rich and raises a series of questions: can non-linear AE trained with SGD reach the PCA error, and if so, how do they do it? How do the representations they learn depend on the architecture of the network, or the learning rule used to train them?
We describe our setup in detail in section 2. Our main contributions can be summarised as follows:
1.
we derive a set of asymptotically exact equations that describe the generalisation dynamics of shallow, non-linear autoencoders trained using online SGD (section 3.1);
2.
using these equations, we show how autoencoders learn important features of their data sequentially (section 3.2); we analyse the long-time dynamics of learning and highlight the importance of untying the weights of decoder and encoder (section 3.3); and we show that training the biases is necessary for ReLU autoencoders (section 3.4);
3.
we illustrate how a modification of vanilla SGD breaking the rotational symmetry of neurons yields the exact principal components of the data (section 3.5);
4.
we finally show that the equations also capture the learning dynamics of non-linear autoencoders trained on more realistic datasets such as CIFAR10 with great accuracy, and discuss connections with recent results on Gaussian universality in neural network models in section 4.
Reproducibility. We provide code to solve the dynamical equations of section 3.1 and to reproduce our plots at https://github.com/mariaref/NonLinearShallowAE.
2.Setup
Architecture. We study the performance and learning dynamics of shallow non-linear autoencoders with K neurons in the hidden layer. Given a D-dimensional input , the output of the autoencoder is given by
where are the encoder and decoder weight of the kth hidden neuron, resp., and is a non-linear function. We study this model in the thermodynamic limit where we let the input dimension while keeping the number of hidden neurons K finite, as shown in figure 1(a). The performance of a given autoencoder is measured by the population reconstruction mean-squared error,
where the expectation is taken with respect to the data distribution. Nguyen [11] analysed shallow autoencoders in a complementary mean-field limit, where the number of hidden neurons K grows polynomially with the input dimension. Here, by focusing on the case K < D, we study the case where the hidden layer is a bottleneck. Nguyen [11] considers tied autoencoders, where a single set of weights is shared between encoder and decoder, . We will see in section 3.3 that untying the weights is critical to achieve a non-trivial reconstruction error in the regime K < D.
Data model. We derive our theoretical results for inputs drawn from a spiked Wishart model [13, 14], which has been widely studied in statistics to analyse the performance of unsupervised learning algorithms. In this model, inputs are sampled according to
where µ is an index that runs over the inputs in the training set, is a D-dimensional vector whose elements are drawn i.i.d. from a standard Gaussian distribution and σ > 0. The matrix has elements of order and is fixed for all inputs, while we sample from some distribution. Different choices for and the distribution of c allow modelling of Gaussian mixtures, sparse codes, and non-negative sparse coding.
Spectral properties of the inputs. The spectrum of the covariance of the inputs is determined by and c . It governs the dynamics and the performance of autoencoders (by construction, we have ). If we set in equation (3), inputs would be i.i.d. Gaussian vectors with an average covariance of , where δij is the Kronecker delta. The empirical covariance matrix of a finite data set with samples has a Marchenko–Pastur distribution with a scale controlled by σ [14, 15]. By sampling the mth element of c as , we ensure that the input–input covariance has M eigenvalues with the columns of as the corresponding eigenvectors. The remaining D − M eigenvalues of the empirical covariance, i.e. the bulk, still follow the Marchenko–Pastur distribution. We further ensure that the reconstruction task for the autoencoder is well-posed by letting the outlier eigenvalues scale as (or ), resulting in spectra such as the one shown in figure 1(b). If the largest eigenvalues were of order one, an autoencoder with a finite number of neurons K could not obtain a better than random guessing as the input dimension .
Training. We train the autoencoder on the quadratic error in the one-pass (or online) limit of SGD, where a previously unseen input x is presented to the network at each step of SGD. This limit is extensively used to analyse the dynamics of non-linear networks for both supervised and unsupervised learning, and it has been shown experimentally to capture the dynamics of deep networks trained with data augmentation well [16]. The SGD weight increments for the network parameters are then given by:
where and is the regularisation constant. In order to obtain a well defined limit in the limit , we rescale the learning rates as , for some fixed η > 0.
3.Results
3.1.Dynamical equations to describe feature learning
The starting point for deriving the dynamical equations is the observation that the defined in equation (2) depends on the inputs only via their low-dimensional projections on the network's weights
This allows us to replace the high-dimensional expectation over the inputs in equation (2) with a K-dimensional expectation over the local fields and . For Gaussian inputs drawn from (3), are jointly Gaussian and their distribution is entirely characterised by their second moments:
where is the covariance of the inputs. These macroscopic overlaps, often referred to as order parameters in the statistical physics of learning, together with , are sufficient to evaluate the . To analyse the dynamics of learning, the goal is then to derive a closed set of differential equations which describe how the order parameters evolve as the network is trained using SGD (4). Solving these equations then yields performance of the network at all times. Note that feature learning requires the weights of the encoder and the decoder move far away from their initial values. Hence, we cannot resort to the framework of the neural tangent kernel or lazy learning [17, 18].
Instead, here we build on an approach that was pioneered by Saad and Solla[19, 20] and Riegler and Biehl [21] to analyse supervised learning in two-layer networks with uncorrelated inputs; Saad [22] provides a summary of the ensuing work. We face two new challenges when analysing autoencoders in this setup. First, autoencoders cannot be formulated in the usual teacher-student setup, where a student network is trained on inputs whose label is provided by a fixed teacher network, because it is impossible to choose a teacher autoencoder which exactly implements the 'identify function' x → x with K < D. Likewise, a teacher autoencoder with random weights, the typical choice in supervised learning, is not an option since such an autoencoder does not implement the identity function, either. Instead, we bypass introducing a teacher and develop a description starting from the data properties. The second challenge is then to handle the non-trivial correlations in the inputs (3), where we build on recent advances in supervised learning on structured data [23–26].
3.1.1.Derivation: a sketch.
Here, we sketch the derivation of the equation of motion for the order parameter ; a complete derivation for and all other order parameters is given in appendix
We introduce the order-parameter density that depends on ρ and on the normalised number of steps , which we interpret as a continuous time variable in the limit ,
where is the indicator function and the limit is taken after the thermodynamic limit . Similarly and are introduced as order parameter densities corresponding to R1 and Q1 respectively. Now, inserting the update for v (4) in the expression of , and evaluating the expectation over a fresh sample x , we obtain the dynamical equation:
where we write to denote the same term with the indices k and permuted. The order parameter is recovered by integrating the density over the spectral density of the covariance matrix: .
We can finally close the equations by evaluating the remaining averages such as . For some activation functions, such as the ReLU or sigmoidal activation , these integrals have an analytical closed form. In any case, the averages only depend on the second moments of the Gaussian random variables , which are given by the order parameters, allowing us to close the equations on the order parameters.
We derived these equations using heuristic methods from statistical physics. While we conjecture that they are exact, we leave it to future work to prove their asymptotic correctness in the limit . We anticipate that techniques to prove the correctness of dynamical equations for supervised learning [27, 28] should extend to this case after controlling for the spectral dependence of order parameters.
3.1.2.Separation between bulk modes and principal components.
We can exploit the separation of bulk and outlier eigenvalues to decompose the integral (9) into an integral over the bulk of the spectrum and a sum over the outliers. This results in a decomposition of the overlap as:
and likewise for other order parameters. Leveraging the fact that the bulk eigenvalues are of order , and that we take the limit, we can write the equation for as:
On the other hand, the M outlier eigenvalues, i.e. the spikes, are of order . We thus need to keep track of all the terms in their equations of motion. Similarly to equation (10), they are written:
Note that the evolution of is coupled to the one of only through the expectations of the local fields. A similar derivation, which we defer to appendix B.3, can be carried out for and .
3.1.3.Empirical verification.
In figure 2, we plot the of an autoencoder with K = 3 hidden neurons during training on a dataset drawn from equation (3) with M = 3. We plot the both obtained from training the network using SGD (4) (crosses) and obtained from integrating the equations of motion that we just derived (lines). The agreement between the theoretical description in terms of bulk and spike modes captures the training dynamics well, even at an intermediate input dimension of D = 1000. In the following, we analyse these equations, and hence the behaviour of the autoencoder during training, in more detail.
3.2.Nonlinear autoencoders learn principal components sequentially
The dynamical equations of the encoder-encoder overlap , equation (B.28), takes the form
The learning rate of each mode of is thus rescaled by the corresponding eigenvalue, suggesting that the principal components that correspond to each to each mode are learnt at different speeds, with the modes corresponding to the largest eigenvalues being learnt the fastest. For the other order parameters and , all the terms in the equations of motion equations (B.24) and (B.29) are rescaled by ρ except for one:
Indeed, the of the sigmoidal autoencoder shown in figure 2 goes through several sudden decreases, preceded by plateaus with quasi-stationary . Each transition is related to the network 'picking up' an additional principal component. By comparing the error of the network on the plateaus to the PCA reconstruction error with an increasing number of principal components (solid horizontal lines in figure 2), we confirm that the plateaus correspond to stationary points where the network has learnt (a rotation of) the first leading principal components. Whether or not the plateaus are clearly visible in the curves depends on the separation between the eigenvalues of the leading eigenvectors.
This sequential learning of principal components has also been observed in several other models of learning. It appears in unsupervised learning rules such as Sanger's rule [29] (aka as generalised Hebbian learning, cf appendix
The evolution of the bulk order parameters obeys
As a consequence, in the early stages of training, to first approximation, the bulk component of and follow an exponential decay towards 0. The characteristic time of this decay is given by the expectation which only depends on . In addition, the last equation implies that the evolution of is entirely determined by the dynamics of the spike modes.
3.3.The long-time dynamics of learning: align, then rescale
At long training times, we expect a sigmoidal autoencoder with untied weights to have retrieved the leading PCA subspace since it achieves the PCA reconstruction error. This motivates an ansatz for the dynamical equations where the network weights are proportional to the eigenvectors of the covariance ,
where are scaling constants that evolve in time. With this ansatz, the order parameters are diagonal, and we are left with a reduced set of 2K equations describing the dynamics of and which are given in equation (C.8) together with their detailed derivation.
We first verify the validity of the reduced equations (C.8) to describe the long-time dynamics of sigmoidal autoencoders. In figure 3(a), we show the generalisation dynamics of a sigmoidal autoencoder starting from random initial conditions (blue). We see two phases of learning: after an exponential decay of the up to time , the decays as a power-law. This latter decay is well captured by the evolution of the of a model which we initialised with weights aligned to the leading PCs. We deduce that during the exponential decay of the error, the weights align to the leading PCA directions and stay aligned during the power-law decay of the error.
After having recovered the suitable subspace, the network adjusts the norm of its weights to decrease the error. A look at the dynamics of the scale parameters and during this second phase of learning reveals that the encoder's weights shrink, while the decoder's weights grow, cf figure 3(b). Together, these changes lead to the power-law decay of the . Note that we chose a dataset with only one leading eigenvalue so as to guarantee that the AE recovers the leading principal component, rather than a rotation of them. A scaling analysis of the evolution of the scaling constants and shows that the weights decay, respectively grow, as a power-law with time, and , see figures 3(c) and (d).
We can understand this behaviour by recalling the linearity of around the origin, where . By shrinking the encoder's weights, the autoencoder thus performs a linear transformation of its inputs, despite the non-linearity. It recovers the linear performance of a network given by PCA if the decoder's weights grow correspondingly for the reconstructed input to have the same norm as the input x .
We can find a relation between the scaling constants and at long times using the ansatz equation (19) in the expression of the :
where the second term is simply the rank K PCA error. The first term should be minimised to achieve PCA error, i.e. for all k. For a linear autoencoder, we find : as expected, any rescaling of the encoder's weights needs to be compensated in the decoder. For a sigmoidal autoencoder instead, we find
Note, that at small we recover the linear scaling .
The importance of untying the weights for sigmoidal autoencoders. The need to let the encoder and decoder weights grow, resp. shrink, to achieve PCA error makes learning impossible in sigmoidal autoencoders with tied weights, where . Such an autoencoder evidently cannot perform the rescaling required to enter the linear regime of its activation function, and is therefore not able to achieve PCA error. Even worse, the learning dynamics shown in figure 4(a) show that a sigmoidal autoencoder with tied weights (blue) hardly achieves a reconstruction error better than chance, in stark contrast to the same network with untied weights (red).
3.4.The importance of the bias in ReLU autoencoders
Sigmoidal autoencoders achieve the PCA error by exploiting the linear region of their activation function. We can hence expect that in order to successfully train a ReLU AE, which is linear for all positive arguments, it is necessary to add biases at the hidden layer. The error curves for ReLU autoencoders shown in figure 4(b) show indeed that ReLU autoencoders achieve an error close to the PCA error only if the weights are untied and the biases are trained. If the biases are not trained, we found that a ReLU autoencoder with K hidden neurons generally achieves a reconstruction error of roughly , presumably because only nodes have a positive pre-activation and can exploit the linear region of ReLU. Training the biases consistently resulted in large, positive biases at the end of training, figure 4(c). The large biases, however, result in a small residual error that negatively affects the final performance of a ReLU network when compared to the PCA performances, as can be seen from linearising the output of the ReLU autoencoder:
where is the bias of the kth neuron.
3.5.Breaking the symmetry of SGD yields the exact principal components
Linear autoencoders do not retrieve the exact principal components of the inputs, but some rotation of them due to the rotational symmetry of the square loss [32]. While this does not affect performance, it would still be desirable to obtain the leading eigenvectors exactly to identify the important features in the data.
The symmetry between neurons can be broken in a number of ways. In linear autoencoders, previous work considered applying different amounts of regularisation to each neuron [4, 9, 33, 34] or optimising a modified loss [10]. In unsupervised learning, Sanger's rule [29] breaks the symmetry by making the update of the weights of the first neuron independent of all other neurons; the update of the second neuron only depends on the first neuron, etc. This idea was extended to linear autoencoders by [9, 10]. We can easily extend this approach to non-linear autoencoders by training the decoder weights following
where the sum now only runs up to k instead of K. The update for the encoder weights w k is unchanged from equation (4), but the error term is changed to .
We can appreciate the effect of this change by considering a fixed point of the vanilla SGD updates equation (4) with and . Multiplying this solution by an orthogonal matrix , i.e. yields another fixed point, since
for the decoder, and likewise for the encoder weights. For truncated SGD (23), the underbrace term in equation (24) is no longer the identity and the rotated weights are no longer a fixed point.
We show the of a sigmoidal autoencoder trained with vanilla and truncated SGD in figure 5(a), together with the theoretical prediction from a modified set of dynamical equations. The theory captures both learning curves perfectly, and we can see that using truncated SGD incurs no performance penalty—both autoencoders achieve the PCA reconstruction error. Note that the reconstruction error can increase temporarily while training with the truncated algorithm since the updates do not minimise the square loss directly. In this experiment, we trained the networks on a synthetic dataset where the eigenvectors are chosen to be sinusoidal (black lines in figure 5(b)). As desired, the weights of the network trained with truncated SGD converges to the exact principal components, while vanilla SGD only recovers a linear combination of them.
The advantage of recovering the exact principal components is illustrated with the partial reconstructions of a test image from the FashionMNIST database [35]. We train autoencoders with K = 64 neurons to reconstruct FashionMNIST images using the vanilla and truncated SGD algorithms until convergence. We show the reconstruction using all the neurons of the networks in the left column of figure 5(d). Since the networks achieve essentially the same performance, both reconstructions look similar. The partial reconstruction using only the first five neurons of the networks shown on the right are very different: while the partial reconstruction from truncated SGD is close to the original image, the reconstruction of the vanilla autoencoder shows traces of pants mixed with those of a sweatshirt.
4.Representation learning on realistic data under standard conditions
We have derived a series of results on the training of non-linear AEs derived in the case of synthetic, Gaussian datasets. Most of our simplifying assumptions are not valid on real datasets, because they contain only a finite number of samples, and because these samples are finite-dimensional and not normally distributed. Furthermore, we had to choose certain scalings of the pre-activations (5) and the learning rates to ensure the existence of a well-defined asymptotic limit . Here, we show that the dynamical equations of section 3.1 describe the dynamics of an autoencoder trained on more realistic images rather well, even if we do not perform any explicit rescalings in the simulations.
Realistic data. In figure 6, we show the of autoencoders of increasing size trained on 10k greyscale images from CIFAR10 [12] with crosses. The solid lines are obtained from integration the equations of motion of appendix
Relation to the Gaussian equivalence. The fact that the learning dynamics of the autoencoders trained on CIFAR10 are well captured by an equivalent Gaussian model is an example of the Gaussian equivalence principle, or Gaussian universality, that received a lot of attention recently in the context of supervised learning with random features [36–38] or one- and two-layer neural networks [24, 25, 39, 40]. These works showed that the performance of these different neural networks is asymptotically well captured by an appropriately chosen Gaussian model for the data. While previous work on Gaussian equivalence in two-layer networks crucially assumes that the network has a number of hidden neurons K that is small compared to the input dimension, here we find that the Gaussian equivalence of autoencoders extends to essentially any number of hidden neurons, for example , as we show in figure 8 in the appendix.
Robustness of the dynamical equations to rescalings. The scaling of the local fields in equation (5), and hence of the order parameters, is necessary to ensure that in the high-dimensional limit , all three terms in the are non-vanishing. As we describe in the derivation, the rescaling of the learning rates are then necessary to obtain a set of closed equations. Nevertheless, we verified that the conclusions of the paper are not an artefact of these scalings. To this end, we trained a sigmoidal AE on grayscale CIFAR10 using SGD without rescaling the eigenvalues of the inputs, pre-activations , or the learning rates. As we show in figure 7, we recover the three key phenomena described in the paper: autoencoders with tied weights fail to achieve rank-1 PCA error (orange); autoencoders with untied weights achieve rank-K PCA error (blue); and for sigmoidal autoencoders, the encoder weights shrink, while decoder weights grow to reach the linear regime (inset). These results underline that our theoretical results describe AEs under 'standard conditions'.
Acknowledgments
We thank Galen Reeves for illuminating discussions. M R thanks the Data Science group at SISSA for its hospitality during a visit, where part of this research was performed. M R acknowledges funding from the French Agence Nationale de la Recherche under Grant ANR-19-P3IA-0001 PRAIRIE.
Appendix A: Online learning algorithms for PCA
We briefly review a number of unsupervised learning algorithms for principal component analysis, leading to Sanger's rule which is the inspiration for the truncated SGD algorithm of section 3.5.
Setup. We consider inputs with first two moments equal to
We work in the thermodynamic limit .
Hebbian learning. The Hebbian learning rule allows to obtain an estimate of the leading PC by considering a linear model with and the loss function . It updates the weights as:
In other words, the Hebbian learning rule tries to maximise the second moment of the pre-activation λ. That using this update we obtain an estimate of the first principal component makes intuitive sense. Consider the average over the inputs of the loss:
If the weight vector w is equal to the τth eigenvector of Ω, then , and the loss is minimised by converging to the eigenvector with the larges eigenvalue. Also note, that similar to what we find in the main text, this result also suggests that the leading eigenvalue in the large-D limit should scale as . The Hebbian rule has the well-known deficit that it imposes no bound on the growth of the weight vector. A natural remedy is to introduce some form of weight decay such that
Oja's rule. [41] offers a smart choice for the weight decay constant κ. Consider the same linear model trained with a Hebbian rule and weight decay :
The purpose of this choice can be appreciated from substituting in the linear model and setting the update rule to zero, which yields (dropping the constants)
which is precisely the eigenvalue equation for Ω.
Then, Oja's update rule is then derived from Hebb's rule by adding normalisation to the Hebbian update (A.2),
with p integer (in Oja's original paper, p = 2.) This learning rule can be seen as a power iteration update. Substituting for y, we see that the numerator corresponds to, on average, repeated multiplication of the weight vector with the average covariance matrix. Expanding equation (A.7) around η = 0 yields Oja's algorithm (while also assuming that the weight vector is normalised to one). One can show that Oja's rule indeed converges to the PC with the largest eigenvalue by allowing the learning rate to vary with time and imposing the mild conditions
Sanger's rule. [29] is also known as generalised Hebbian learning in the literature and extends the idea behind Oja's rule to networks with multiple outputs . It allows estimation of the first few leading eigenvectors by using the update rule is given by
Note that the second sum extends only to the k; the dynamics of the kth input vector hence only depends on weight vectors with . This dependence is one of the similarities of Sanger's rule with the Gram-Schmidt procedure for orthogonalising a set of vectors (cf section 0.6.4 of [42]). The dynamics of Sanger's rule in the online setting were studied by [30, 43]. Sanger's rule reduces to Oja's rule for K = 1.
Appendix B: Online learning in autoencoders
Here we derive the set of equations tracking the dynamics of shallow non-linear autoencoders trained in the one-pass limit of stochastic gradient descent (SGD). These equations are derived for Gaussian inputs drawn from the generative model equation (3), but as we discuss in section 4, they also capture accurately the dynamics of training on real data.
B.1.Statics
The starting point of the analysis is the definition of the test error (2) and the identification of order parameters, i.e. low dimensional overlaps of high dimensional vectors,
First we introduce the local fields corresponding to the encoder and decoder's weights,
as well as the order parameter tracking the overlap between the decoder weights:
Note the unusual scaling of νk with D; the intuition here is that in shallow autoencoders, the second-layer weights will be strongly correlated with the eigenvectors of the input-input covariance, and hence with the inputs, requiring the scaling instead of . This scaling is also the one that yields a set of self-consistent equations in the limit . The generalisation error becomes
Crucially, the local fields λ and ν are jointly Gaussian since the inputs are Gaussian. Since we have , the can be written in terms of the second moments of these fields only:
The full expression of and in terms of the order parameters is given, for various activation functions, in appendix D. Note the different scalings of these overlaps with D. These are a direct consequence of the different scaling of the local fields λk and νk . In order to derive the equations of motion, it is useful to introduce the decomposition of Ω in its eigenbasis:
The eigenvectors are normalised as and . We further define the rotation of any onto this basis as . The order parameters are then written:
B.2.Averages over non-linear functions of weakly correlated variables
We briefly recall expressions to compute averages over non-linear functions of weakly correlated random variables that were also used by [24]. These expressions will be extensively used in the following derivation. Consider x, , two weakly correlated jointly Gaussian random variables with covariance matrix
with which encodes the weak correlation between x and y. Then, the expectation of the product of any two real valued functions , is given, to leading order in ε by:
Similarly, for 3 random variables with x1 and x2 weakly correlated with x3 but not between each other, i.e. , one can compute the expectation of product of three real valued functions to leading order in ε as:
B.3.Derivation of dynamical equations
At the µth step of training, the SGD update for the rotated weights reads
To keep notation concise, we drop the weight decay term in the following analysis. We can now compute the update of the various order parameters by inserting the above into the definition of the order parameters. The stochastic process described by the resulting equations concentrates, in the limit, to its expectation over the input distribution. By performing the average over a fresh sample x, we therefore obtain a closed set of deterministic ODEs tracking the dynamics of training in the high-dimensional limit. We also show in simulations that these are able to capture well the dynamics also at finite dimension .
Update of the overlap of the decoder's weights. Let us start with the update equation for which is easily obtained by using the sgd update (B.14),
By taking the expectation over a fresh sample x and the limit of one-pass SGD, we obtain a continuous in the normalised number of steps , which we interpret as a continuous time variable, as:
where the notation indicates an additional term identical to the first except for exchange of the indices k and l. This equation requires to evaluate , , which are two-dimensional Gaussian integrals with covariance matrix given by the order parameters. We give their expression in appendix D. We also note that the second order term is sub-leading in the high dimensional limit we work in.
Update of . The update equation for is obtained similarly as before by using the SGD update (B.14) and taking the high-dimensional limit,
To make progress, we have to evaluate . For this purpose it is crucial to notice that the rotated inputs xτ are weakly correlated with the local fields:
Then, we can compute the expectation using the results of appendix B.2, i.e. equation (B.12) with and , thus obtaining:
Inserting the above expression in the update of yields:
Notice, that in this equation we have the appearance of , a term which cannot be simply expressed in terms of order parameters. Similar terms appear in the equation for and . To close the equations, we are thus led to introduce order parameter densities in the next step.
Order parameters as integrals over densities. To proceed, we introduce the densities and . These depend on ρ and on the normalised number of steps :
where is the indicator function and the limit is taken after the thermodynamic limit. The order parameters are obtained by integrating these densities over the spectrum of the input-input covariance matrix:
It follows that tracking the dynamics of the functions and t, we obtain the dynamics of the overlaps and .
Dynamics of . The dynamics of t is given straightforwardly from equation (B.21) as:
where we defined the rescalled eigenvalue . Here again, straightforward algebra shows that the second order term is sub-leading in the limit of high input dimensions.
Dynamics of . In a similar way, we compute the dynamical equation for by using the encoder's weight's update in equation (B.14) and taking the expectation over the input distribution,
In the above, we have the appearance of the expectations:
To compute these, we use our results for weakly correlated variables from appendix B.2 and obtain:
Using these results in the equations for we find:
Note, that as explained in the main text, in order to find a well defined limit, we rescale the learning rate of the encoder's weights as .
Dynamics of . Lastly, we obtain the equation for in the same way as the two previous ones. We use equation (B.14) and evaluate the expectation over a fresh sample x.
We can evaluate the expectations as before (B.2). Thus, we obtain the equation of motion for as:
B.4.Simplification of the equations for spiked covariance matrices
In the case of the synthetic dataset defined as , the spectrum of the covariance matrix is decomposed into a bulk of small, i.e. eigenvalues, of continuous support, and a few outlier eigenvalues taking values of order (see figure 1). This allows to obtain M equations for controlling the evolution of the spikes modes and one equation controlling the evolution of the bulk for all order parameters. Indeed we can simplify the equations of motion of the bulk eigenvalues (i.e. those for which ) by neglecting terms proportional to . Doing so leads to:
We can define bulk order parameters, that take into account the contribution from the bulk modes as:
Note that even though the integrals involve since we are integrating over a large number (O(D)) of modes, the result is of order 1. The equation for these overlaps is thus obtained as the integrated form of equation (B.31):
The order parameters can be decomposed into the contribution from the bulk eigenvalues and those of the spikes as:
and similarly for and . The M spike modes and qi obey the full equations (B.24), (B.28) and (B.29). In particular, it is clear from equation (B.33) that for any non-zero weight decay constant κ, the bulk contribution to will decay to 0 in a characteristic time κ−1. Further note that the equations for and result in an exponential decay of the bulk modes towards 0.
Appendix C: Reduced equations for long-time dynamics of learning
In this section we illustrate how to use the equations of motion in order to study the long training time dynamics of shallow non-linear autoencoders. At sufficiently long times, we have seen that the weights of the network span the leading PC subspace of the covariance matrix. We restrict to the case in which the eigenvectors are recovered directly. The case in which a rotation of them is found follows straightforwardly. We also restrict to the matched case in which K = M.
Consider the following ansatz on the weights configuration: and . We defined dynamical constants which control the norm of the weights. Using this ansatz, the order parameters take the form:
Using the above, we can rewrite the as:
The first observation is that the generalisation error reaches a minimum when the term in bracket fk is minimised. Minimising it leads to an equation at which and the is equal to the PCA reconstruction error.
For a linear activation function, we have
The is a function only of the product and is minimal and equal to the PCA reconstruction error whenever:
The above is nothing but the intuitive result that for a linear autoencoder, any rescaling of the encoder's weights can be compensated by the decoder's weights.
For a sigmoidal autoencoder, instead, the expressions of the integrals given in appendix D, give:
Differentiating fK with respect to , and requiring the derivative to be 0, we find the equation:
We point out that this solution is independent of the sign of or as recovering the eigenvectors or minus the eigenvectors is equivalent. Replacing this solution for in the expression for the , we find . The autoencoder reaches the same reconstruction error as the one achieved by PCA.
Ansatz in the equations of motion. The ansatz equation (19) results in order parameters of the form:
Inserting these expressions into the full dynamical equations of motion equation (B.28), equation (B.29) and equation (B.24), allows us to find a simplified set of 2K ODEs for the scaling constants and :
The validity of the equations is verified in figure 3 where we compare the result of integrating them to simulations. We provide a Mathematica notebook which allows to derive these equations on the Github repository of this paper.
To validate our calculations for autoencoders with finite D, we trained an autoencoder from three distinct initial conditions: A random one, an informed configuration in which the weights are initialised as in equation (19) with and a perfect one, in which additionally, is initialised from the minimal solution equation (C.6). The first observation is that the of the network started with perfect initial conditions remains at the PCA reconstruction error, thus validating equation (C.6). Since we have a noiseless dataset, the PCA error is given by the numerical error, however we note that the same conclusion caries over to noisy datasets with finite PCA reconstruction error. In figure 10, we plot the of the randomly initialised and informed autoencoders on the left as well as the evolution of and , i.e. of the overlap between the weights and the eigenvectors during training, on the right. We can see that the network with random initial weights learns in two phases: first the decays exponentially until it reaches the error of the aligned network. Then, its coincides with that of the informed network and the error decays as a power-law. This shows that during the first phase of exponential learning, the network recovers the leading PC subspace. This occurs early on in training. In the second phase, the networks adjust the norm of the weights to reach PCA performances. As discussed in the main text, this is achieved in the sigmoidal network by shrinking the encoder's weights, so as to recover the linear regime of the activation function. It keeps the norm of the reconstruction similar to the one of the input by and growing the decoder's weights.
Appendix D: Analytical formula of the integral
In this subsection we define the covariance matrix of the variables appearing in the expectation. For e.g. .
Sigmoidal activation function
Linear activation function
ReLU activation function