The dynamics of representation learning in shallow, non-linear autoencoders (2024)

Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations—a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm, which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders trained on realistic datasets such as CIFAR10, thus establishing shallow autoencoders as an instance of the recently observed Gaussian universality.

1.Introduction

Autoencoders are a class of neural networks trained to reconstruct their inputs by minimising the distance between an input The dynamics of representation learning in shallow, non-linear autoencoders (1), and the network output The dynamics of representation learning in shallow, non-linear autoencoders (2). The key idea for learning good features with autoencoders is to make the intermediate layer in the middle of the network (significantly) smaller than the input dimension. This 'bottleneck' forces the network to develop a compressed representation of its inputs, together with an encoder and a decoder. Autoencoders of this type are called undercomplete, see figure 1, and provide an ideal framework to study feature learning, which appears to be a key property of deep learning, yet remains poorly understood from a theoretical point of view.

The dynamics of representation learning in shallow, non-linear autoencoders (3)

How well can shallow autoencoders with a single hidden layer of neurons perform on a reconstruction task? For linear autoencoders, Eckart and Young [1] established that the optimal reconstruction error is obtained by a network whose weights are given by (a rotation of) the K leading principal components of the data, i.e. the K leading eigenvectors of the input covariance matrix. The reconstruction error of such a network is therefore called the PCA error. How linear autoencoders learn these components when trained using stochastic gradient descent (SGD) was first studied by Bourlard and Kamp [2] and Baldi and Hornik [3]. A series of recent work analysed linear autoencoders in more detail, focusing on the loss landscape [4], the convergence and implicit bias of SGD dynamics [57], and on recovering the exact principal components [810].

Adding a non-linearity to autoencoders is a key step on the way to more realistic models of neural networks, which crucially rely on non-linearities in their hidden layers to extract good features from data. On the other hand, non-linearities pose a challenge for theoretical analysis. In the context of high-dimensional models of learning, an additional complication arises from analysing non-linear models trained on inputs with non-trivial correlations, which are a fundamental prerequisite for unsupervised learning. The first step in this direction was taken by Nguyen [11], who analysed overcomplete autoencoders where the number of hidden neurons grows polynomially with the input dimension, focusing on the special case of weight-tied autoencoders, where encoder and decoder have the same weights.

Here, we leverage recent developments in the analysis of supervised learning on complex input distributions to study the learning dynamics of undercomplete, non-linear autoencoders with both tied and untied weights. While their reconstruction error is also limited by the PCA reconstruction error [2, 3], we will see that their learning dynamics is rich and raises a series of questions: can non-linear AE trained with SGD reach the PCA error, and if so, how do they do it? How do the representations they learn depend on the architecture of the network, or the learning rule used to train them?

We describe our setup in detail in section 2. Our main contributions can be summarised as follows:

  • 1.

    we derive a set of asymptotically exact equations that describe the generalisation dynamics of shallow, non-linear autoencoders trained using online SGD (section 3.1);

  • 2.

    using these equations, we show how autoencoders learn important features of their data sequentially (section 3.2); we analyse the long-time dynamics of learning and highlight the importance of untying the weights of decoder and encoder (section 3.3); and we show that training the biases is necessary for ReLU autoencoders (section 3.4);

  • 3.

    we illustrate how a modification of vanilla SGD breaking the rotational symmetry of neurons yields the exact principal components of the data (section 3.5);

  • 4.

    we finally show that the equations also capture the learning dynamics of non-linear autoencoders trained on more realistic datasets such as CIFAR10 with great accuracy, and discuss connections with recent results on Gaussian universality in neural network models in section 4.

Reproducibility. We provide code to solve the dynamical equations of section 3.1 and to reproduce our plots at https://github.com/mariaref/NonLinearShallowAE.

2.Setup

Architecture. We study the performance and learning dynamics of shallow non-linear autoencoders with K neurons in the hidden layer. Given a D-dimensional input The dynamics of representation learning in shallow, non-linear autoencoders (4), the output of the autoencoder is given by

The dynamics of representation learning in shallow, non-linear autoencoders (5)

where The dynamics of representation learning in shallow, non-linear autoencoders (6) are the encoder and decoder weight of the kth hidden neuron, resp., and The dynamics of representation learning in shallow, non-linear autoencoders (7) is a non-linear function. We study this model in the thermodynamic limit where we let the input dimension The dynamics of representation learning in shallow, non-linear autoencoders (8) while keeping the number of hidden neurons K finite, as shown in figure 1(a). The performance of a given autoencoder is measured by the population reconstruction mean-squared error,

The dynamics of representation learning in shallow, non-linear autoencoders (9)

where the expectation The dynamics of representation learning in shallow, non-linear autoencoders (10) is taken with respect to the data distribution. Nguyen [11] analysed shallow autoencoders in a complementary mean-field limit, where the number of hidden neurons K grows polynomially with the input dimension. Here, by focusing on the case K < D, we study the case where the hidden layer is a bottleneck. Nguyen [11] considers tied autoencoders, where a single set of weights is shared between encoder and decoder, The dynamics of representation learning in shallow, non-linear autoencoders (11). We will see in section 3.3 that untying the weights is critical to achieve a non-trivial reconstruction error in the regime K < D.

Data model. We derive our theoretical results for inputs drawn from a spiked Wishart model [13, 14], which has been widely studied in statistics to analyse the performance of unsupervised learning algorithms. In this model, inputs are sampled according to

The dynamics of representation learning in shallow, non-linear autoencoders (12)

where µ is an index that runs over the inputs in the training set, The dynamics of representation learning in shallow, non-linear autoencoders (13) is a D-dimensional vector whose elements are drawn i.i.d. from a standard Gaussian distribution and σ > 0. The matrix The dynamics of representation learning in shallow, non-linear autoencoders (14) has elements of order The dynamics of representation learning in shallow, non-linear autoencoders (15) and is fixed for all inputs, while we sample The dynamics of representation learning in shallow, non-linear autoencoders (16) from some distribution. Different choices for The dynamics of representation learning in shallow, non-linear autoencoders (17) and the distribution of c allow modelling of Gaussian mixtures, sparse codes, and non-negative sparse coding.

Spectral properties of the inputs. The spectrum of the covariance of the inputs is determined by The dynamics of representation learning in shallow, non-linear autoencoders (18) and c . It governs the dynamics and the performance of autoencoders (by construction, we have The dynamics of representation learning in shallow, non-linear autoencoders (19)). If we set The dynamics of representation learning in shallow, non-linear autoencoders (20) in equation (3), inputs would be i.i.d. Gaussian vectors with an average covariance of The dynamics of representation learning in shallow, non-linear autoencoders (21), where δij is the Kronecker delta. The empirical covariance matrix of a finite data set with The dynamics of representation learning in shallow, non-linear autoencoders (22) samples has a Marchenko–Pastur distribution with a scale controlled by σ [14, 15]. By sampling the mth element of c as The dynamics of representation learning in shallow, non-linear autoencoders (23), we ensure that the input–input covariance has M eigenvalues The dynamics of representation learning in shallow, non-linear autoencoders (24) with the columns of The dynamics of representation learning in shallow, non-linear autoencoders (25) as the corresponding eigenvectors. The remaining D − M eigenvalues of the empirical covariance, i.e. the bulk, still follow the Marchenko–Pastur distribution. We further ensure that the reconstruction task for the autoencoder is well-posed by letting the outlier eigenvalues scale as The dynamics of representation learning in shallow, non-linear autoencoders (26) (or The dynamics of representation learning in shallow, non-linear autoencoders (27)), resulting in spectra such as the one shown in figure 1(b). If the largest eigenvalues were of order one, an autoencoder with a finite number of neurons K could not obtain a The dynamics of representation learning in shallow, non-linear autoencoders (28) better than random guessing as the input dimension The dynamics of representation learning in shallow, non-linear autoencoders (29).

Training. We train the autoencoder on the quadratic error in the one-pass (or online) limit of SGD, where a previously unseen input x is presented to the network at each step of SGD. This limit is extensively used to analyse the dynamics of non-linear networks for both supervised and unsupervised learning, and it has been shown experimentally to capture the dynamics of deep networks trained with data augmentation well [16]. The SGD weight increments for the network parameters are then given by:

The dynamics of representation learning in shallow, non-linear autoencoders (30)

The dynamics of representation learning in shallow, non-linear autoencoders (31)

where The dynamics of representation learning in shallow, non-linear autoencoders (32) and The dynamics of representation learning in shallow, non-linear autoencoders (33) is the The dynamics of representation learning in shallow, non-linear autoencoders (34) regularisation constant. In order to obtain a well defined limit in the limit The dynamics of representation learning in shallow, non-linear autoencoders (35), we rescale the learning rates as The dynamics of representation learning in shallow, non-linear autoencoders (36), The dynamics of representation learning in shallow, non-linear autoencoders (37) for some fixed η > 0.

3.Results

3.1.Dynamical equations to describe feature learning

The starting point for deriving the dynamical equations is the observation that the The dynamics of representation learning in shallow, non-linear autoencoders (38) defined in equation (2) depends on the inputs only via their low-dimensional projections on the network's weights

The dynamics of representation learning in shallow, non-linear autoencoders (39)

This allows us to replace the high-dimensional expectation over the inputs in equation (2) with a K-dimensional expectation over the local fields The dynamics of representation learning in shallow, non-linear autoencoders (40) and The dynamics of representation learning in shallow, non-linear autoencoders (41). For Gaussian inputs drawn from (3), The dynamics of representation learning in shallow, non-linear autoencoders (42) are jointly Gaussian and their distribution is entirely characterised by their second moments:

The dynamics of representation learning in shallow, non-linear autoencoders (43)

The dynamics of representation learning in shallow, non-linear autoencoders (44)

The dynamics of representation learning in shallow, non-linear autoencoders (45)

where The dynamics of representation learning in shallow, non-linear autoencoders (46) is the covariance of the inputs. These macroscopic overlaps, often referred to as order parameters in the statistical physics of learning, together with The dynamics of representation learning in shallow, non-linear autoencoders (47), are sufficient to evaluate the The dynamics of representation learning in shallow, non-linear autoencoders (48). To analyse the dynamics of learning, the goal is then to derive a closed set of differential equations which describe how the order parameters evolve as the network is trained using SGD (4). Solving these equations then yields performance of the network at all times. Note that feature learning requires the weights of the encoder and the decoder move far away from their initial values. Hence, we cannot resort to the framework of the neural tangent kernel or lazy learning [17, 18].

Instead, here we build on an approach that was pioneered by Saad and Solla[19, 20] and Riegler and Biehl [21] to analyse supervised learning in two-layer networks with uncorrelated inputs; Saad [22] provides a summary of the ensuing work. We face two new challenges when analysing autoencoders in this setup. First, autoencoders cannot be formulated in the usual teacher-student setup, where a student network is trained on inputs whose label is provided by a fixed teacher network, because it is impossible to choose a teacher autoencoder which exactly implements the 'identify function' x → x with K < D. Likewise, a teacher autoencoder with random weights, the typical choice in supervised learning, is not an option since such an autoencoder does not implement the identity function, either. Instead, we bypass introducing a teacher and develop a description starting from the data properties. The second challenge is then to handle the non-trivial correlations in the inputs (3), where we build on recent advances in supervised learning on structured data [2326].

3.1.1.Derivation: a sketch.

Here, we sketch the derivation of the equation of motion for the order parameter The dynamics of representation learning in shallow, non-linear autoencoders (49); a complete derivation for The dynamics of representation learning in shallow, non-linear autoencoders (50) and all other order parameters is given in appendix B. We define the eigen-decomposition of the covariance matrix The dynamics of representation learning in shallow, non-linear autoencoders (51) and the rotation of any vector The dynamics of representation learning in shallow, non-linear autoencoders (52) onto this basis as The dynamics of representation learning in shallow, non-linear autoencoders (53). The eigenvectors are normalised as The dynamics of representation learning in shallow, non-linear autoencoders (54) and The dynamics of representation learning in shallow, non-linear autoencoders (55). Using this decomposition, we can re-write T1 and its update at step µ as:

The dynamics of representation learning in shallow, non-linear autoencoders (56)

We introduce the order-parameter density The dynamics of representation learning in shallow, non-linear autoencoders (57) that depends on ρ and on the normalised number of steps The dynamics of representation learning in shallow, non-linear autoencoders (58), which we interpret as a continuous time variable in the limit The dynamics of representation learning in shallow, non-linear autoencoders (59),

The dynamics of representation learning in shallow, non-linear autoencoders (60)

where The dynamics of representation learning in shallow, non-linear autoencoders (61) is the indicator function and the limit The dynamics of representation learning in shallow, non-linear autoencoders (62) is taken after the thermodynamic limit The dynamics of representation learning in shallow, non-linear autoencoders (63). Similarly The dynamics of representation learning in shallow, non-linear autoencoders (64) and The dynamics of representation learning in shallow, non-linear autoencoders (65) are introduced as order parameter densities corresponding to R1 and Q1 respectively. Now, inserting the update for v (4) in the expression of The dynamics of representation learning in shallow, non-linear autoencoders (66), and evaluating the expectation over a fresh sample x , we obtain the dynamical equation:

The dynamics of representation learning in shallow, non-linear autoencoders (67)

where we write The dynamics of representation learning in shallow, non-linear autoencoders (68) to denote the same term with the indices k and The dynamics of representation learning in shallow, non-linear autoencoders (69) permuted. The order parameter is recovered by integrating the density The dynamics of representation learning in shallow, non-linear autoencoders (70) over the spectral density The dynamics of representation learning in shallow, non-linear autoencoders (71) of the covariance matrix: The dynamics of representation learning in shallow, non-linear autoencoders (72).

We can finally close the equations by evaluating the remaining averages such as The dynamics of representation learning in shallow, non-linear autoencoders (73). For some activation functions, such as the ReLU or sigmoidal activation The dynamics of representation learning in shallow, non-linear autoencoders (74), these integrals have an analytical closed form. In any case, the averages only depend on the second moments of the Gaussian random variables The dynamics of representation learning in shallow, non-linear autoencoders (75), which are given by the order parameters, allowing us to close the equations on the order parameters.

We derived these equations using heuristic methods from statistical physics. While we conjecture that they are exact, we leave it to future work to prove their asymptotic correctness in the limit The dynamics of representation learning in shallow, non-linear autoencoders (76). We anticipate that techniques to prove the correctness of dynamical equations for supervised learning [27, 28] should extend to this case after controlling for the spectral dependence of order parameters.

3.1.2.Separation between bulk modes and principal components.

We can exploit the separation of bulk and outlier eigenvalues to decompose the integral (9) into an integral over the bulk of the spectrum and a sum over the outliers. This results in a decomposition of the overlap The dynamics of representation learning in shallow, non-linear autoencoders (77) as:

The dynamics of representation learning in shallow, non-linear autoencoders (78)

The dynamics of representation learning in shallow, non-linear autoencoders (79)

and likewise for other order parameters. Leveraging the fact that the bulk eigenvalues are of order The dynamics of representation learning in shallow, non-linear autoencoders (80), and that we take the The dynamics of representation learning in shallow, non-linear autoencoders (81) limit, we can write the equation for The dynamics of representation learning in shallow, non-linear autoencoders (82) as:

The dynamics of representation learning in shallow, non-linear autoencoders (83)

On the other hand, the M outlier eigenvalues, i.e. the spikes, are of order The dynamics of representation learning in shallow, non-linear autoencoders (84). We thus need to keep track of all the terms in their equations of motion. Similarly to equation (10), they are written:

The dynamics of representation learning in shallow, non-linear autoencoders (85)

Note that the evolution of The dynamics of representation learning in shallow, non-linear autoencoders (86) is coupled to the one of The dynamics of representation learning in shallow, non-linear autoencoders (87) only through the expectations of the local fields. A similar derivation, which we defer to appendix B.3, can be carried out for The dynamics of representation learning in shallow, non-linear autoencoders (88) and The dynamics of representation learning in shallow, non-linear autoencoders (89).

3.1.3.Empirical verification.

In figure 2, we plot the The dynamics of representation learning in shallow, non-linear autoencoders (90) of an autoencoder with K = 3 hidden neurons during training on a dataset drawn from equation (3) with M = 3. We plot the The dynamics of representation learning in shallow, non-linear autoencoders (91) both obtained from training the network using SGD (4) (crosses) and obtained from integrating the equations of motion that we just derived (lines). The agreement between the theoretical description in terms of bulk and spike modes captures the training dynamics well, even at an intermediate input dimension of D = 1000. In the following, we analyse these equations, and hence the behaviour of the autoencoder during training, in more detail.

The dynamics of representation learning in shallow, non-linear autoencoders (92)

3.2.Nonlinear autoencoders learn principal components sequentially

The dynamical equations of the encoder-encoder overlap The dynamics of representation learning in shallow, non-linear autoencoders (96), equation (B.28), takes the form

The dynamics of representation learning in shallow, non-linear autoencoders (97)

The learning rate of each mode of The dynamics of representation learning in shallow, non-linear autoencoders (98) is thus rescaled by the corresponding eigenvalue, suggesting that the principal components that correspond to each to each mode are learnt at different speeds, with the modes corresponding to the largest eigenvalues being learnt the fastest. For the other order parameters The dynamics of representation learning in shallow, non-linear autoencoders (99) and The dynamics of representation learning in shallow, non-linear autoencoders (100), all the terms in the equations of motion equations (B.24) and (B.29) are rescaled by ρ except for one:

The dynamics of representation learning in shallow, non-linear autoencoders (101)

The dynamics of representation learning in shallow, non-linear autoencoders (102)

Indeed, the The dynamics of representation learning in shallow, non-linear autoencoders (103) of the sigmoidal autoencoder shown in figure 2 goes through several sudden decreases, preceded by plateaus with quasi-stationary The dynamics of representation learning in shallow, non-linear autoencoders (104). Each transition is related to the network 'picking up' an additional principal component. By comparing the error of the network on the plateaus to the PCA reconstruction error with an increasing number of principal components (solid horizontal lines in figure 2), we confirm that the plateaus correspond to stationary points where the network has learnt (a rotation of) the first leading principal components. Whether or not the plateaus are clearly visible in the curves depends on the separation between the eigenvalues of the leading eigenvectors.

This sequential learning of principal components has also been observed in several other models of learning. It appears in unsupervised learning rules such as Sanger's rule [29] (aka as generalised Hebbian learning, cf appendix A) that was analysed by Biehl and Schlösser [30]. Gidel et al [31] and Gunasekar et al [8] found a sudden transition from the initial to the final error in linear models The dynamics of representation learning in shallow, non-linear autoencoders (105), but step-wise transitions for factorised models, i.e. linear autoencoders of the form The dynamics of representation learning in shallow, non-linear autoencoders (106). Saxe et al [6] also highlighted the sequential learning of principal components, sorted by singular values.

The evolution of the bulk order parameters obeys

The dynamics of representation learning in shallow, non-linear autoencoders (107)

As a consequence, in the early stages of training, to first approximation, the bulk component of The dynamics of representation learning in shallow, non-linear autoencoders (108) and The dynamics of representation learning in shallow, non-linear autoencoders (109) follow an exponential decay towards 0. The characteristic time of this decay is given by the expectation The dynamics of representation learning in shallow, non-linear autoencoders (110) which only depends on The dynamics of representation learning in shallow, non-linear autoencoders (111). In addition, the last equation implies that the evolution of The dynamics of representation learning in shallow, non-linear autoencoders (112) is entirely determined by the dynamics of the spike modes.

3.3.The long-time dynamics of learning: align, then rescale

At long training times, we expect a sigmoidal autoencoder with untied weights to have retrieved the leading PCA subspace since it achieves the PCA reconstruction error. This motivates an ansatz for the dynamical equations where the network weights are proportional to the eigenvectors of the covariance The dynamics of representation learning in shallow, non-linear autoencoders (113),

The dynamics of representation learning in shallow, non-linear autoencoders (114)

where The dynamics of representation learning in shallow, non-linear autoencoders (115) are scaling constants that evolve in time. With this ansatz, the order parameters are diagonal, and we are left with a reduced set of 2K equations describing the dynamics of The dynamics of representation learning in shallow, non-linear autoencoders (116) and The dynamics of representation learning in shallow, non-linear autoencoders (117) which are given in equation (C.8) together with their detailed derivation.

We first verify the validity of the reduced equations (C.8) to describe the long-time dynamics of sigmoidal autoencoders. In figure 3(a), we show the generalisation dynamics of a sigmoidal autoencoder starting from random initial conditions (blue). We see two phases of learning: after an exponential decay of the The dynamics of representation learning in shallow, non-linear autoencoders (118) up to time The dynamics of representation learning in shallow, non-linear autoencoders (119), the The dynamics of representation learning in shallow, non-linear autoencoders (120) decays as a power-law. This latter decay is well captured by the evolution of the The dynamics of representation learning in shallow, non-linear autoencoders (121) of a model which we initialised with weights aligned to the leading PCs. We deduce that during the exponential decay of the error, the weights align to the leading PCA directions and stay aligned during the power-law decay of the error.

The dynamics of representation learning in shallow, non-linear autoencoders (122)

After having recovered the suitable subspace, the network adjusts the norm of its weights to decrease the error. A look at the dynamics of the scale parameters The dynamics of representation learning in shallow, non-linear autoencoders (130) and The dynamics of representation learning in shallow, non-linear autoencoders (131) during this second phase of learning reveals that the encoder's weights shrink, while the decoder's weights grow, cf figure 3(b). Together, these changes lead to the power-law decay of the The dynamics of representation learning in shallow, non-linear autoencoders (132). Note that we chose a dataset with only one leading eigenvalue so as to guarantee that the AE recovers the leading principal component, rather than a rotation of them. A scaling analysis of the evolution of the scaling constants The dynamics of representation learning in shallow, non-linear autoencoders (133) and The dynamics of representation learning in shallow, non-linear autoencoders (134) shows that the weights decay, respectively grow, as a power-law with time, The dynamics of representation learning in shallow, non-linear autoencoders (135) and The dynamics of representation learning in shallow, non-linear autoencoders (136), see figures 3(c) and (d).

We can understand this behaviour by recalling the linearity of The dynamics of representation learning in shallow, non-linear autoencoders (137) around the origin, where The dynamics of representation learning in shallow, non-linear autoencoders (138). By shrinking the encoder's weights, the autoencoder thus performs a linear transformation of its inputs, despite the non-linearity. It recovers the linear performance of a network given by PCA if the decoder's weights grow correspondingly for the reconstructed input The dynamics of representation learning in shallow, non-linear autoencoders (139) to have the same norm as the input x .

We can find a relation between the scaling constants The dynamics of representation learning in shallow, non-linear autoencoders (140) and The dynamics of representation learning in shallow, non-linear autoencoders (141) at long times using the ansatz equation (19) in the expression of the The dynamics of representation learning in shallow, non-linear autoencoders (142):

The dynamics of representation learning in shallow, non-linear autoencoders (143)

where the second term is simply the rank K PCA error. The first term should be minimised to achieve PCA error, i.e. The dynamics of representation learning in shallow, non-linear autoencoders (144) for all k. For a linear autoencoder, we find The dynamics of representation learning in shallow, non-linear autoencoders (145): as expected, any rescaling of the encoder's weights needs to be compensated in the decoder. For a sigmoidal autoencoder instead, we find

The dynamics of representation learning in shallow, non-linear autoencoders (146)

Note, that at small The dynamics of representation learning in shallow, non-linear autoencoders (147) we recover the linear scaling The dynamics of representation learning in shallow, non-linear autoencoders (148).

The importance of untying the weights for sigmoidal autoencoders. The need to let the encoder and decoder weights grow, resp. shrink, to achieve PCA error makes learning impossible in sigmoidal autoencoders with tied weights, where The dynamics of representation learning in shallow, non-linear autoencoders (149). Such an autoencoder evidently cannot perform the rescaling required to enter the linear regime of its activation function, and is therefore not able to achieve PCA error. Even worse, the learning dynamics shown in figure 4(a) show that a sigmoidal autoencoder with tied weights (blue) hardly achieves a reconstruction error better than chance, in stark contrast to the same network with untied weights (red).

The dynamics of representation learning in shallow, non-linear autoencoders (150)

3.4.The importance of the bias in ReLU autoencoders

Sigmoidal autoencoders achieve the PCA error by exploiting the linear region of their activation function. We can hence expect that in order to successfully train a ReLU AE, which is linear for all positive arguments, it is necessary to add biases at the hidden layer. The error curves for ReLU autoencoders shown in figure 4(b) show indeed that ReLU autoencoders achieve an error close to the PCA error only if the weights are untied and the biases are trained. If the biases are not trained, we found that a ReLU autoencoder with K hidden neurons generally achieves a reconstruction error of roughly The dynamics of representation learning in shallow, non-linear autoencoders (154), presumably because only The dynamics of representation learning in shallow, non-linear autoencoders (155) nodes have a positive pre-activation and can exploit the linear region of ReLU. Training the biases consistently resulted in large, positive biases at the end of training, figure 4(c). The large biases, however, result in a small residual error that negatively affects the final performance of a ReLU network when compared to the PCA performances, as can be seen from linearising the output of the ReLU autoencoder:

The dynamics of representation learning in shallow, non-linear autoencoders (156)

where The dynamics of representation learning in shallow, non-linear autoencoders (157) is the bias of the kth neuron.

3.5.Breaking the symmetry of SGD yields the exact principal components

Linear autoencoders do not retrieve the exact principal components of the inputs, but some rotation of them due to the rotational symmetry of the square loss [32]. While this does not affect performance, it would still be desirable to obtain the leading eigenvectors exactly to identify the important features in the data.

The symmetry between neurons can be broken in a number of ways. In linear autoencoders, previous work considered applying different amounts of regularisation to each neuron [4, 9, 33, 34] or optimising a modified loss [10]. In unsupervised learning, Sanger's rule [29] breaks the symmetry by making the update of the weights of the first neuron independent of all other neurons; the update of the second neuron only depends on the first neuron, etc. This idea was extended to linear autoencoders by [9, 10]. We can easily extend this approach to non-linear autoencoders by training the decoder weights following

The dynamics of representation learning in shallow, non-linear autoencoders (158)

where the sum now only runs up to k instead of K. The update for the encoder weights w k is unchanged from equation (4), but the error term is changed to The dynamics of representation learning in shallow, non-linear autoencoders (159).

We can appreciate the effect of this change by considering a fixed point The dynamics of representation learning in shallow, non-linear autoencoders (160) of the vanilla SGD updates equation (4) with The dynamics of representation learning in shallow, non-linear autoencoders (161) and The dynamics of representation learning in shallow, non-linear autoencoders (162). Multiplying this solution by an orthogonal matrix The dynamics of representation learning in shallow, non-linear autoencoders (163), i.e. The dynamics of representation learning in shallow, non-linear autoencoders (164) yields another fixed point, since

The dynamics of representation learning in shallow, non-linear autoencoders (165)

for the decoder, and likewise for the encoder weights. For truncated SGD (23), the underbrace term in equation (24) is no longer the identity and the rotated weights The dynamics of representation learning in shallow, non-linear autoencoders (166) are no longer a fixed point.

We show the The dynamics of representation learning in shallow, non-linear autoencoders (167) of a sigmoidal autoencoder trained with vanilla and truncated SGD in figure 5(a), together with the theoretical prediction from a modified set of dynamical equations. The theory captures both learning curves perfectly, and we can see that using truncated SGD incurs no performance penalty—both autoencoders achieve the PCA reconstruction error. Note that the reconstruction error can increase temporarily while training with the truncated algorithm since the updates do not minimise the square loss directly. In this experiment, we trained the networks on a synthetic dataset where the eigenvectors are chosen to be sinusoidal (black lines in figure 5(b)). As desired, the weights of the network trained with truncated SGD converges to the exact principal components, while vanilla SGD only recovers a linear combination of them.

The dynamics of representation learning in shallow, non-linear autoencoders (168)

The advantage of recovering the exact principal components is illustrated with the partial reconstructions of a test image from the FashionMNIST database [35]. We train autoencoders with K = 64 neurons to reconstruct FashionMNIST images using the vanilla and truncated SGD algorithms until convergence. We show the reconstruction using all the neurons of the networks in the left column of figure 5(d). Since the networks achieve essentially the same performance, both reconstructions look similar. The partial reconstruction using only the first five neurons of the networks shown on the right are very different: while the partial reconstruction from truncated SGD is close to the original image, the reconstruction of the vanilla autoencoder shows traces of pants mixed with those of a sweatshirt.

4.Representation learning on realistic data under standard conditions

We have derived a series of results on the training of non-linear AEs derived in the case of synthetic, Gaussian datasets. Most of our simplifying assumptions are not valid on real datasets, because they contain only a finite number of samples, and because these samples are finite-dimensional and not normally distributed. Furthermore, we had to choose certain scalings of the pre-activations (5) and the learning rates to ensure the existence of a well-defined asymptotic limit The dynamics of representation learning in shallow, non-linear autoencoders (170). Here, we show that the dynamical equations of section 3.1 describe the dynamics of an autoencoder trained on more realistic images rather well, even if we do not perform any explicit rescalings in the simulations.

Realistic data. In figure 6, we show the The dynamics of representation learning in shallow, non-linear autoencoders (171) of autoencoders of increasing size trained on 10k greyscale images from CIFAR10 [12] with crosses. The solid lines are obtained from integration the equations of motion of appendix B, using the empirical covariance of the training images. The accuracy of the equations to describe the learning dynamics at all times implies that the low-dimensional projections λ and ν introduced in equation (5) are very close to being normally distributed. This is by no means obvious, since the images clearly do not follow a normal distribution and the weights of the autoencoder are obtained from training on said images, making them correlated with the data. Instead, figure 6 suggests that from the point of view of a shallow, sigmoidal autoencoder, the inputs can be replaced by Gaussian vectors with the same mean and covariance without changing the dynamics of the autoencoder. In figure 8 in the appendix, we show that this behaviour persists also for ReLU autoencoders and, as would be expected, for linear autoencoders. Nguyen [11] observed a similar phenomenon for tied-weight autoencoders with a number of hidden neurons that grows polynomially with the input dimension D. We also verified that several insights developed in the case of Gaussian data carry over to real data. In particular, we demonstrate in figure 9 that sigmoidal AE require untied weights to learn, ReLU networks require adding biases to the encoder weights, and that the principal components of realistic data are also learnt sequentially.

The dynamics of representation learning in shallow, non-linear autoencoders (172)

Relation to the Gaussian equivalence. The fact that the learning dynamics of the autoencoders trained on CIFAR10 are well captured by an equivalent Gaussian model is an example of the Gaussian equivalence principle, or Gaussian universality, that received a lot of attention recently in the context of supervised learning with random features [3638] or one- and two-layer neural networks [24, 25, 39, 40]. These works showed that the performance of these different neural networks is asymptotically well captured by an appropriately chosen Gaussian model for the data. While previous work on Gaussian equivalence in two-layer networks crucially assumes that the network has a number of hidden neurons K that is small compared to the input dimension, here we find that the Gaussian equivalence of autoencoders extends to essentially any number of hidden neurons, for example The dynamics of representation learning in shallow, non-linear autoencoders (177), as we show in figure 8 in the appendix.

Robustness of the dynamical equations to rescalings. The scaling of the local fields in equation (5), and hence of the order parameters, is necessary to ensure that in the high-dimensional limit The dynamics of representation learning in shallow, non-linear autoencoders (178), all three terms in the The dynamics of representation learning in shallow, non-linear autoencoders (179) are non-vanishing. As we describe in the derivation, the rescaling of the learning rates are then necessary to obtain a set of closed equations. Nevertheless, we verified that the conclusions of the paper are not an artefact of these scalings. To this end, we trained a sigmoidal AE The dynamics of representation learning in shallow, non-linear autoencoders (180) on grayscale CIFAR10 using SGD without rescaling the eigenvalues of the inputs, pre-activations The dynamics of representation learning in shallow, non-linear autoencoders (181), or the learning rates. As we show in figure 7, we recover the three key phenomena described in the paper: autoencoders with tied weights fail to achieve rank-1 PCA error (orange); autoencoders with untied weights achieve rank-K PCA error (blue); and for sigmoidal autoencoders, the encoder weights The dynamics of representation learning in shallow, non-linear autoencoders (182) shrink, while decoder weights The dynamics of representation learning in shallow, non-linear autoencoders (183) grow to reach the linear regime (inset). These results underline that our theoretical results describe AEs under 'standard conditions'.

The dynamics of representation learning in shallow, non-linear autoencoders (184)

The dynamics of representation learning in shallow, non-linear autoencoders (187)

The dynamics of representation learning in shallow, non-linear autoencoders (189)

Acknowledgments

We thank Galen Reeves for illuminating discussions. M R thanks the Data Science group at SISSA for its hospitality during a visit, where part of this research was performed. M R acknowledges funding from the French Agence Nationale de la Recherche under Grant ANR-19-P3IA-0001 PRAIRIE.

Appendix A: Online learning algorithms for PCA

We briefly review a number of unsupervised learning algorithms for principal component analysis, leading to Sanger's rule which is the inspiration for the truncated SGD algorithm of section 3.5.

Setup. We consider inputs The dynamics of representation learning in shallow, non-linear autoencoders (190) with first two moments equal to

The dynamics of representation learning in shallow, non-linear autoencoders (191)

We work in the thermodynamic limit The dynamics of representation learning in shallow, non-linear autoencoders (192).

Hebbian learning. The Hebbian learning rule allows to obtain an estimate of the leading PC by considering a linear model The dynamics of representation learning in shallow, non-linear autoencoders (193) with The dynamics of representation learning in shallow, non-linear autoencoders (194) and the loss function The dynamics of representation learning in shallow, non-linear autoencoders (195). It updates the weights as:

The dynamics of representation learning in shallow, non-linear autoencoders (196)

In other words, the Hebbian learning rule tries to maximise the second moment of the pre-activation λ. That using this update we obtain an estimate of the first principal component makes intuitive sense. Consider the average over the inputs of the loss:

The dynamics of representation learning in shallow, non-linear autoencoders (197)

If the weight vector w is equal to the τth eigenvector of Ω, then The dynamics of representation learning in shallow, non-linear autoencoders (198), and the loss is minimised by converging to the eigenvector with the larges eigenvalue. Also note, that similar to what we find in the main text, this result also suggests that the leading eigenvalue in the large-D limit should scale as The dynamics of representation learning in shallow, non-linear autoencoders (199). The Hebbian rule has the well-known deficit that it imposes no bound on the growth of the weight vector. A natural remedy is to introduce some form of weight decay such that

The dynamics of representation learning in shallow, non-linear autoencoders (200)

Oja's rule. [41] offers a smart choice for the weight decay constant κ. Consider the same linear model The dynamics of representation learning in shallow, non-linear autoencoders (201) trained with a Hebbian rule and weight decay The dynamics of representation learning in shallow, non-linear autoencoders (202):

The dynamics of representation learning in shallow, non-linear autoencoders (203)

The purpose of this choice can be appreciated from substituting in the linear model and setting the update rule to zero, which yields (dropping the constants)

The dynamics of representation learning in shallow, non-linear autoencoders (204)

which is precisely the eigenvalue equation for Ω.

Then, Oja's update rule is then derived from Hebb's rule by adding normalisation to the Hebbian update (A.2),

The dynamics of representation learning in shallow, non-linear autoencoders (205)

with p integer (in Oja's original paper, p = 2.) This learning rule can be seen as a power iteration update. Substituting for y, we see that the numerator corresponds to, on average, repeated multiplication of the weight vector with the average covariance matrix. Expanding equation (A.7) around η = 0 yields Oja's algorithm (while also assuming that the weight vector is normalised to one). One can show that Oja's rule indeed converges to the PC with the largest eigenvalue by allowing the learning rate to vary with time and imposing the mild conditions

The dynamics of representation learning in shallow, non-linear autoencoders (206)

Sanger's rule. [29] is also known as generalised Hebbian learning in the literature and extends the idea behind Oja's rule to networks with multiple outputs The dynamics of representation learning in shallow, non-linear autoencoders (207). It allows estimation of the first few leading eigenvectors by using the update rule is given by

The dynamics of representation learning in shallow, non-linear autoencoders (208)

Note that the second sum extends only to the k; the dynamics of the kth input vector hence only depends on weight vectors The dynamics of representation learning in shallow, non-linear autoencoders (209) with The dynamics of representation learning in shallow, non-linear autoencoders (210). This dependence is one of the similarities of Sanger's rule with the Gram-Schmidt procedure for orthogonalising a set of vectors (cf section 0.6.4 of [42]). The dynamics of Sanger's rule in the online setting were studied by [30, 43]. Sanger's rule reduces to Oja's rule for K = 1.

Appendix B: Online learning in autoencoders

Here we derive the set of equations tracking the dynamics of shallow non-linear autoencoders trained in the one-pass limit of stochastic gradient descent (SGD). These equations are derived for Gaussian inputs The dynamics of representation learning in shallow, non-linear autoencoders (211) drawn from the generative model equation (3), but as we discuss in section 4, they also capture accurately the dynamics of training on real data.

B.1.Statics

The starting point of the analysis is the definition of the test error (2) and the identification of order parameters, i.e. low dimensional overlaps of high dimensional vectors,

The dynamics of representation learning in shallow, non-linear autoencoders (212)

The dynamics of representation learning in shallow, non-linear autoencoders (213)

First we introduce the local fields corresponding to the encoder and decoder's weights,

The dynamics of representation learning in shallow, non-linear autoencoders (214)

as well as the order parameter tracking the overlap between the decoder weights:

The dynamics of representation learning in shallow, non-linear autoencoders (215)

Note the unusual scaling of νk with D; the intuition here is that in shallow autoencoders, the second-layer weights will be strongly correlated with the eigenvectors of the input-input covariance, and hence with the inputs, requiring the scaling The dynamics of representation learning in shallow, non-linear autoencoders (216) instead of The dynamics of representation learning in shallow, non-linear autoencoders (217). This scaling is also the one that yields a set of self-consistent equations in the limit The dynamics of representation learning in shallow, non-linear autoencoders (218). The generalisation error becomes

The dynamics of representation learning in shallow, non-linear autoencoders (219)

Crucially, the local fields λ and ν are jointly Gaussian since the inputs are Gaussian. Since we have The dynamics of representation learning in shallow, non-linear autoencoders (220), the The dynamics of representation learning in shallow, non-linear autoencoders (221) can be written in terms of the second moments of these fields only:

The dynamics of representation learning in shallow, non-linear autoencoders (222)

The dynamics of representation learning in shallow, non-linear autoencoders (223)

The dynamics of representation learning in shallow, non-linear autoencoders (224)

The full expression of The dynamics of representation learning in shallow, non-linear autoencoders (225) and The dynamics of representation learning in shallow, non-linear autoencoders (226) in terms of the order parameters is given, for various activation functions, in appendix D. Note the different scalings of these overlaps with D. These are a direct consequence of the different scaling of the local fields λk and νk . In order to derive the equations of motion, it is useful to introduce the decomposition of Ω in its eigenbasis:

The dynamics of representation learning in shallow, non-linear autoencoders (227)

The eigenvectors The dynamics of representation learning in shallow, non-linear autoencoders (228) are normalised as The dynamics of representation learning in shallow, non-linear autoencoders (229) and The dynamics of representation learning in shallow, non-linear autoencoders (230). We further define the rotation of any The dynamics of representation learning in shallow, non-linear autoencoders (231) onto this basis as The dynamics of representation learning in shallow, non-linear autoencoders (232). The order parameters are then written:

The dynamics of representation learning in shallow, non-linear autoencoders (233)

B.2.Averages over non-linear functions of weakly correlated variables

We briefly recall expressions to compute averages over non-linear functions of weakly correlated random variables that were also used by [24]. These expressions will be extensively used in the following derivation. Consider x, The dynamics of representation learning in shallow, non-linear autoencoders (234), two weakly correlated jointly Gaussian random variables with covariance matrix

The dynamics of representation learning in shallow, non-linear autoencoders (235)

with The dynamics of representation learning in shallow, non-linear autoencoders (236) which encodes the weak correlation between x and y. Then, the expectation of the product of any two real valued functions The dynamics of representation learning in shallow, non-linear autoencoders (237), is given, to leading order in ε by:

The dynamics of representation learning in shallow, non-linear autoencoders (238)

Similarly, for 3 random variables The dynamics of representation learning in shallow, non-linear autoencoders (239) with x1 and x2 weakly correlated with x3 but not between each other, i.e. The dynamics of representation learning in shallow, non-linear autoencoders (240), one can compute the expectation of product of three real valued functions The dynamics of representation learning in shallow, non-linear autoencoders (241) to leading order in ε as:

The dynamics of representation learning in shallow, non-linear autoencoders (242)

B.3.Derivation of dynamical equations

At the µth step of training, the SGD update for the rotated weights reads

The dynamics of representation learning in shallow, non-linear autoencoders (243)

The dynamics of representation learning in shallow, non-linear autoencoders (244)

To keep notation concise, we drop the weight decay term in the following analysis. We can now compute the update of the various order parameters by inserting the above into the definition of the order parameters. The stochastic process described by the resulting equations concentrates, in the The dynamics of representation learning in shallow, non-linear autoencoders (245) limit, to its expectation over the input distribution. By performing the average over a fresh sample x, we therefore obtain a closed set of deterministic ODEs tracking the dynamics of training in the high-dimensional limit. We also show in simulations that these are able to capture well the dynamics also at finite dimension The dynamics of representation learning in shallow, non-linear autoencoders (246).

Update of The dynamics of representation learning in shallow, non-linear autoencoders (247) the overlap of the decoder's weights. Let us start with the update equation for The dynamics of representation learning in shallow, non-linear autoencoders (248) which is easily obtained by using the sgd update (B.14),

The dynamics of representation learning in shallow, non-linear autoencoders (249)

By taking the expectation over a fresh sample x and the The dynamics of representation learning in shallow, non-linear autoencoders (250) limit of one-pass SGD, we obtain a continuous in the normalised number of steps The dynamics of representation learning in shallow, non-linear autoencoders (251), which we interpret as a continuous time variable, as:

The dynamics of representation learning in shallow, non-linear autoencoders (252)

where the notation The dynamics of representation learning in shallow, non-linear autoencoders (253) indicates an additional term identical to the first except for exchange of the indices k and l. This equation requires to evaluate The dynamics of representation learning in shallow, non-linear autoencoders (254), The dynamics of representation learning in shallow, non-linear autoencoders (255), which are two-dimensional Gaussian integrals with covariance matrix given by the order parameters. We give their expression in appendix D. We also note that the second order term The dynamics of representation learning in shallow, non-linear autoencoders (256) is sub-leading in the high dimensional limit we work in.

Update of The dynamics of representation learning in shallow, non-linear autoencoders (257). The update equation for The dynamics of representation learning in shallow, non-linear autoencoders (258) is obtained similarly as before by using the SGD update (B.14) and taking the high-dimensional limit,

The dynamics of representation learning in shallow, non-linear autoencoders (259)

To make progress, we have to evaluate The dynamics of representation learning in shallow, non-linear autoencoders (260). For this purpose it is crucial to notice that the rotated inputs xτ are weakly correlated with the local fields:

The dynamics of representation learning in shallow, non-linear autoencoders (261)

Then, we can compute the expectation The dynamics of representation learning in shallow, non-linear autoencoders (262) using the results of appendix B.2, i.e. equation (B.12) with The dynamics of representation learning in shallow, non-linear autoencoders (263) and The dynamics of representation learning in shallow, non-linear autoencoders (264), thus obtaining:

The dynamics of representation learning in shallow, non-linear autoencoders (265)

Inserting the above expression in the update of The dynamics of representation learning in shallow, non-linear autoencoders (266) yields:

The dynamics of representation learning in shallow, non-linear autoencoders (267)

Notice, that in this equation we have the appearance of The dynamics of representation learning in shallow, non-linear autoencoders (268), a term which cannot be simply expressed in terms of order parameters. Similar terms appear in the equation for The dynamics of representation learning in shallow, non-linear autoencoders (269) and The dynamics of representation learning in shallow, non-linear autoencoders (270). To close the equations, we are thus led to introduce order parameter densities in the next step.

Order parameters as integrals over densities. To proceed, we introduce the densities The dynamics of representation learning in shallow, non-linear autoencoders (271) and The dynamics of representation learning in shallow, non-linear autoencoders (272). These depend on ρ and on the normalised number of steps The dynamics of representation learning in shallow, non-linear autoencoders (273):

The dynamics of representation learning in shallow, non-linear autoencoders (274)

where The dynamics of representation learning in shallow, non-linear autoencoders (275) is the indicator function and the limit The dynamics of representation learning in shallow, non-linear autoencoders (276) is taken after the thermodynamic limit. The order parameters are obtained by integrating these densities over the spectrum of the input-input covariance matrix:

The dynamics of representation learning in shallow, non-linear autoencoders (277)

It follows that tracking the dynamics of the functions The dynamics of representation learning in shallow, non-linear autoencoders (278) and t, we obtain the dynamics of the overlaps The dynamics of representation learning in shallow, non-linear autoencoders (279) and The dynamics of representation learning in shallow, non-linear autoencoders (280).

Dynamics of The dynamics of representation learning in shallow, non-linear autoencoders (281). The dynamics of t is given straightforwardly from equation (B.21) as:

The dynamics of representation learning in shallow, non-linear autoencoders (282)

where we defined the rescalled eigenvalue The dynamics of representation learning in shallow, non-linear autoencoders (283). Here again, straightforward algebra shows that the second order term is sub-leading in the limit of high input dimensions.

Dynamics of The dynamics of representation learning in shallow, non-linear autoencoders (284). In a similar way, we compute the dynamical equation for The dynamics of representation learning in shallow, non-linear autoencoders (285) by using the encoder's weight's update in equation (B.14) and taking the expectation over the input distribution,

The dynamics of representation learning in shallow, non-linear autoencoders (286)

In the above, we have the appearance of the expectations:

The dynamics of representation learning in shallow, non-linear autoencoders (287)

To compute these, we use our results for weakly correlated variables from appendix B.2 and obtain:

The dynamics of representation learning in shallow, non-linear autoencoders (288)

Using these results in the equations for The dynamics of representation learning in shallow, non-linear autoencoders (289) we find:

The dynamics of representation learning in shallow, non-linear autoencoders (290)

Note, that as explained in the main text, in order to find a well defined The dynamics of representation learning in shallow, non-linear autoencoders (291) limit, we rescale the learning rate of the encoder's weights as The dynamics of representation learning in shallow, non-linear autoencoders (292).

Dynamics of The dynamics of representation learning in shallow, non-linear autoencoders (293). Lastly, we obtain the equation for The dynamics of representation learning in shallow, non-linear autoencoders (294) in the same way as the two previous ones. We use equation (B.14) and evaluate the expectation over a fresh sample x.

The dynamics of representation learning in shallow, non-linear autoencoders (295)

We can evaluate the expectations as before (B.2). Thus, we obtain the equation of motion for The dynamics of representation learning in shallow, non-linear autoencoders (296) as:

The dynamics of representation learning in shallow, non-linear autoencoders (297)

B.4.Simplification of the equations for spiked covariance matrices

In the case of the synthetic dataset defined as The dynamics of representation learning in shallow, non-linear autoencoders (298), the spectrum of the covariance matrix is decomposed into a bulk of small, i.e. The dynamics of representation learning in shallow, non-linear autoencoders (299) eigenvalues, of continuous support, and a few outlier eigenvalues taking values of order The dynamics of representation learning in shallow, non-linear autoencoders (300) (see figure 1). This allows to obtain M equations for controlling the evolution of the spikes modes and one equation controlling the evolution of the bulk for all order parameters. Indeed we can simplify the equations of motion of the bulk eigenvalues (i.e. those for which The dynamics of representation learning in shallow, non-linear autoencoders (301)) by neglecting terms proportional to The dynamics of representation learning in shallow, non-linear autoencoders (302). Doing so leads to:

The dynamics of representation learning in shallow, non-linear autoencoders (303)

We can define bulk order parameters, that take into account the contribution from the bulk modes as:

The dynamics of representation learning in shallow, non-linear autoencoders (304)

Note that even though the integrals involve The dynamics of representation learning in shallow, non-linear autoencoders (305) since we are integrating over a large number (O(D)) of modes, the result is of order 1. The equation for these overlaps is thus obtained as the integrated form of equation (B.31):

The dynamics of representation learning in shallow, non-linear autoencoders (306)

The order parameters can be decomposed into the contribution from the bulk eigenvalues and those of the spikes as:

The dynamics of representation learning in shallow, non-linear autoencoders (307)

and similarly for The dynamics of representation learning in shallow, non-linear autoencoders (308) and The dynamics of representation learning in shallow, non-linear autoencoders (309). The M spike modes The dynamics of representation learning in shallow, non-linear autoencoders (310) and qi obey the full equations (B.24), (B.28) and (B.29). In particular, it is clear from equation (B.33) that for any non-zero weight decay constant κ, the bulk contribution to The dynamics of representation learning in shallow, non-linear autoencoders (311) will decay to 0 in a characteristic time κ−1. Further note that the equations for The dynamics of representation learning in shallow, non-linear autoencoders (312) and The dynamics of representation learning in shallow, non-linear autoencoders (313) result in an exponential decay of the bulk modes towards 0.

Appendix C: Reduced equations for long-time dynamics of learning

In this section we illustrate how to use the equations of motion in order to study the long training time dynamics of shallow non-linear autoencoders. At sufficiently long times, we have seen that the weights of the network span the leading PC subspace of the covariance matrix. We restrict to the case in which the eigenvectors are recovered directly. The case in which a rotation of them is found follows straightforwardly. We also restrict to the matched case in which K = M.

Consider the following ansatz on the weights configuration: The dynamics of representation learning in shallow, non-linear autoencoders (314) and The dynamics of representation learning in shallow, non-linear autoencoders (315). We defined dynamical constants The dynamics of representation learning in shallow, non-linear autoencoders (316) which control the norm of the weights. Using this ansatz, the order parameters take the form:

The dynamics of representation learning in shallow, non-linear autoencoders (317)

Using the above, we can rewrite the The dynamics of representation learning in shallow, non-linear autoencoders (318) as:

The dynamics of representation learning in shallow, non-linear autoencoders (319)

The first observation is that the generalisation error reaches a minimum when the term in bracket fk is minimised. Minimising it leads to an equation The dynamics of representation learning in shallow, non-linear autoencoders (320) at which The dynamics of representation learning in shallow, non-linear autoencoders (321) and the The dynamics of representation learning in shallow, non-linear autoencoders (322) is equal to the PCA reconstruction error.

For a linear activation function, we have

The dynamics of representation learning in shallow, non-linear autoencoders (323)

The The dynamics of representation learning in shallow, non-linear autoencoders (324) is a function only of the product The dynamics of representation learning in shallow, non-linear autoencoders (325) and is minimal and equal to the PCA reconstruction error whenever:

The dynamics of representation learning in shallow, non-linear autoencoders (326)

The above is nothing but the intuitive result that for a linear autoencoder, any rescaling of the encoder's weights can be compensated by the decoder's weights.

For a sigmoidal autoencoder, instead, the expressions of the integrals given in appendix D, give:

The dynamics of representation learning in shallow, non-linear autoencoders (327)

Differentiating fK with respect to The dynamics of representation learning in shallow, non-linear autoencoders (328), and requiring the derivative to be 0, we find the equation:

The dynamics of representation learning in shallow, non-linear autoencoders (329)

We point out that this solution is independent of the sign of The dynamics of representation learning in shallow, non-linear autoencoders (330) or The dynamics of representation learning in shallow, non-linear autoencoders (331) as recovering the eigenvectors or minus the eigenvectors is equivalent. Replacing this solution for The dynamics of representation learning in shallow, non-linear autoencoders (332) in the expression for the The dynamics of representation learning in shallow, non-linear autoencoders (333), we find The dynamics of representation learning in shallow, non-linear autoencoders (334). The autoencoder reaches the same reconstruction error as the one achieved by PCA.

Ansatz in the equations of motion. The ansatz equation (19) results in order parameters of the form:

The dynamics of representation learning in shallow, non-linear autoencoders (335)

Inserting these expressions into the full dynamical equations of motion equation (B.28), equation (B.29) and equation (B.24), allows us to find a simplified set of 2K ODEs for the scaling constants The dynamics of representation learning in shallow, non-linear autoencoders (336) and The dynamics of representation learning in shallow, non-linear autoencoders (337):

The dynamics of representation learning in shallow, non-linear autoencoders (338)

The validity of the equations is verified in figure 3 where we compare the result of integrating them to simulations. We provide a Mathematica notebook which allows to derive these equations on the Github repository of this paper.

To validate our calculations for autoencoders with finite D, we trained an autoencoder from three distinct initial conditions: A random one, an informed configuration in which the weights are initialised as in equation (19) with The dynamics of representation learning in shallow, non-linear autoencoders (339) and a perfect one, in which additionally, The dynamics of representation learning in shallow, non-linear autoencoders (340) is initialised from the minimal The dynamics of representation learning in shallow, non-linear autoencoders (341) solution equation (C.6). The first observation is that the The dynamics of representation learning in shallow, non-linear autoencoders (342) of the network started with perfect initial conditions remains at the PCA reconstruction error, thus validating equation (C.6). Since we have a noiseless dataset, the PCA error is given by the numerical error, however we note that the same conclusion caries over to noisy datasets with finite PCA reconstruction error. In figure 10, we plot the The dynamics of representation learning in shallow, non-linear autoencoders (343) of the randomly initialised and informed autoencoders on the left as well as the evolution of The dynamics of representation learning in shallow, non-linear autoencoders (344) and The dynamics of representation learning in shallow, non-linear autoencoders (345), i.e. of the overlap between the weights and the eigenvectors during training, on the right. We can see that the network with random initial weights learns in two phases: first the The dynamics of representation learning in shallow, non-linear autoencoders (346) decays exponentially until it reaches the error of the aligned network. Then, its The dynamics of representation learning in shallow, non-linear autoencoders (347) coincides with that of the informed network and the error decays as a power-law. This shows that during the first phase of exponential learning, the network recovers the leading PC subspace. This occurs early on in training. In the second phase, the networks adjust the norm of the weights to reach PCA performances. As discussed in the main text, this is achieved in the sigmoidal network by shrinking the encoder's weights, so as to recover the linear regime of the activation function. It keeps the norm of the reconstruction similar to the one of the input by and growing the decoder's weights.

The dynamics of representation learning in shallow, non-linear autoencoders (348)

Appendix D: Analytical formula of the integral

In this subsection we define The dynamics of representation learning in shallow, non-linear autoencoders (352) the covariance matrix of the variables The dynamics of representation learning in shallow, non-linear autoencoders (353) appearing in the expectation. For e.g. The dynamics of representation learning in shallow, non-linear autoencoders (354).

Sigmoidal activation function

The dynamics of representation learning in shallow, non-linear autoencoders (355)

The dynamics of representation learning in shallow, non-linear autoencoders (356)

The dynamics of representation learning in shallow, non-linear autoencoders (357)

The dynamics of representation learning in shallow, non-linear autoencoders (358)

The dynamics of representation learning in shallow, non-linear autoencoders (359)

Linear activation function

The dynamics of representation learning in shallow, non-linear autoencoders (360)

The dynamics of representation learning in shallow, non-linear autoencoders (361)

The dynamics of representation learning in shallow, non-linear autoencoders (362)

ReLU activation function

The dynamics of representation learning in shallow, non-linear autoencoders (363)

The dynamics of representation learning in shallow, non-linear autoencoders (364)

The dynamics of representation learning in shallow, non-linear autoencoders (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Velia Krajcik

Last Updated:

Views: 5765

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.