Learning & Symmetry Group: 1. Generalization and Symmetry

Carles Gelada; Jacob Buckman

1.1) The Learning of Functions

At its core, machine learning is about fitting a function to data. There is a target function \(f: X\to Y\) between two spaces that we’d like to learn. The input space \(X\) might be text sequences, images encoded as arrays of pixels, sound waves, etc. Similarly, the output space \(Y\) could be numbers, text, images, etc.

For example, one might want to learn a function which takes as input an image of a person, and outputs an estimate of that person’s height: for this application, the input space would be the space of images, and the output space would be the space of positive real numbers. In image reconstruction, the input is a corrupted/low-quality image and the output is the original, high-quality version, so image reconstruction is a mapping from the space of images to itself. In audio transcription, the input is a person speaking and the output is the text of what was said, so audio transcription maps from the space of sound waves to the space of text strings. If we are instead interested in audio synthesis, we can simply reverse this, and instead learn a mapping from the space of text strings to the space of sound waves. You get the idea. This is a very flexible approach, and it can model all sorts of tasks.

One particularly interesting setting is learning functions \(f : \R^2 \to \R^3\). This setting is useful for building intuition, because these functions can be rendered as an image. Each input \((x_1,x_2)\in \R^2\) gives the coordinates of a particular pixel, and each output \((y_1, y_2, y_3) \in \R^3\) gives the RGB color of that pixel.

The central challenge of machine learning is to discover \(f\) when we only have knowledge of its outputs on a set \(\{x_0, \cdots, x_n \} = S\subset X\), which we call the support set. Concretely, we have access to dataset of input-output pairs \((x_0, f(x_0) ), \cdots, (x_n, f(x_n))\). A way to think about the dataset is that we are only allowed to query the outputs of \(f\) on inputs that belong to \(S\). So \(f|_S\), the restriction of \(f\) to \(S\), is a mathematically convinient way to talk about the dataset.¹

For example, if we are trying to learn a function \(f : \R^2 \to \R^3\), the support \(S\) would only contain a subset of all possible coordinates. Below, the image on the left shows a dataset \(f|_S\) whose support \(S\) is a patch of pixels. The job of a machine learning algorithm would be to use \(f|_S\) to reconstruct the rest of the image function \(f\) seen on the right.

The basic approach of machine learning is to construct a large space of functions, and then find a function in this space which agrees with \(f\) on the support \(S\). For this to work well, we must be confident that the space of functions is large enough that it contains \(f\) (or at least a close approximation). The most common way to construct such function spaces is via a parameterization \(h: X \times \R^p \to Y\), where \(\R^p\) is the space of parameters. By picking different values \(w\in \R^p\) we can control the behavior of the function \(h_w: X \to Y\), defined as \(h_w(x) = h(x, w)\). Two examples of function spaces constructed this way are:

Polynomials of degree \(d\). We define \(h: (\R \times \R^{d+1}) \to \R\) as \(h(x, w) = w_0 + w_1 x + \cdots + x^d w_d\).
Neural networks. We define \(h(x, w) = w_l \sigma( w_{l-1} \sigma(\cdots \sigma(w_1 x)))\) where \(l\) is the number of layers, \(w_i \in \R^{h_{i+1}\times h_i}\) is a linear map from \(\R^{s_{i}}\) to \(\R^{s_{i+1}}\) and where \(\sigma\) is any pointwise nonlinearity, like tanh or ReLU.

Once we have chosen our parameterized function space \(h\), our learning algorithm finds a value of the parameters \(w\in \R^p\) such that \(h_w\) fits the dataset \(f|_S\). With that aim in mind, we pick a differentiable loss function \(L(f|_S, w)\) which measures how successful \(h_w\) is at fitting the values of \(f\) on the points \(S\). For example, if \(Y\) is a scalar, a common way to define the loss function is \(L(f|_S, w) = \frac{1}{|S|} \sum_{i=1}^{|S|} \left(h_w(x_i) - f(x_i)\right)^2\). A loss function must take its minumum possible value if and only if \(h_w(x_i) = f(x_i), \; \forall x_i \in S\), i.e., when \(h_w\) fits the dataset perfectly.

Learning is often described as the problem of finding the optimal parameters, defined as: \[\underset{{w\in \R^p}}{\text{argmin}} \: L(f|_S, w)\] But in the context of modern machine learning, this is a fundamentally flawed definition. This is because, for the function spaces \(h\) that we use in practice, there is a vast set of parameters \(w\) that all equivalently minimize the loss, but which produce functions \(h_w\) that behave completely differently outside of \(S\). The implications of this will be discussed in greater depth in Section 1.3; for now, all we need to understand is that it’s not enough to talk about the argmin, we need to be thinking about which argmin \(w\) we pick. To understand a machine learning algorithm, it is important to look carefully at the specific method used to choose a \(w\) that fits the data.

In any modern deep learning scenario, that method is gradient descent.² It works by changing the weights through time according to the differential equation \(\frac{d w}{d t} = -\nabla_w L(f|_S, w)\), where \(\nabla_w L(f|_S, w)\) is the gradient of \(L\) w.r.t. the weights \(w\). The gradient can be thought of as the best linear approximation to the loss function at the point \(w\): it is a vector pointing in the direction where the loss function increases the fastest, and with a magnitude proportional how fast the loss function is increasing. Since the goal of our algorithm is to decrease the loss, we update the parameters in the opposite direction of this vector.

As with any differential equation, one must specify initial conditions. In our case, this means choosing parameters \(w_0\) at \(t=0\). Without loss of generality, we can simply assume that \(w_0=0\) because any other choice of initial conditions can be absorbed into the the parametrization \(h\). Since lot of the objects we’ll be considering depend on \(w_0\), by absorbing the initial conditions into \(h\), we avoid having to keep track an extra variable \(w_0\). A properly-constructed loss function will result in a differential equation on \(w_t\) that converges to a point \(w_\infty\). We’ll refer to \(w_\infty\) as the solution learned by gradient decent.

A dataset \(f|_S\). A parameterization \(h: X \times \R^p \to Y\). A loss function \(L(f|_S, w)\). And the path \(\frac{\part w}{\part t}\) taken by gradient descent. That is all we are going to be studying. Because somehow, this extremely simple setup is the core of the recent revolution in artificial intelligence.

1.2) The Magic of Generalization

Ultimately, the point of learning a function \(h_w\) is that we’d like to know \(f(x)\) for \(x\not\in S\), i.e., points outside of the support. (After all, if we wanted to know \(f(x)\) for \(x\in S\), we could just look it up using the dataset.) So, we find \(w\) for which \(h(x_i, w) = f(x_i), \;\forall x_i \in S\) by minimizing \(L(f|_S,w)\) with gradient descent, and we hope that this results in an \(h_w\) which is similar to \(f\) elsewhere. If it is, we say the model generalizes.

A way to measure how well a model \(h_w\) generalizes is by checking if it fits the data on a test support \(S'\subset X\), where a test support is a collection of points \(x'_i \in S'\) for which we also know the values \(f(x'_i)\), but which we never used during the training of the parameters \(w\). After we’ve found some parameters \(w\) that fit \(f|_S\), we measure whether \(h_w(x'_i) = f(x'_i)\) on \(x'_i \in S'\), and the more points it gets correct, the better it generalizes. ³

A useful tool for thinking about generalization is to consider a partition of \(X\) into its correct and error subsets, denoted \(X_c\) and \(X_e\) respectively, where \(X_c = \{x\in X \: |\ h_w(x) = f(x) \}\) is the subset where \(h_w\) correctly fits \(f\) and \(X_e = X-X_c\) is the region where it makes errors. Of course, these are purely theoretical objects, since knowing the regions \(X_c, X_e\) requires acccess to the unknown function \(f\). Yet, they give us a natural way to think about generalization: a model is generalizing well when \(X_c\) is much larger than \(S\). This setup makes it evident that generalization is not naturally a scalar quantity: different models might make correct predictions on equally-large but completely different regions of \(X\). In terms of test-sets, this is like saying that how well a model seems to generalize is dependent on the test points \(S'\subset X\) on which you choose measure it.

Ultimately, generalization is about giving something to the learning process and receiving more in return. We only know \(f|_S\), and in return we receive \(f|_{X_c}\). The fact that \(X_c\) is larger than \(S\) might not seem impressive at first glance; after all, even if we pick the parameters \(w\) at random, there might still be some points in \(X - S\) where the predictions of \(h_w\) are correct by pure coincidence. But a learning algorithm that produces correct regions \(X_c\) which are consistently much larger than \(S\) is something powerful indeed.

However, there is something very important to note about generalization. Since our target \(f\) could be an arbitrary function, when we only have access to the dataset \(f|_S\), the only points \(x\) for which we can be absolutely sure of the value \(f(x)\) are the points in \(S\) themselves. For \(x\not\in S\), \(f(x)\) could take any value in \(Y\), and thus no matter how we construct our function space or what learning algorithm we use, there will always be functions \(f\) for which \(X_c = S\). That is: for any learning algorithm, we can always construct \(f,S\) for which the function learned by the algorithm gets all the predictions wrong outside of \(S\). Thus, for an arbitrary target function \(f\), generalization is impossible. This idea is often discussed as the no-free-lunch principle.

But the reason the field of ML has exploded is that we’ve found an approach which generalizes incredibly well on a vast variety of real-world tasks. How can we square these two facts? The key is a crucial assumption underpinning the no-free-lunch theorem: it only applies when \(f\) is arbitrary. If we restrict our interest to a specific subset of functions or supports with some useful structure, the no-free-lunch principle does not apply. We can therefore resolve this apparent contradiction by noting that the space of “real-world tasks” to which we apply machine learning is far smaller than the space of all possible functions. To explain generalization, then, requires both characterizing the set of all real-world tasks, and demonstrating that our learning algorithms are guaranteed to generalize on this set.

To really appreciate the scope of the mystery we are trying to solve, it’s worth empasizing the astronomical generalization abilities that deep neural networks consistently demonstate. Consider a task like playing chess, where the input is a board position, and the output is a move. The number of positions one might encouter in the game vastly exeeds the number of atoms in the universe. Yet, a neural network trained to imitate the moves selected by human players on just a few million board positions will predict good moves on almost any situation it encouters when playing. The overwhelming majority of these will be brand-new, never-before-seen positions. The neural network has taken a few examples of board positions and moves, and used them to choose good moves on a space larger than the number of atoms of the universe!

Another way to apprciate the reach of the generalization abilities of neural networks is by exploring the behavior of text-to-image models like DALLE-2.


“Photo of hip hop cow in a denim jacket recording a hit single in the studio”	“An Italian town made of pasta, tomatoes, basil and parmesian”	“Spider-Man from Ancient Rome”

While the datasets used in training are as large as technologically viable, the few hundred million text-image pairs that are included in the dataset are negilgible compared to the sheer combinatorial complexity of the space of text inputs, which is of course much larger than even the number of positions in the game of chess.

How can neural networks trained with gradient descent possibly return so much more than they are given? That is the fundamental mystery of generalization.

1.3) The Overparametrized Regime

Before discussing the ability of neural networks to generalize outside of the support \(S\), it is worthwhile to first consider why gradient descent is even able to find parameters which correctly fit \(f|_S\). You’ll see how thinking about this question will make the mystery of generalization seem even more daunting.

An important fact to keep in mind is that modern ML tends to use parameter spaces \(\R^p\) which are of much higher dimensionality than \(|S|\), the number of datapoints in our dataset. If we are using a polynomial function space, we can use Lagrange polynomials to show that for any function \(f\), if \(|S| < p\), there will always exist parameters \(w\in \R^p\) with \(h_w(x) = f(x)\) for all \(x\in S\). More over, gradient descent (with a reasonably defined loss function) is guaranteed to find one of these solutions. We have some similar results for neural networks: an immediate consequence of the universal function approximation theorems is that there exist parameters \(w\) that fit the dataset if \(p\) is large enough. Showing that gradient descent would find this solution is a different story though: as far as we know, no such results exist. ⁴ But our lack of theoretical understanding is not such a big problem because this is a question that can be answered empirically.

In perhaps one of the most important and insightful experiments of deep learning, Zhang et al. took a variety of neural network architectures used in computer vision, and trained them to fit a few variations of ImageNet, a popular image classification dataset. The settings they consider are:

Standard ImageNet. The inputs were real-world images – pictures of cats, dogs, etc. The outputs are the correct labels given by ImageNet.
ImageNet with random labels. The inputs were once again images from ImageNet dataset. However, the output labels were sampled at random from the set of all possible labels. (Sampling happened just once at the start of training.)
Random inputs, random outputs. The input images are sampled according to a uniform distribution, essentially generating white noise. The outputs are also selected at random. (Once again, sampling happened just once at the start of training, after which the dataset was held fixed.)

The result of the experiment was conclusive: the neural networks were reliably capable of fitting all three datasets. This indicates that in practice, where \(p\) is much larger than \(|S|\), neural networks trained with gradient descent are flexible enough to fit essentially anything.

To recap: for simple cases like polynomial learning, it’s easy to show that if \(p > |S|\), gradient descent is mathematically guaranteed find a solution that perfectly fits \(f|_S\) for any function \(f: X\to Y\). For neural networks, mathematical results are lacking, but the same statement has been explored empirically, and systematically found to be true.

Accepting this premise leads us to an inescapable realization, which we touched on earlier. In the overparametrized regime, low loss does not imply generaliation. This is because if we can find parameters \(w\) to fit any \(f|_S\), then there must also exist parameters \(w\) that fit \(f|_S\) while getting almost everything else wrong.

To see why this must be true, imagine a fourth experiment, in which \(f|_S\) contains all the images in ImageNet but also a second, much larger set of images (since ImageNet has 1M images, this second set could have 100M). In our new dataset, the labels for the ImageNet images are assigned correctly, but the labels for second set of images are assigned at random. As was the case for the other experiments, if we train a large enough neural network, gradient descent will find parameters that fit the entire dataset perfectly: both the correctly-labeled ImageNet images and the randomly-labeled second set. Of course, the resulting model will generalize terribly. After all, it was mostly trained on noise. Thus, these parameters perfectly fit the original ImageNet task but fail to generalize!

The crucial realization is that gradient descent does much more than find low-loss solutions. When we are learning on real-world tasks, there is a sea of parameters that fit \(f|_S\) but which make completely incorrect predictions elsewhere. And somehow, gradient descent manages to find the \(h_w\) that actually generalizes! How can gradient descent possibly know which solution to choose?

To be clear, the point of this section is not to say that real world problems are always in the overparametrized regime (although it is certainly the most common case). It’s perfectly possible to train an image model with a few million parameters on a dataset with a billion images. The real point is that we appreciate something very deep about neural networks by observing that they are capable of both generalizing and fitting arbitrary datasets. It unequivocally demonstrates that gradient descent must play a crucial role in generalization, far beyond its role in minimizing the loss function \(L(f|_S, w)\). And in the overparameterized regime, this role is given center stage.

Therefore, in this line of work, we will always assume that we are overparametrized (i.e. that we have enough parameters that we could fit any \(f|_S\) on the support \(S\)), and also that \(f = h_w\) for some \(w\) (i.e. we don’t want to be disacted by “capacity” concerns). This gives as a mathematically elegant setting where we can study the magic of generalization.

1.4) The Goal of Generalization Theory

It’s not entirely obvious what it would even mean for us to have “a theory of how neural networks learn”. In the natural sciences, such as physics, the objective when developing a mathematical theory usually consists of finding a set of equations that accurately model the real world. But we already have the equations that govern how neural networks learn (namely, the equations of gradient descent and of the parametrization \(h\)). Instead, we are trying to study an emergent phenomena of the learning dynamics: the fact that models trained in this manner generalize well on real-world tasks. We aim to understand how neural networks learn to behave in unseen situations, how they “form beliefs” about the world.

One natural starting point is to ask:

Which combinations of architecture \(h\), support \(S\) and target function \(f\) result in gradient descent perfectly learning the target function?

In the next article, we will study this question in a simplified setting (i.e. not neural networks), and we will see that one can gain a lot of insight about the behavior of gradient descent. For example, we will see that in certain settings, gradient descent behaves like a projection operator. But in a sense, this is an empty question because is easy to check the answer for any particular \(h,f,S\). After all, we know the learning equations, so we can use them to compute \(w_\infty\) and measure whether \(f=h_{w_\infty}\). Yes, this tells us in a particular situation whether or not we generalize, but doesn’t provide any deeper insight into the underlying causes of generalization.

Another angle of attack is to ask:

What is it about certain functions \(f\) that makes them learnable?

Sometimes, we can learn \(f\) with vastly less datapoints than one would have the right to expect (i.e. \(|S|\ll|X|\)). As it turns out, the function that maps images to the categories “hot dog” and “not hot dog” is one of these learnable funcitons. But why? Similarly, what is it about human written text that makes it possible to predict the next character? What is it about games like chess that makes it possible to learn a good strategy?

This line of questioning follows naturally from our earlier discussion on the no-free-lunch principle. We know no algorithm can generalize for all target functions \(f\) and supports \(S\); yet, any succesful application of ML is an example of a situation where this generalization can be seen. Any theory explaining the generalization observed from our algorithms must first identify what is special about these problem settings.

Research into machine learning is motivated by its success on natural problems. But functions originating in the real world are not always conducive to mathematical research, because we can only ever access them via data, never knowing the ground-truth \(f\). But we can also construct functions \(f\) that are learnable and for which we have a complete mathematical description: from simple tasks like learning to perform arithmetic operations, to more complex tasks like learning to emulate a Python interpreter.

Ultimately, the test of a good theory is whether it helps us build previously impossible things, or whether we can predict outcomes that we couldn’t before. That is why we strive to ground our research on generalization in concrete empirical objectives. Doing so will help us avoid fooling ourselves into thinking we are making progress when we are just writting useless equations. Here are some examples of such objectives:

Certify a trained neural network. On a synthetic task like multiplying digits, can we take a trained neural network and prove that it has learned the correct mathematical struture of the problem, guaranteeing that it gives the correct predictions on all possible inputs?
Prove that neural networks learn certain broad classes of structured functions, under a broad class of datasets. Can we find a broad category of functions \(f\) and supports \(S\) for which we can show gradient descent will learn \(f\)?
Just looking at \(f|_S\), can we be confident that \(h(w_\infty) = f\)? If you are given the list \(1,2,3,4,5,6,7,8,9,10,11\) you will be able to continue the sequence, but if instead you were given the list \(1,6,11,2,9,10,3,7,8,4,5\) you’d wouldn’t be able to do so (at least, not with any degree of confidence). Analogously, a neural network may sometimes be trained on a dataset for which it can confidently generalize, and other times be given a dataset with no structure or patterns at all. Can we construct an algorithm that can tell the difference?
In what regions of \(X\) can we be confident that the outputs of \(f\) and \(h_{w_\infty}\) match? This is just an extension of the previous question, but one of great practical importance. Very often, we have models that generalize well, but nonetheless fail to learn \(f\) perfectly. In other words, the correct set \(X_c\) might be very large, but \(X_c \not=X\). We’d like to know on which inputs \(x \in X\) we can trust the predictions \(h(x,w_\infty)\). Of course, we are just describing a neural network uncertainty method, but coming at it from a different angle than is typical. The central question isn’t really “where should we be uncertain about the outputs of a neural network?”. Instead, we should be asking: “why would we ever trust the outputs of \(h(x,w_\infty)\) for \(x \not\in S\)?”

1.5) Symmetry and Generalization

In this section, we’ll switch from describing the problem, to describing our approach to solving it. Our central hypothesis is that the thing that makes functions learnable is symmetry. The symmetry we are referring to is, intuitively, the exact same idea that you learned in grade school. This idea has a rigorous mathematical definition given by the field of group theory.

At a high level, we can think of a group \(G\) as a collection of objects \(g\in G\) that transform a space \(X\), sending every \(x\in X\) to another point \(g\rhd x\in X\). We say that \(G\) acts on \(X\). A \(G\) symmetric funciton is one for which \(f(x) = f(g\rhd x)\) for all \(g\in G\). Groups help us abstract away the redundant information in these functions. We can gain intuition by looking at images with symmetry.


Image 1	Image 2

The reflection transformation \(r\) sends the pixel cordinates \((x_1,x_2)\) to \(r\rhd (x_1, x_2) = (-x_1, x_2)\). Since an image of a butterfly has this symmetry, we only needed the data on the left half (Image 1) to reconstruct the full image (Image 2). The RGB value for any pixel coordinates \((x_1, x_2)\) on the right half will be the same as the RGB value of \(r\rhd (x_1, x_2)\) which we can determine by looking at Image 1.

For another example, take a look at this tiling pattern from Seville. The tile seen on Image 1 can be used to construct the entire image on the left by placing shifted copies alongside one another. These shift movements also correspond to the group \(G= \Z^2\) with points \(g = (g_1, g_2) \in G\) which transforms points \((x_1,x_2)\) to \(g\rhd (x_1, x_2) = (x_1 + g_1, x_2+ g_2)\).


Image 1	Image 2

Furthermore, note how even a single tile (Image 1) is itself full of symmetry. In the case of the tile, it has the same group of symmetries as the dodecagon, a group called \(D_{12}\) composed of 12 rotations and 12 reflections. We again only really would need small subset of all the pixels in Image 1 together with the group \(D_{12}\) to reconstruct the whole tile.

These examples should make it clear how symmetry can be used to generalize. If we know the function \(f\) has symmetries \(G\), having a point in the support \(x\in S\) not only informs us about the value \(f(x)\), but also tells us \(f(g\rhd x),\;\forall g\in G\). We are using \(G\) to “fill in” missing data. But all we’ve done is kick the can down the road. Sure, if we knew that the target function \(f\) had symmetry \(G\), we could use \(G\) to know the value of \(f\) on points outside of \(S\). But how can we possibly know that \(G\) is a symmetry group of \(f\)? If symmetries really are the key to understanding generalization, there can only be one answer: \(f|_S\) can provide information about the symmetries of \(f\). Moreover, gradient descent must be using this information implicitly to choose \(w\), selecting a \(h_w\) that matches not just the values of \(f|_S\), but its symmetries as well.

The idea that the dataset can convey symmetry is, in fact, quite intuitive. Consider a new dataset for the butterfly function:

From the top half of this dataset, we have strong evidence that the image is symmetric under reflections. On the bottom half, we only have partial information on the function, but nothing that we observe contradicts the idea that the function has reflective symmetry. By combining the values observed on the bottom-left with the symmetry observed on the top, we can deduce the missing values on the bottom-right.

This idea is deeply connected to Occam’s Razor: “if you have competing theories, the simpler one is better.” Concretely, the compteting theories are all the functions \(h_w\) that fit \(f|_S\), and the notion of simplicity is related to how symmetric \(h_w\) is. We believe this perspecive might provide a path to achieving our third goal, “can we be confident we’ve learned the right function \(h(w_\infty) = f\), just from the dataset?”. Perhaps we can be confident if there is only one highly symmetric \(h_w\) that fits the dataset.

Much of what we will be doing in the following articles is to define precisely what it means for \(f|_S\) to contain symmetry. Learning via gradient descent will be seen to be equivalent to doing two things: first, extracting a group \(G\) out of the support \(S\), then using \(f|_S\) together with \(G\) to fill in the missing data, and produce the final learned funciton \(h_{w_\infty}\).

1.6) Symmetries in the Real World

In the section above, we have given examples of symmetry using functions \(f : \R^2 \to \R^3\), which can be visualized as images. Of course, most machine learning applications don’t deal with function spaces whose input is 2-dimensional. In computer vision, for example, the input to our function space might be an arrray of 100x100 RGB pixels, and so the dimensionality of the input space \(X\) is 30,000. To build intuition, consider the following dataset for an image classification task:

Hotdog	Not hotdog

Note that these images play a very different role to the ones we’ve been considering. Before, the an image represented an entire function: each input was an \(x,y\) pixel coordinate, and each output was an RGB color. But here, each image is only a single point in input space \(X\). The function \(f: \R^{30,000} \to \{-1, 1 \}\) maps any image to one of two categories.

What do symmetries on this space look like? Before, the group elements \(g\in G\) sent each coordinate to another coordinate in a way that preserves the pixel’s color; now, \(g\in G\) sends each image to another image while preserving its class. Let’s build some intuition about what symmetries the function that classifies hotdogs might have. What type of movements of the 30,000-dimensional space preserve the class of the images?

One example of such a symmetry might be “add ketchup”. For any image in the dataset, adding or removing ketchup will not change its class: a hotdog is still a hotdog with ketchup on it, and the Eiffel Tower is still not a hotdog, even if we slather it in condiments.

Not hotdog	Also not hotdog

Other symmetric transformations include changing the camera angle, rotating the subject 180 degrees, bringing the center into soft focus, changing the time of day, changing the contents of the background…

We could go on and on, filling endless pages stating all the ways one can modify an image in a way that preserves its category, and wouldn’t even make a dent in the full list. But through analysis of a dataset of millions of images, these incredibly complicated symmetries start to become evident, and are picked up by gradient descent. Each time one is discovered, it unlocks a whole new region of generalization.

(Well…at least, that is our theory.)

1.7) Conclusions

We’ve discussed how exceptional the phenomenon of generalization of machine learning models (and of neural networks in particular) really is. And we have seen how experiments showing that neural networks are overparameterized imply that the key to understanding generalization lies in the way gradient descent selects a solution among the many that minimize the loss in the train set.

We’ve given some intuition on how a dataset could do more than inform us about that values of \(f\) on the points of the support \(S\), it could also provide evidence that the function \(f\) has symmetry. And once we have a symmetry \(G\) of \(f\), we can use it to predict the value of \(f(g\rhd x)\) for all points \(x\in S\) and group elements \(g\in G\).

What we haven’t yet shown is any connection between gradient descent and symmetry. Our next article will take a look at some simple ML problems, and we will see that when the target function \(f\) has some symmetry and the support \(S\) contains points that make this apparent, gradient descent does indeed learn functions that have that same symmetry.

Some might raise concerns that this definition of machine learning is too narrow. For example, it doesn’t allow for cases where \(f\) outputs a distribution, and so the dataset is composed of examples \((x_i, y_i)\) where \(y_i\) is merely sampled from \(f(x_i)\). Although that setting is important, we choose to ignore it for now; the deterministic case is simpler, and the issues we study are deeper than the superficial difference between trying to recover a function or a distribution. Understanding one will be the key to understanding the other. Similar tradeoffs are made through the document, prioritizing simplicity over generality as long as the simplification does not impact the central issues.↩︎
For simplicity, we discuss only vanilla, full-batch, continuous-time gradient descent. In practice it is more common to use variants such as discrete-time SGD, momentum, Adam, etc.↩︎
In this discussion we consider fitting to be a binary yes/no: either a prediction is correct, or it is not. This is a coarse notion of fitting, and inadequate for settings where perfect answers are impossible. However, it can easily be generalized to these instances. For example, when the target lives in \(\mathbb{R}\), we might prefer a more relaxed notion of approximate fitting that e.g. considers a point to be correctly fit whenever some scalar error is below a threshold.↩︎
Outside of limiting cases with specialized parameterizations, such as NTK.↩︎

Comment on this article Share:

1. Generalization and Symmetry