Shallow neural networks as Wasserstein gradient flows: Difference between revisions
Paigehillen (talk | contribs) |
Paigehillen (talk | contribs) |
||
Line 55: | Line 55: | ||
where <math> \nabla f</math> is the gradient of f.<ref name="Schiebinger"/> | where <math> \nabla f</math> is the gradient of f.<ref name="Schiebinger"/> | ||
Crucially, the gradient flow heads in the direction that decreases the value of <math> f </math> the fastest. We would like to use this nice property of gradient flow in our setting with the functional <math> F </math>. However, it is not immediately straightforward how to do this, since <math> F </math> is defined on the space of probability measures, rather than on <math> \mathbb{R}^n </math>, so the usual gradient is not defined. Before we generalize the notion of gradient flow, recall that <math> \nabla f </math> is the unique function from <math> \mathbb{R}^n </math> to <math> \mathbb{R}^n</math> such that | Crucially, the gradient flow heads in the direction that decreases the value of <math> f </math> the fastest. We would like to use this nice property of gradient flow in our setting with the functional <math> F </math>. However, it is not immediately straightforward how to do this, since <math> F </math> is defined on the space of probability measures, rather than on <math> \mathbb{R}^n </math>, so the usual gradient is not defined. Before we generalize the notion of gradient flow, recall that <math> \nabla f </math> is the unique function from <math> \mathbb{R}^n </math> to <math> \mathbb{R}^n </math> such that | ||
:<math>\langle\nabla f , v \rangle = D_v f </math>, | :<math>\langle\nabla f , v \rangle = D_v f </math>, | ||
where <math> D_v f </math> is the directional derivative of <math> f </math> in the direction <math v </math> | where <math> D_v f </math> is the directional derivative of <math> f </math> in the direction <math> v </math> Motivated by this and using the [http://34.106.105.83/wiki/Formal_Riemannian_Structure_of_the_Wasserstein_metric Riemannian structure of the space of probability measures], we can define the a notion of gradient for our energy functional <math> F </math>. | ||
We can define the [http://34.106.105.83/wiki/Gradient_flows_in_Hilbert_spaces gradient flow in a Hilbert space] <math> \mathcal{H} </math> of a convex and lower semi-continuous functional <math> \Phi </math>. An absolutely continuous curve <math>x: [0, \infty) \rightarrow \mathcal{H} </math> is a gradient flow for <math> \Phi </math> starting at <math> x_0 \in \mathcal{H} </math> if | |||
: <math> x(0)= x_0, </math> and <math> \dot{x}(t) \in -\partial \Phi(x(t)) </math> for almost every <math> t>0 </math> | : <math> x(0)= x_0, </math> and <math> \dot{x}(t) \in -\partial \Phi(x(t)) </math> for almost every <math> t>0 </math> | ||
where <math> \partial \Phi </math> is the subdifferential of <math> \Phi </math>. <ref name="Glaudo"/> | where <math> \partial \Phi </math> is the subdifferential of <math> \Phi </math>. <ref name="Glaudo"/> | ||
===Wasserstein Distance=== | ===Wasserstein Distance=== |
Revision as of 01:45, 26 February 2022
Motivation
Artificial neural networks (ANNs) consist of layers of artificial "neurons" which take in information from the previous layer and output information to neurons in the next layer. Gradient descent is a common method for updating the weights of each neuron based on training data. While in practice every layer of a neural network has only finitely many neurons, it is beneficial to consider a neural network layer with infinitely many neurons, for the sake of developing a theory that explains how ANNs work. In particular, from this viewpoint the process of updating the neuron weights for a shallow neural network can be described by a Wasserstein gradient flow.
Single Layer Neural Networks
{See also Mathematics of Artificial Neural Networks}
Discrete Formulation
Let us introduce the mathematical framework and notation for a neural network with a single hidden layer.[1] Let be open . The set represents the space of inputs into the network. There is some unknown function which we would like to approximate. Let be the number of neurons in the hidden layer. Define
be given by
where is a fixed activation function and is a space of possible parameters . The goal is to use training data to repeatedly update the weights and based on how close is to the function . More concretely, we want to find that minimizes the loss function:
A standard way to choose and update the weights is to start with a random choice of weights and perform gradient descent on these parameters. Unfortunately, this problem is non-convex, so the minimizer may not be achieved. It turns out in practice, neural networks are surprisingly good at finding the minimizer. A nicer minimization problem that may provide insight into how neural networks work is a neural network model with infinitely many neurons.
Continuous Formulation
For the continuous formulation (i.e. when ), we rephrase the above mathematical framework. In this case, it no longer makes sense to look for weights that minimize the loss function. We instead look for a probability measure such that
minimizes the loss function:
- .
Here is an activation function with parameter .
Note that by restricting choices of to probability measures of the form , the above minimization problem generalizes to case with finitely many neurons as well.
To avoid overfitting the network to the training data, a potential term is added the loss function. For the remainder of this article, we define the loss function to be:
for a convex potential function . Often we choose . In fact, is convex (along linear interpolations), in contrast to the minimization function in the finite neuron case.
Gradient Flow
When , the gradient flow of a differentiable function starting at a point is a curve satisfying the differential equation
- .
where is the gradient of f.[2]
Crucially, the gradient flow heads in the direction that decreases the value of the fastest. We would like to use this nice property of gradient flow in our setting with the functional . However, it is not immediately straightforward how to do this, since is defined on the space of probability measures, rather than on , so the usual gradient is not defined. Before we generalize the notion of gradient flow, recall that is the unique function from to such that
- ,
where is the directional derivative of in the direction Motivated by this and using the Riemannian structure of the space of probability measures, we can define the a notion of gradient for our energy functional .
We can define the gradient flow in a Hilbert space of a convex and lower semi-continuous functional . An absolutely continuous curve is a gradient flow for starting at if
- and for almost every
where is the subdifferential of . [3]
Wasserstein Distance
Let us first define the (pth) Wasserstein distance between two probability measures. [4] This can be defined for probability measures in any separable metric space. Let and let denote the space of transport plans from to . Define the pth Wasserstein distance from to to be:
where is the distance between and in the metric space . In this context, is just a subset of for some with the euclidean metric. Often .
First Variation / Functional Derivative
The first variation ( or functional derivative) of a function from to measures
Wasserstein Subdifferential
Wasserstein Gradient Flow
Main Results
Consistency Between Infinite and Finite Cases
References
- ↑ X. Fernandez-Real and A. Figalli, The Continuous Formulation of Shallow Neural Networks as Wasserstein-Type Gradient Flows
- ↑ G. Schiebinger, Gradient Flow in Wasserstein Space
- ↑ A. Figalli, F. Glaudo, An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows
- ↑ L. Ambrosio, N. Gigli, G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures