Shallow neural networks as Wasserstein gradient flows: Difference between revisions

Latest revision as of 02:33, 20 March 2022

Motivation

Artificial neural networks (ANNs) consist of layers of artificial "neurons" which take in information from the previous layer and output information to neurons in the next layer. Gradient descent is a common method for updating the weights of each neuron based on training data. While in practice every layer of a neural network has only finitely many neurons, it is beneficial to consider a neural network layer with infinitely many neurons, for the sake of developing a theory that explains how ANNs work. In particular, from this viewpoint the process of updating the neuron weights for a shallow neural network can be described by a Wasserstein gradient flow.

Single Layer Neural Networks

Discrete Formulation

Let us introduce the mathematical framework and notation for a neural network with a single hidden layer.^[1] Let $D\subset \mathbb {R} ^{d}$ be open . The set $D$ represents the space of inputs into the network. There is some unknown function $f:D\rightarrow \mathbb {R}$ which we would like to approximate. Let $N\in \mathbb {N}$ be the number of neurons in the hidden layer. Define

F_{N}:D\times \Omega ^{N}\rightarrow \mathbb {R} ^{k}

be given by

F_{N}(x,\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})={\frac {1}{N}}\sum _{i=1}^{N}\omega _{i}h(\theta _{i},x)

where $h$ is a fixed activation function and $\Omega ^{N}$ is a space of possible parameters $(\omega ,\theta )=(\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})$ . The goal is to use training data to repeatedly update the weights $\omega _{i}$ and $\theta _{i}$ based on how close $f_{N,\omega ,\theta }:=F_{N}(\cdot ,\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})$ is to the function $f$ . More concretely, we want to find $\omega ,\theta$ that minimizes the loss function:

l(f,f_{N,\omega ,\theta }):={\frac {1}{2}}\int _{D}|f(x)-f_{N,\omega ,\theta }(x)|^{2}dx

A standard way to choose and update the weights is to start with a random choice of weights ${\bar {\omega }},{\bar {\theta }}$ and perform gradient descent on these parameters. Unfortunately, this problem is non-convex, so the minimizer may not be achieved. It turns out in practice, neural networks are surprisingly good at finding the minimizer. A nicer minimization problem that may provide insight into how neural networks work is a neural network model with infinitely many neurons.

Continuous Formulation

For the continuous formulation (i.e. when $N=\infty$ ), we rephrase the above mathematical framework. In this case, it no longer makes sense to look for weights $\omega ,\theta$ that minimize the loss function. We instead look for a probability measure $\mu \in {\mathcal {P}}(\Omega )$ such that

f_{\mu }(x):=\int _{\Omega }\Phi (\xi ,x)d\mu (\xi )

minimizes the loss function:

F(\mu ):={\frac {1}{2}}\int _{D}(f-f_{\mu })^{2}dx

.

Here $\Phi (\xi ,x)$ is an activation function with parameter $\xi =(\omega ,\theta )\in \Omega$ . We will in fact restrict choices of $\mu$ to probability measures with finite second moment, denoted ${\mathcal {P}}_{2}(\Omega )$ . This is a small technicality to ensure that the Wasserstein metric is indeed a metric.

Note that by restricting choices of $\mu$ to probability measures of the form $\mu _{N}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{\xi _{i}}$ , the above minimization problem generalizes to case with finitely many neurons as well.

To avoid overfitting the network to the training data, a potential term is added the loss function. For the remainder of this article, we define the loss function $F$ to be:

F(\mu ):={\frac {1}{2}}\int _{D}(f-(\int _{\Omega }\Phi (\xi ,x)d\mu (\xi )))^{2}dx+\int _{\Omega }V(\xi )d\mu (\xi )

for a convex potential function $V:\Omega \rightarrow \mathbb {R}$ . Often we choose $V(\xi )={\frac {\lambda }{2}}|\xi |^{2}$ . In fact, $F$ is convex (along linear interpolations), in contrast to the minimization function in the finite neuron case.

Gradient Flow

When $X\subseteq \mathbb {R} ^{n}$ , the gradient flow of a differentiable function $g:X\rightarrow \mathbb {R}$ starting at a point $x_{0}$ is a curve $x(t):[0,T)\rightarrow X$ satisfying the differential equation

{\frac {d}{dt}}x(t)=-\nabla g(x(t)),x(0)=x_{0}

.

where $\nabla g$ is the gradient of g.^[2]

Crucially, the gradient flow heads in the direction that decreases the value of $g$ the fastest. We would like to use this nice property of gradient flow in our setting with the functional $F$ . However, it is not immediately straightforward how to do this, since $F$ is defined on the space of probability measures, rather than on $\mathbb {R} ^{n}$ , so the usual gradient is not defined. Before we generalize the notion of gradient flow, recall that $\nabla g$ is the unique function from $\mathbb {R} ^{n}$ to $\mathbb {R} ^{n}$ such that

\langle \nabla g,v\rangle =D_{v}g

,

where $D_{v}g$ is the directional derivative of $g$ in the direction $v$ . Motivated by this and using the Riemannian structure of the space of probability measures, we can define the a notion of gradient for our loss functional $F$ .

Note that one can define gradient flows in a general Hilbert space.

Wasserstein Gradient / Subdifferential

We are looking for an element $\nabla _{W_{2}}F(\mu )$ in the tangent space of ${\mathcal {P}}_{2}(\Omega )$ at $\mu$ such that

\langle \nabla _{W_{2}}F(\mu _{*}),{\frac {d}{dt}}\mu (t)|_{t=0}\rangle _{W_{2}}=\lim _{h\rightarrow 0}{\frac {F(\mu (h))-F(\mu _{*})}{h}}

for any absolutely continuous curve $\mu (t)$ in ${\mathcal {P}}_{2}(\Omega )$ with $\mu (0)=\mu _{*}$ .

We claim that in fact

\nabla _{W_{2}}F(\mu _{*})=-\nabla \cdot (\mu _{*}\nabla {\frac {\delta F}{\delta \mu _{*}}})

where ${\frac {\delta F}{\delta \mu _{*}}}$ is the first variation of $F$ at $\mu _{*}$ ^[3] (also called the functional derivative).

We provide a formal argument for this equality that makes the most sense when $\mu$ is absolutely continuous with respect to the Lebesgue measure. In this case, we can think of $\mu$ as a density function with $d\mu (x)=\mu (x)dx$ . We can further assume that $\mu$ is in $L^{2}(\Omega )$ . Therefore the same notion of gradient for $L^{2}$ exists for $F$ , and in fact

\nabla _{L^{2}}F(\mu )={\frac {\delta F}{\delta \mu }}

Thus by definition of $L^{2}$ gradient and $L^{2}$ inner product, we should have

\lim _{h\rightarrow 0}{\frac {F(\mu (h))-F(\mu _{*})}{h}}=\langle {\frac {\delta F}{\delta \mu _{*}}},{\frac {d}{dt}}\mu (t)|_{t=0}\rangle _{L^{2}}

=\int {\frac {\delta F}{\delta \mu _{*}}}{\frac {d}{dt}}\mu (t)|_{t=0}dx

by the continuity equation for $\mu (t)$ and divergence theorem,

\int {\frac {\delta F}{\delta \mu _{*}}}{\frac {d}{dt}}\mu (t)|_{t=0}dx=-\int {\frac {\delta F}{\delta \mu _{*}}}(\nabla \cdot (v_{*}\mu _{*}))dx=\int \nabla {\frac {\delta F}{\delta \mu _{*}}}(v_{*}\mu _{*})dx=\int \nabla {\frac {\delta F}{\delta \mu _{*}}}v_{*}d\mu _{*}

where $v_{*}$ is the unique velocity vector corresponding to $\mu _{*}$ by the equivalence of absolutely continuous curves and solutions of the continuity equation. Now by the definition of the inner product structure of $({\mathcal {P}}_{2}(\mathbb {R} ^{d}),W_{2})$ , we have

\int \nabla {\frac {\delta F}{\delta \mu _{*}}}v_{*}d\mu _{*}=\langle -\nabla \cdot (\mu \nabla {\frac {\delta F}{\delta \mu _{*}}}),{\frac {d}{dt}}\mu (t)|_{t=0}\rangle _{W_{2}}.

Therefore we have

\nabla _{W_{2}}F(\mu _{*})=-\nabla \cdot (\mu \nabla {\frac {\delta F}{\delta \mu _{*}}}).

so the gradient flow for $F$ is an absolutely continuous curve of probability measure $\mu _{t}$ such that

\partial _{t}\mu _{t}=\nabla \cdot (\mu _{t}\nabla {\frac {\delta F}{\delta \mu _{t}}})

.

Now we actually compute this gradient flow for our loss functional $F$ .

Computing the Wasserstein Gradient Flow for F

First, we compute the first variation ( or functional derivative) of a function from ${\mathcal {P}}(\Omega )$ to $\mathbb {R}$ of $F$

{\frac {\delta F}{\delta \mu }}(\mu _{*})=\int _{D}\Phi (\cdot ,x)[\int _{\Omega }\Phi ({\bar {\xi }},x)d\mu _{*}({\bar {xi}})-f(x)]dx+V

Therefore we have

\nabla {\frac {\delta F}{\delta \mu }}(\mu _{*})=\int _{D}\nabla _{\xi }\Phi (\cdot ,x)[\int _{\Omega }\Phi ({\bar {\xi }},x)d\mu _{*}({\bar {xi}})-f(x)]dx+\nabla V.

Hence the Wasserstein gradient flow is

\partial _{t}\mu _{t}={\rm {{div}(\mu _{t}\nabla {\frac {\delta F}{\delta \mu }}(\mu _{t}))}}

.

Open Problems

Generalization

Generalization error is how well a model generalizes for a new data not in your training data set. In practice, gradient descent does generalize really well, but it is an open problem to provide a theoretical framework to guarantee generalization. Wang, Meng, Chen and Liu 2021 made progress on studying the implicit regularization of the algorithm, and provide a framework for convergence. ^[4]

References

[Figalli-1] X. Fernandez-Real and A. Figalli, The Continuous Formulation of Shallow Neural Networks as Wasserstein-Type Gradient Flows

[Schiebinger-2] G. Schiebinger, Gradient Flow in Wasserstein Space

[Ambrosio-3] L. Ambrosio, N. Gigli, G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures

[Wang-4] B. Wang, Q. Meng, W. Chen, T. Liu, The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks

[1]

[2]

[3]

[4]

@@ Line 37: / Line 37: @@
 :<math> F(\mu) : = \frac{1}{2} \int_{D} (f - f_\mu)^2 dx </math>.
-Here <math>\Phi(\xi, x)</math> is an activation function with parameter <math> \xi = (\omega, \theta) \in \Omega </math>.
+Here <math>\Phi(\xi, x)</math> is an activation function with parameter <math> \xi = (\omega, \theta) \in \Omega </math>. We will in fact restrict choices of <math> \mu </math> to probability measures with finite second moment, denoted <math> \mathcal{P}_2(\Omega) </math>. This is a small technicality to ensure that the Wasserstein metric is indeed a metric.
 Note that by restricting choices of <math> \mu </math> to probability measures of the form <math> \mu_N = \frac{1}{N} \sum_{i=1}^{N} \delta_{\xi_i} </math>, the above minimization problem generalizes to case with finitely many neurons as well.
@@ Line 49: / Line 49: @@
 ==Gradient Flow==
-When <math> X \subseteq \mathbb{R}^n </math>, the gradient flow of a differentiable function <math> f: X \rightarrow \mathbb{R} </math> starting at a point <math> x_0 </math> is a curve <math>x(t): [0, T) \rightarrow X </math> satisfying the differential equation
+When <math> X \subseteq \mathbb{R}^n </math>, the gradient flow of a differentiable function <math> g: X \rightarrow \mathbb{R} </math> starting at a point <math> x_0 </math> is a curve <math>x(t): [0, T) \rightarrow X </math> satisfying the differential equation
-:<math> \frac{d}{dt}x(t) = - \nabla f (x(t)), x(0)=x_0 </math>.
+:<math> \frac{d}{dt}x(t) = - \nabla g (x(t)), x(0)=x_0 </math>.
-where <math> \nabla f</math> is the gradient of f.<ref name="Schiebinger"/>
+where <math> \nabla g</math> is the gradient of g.<ref name="Schiebinger"/>
-Crucially, the gradient flow heads in the direction that decreases the value of <math> f </math> the fastest. We would like to use this nice property of gradient flow in our setting with the functional  <math> F </math>. However, it is not immediately straightforward how to do this, since <math> F </math> is defined on the space of probability measures, rather than on <math> \mathbb{R}^n </math>, so the usual gradient is not defined. Before we generalize the notion of gradient flow, recall that <math> \nabla f </math> is the unique function such that
+Crucially, the gradient flow heads in the direction that decreases the value of <math> g </math> the fastest. We would like to use this nice property of gradient flow in our setting with the functional  <math> F </math>. However, it is not immediately straightforward how to do this, since <math> F </math> is defined on the space of probability measures, rather than on <math> \mathbb{R}^n </math>, so the usual gradient is not defined. Before we generalize the notion of gradient flow, recall that <math> \nabla g </math> is the unique function from <math> \mathbb{R}^n </math> to <math> \mathbb{R}^n </math> such that
-:<math>\langle\nabla f , v \rangle = \rm{d} f(v) </math>.
+:<math>\langle\nabla g , v \rangle = D_v g </math>,
-:<math> \nabla f (x(t)) = \nabla f \cdot \dot{x}(t) </math>.
+where <math> D_v g </math> is the directional derivative of <math> g </math> in the direction <math> v </math>. Motivated by this and using the [http://34.106.105.83/wiki/Formal_Riemannian_Structure_of_the_Wasserstein_metric Riemannian structure of the space of probability measures], we can define the a notion of gradient for our loss functional <math> F </math>.
-Recall that for a vector field <math> V, \rm{div}V = - \nabla \cdot V </math>.
+Note that one can define [http://34.106.105.83/wiki/Gradient_flows_in_Hilbert_spaces gradient flows in a general Hilbert space].
-Motivated by this and using the notion of a subdifferential we can define the [http://34.106.105.83/wiki/Gradient_flows_in_Hilbert_spaces gradient flow in a Hilbert space] <math> \mathcal{H} </math> of a convex and lower semi-continuous functional <math> \Phi </math>. An absolutely continuous curve <math>x: [0, \infty) \rightarrow \mathcal{H} </math> is a gradient flow for <math> \Phi </math> starting at <math> x_0 \in \mathcal{H} </math> if
+===Wasserstein Gradient / Subdifferential ===
-: <math> x(0)= x_0, </math> and <math> \dot{x}(t) \in -\partial \Phi(x(t)) </math> for almost every <math> t>0 </math>
+We are looking for an element <math> \nabla_{W_2} F (\mu) </math> in the tangent space of <math>\mathcal{P}_2(\Omega) </math> at <math> \mu</math> such that
-where <math> \partial \Phi </math> is the subdifferential of <math> \Phi </math>. <ref name="Glaudo"/>
+:<math> \langle  \nabla_{W_2} F (\mu_*), \frac{d}{dt}\mu(t)|_{t=0} \rangle_{W_2} = \lim_{h \rightarrow 0} \frac{F(\mu(h)) - F(\mu_*)}{h}</math>
-In order to define the gradient in <math>\mathcal{P}(\Omega) </math>, we also need to utilize the [http://34.106.105.83/wiki/Formal_Riemannian_Structure_of_the_Wasserstein_metric Riemannian structure of the space of probability measures].
+for any absolutely continuous curve <math> \mu(t) </math> in <math>\mathcal{P}_2(\Omega) </math> with <math> \mu(0)=\mu_* </math>.
+We claim that in fact
-===Wasserstein Distance===
+:<math> \nabla_{W_2} F (\mu_*) = - \nabla \cdot (\mu_* \nabla \frac{\delta F}{\delta \mu_*}) </math>
-Let us first define the (pth) Wasserstein distance between two probability measures. <ref name="Ambrosio"/> This can be defined for probability measures in any separable metric space. Let <math> \mu, \nu\in \mathcal{P}(\Omega) </math> and let <math> \Gamma(\mu, \nu) </math> denote the space of transport plans from <math> \mu </math> to <math> \nu </math>. Define the pth Wasserstein distance from <math> \mu </math> to <math> \nu </math> to be:
-:<math> W_p(\mu,\nu) := \min \{ \int_{\Omega \times \Omega} d(\xi_1,\xi_2)^p d\gamma(\xi_1,\xi_2): \gamma \in \Gamma(\mu,\nu)\} </math>
+where <math> \frac{\delta F}{\delta \mu_*} </math> is the first variation of <math> F </math> at <math> \mu_* </math> <ref name="Ambrosio" /> (also called the [https://en.wikipedia.org/wiki/Functional_derivative#cite_note-ParrYangP246A.2-3 functional derivative]).
-where <math> d(\xi_1,\xi_2) </math> is the distance between  <math> \xi_1 </math> and <math> \xi_2 </math> in the metric space <math> \Omega \times \Omega </math>. In this context, <math> \Omega \times \Omega </math> is just a subset of <math> \mathbb{R}^{d} </math> for some <math> d </math> with the euclidean metric. Often <math> p = 2 </math>.
+We provide a formal argument for this equality that makes the most sense when <math> \mu </math> is absolutely continuous with respect to the Lebesgue measure. In this case, we can think of <math> \mu </math> as a density function with <math> d \mu(x) = \mu(x) dx </math>. We can further assume that <math> \mu </math> is in <math> L^2(\Omega) </math>. Therefore the same notion of gradient for <math> L^2 </math> exists for <math> F </math>, and in fact
-===First Variation===
+:<math> \nabla_{L^2}F(\mu) = \frac{\delta F}{\delta \mu} </math>
-===Wasserstein Subdifferential===
+Thus by definition of <math> L^2 </math> gradient and <math> L^2 </math> inner product, we should have
-===Wasserstein Gradient Flow===
+:<math>\lim_{h \rightarrow 0} \frac{F(\mu(h)) - F(\mu_*)}{h}= \langle \frac{\delta F}{\delta \mu_*} , \frac{d}{dt}\mu(t)|_{t=0} \rangle_{L^2}</math>
-:<math> \partial_t \mu_t = \rm{div}( \mu_t \nabla \frac{\delta F}{\delta \mu} (\mu_t)) </math>
+:<math> = \int \frac{\delta F}{\delta \mu_*} \frac{d}{dt}\mu(t)|_{t=0} dx </math>
-==Main Results==
+by the continuity equation for <math>\mu(t) </math> and divergence theorem,
-===Consistency Between Infinite and Finite Cases===
+:<math> \int \frac{\delta F}{\delta \mu_*} \frac{d}{dt}\mu(t)|_{t=0} dx = - \int \frac{\delta F}{\delta \mu_*} (\nabla \cdot (v_*\mu_*)) dx= \int \nabla \frac{\delta F}{\delta \mu_*} (v_*\mu_*)dx = \int \nabla \frac{\delta F}{\delta \mu_*} v_* d\mu_*</math>
+where <math> v_* </math> is the unique velocity vector corresponding to <math> \mu_* </math> by the equivalence of absolutely continuous curves and solutions of [http://34.106.105.83/wiki/The_continuity_equation_and_Benamour_Brenier_formula the continuity equation]. Now by the definition of the inner product structure of <math> (\mathcal{P}_2(\mathbb{R}^d), W_2)</math>, we have
+:<math>\int \nabla \frac{\delta F}{\delta \mu_*} v_* d\mu_* = \langle - \nabla \cdot (\mu \nabla \frac{\delta F}{\delta \mu_*}),  \frac{d}{dt}\mu(t)|_{t=0} \rangle_{W_2}.</math>
+Therefore we have
+:<math>\nabla_{W_2}F(\mu_*)= - \nabla \cdot (\mu \nabla \frac{\delta F}{\delta \mu_*}).</math>
+so the gradient flow for <math> F </math> is an absolutely continuous curve of probability measure <math> \mu_t </math> such that
+:<math> \partial_t \mu_t  = \nabla \cdot (\mu_t \nabla \frac{\delta F}{\delta \mu_t}) </math>.
+Now we actually compute this gradient flow for our loss functional <math> F</math>.
+===Computing the Wasserstein Gradient Flow for F ===
+First, we compute the first variation ( or [https://en.wikipedia.org/wiki/Functional_derivative functional derivative]) of a function from <math>\mathcal{P}(\Omega) </math> to <math> \mathbb{R} </math> of <math> F </math>
+:<math> \frac{\delta F}{\delta \mu}(\mu_*) = \int_D \Phi( \cdot, x) [\int_\Omega \Phi(\bar{\xi},x)d \mu_*(\bar{xi}) - f(x) ]dx + V </math>
+Therefore we have
+: <math> \nabla \frac{\delta F}{\delta \mu}(\mu_*) = \int_D \nabla_{\xi} \Phi( \cdot, x) [\int_\Omega \Phi(\bar{\xi},x)d \mu_*(\bar{xi}) - f(x) ]dx + \nabla V .</math>
+Hence the  Wasserstein gradient flow is
+:<math> \partial_t \mu_t = \rm{div}( \mu_t \nabla \frac{\delta F}{\delta \mu} (\mu_t)) </math>.
+==Open Problems==
+===Generalization===
+[https://en.wikipedia.org/wiki/Generalization_error Generalization error] is how well a model generalizes for a new data not in your training data set. In practice, gradient descent does generalize really well, but it is an open problem to provide a theoretical framework to guarantee generalization. Wang, Meng, Chen and Liu 2021 made progress on studying the implicit regularization of the algorithm, and provide a framework for convergence. <ref name="Wang" />
 ==References==
@@ Line 100: / Line 133: @@
 <ref name="Schiebinger"> [https://personal.math.ubc.ca/~geoff/courses/W2019T1/Lecture16.pdf G. Schiebinger, ''Gradient Flow in Wasserstein Space''] </ref>
-<ref name="Glaudo"> [https://www.maa.org/press/maa-reviews/an-invitation-to-optimal-transport-wasserstein-distances-and-gradient-flows A. Figalli, F. Glaudo, ''An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows''] </ref>
+<ref name="Wang"> [https://arxiv.org/pdf/2012.06244.pdf B. Wang, Q. Meng, W. Chen, T. Liu,  ''The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks''] </ref>
 </references>

Shallow neural networks as Wasserstein gradient flows: Difference between revisions

Latest revision as of 02:33, 20 March 2022

Contents

Motivation

Single Layer Neural Networks

Discrete Formulation

Continuous Formulation

Gradient Flow

Wasserstein Gradient / Subdifferential

Computing the Wasserstein Gradient Flow for F

Open Problems

Generalization

References

Navigation menu

Shallow neural networks as Wasserstein gradient flows: Difference between revisions

Latest revision as of 02:33, 20 March 2022

Motivation

Single Layer Neural Networks

Discrete Formulation

Continuous Formulation

Gradient Flow

Wasserstein Gradient / Subdifferential

Computing the Wasserstein Gradient Flow for F

Open Problems

Generalization

References

Navigation menu

Search