Shallow neural networks as Wasserstein gradient flows: Difference between revisions

Revision as of 19:14, 10 February 2022

Artificial neural networks (ANNs) consist of layers of artificial "neurons" which take in information from the previous layer and output information to neurons in the next layer. Gradient descent is a common method for updating the weights of each neuron based on training data. While in practice every layer of a neural network has only finitely many neurons, it is beneficial to consider a neural network layer with infinitely many neurons, for the sake of developing a theory that explains how ANNs work. In particular, from this viewpoint the process of updating the neuron weights for a shallow neural network can be described by a Wasserstein gradient flow.

Motivation

Shallow Neural Networks

Let us introduce the mathematical framework and notation for a neural network with a single hidden layer. Let $D\subset \mathbb {R} ^{d}$ be open . The set $D$ represents the space of inputs into the network. There is some unknown function $f:D\rightarrow \mathbb {R}$ which we would like to approximate. Let $N\in \mathbb {N}$ be the number of neurons in the hidden layer. Define

F_{N}:D\times \Omega \rightarrow \mathbb {R} ^{k}

be given by

F_{N}(x,\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})={\frac {1}{N}}\sum _{i=1}^{N}\omega _{i}h(\theta _{i},x)

where $h$ is a fixed activation function and $\Omega$ is a space of possible parameters Failed to parse (unknown function "\math"): {\displaystyle (\omega, \theta) <\math>. The goal is to use training data to repeatedly update the weights <math> \omega_i } and $\theta _{i}$ based on how close $f_{N,\omega ,\theta }:=F_{N}(\cdot ,\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})$ is to the function $f$ . More concretely, we want to find $\omega ,\theta$ that minimizes the loss function:

l(f,f_{N,\omega ,\theta }):={\frac {1}{2}}\int _{D}|f(x)-f_{N,\omega ,\theta }(x)|^{2}dx

A standard way to choose an update the weights is to start with a random choice of weights ${\bar {\omega }},{\bar {\theta }}$ and perform gradient descent on these parameters. Unfortunately, this problem is in general non-convex, so the minimizer may not be achieved with this method. To avoid this issue, it is useful to instead study a neural network model with infinitely many neurons.

Continuous Formulation

For the continuous formulation (i.e. when $N=\infty$ ), we rephrase the above mathematical framework. In this case, it no longer makes sense to look for weights $\omega ,\theta$ that minimize the loss function. We instead look for a probability measure $\mu \in {\mathcal {P}}(\Omega )$ such that

f_{\mu }(x):=\int _{\Omega }\Phi (\xi ,x)d\mu (\xi )

minimizes the loss function:

F(\mu ):={\frac {1}{2}}\int _{D}(f-f_{\mu })^{2}dx

.

Minimization Problem

Wasserstein Gradient Flow

Main Results

Consistency Between Infinite and Finite Cases

References

↑ Xavier Fernandez-Real and Alessio Figalli, The Continuous Formulation of Shallow Neural Networks as Wasserstein-Type Gradient Flows

[Figalli-1] Xavier Fernandez-Real and Alessio Figalli, The Continuous Formulation of Shallow Neural Networks as Wasserstein-Type Gradient Flows

[1]

@@ Line 14: / Line 14: @@
 : <math> F_N(x, \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) = \frac{1}{N} \sum_{i=1}^N \omega_i h(\theta_i,x) </math>
-where <math> h </math> is a fixed activation function. The goal is to use training data to repeatedly update the weights <math> \omega_i </math> and the <math>\theta_i </math> based on how close <math> f_{N, \omega, \theta} := F_N( \cdot,  \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) </math> is to the function <math> f </math>. More concretely, we want to find  <math> \omega, \theta </math> that minimizes the loss function:
+where <math> h </math> is a fixed [https://en.wikipedia.org/wiki/Activation_function activation function] and <math> \Omega </math> is a space of possible parameters <math> (\omega, \theta) <\math>. The goal is to use training data to repeatedly update the weights <math> \omega_i </math> and <math>\theta_i </math> based on how close <math> f_{N, \omega, \theta} := F_N( \cdot,  \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) </math> is to the function <math> f </math>. More concretely, we want to find  <math> \omega, \theta </math> that minimizes the loss function:
 : <math> l(f,f_{N, \omega, \theta}) := \frac{1}{2} \int_{D} |f(x)-f_{N,\omega,\theta}(x)|^2dx </math>

Shallow neural networks as Wasserstein gradient flows: Difference between revisions

Revision as of 19:14, 10 February 2022

Contents

Motivation