Shallow neural networks as Wasserstein gradient flows

^[1]

Artificial neural networks (ANNs) consist of layers of artificial "neurons" which take in information from the previous layer and output information to neurons in the next layer. Gradient descent is a common method for updating the weights of each neuron based on training data. While in practice every layer of a neural network has only finitely many neurons, it is beneficial to consider a neural network layer with infinitely many neurons, for the sake of developing a theory that explains how ANNs work. In particular, from this viewpoint the process of updating the neuron weights for a shallow neural network can be described by a Wasserstein gradient flow.

Motivation

Shallow Neural Networks

Let us introduce the mathematical framework and notation for a neural network with a single hidden layer. Let $D\subset \mathbb {R} ^{d}$ be open . The set $D$ represents the space of inputs into the network. There is some unknown function $f:D\rightarrow \mathbb {R}$ which we would like to approximate. Let $N\in \mathbb {N}$ be the number of neurons in the hidden layer. Define

F_{N}:D\times \Omega \rightarrow \mathbb {R} ^{k}

be given by

F_{N}(x,\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})={\frac {1}{N}}\sum _{i=1}^{N}\omega _{i}h(\theta _{i},x)

where $h$ is a fixed activation function and $\Omega$ is a space of possible parameters Failed to parse (unknown function "\math"): {\displaystyle (\omega, \theta) <\math>. The goal is to use training data to repeatedly update the weights <math> \omega_i } and $\theta _{i}$ based on how close $f_{N,\omega ,\theta }:=F_{N}(\cdot ,\omega _{1},\dots ,\omega _{N},\theta _{1},\dots ,\theta _{N})$ is to the function $f$ . More concretely, we want to find $\omega ,\theta$ that minimizes the loss function:

l(f,f_{N,\omega ,\theta }):={\frac {1}{2}}\int _{D}|f(x)-f_{N,\omega ,\theta }(x)|^{2}dx

A standard way to choose an update the weights is to start with a random choice of weights ${\bar {\omega }},{\bar {\theta }}$ and perform gradient descent on these parameters. Unfortunately, this problem is in general non-convex, so the minimizer may not be achieved with this method. To avoid this issue, it is useful to instead study a neural network model with infinitely many neurons.

Continuous Formulation

For the continuous formulation (i.e. when $N=\infty$ ), we rephrase the above mathematical framework. In this case, it no longer makes sense to look for weights $\omega ,\theta$ that minimize the loss function. We instead look for a probability measure $\mu \in {\mathcal {P}}(\Omega )$ such that

f_{\mu }(x):=\int _{\Omega }\Phi (\xi ,x)d\mu (\xi )

minimizes the loss function:

F(\mu ):={\frac {1}{2}}\int _{D}(f-f_{\mu })^{2}dx

.

Minimization Problem

Wasserstein Gradient Flow

Main Results

Consistency Between Infinite and Finite Cases

References

↑ Xavier Fernandez-Real and Alessio Figalli, The Continuous Formulation of Shallow Neural Networks as Wasserstein-Type Gradient Flows

[Figalli-1] Xavier Fernandez-Real and Alessio Figalli, The Continuous Formulation of Shallow Neural Networks as Wasserstein-Type Gradient Flows

[1]

Shallow neural networks as Wasserstein gradient flows

Contents

Motivation