2 layer neural networks as Wasserstein gradient flows: Difference between revisions

From Optimal Transport Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 14: Line 14:
: <math> F_N(x, \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) = \frac{1}{N} \sum_{i=1}^N \omega_i h(\theta_i,x) </math>  
: <math> F_N(x, \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) = \frac{1}{N} \sum_{i=1}^N \omega_i h(\theta_i,x) </math>  


where <math> h </math> is a fixed activation function. The goal is to continually update the weights <math> \omega_i </math> and the <math>\theta_i </math> based on how close <math> f_{N, \omega, \theta) := F_N( \cdot,  \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) </math> is to the function <math> f </math> based on the training data. More concretely, we want to find  <math> \omega, \theta </math> that minimizes the loss function:
where <math> h </math> is a fixed activation function. The goal is to continually update the weights <math> \omega_i </math> and the <math>\theta_i </math> based on how close <math> f_{N, \omega, \theta} := F_N( \cdot,  \omega_1, \dots, \omega_N,\theta_1, \dots, \theta_N) </math> is to the function <math> f </math> based on the training data. More concretely, we want to find  <math> \omega, \theta </math> that minimizes the loss function:


: <math> l(f,f_{N, \omega, \theta}) := \frac{1}{2} \int_{D} |f(x)-f_{N,\omega,\theta}(x)|^2dx </math>
: <math> l(f,f_{N, \omega, \theta}) := \frac{1}{2} \int_{D} |f(x)-f_{N,\omega,\theta}(x)|^2dx </math>

Revision as of 04:20, 10 February 2022

[1]

Artificial neural networks (ANNs) consist of layers of artificial "neurons" which take in information from the previous layer and output information to neurons in the next layer. Gradient descent is a common method for updating the weights of each neuron based on training data. While in practice every layer of a neural network has only finitely many neurons, it is beneficial to consider a neural network layer with infinitely many neurons, for the sake of developing a theory that explains how ANNs work. In particular, from this viewpoint the process of updating the neuron weights for a shallow neural network can be described by a Wasserstein gradient flow.

Motivation

Shallow Neural Networks

Let us introduce the mathematical framework and notation for a neural network with a single hidden layer. Let be open . The set represents the space of inputs into the network. There is some unknown function which we would like to approximate. Let be the number of neurons in the hidden layer. Define

be given by

where is a fixed activation function. The goal is to continually update the weights and the based on how close is to the function based on the training data. More concretely, we want to find that minimizes the loss function:


Continuous Formulation

Minimization Problem

Wasserstein Gradient Flow

Main Results

References