Machine Learning: Difference between revisions

From Optimal Transport Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 38: Line 38:
Recall that the EM distance is both continuous and differentiable, this implies that the critic should be trained to toward optimality.  As the critic is trained more, the gradient (Wasserstein) becomes more reliable.   
Recall that the EM distance is both continuous and differentiable, this implies that the critic should be trained to toward optimality.  As the critic is trained more, the gradient (Wasserstein) becomes more reliable.   


Restricted Bolzman Machines (RMB): These are probabilistic graphical models and can obtain hierarchical features at multiple levels.  An RBM can learn a probability distribution over a given set of inputs, and they were originally created under the name "Harmonim" by Paul Smolensky in 1986.  These machines are able to learn complex and multi-scale distributions based on empirical data. The RBM generally has a layer of latent variables with a given probability distribution.  This probability distribution is defined over a set of binary observed variables with a parameter that is to be learned.
===Restricted Bolzman Machines (RMB)===  
These are probabilistic graphical models and can obtain hierarchical features at multiple levels.  An RBM can learn a probability distribution over a given set of inputs, and they were originally created under the name "Harmonim" by Paul Smolensky in 1986.  These machines are able to learn complex and multi-scale distributions based on empirical data. The RBM generally has a layer of latent variables with a given probability distribution.  This probability distribution is defined over a set of binary observed variables with a parameter that is to be learned.


Learning of model parameters is usually facilitated by Kullback-Lieber divergence (divergence from the training sample to the model).  In research by Montavan, it is assumed that a metric can be established between observations.  This was used to define the Wasserstein distance between the training sample distribution that the distribution formed by the Boltzman machine.
Learning of model parameters is usually facilitated by Kullback-Lieber divergence (divergence from the training sample to the model).  In research by Montavan, it is assumed that a metric can be established between observations.  This was used to define the Wasserstein distance between the training sample distribution that the distribution formed by the Boltzman machine.

Revision as of 01:39, 10 June 2020

Optimal Transport Methods Machine Learning

Introduction

Optimal transport concepts applied to machine learning applications can also be referred to as computational Optimal Transport (OT). At its core, machine learning focuses on making comparisons between complex objects. To properly measure these similarities, a metric is needed, which is a distance function.

Optimal transport respects the underlying structure and geometry of a problem while providing a framework for comparing probability distributions. Optimal transport methods have received attention from researchers in fields as varied as economics, statistics, and quantum mechanics. The categories that OT methods can be divided into include learning, domain adaptation, Bayesian inference, and hypothesis testing.

Learning Methods

These methods have used transport-based distances in a wide range of research contexts. These contexts range from language analysis to image recognition. To a certain extent, the geometry underlying optimal transport models helps to generate faster algorithms as well as establish data connections. Optimal transport models still have limitations, however, as some structure may not be captured. This structure can be described as either intrinsic or extrinsic. It is intrinsic if the distributions reflect structured objects (e.g. images with segments)n ad it can be characterized as extrinsic if there is additional information that creates structure (such as groupings) (Alvarez-Melis). The following are some types of models that fall into the learning methods in OT category.

Graph-based Semi-supervised Learning

Effective approach for classification from a large variety of domains, including image and text classification. It is possible to use graph-based algorithms, and is often useful for unlabeled data. To imagine what comprises this type of model, consider a graph with multiple nodes. These graph nodes are usually associated with a simple label, like a real number. Expand this concept to consider more complex node labels, such as probability distributions or histograms. A concrete example of this would be a set of traffic cameras distributed across a geographical area, each of which produces a histogram or a distribution of commuter traffic over a given period of time. Many other interesting research questions exhibit a similar structure (e.g. climate data, rankings of restaurants, etc.). Optimal transport-inspired methods have used the two-Wasserstein distance between distributions to analyze models of this type (Solomon). The work of Solomon, et. al. in this area led to additional insights by Gao, et. al, where Wassertstein propagation on graphs was used as the basis for a message-passing algorithm. This technique facilitated a generalization that could be applied to hypergraphs through Wasserstein barycenters (Gao).

Generative Adversarial Networks (GAN)

A class of frameworks for machine learning developed in 2014 by Ian Goodfellow (Goodfellow). Machine learning frameworks where two neural networks are used compete against each other in a game-theoretic sense. If a GAN is trained on a given set of training data, it can produced a new set of data that is at the surface similar to the original set of data (for example, it can create new, similar images based on an initial training set of images). These techniques have been used in semi-supervised learning. They offer increased flexibility in use of the objective function, including the use of f-divergences and Jensen-Shannon (Arjovsky).

Wasserstein GAN (WGAN)

This is classified as a type of unsupervised learning, which uses a minimization of the distance between data distribution contained in the training set and the distribution of the observed data. In certain cases this produces a more stable training process. It minimizes an approximation of the earth mover (EM) distance. The optimization problem associated with the WGAN was shown to be sound by Arjovsky. This type of GAN reduces some of the problems associated with GAN approaches, including a reduction in the need to balance the training of the generator and the discriminator (Arjovsky). The WGAN also estimates the EM distance in a continuous manner via training the discriminator (Arjovsky). The following is an algorithm implementing the WGAN approach (Arjovsky).

Algorithm 1: Wasserstein Generative Adversarial Networks (WGAN)

   Require :  learning rate,  clipping parameter,  batch size,   number of iterations of critic per generator iteration. 
   Require :  initial critic parameters,  initial generator's parameters.
   1: while  has not converged do
   2:    for  do
   3:          Sample  a batch from the real data.
   4:          Sample  a batch of prior samples.
   5:          
   6:              
   7:          
   8:    end for
   9:       Sample  a batch of prior samples.
   10:      
   11:      
   12: end while     
   

Recall that the EM distance is both continuous and differentiable, this implies that the critic should be trained to toward optimality. As the critic is trained more, the gradient (Wasserstein) becomes more reliable.

Restricted Bolzman Machines (RMB)

These are probabilistic graphical models and can obtain hierarchical features at multiple levels. An RBM can learn a probability distribution over a given set of inputs, and they were originally created under the name "Harmonim" by Paul Smolensky in 1986. These machines are able to learn complex and multi-scale distributions based on empirical data. The RBM generally has a layer of latent variables with a given probability distribution. This probability distribution is defined over a set of binary observed variables with a parameter that is to be learned.

Learning of model parameters is usually facilitated by Kullback-Lieber divergence (divergence from the training sample to the model). In research by Montavan, it is assumed that a metric can be established between observations. This was used to define the Wasserstein distance between the training sample distribution that the distribution formed by the Boltzman machine.

Entropy-regularized Wasserstein loss: This has been used for multi-label classification. It is characterized by a relaxation of the transport problem which addresses unnormalized measure. It does this be replacing the equality constraints with soft penalties with respect to KL- divergence.

Domain Adaptation

In this case the goal is to learn about or extrapolate from one domain to another, often by finding domain-invariant representations (County). https://arxiv.org/pdf/1507.00504.pdf In cases where these sorts of models are useful the distributions are generally different but related. This is a technique that is often used to transfer information based on labelled data to unlabeled data. For example, in some cases the labels are available in the source domain, but the classification needs to be conducted on the target. If the classifier is trained on the source domain, the data will have performance deficiencies in the targeted domain (Flamary).

By obtaining the best transportation plan connecting the probability distributions of source and target domains, estimates of learning samples are estimated. The transformation is non-linear and invertible. This allows for the use of a variety of machine learning methods that can be used on the transformed dataset. Regularized, unsupervised models have been used, as well as Joint Class Proportion and Optimal Transport (JCPOT) to address multi-source domain adaptation. The following is a summary of an algorithm implemented by Redko:

Algorithm 2: Joint Class Proportion and Optimal (JCPOT)

   1: Parameters:  maxIter, 
   2: , 
   3: 
   4: while  < maxIter and  > threshold do
   5:   
   6:        
   7:   
   8:   
   9:   

This algorithm describes the optimization problem solved in Redko, et al which used operators defined such that proportions between the source and target domains are equal. A Wasserstein barycenter problem was used to achieve this (Benamou, Redko).

References