Optimal Transport Methods Machine Learning

Introduction

Optimal transport concepts applied to machine learning applications can also be referred to as computational Optimal Transport (OT). At its core, machine learning focuses on making comparisons between complex objects. To properly measure these similarities, a metric is needed, which is a distance function.

Optimal transport respects the underlying structure and geometry of a problem while providing a framework for comparing probability distributions. Optimal transport methods have received attention from researchers in fields as varied as economics, statistics, and quantum mechanics. The categories that applications of OT methods in a machine learning context can be divided into include learning, domain adaptation, Bayesian inference, and hypothesis testing.^[1]

Learning Methods

These methods have used transport-based distances in a wide range of research contexts. These contexts range from language analysis to image recognition. To a certain extent, the geometry underlying optimal transport models helps to generate faster algorithms as well as establish data connections. Optimal transport models still have limitations, however, as some structure may not be captured. This structure can be described as either intrinsic or extrinsic. It is intrinsic if the distributions reflect structured objects (e.g. images with segments)n ad it can be characterized as extrinsic if there is additional information that creates structure (such as groupings) ^[2]. The following are some types of models that fall into the learning methods in OT category.

Graph-based Semi-supervised Learning

Effective approach for classification from a large variety of domains, including image and text classification. It is possible to use graph-based algorithms, and is often useful for unlabeled data. To imagine what comprises this type of model, consider a graph with multiple nodes. These graph nodes are usually associated with a simple label, like a real number. Expand this concept to consider more complex node labels, such as probability distributions or histograms. A concrete example of this would be a set of traffic cameras distributed across a geographical area, each of which produces a histogram or a distribution of commuter traffic over a given period of time. Many other interesting research questions exhibit a similar structure (e.g. climate data, rankings of restaurants, etc.). Optimal transport-inspired methods have used the two-Wasserstein distance between distributions to analyze models of this type ^[3] The work of Solomon, et. al. in this area led to additional insights by Gao, et. al, where Wassertstein propagation on graphs was used as the basis for a message-passing algorithm. This technique facilitated a generalization that could be applied to hypergraphs through Wasserstein barycenters ^[4].

Generative Adversarial Networks (GAN)

A class of frameworks for machine learning developed in 2014 by Ian Goodfellow (Goodfellow). Machine learning frameworks where two neural networks are used compete against each other in a game-theoretic sense. If a GAN is trained on a given set of training data, it can produced a new set of data that is at the surface similar to the original set of data (for example, it can create new, similar images based on an initial training set of images). These techniques have been used in semi-supervised learning. They offer increased flexibility in use of the objective function, including the use of f-divergences and Jensen-Shannon ^[5].

Wasserstein GAN (WGAN)

This is classified as a type of unsupervised learning, which uses a minimization of the distance between data distribution contained in the training set and the distribution of the observed data. In certain cases this produces a more stable training process. It minimizes an approximation of the earth mover (EM) distance. The optimization problem associated with the WGAN was shown to be sound by Arjovsky. This type of GAN reduces some of the problems associated with GAN approaches, including a reduction in the need to balance the training of the generator and the discriminator ^[5]. The WGAN also estimates the EM distance in a continuous manner via training the discriminator ^[5]. The following is an algorithm implementing the WGAN approach ^[5].

Algorithm 1: Wasserstein Generative Adversarial Networks (WGAN)

   Require :  $\alpha$  learning rate,  $c$  clipping parameter,  $m$  batch size,  $n_{\text{critic }}$   number of iterations of critic per generator iteration. 
   Require :  $w_{0},$  initial critic parameters,  $\theta _{0},$  initial generator's parameters.
   1: while  $\theta$  has not converged do
   2:    for  $t=0,\ldots ,n_{\text{critic }}$  do
   3:          Sample  $\left\{x^{(i)}\right\}_{i=1}^{m}\sim \mathbb {P} _{r}$  a batch from the real data.
   4:          Sample  $\left\{z^{(i)}\right\}_{i=1}^{m}\sim p(z)$  a batch of prior samples.
   5:           $g_{w}\leftarrow \nabla _{w}\left[{\frac {1}{m}}\sum _{i=1}^{m}f_{w}\left(x^{(i)}\right)-{\frac {1}{m}}\sum _{i=1}^{m}f_{w}\left(g_{\theta }\left(z^{(i)}\right)\right)\right]$ 
   6:           $w\leftarrow w+\alpha \cdot \operatorname {RMSProp} \left(w,g_{w}\right)$     
   7:           $w\leftarrow \operatorname {clip} (w,-c,c)$ 
   8:    end for
   9:       Sample  $\left\{z^{(i)}\right\}_{i=1}^{m}\sim p(z)$  a batch of prior samples.
   10:       $g_{\theta }\leftarrow -\nabla _{\theta }{\frac {1}{m}}\sum _{i=1}^{m}f_{w}\left(g_{\theta }\left(z^{(i)}\right)\right)$ 
   11:       $\theta \leftarrow \theta -\alpha \cdot \operatorname {RMSProp} \left(\theta ,g_{\theta }\right)$ 
   12: end while

Recall that the EM distance is both continuous and differentiable, this implies that the critic should be trained to toward optimality. As the critic is trained more, the gradient (Wasserstein) becomes more reliable. The Wasserstein GAN can have problems due to the enforcement of the Lipschitz constraint the critic, which required weight clipping (reflected in algorithm) ^[6]

Restricted Bolzman Machines (RMB)

These are probabilistic graphical models and can obtain hierarchical features at multiple levels. An RBM can learn a probability distribution over a given set of inputs, and they were originally created under the name "Harmonim" by Smolensky in 1986. These machines are able to learn complex and multi-scale distributions based on empirical data. The RBM generally has a layer of latent variables with a given probability distribution. This probability distribution is defined over a set of binary observed variables with a parameter that is to be learned.

Learning of model parameters is usually facilitated by Kullback-Lieber divergence (divergence from the training sample to the model). In research by Montavan, it is assumed that a metric can be established between observations ^[7]. This was used to define the Wasserstein distance between the training sample distribution that the distribution formed by the Boltzman machine.

Entropy-regularized Wasserstein Loss

This has been used for multi-label classification. It is characterized by a relaxation of the transport problem which addresses unnormalized measure. It does this by replacing the equality constraints with soft penalties with respect to KL-divergence.^[8]

Domain Adaptation

In this case the goal is to learn about or extrapolate from one domain to another, often by finding domain-invariant representations ^[9] In cases where these sorts of models are useful the distributions are generally different but related. This is a technique that is often used to transfer information based on labelled data to unlabeled data. For example, in some cases the labels are available in the source domain, but the classification needs to be conducted on the target. If the classifier is trained on the source domain, the data will have performance deficiencies in the targeted domain ^[10].

By obtaining the best transportation plan connecting the probability distributions of source and target domains, estimates of learning samples are estimated. The transformation is non-linear and invertible. This allows for the use of a variety of machine learning methods that can be used on the transformed dataset. Regularized, unsupervised models have been used, as well as Joint Class Proportion and Optimal Transport (JCPOT) to address multi-source domain adaptation. ^[8]The following is a summary of an algorithm implemented by Redko, et al:^[11]

Algorithm 2: Joint Class Proportion and Optimal (JCPOT)

   1: Parameters:  $\epsilon ,$  maxIter,  $\forall k\left(\mathrm {C} ^{(k)}and\lambda ^{(k)}\right)$ 
   2:  $cpt\leftarrow 0$ ,  $err\leftarrow \infty$ 
   3:  $\zeta ^{(k)}\leftarrow \exp \left(-{\frac {\mathbf {C} ^{(k)}}{\epsilon }}\right),\quad \forall k$ 
   4: while  $cpt$  < maxIter and  $err$  > threshold do
   5:    $\zeta ^{(k)}\leftarrow \operatorname {diag} \left({\frac {\mathbf {m} }{c(k)\mathbf {1} }}\right)\zeta ^{(k)},\quad \forall k$ 
   6:    $\mathbf {h} ^{(cpt)}\leftarrow exp(\sum _{k=1}^{K}\lambda ^{(k)}\log(\mathbf {D} _{1}^{(k)}\zeta ^{(k)}1))$      
   7:    $\zeta ^{(k)}\leftarrow \zeta ^{(k)}\operatorname {diag} \left({\frac {\mathbf {D} _{2}^{(k)}\mathbf {h} }{\zeta ^{(k)}\mathbf {1} }}\right),\quad \forall k$ 
   8:    $\operatorname {err} \leftarrow \left\|\mathbf {h} ^{(cpt)}-\mathbf {h} ^{(cpt-1)}\right\|_{2}$ 
   9:    $\quad cpt\leftarrow cpt+1$

This algorithm describes the optimization problem solved in Redko, et al which used operators defined such that proportions between the source and target domains are equal. A Wasserstein barycenter problem was used to achieve this (Benamou, Redko).

Bayesian Inference

An important aspect of Bayesian inference is calculating expectations given a posterior probability function. This creates a complicated problem, mainly because of multidimensional integrals. The method usually used to solve these problems is Monte Carlo Markov Chains, which in turn introduce complications (slow convergence to empirical expectation, lack of sample independence) ^[1]. The need for these complicated MCMC models was bypassed in the work of El Moselhy and Marzouk ^[12] In a transport based method, Bayesian inference was formulated in terms of optimal transport theory, and the existence and uniqueness of a measure-preserving map was established. The map was subsequently parametrized and via the solution of an optimization problem. This approach helps bypass many of the problems associated with the MCMC models.

Hypothesis Testing

Additional Resources

Recent research on cutting edge topics relating to optimal transport applications to machine learning can be found at the following site: https://www.msri.org/workshops/928.

Tutorials on the many methods used can be found at the following: Optimal Transport for Machine Learning ^[10].

References

↑ ^1.0 ^1.1 Kolouri, Soheil, Serim Park, Matthew Thorpe, Dejan Stepcey, Gustavo Rohde. Optimal Mass Transport: Signal Processing and Machine-Learning Applications. IEEE Signal Process Mag. 2017 Jul. 34(4): 43-59.
↑ Alvarez-Melis, David, Et al. Structured Optimal Transport. Proceedings of the 21st International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain. PMLR: Volume 84.
↑ Solomon, J., et al. Wasserstein Propagation for Semi-Supervised Learning. Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014.
↑ Gao, Tingran, et al. Wasserstein Soft Label Propagation on Hypergraphs: Algorithm and Generalization Error Bounds. Univ. of Chicago. 2019
↑ ^5.0 ^5.1 ^5.2 ^5.3 Arjovsky, Martin, et al. Wasserstein GAN. arXiv:1701.07875v3 [stat.ML] 6 Dec 2017
↑ Gulrajani, I., et al. Improved Training of Wasserstein GANs. arXiv:1704.00028v3 [cs.LG] 25 Dec 2017.
↑ Montavon, G., Klaus-Robert Muller, Marco Cuturi. Wasserstein Training of Restricted Boltzmann Machines. 30th Conference on Neural Information Processing Systems (NIPS 2016)
↑ ^8.0 ^8.1 Peyre, Gabriel, and Marco Cuturi, "Computational Optimal Transport", Github, https://optimaltransport.github.io/pdf/ComputationalOT.pdf.
↑ Courty, Nicolas, et al. Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. X, No. X, Jan XX. 2018
↑ ^10.0 ^10.1 Flamary, R. Domain Adaptation with Optimal Transport: From Mapping to Learning with Joint Distribution Cite error: Invalid <ref> tag; name "Flamary" defined multiple times with different content
↑ Redko, I., et al. Optimal Transport for Multi-source Domain Adaptation under Target Shift. arXiv:1803.04899v1 [stat.ML] 13 Mar 2018 et al.
↑ El Moselhy, Tarek A., Youssef M Marzouk. Bayesian Inference with Optimal Maps. Journal of Computational Physics 231(23). September 2011.

[Kolouri-1] 1.0 ^1.1 Kolouri, Soheil, Serim Park, Matthew Thorpe, Dejan Stepcey, Gustavo Rohde. Optimal Mass Transport: Signal Processing and Machine-Learning Applications. IEEE Signal Process Mag. 2017 Jul. 34(4): 43-59.

[Alvarez-Melis-2] Alvarez-Melis, David, Et al. Structured Optimal Transport. Proceedings of the 21st International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain. PMLR: Volume 84.

[Solomon-3] Solomon, J., et al. Wasserstein Propagation for Semi-Supervised Learning. Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014.

[Gao-4] Gao, Tingran, et al. Wasserstein Soft Label Propagation on Hypergraphs: Algorithm and Generalization Error Bounds. Univ. of Chicago. 2019

[Arjovsky-5] 5.0 ^5.1 ^5.2 ^5.3 Arjovsky, Martin, et al. Wasserstein GAN. arXiv:1701.07875v3 [stat.ML] 6 Dec 2017

[Gulrajani-6] Gulrajani, I., et al. Improved Training of Wasserstein GANs. arXiv:1704.00028v3 [cs.LG] 25 Dec 2017.

[Montavon-7] Montavon, G., Klaus-Robert Muller, Marco Cuturi. Wasserstein Training of Restricted Boltzmann Machines. 30th Conference on Neural Information Processing Systems (NIPS 2016)

[Peyre-8] 8.0 ^8.1 Peyre, Gabriel, and Marco Cuturi, "Computational Optimal Transport", Github, https://optimaltransport.github.io/pdf/ComputationalOT.pdf.

[Courty-9] Courty, Nicolas, et al. Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. X, No. X, Jan XX. 2018

[Flamary-10] 10.0 ^10.1 Flamary, R. Domain Adaptation with Optimal Transport: From Mapping to Learning with Joint Distribution Cite error: Invalid <ref> tag; name "Flamary" defined multiple times with different content

[Redko-11] Redko, I., et al. Optimal Transport for Multi-source Domain Adaptation under Target Shift. arXiv:1803.04899v1 [stat.ML] 13 Mar 2018 et al.

[El_Moselhy-12] El Moselhy, Tarek A., Youssef M Marzouk. Bayesian Inference with Optimal Maps. Journal of Computational Physics 231(23). September 2011.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Machine Learning

Contents

Optimal Transport Methods Machine Learning

Introduction

Learning Methods

Graph-based Semi-supervised Learning

Generative Adversarial Networks (GAN)

Wasserstein GAN (WGAN)

Algorithm 1: Wasserstein Generative Adversarial Networks (WGAN)

Restricted Bolzman Machines (RMB)

Entropy-regularized Wasserstein Loss

Domain Adaptation

Algorithm 2: Joint Class Proportion and Optimal (JCPOT)

Bayesian Inference

Hypothesis Testing

Additional Resources

References

Navigation menu

Machine Learning

Optimal Transport Methods Machine Learning

Introduction

Learning Methods

Graph-based Semi-supervised Learning

Generative Adversarial Networks (GAN)

Wasserstein GAN (WGAN)

Algorithm 1: Wasserstein Generative Adversarial Networks (WGAN)

Restricted Bolzman Machines (RMB)

Entropy-regularized Wasserstein Loss

Domain Adaptation

Algorithm 2: Joint Class Proportion and Optimal (JCPOT)

Bayesian Inference

Hypothesis Testing

Additional Resources

References

Navigation menu

Search