Entropic Regularization: Difference between revisions

From Optimal Transport Wiki
Jump to navigation Jump to search
(Created page with "== Entropic Regularization == Due the convex structure of the Kantorovich problem, it is natural to turn to methods that, upon discretizing, exploit the fact that we aim to so...")
 
(Blanked the page)
Tag: Blanking
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
== Entropic Regularization ==
Due the convex structure of the Kantorovich problem, it is natural to turn to methods that, upon discretizing, exploit the fact that we aim to solve a linear programming problem. Upon replacing our measures <math> \mu, \nu  </math> with <math> \mu = \sum_{i=1}^{N}a_{i}\delta_{x_{i}} </math> and <math> \nu = \sum_{j=1}^{M}b_{j}d_{y_{j}} </math> for some appropriately determined <math> a_{i},b_{j} </math> and fixed <math> N, M. </math> The resulting minimization problem becomes:


<math> \min \left( \sum_{i,j}c_{ij}\alpha_{ij} : \alpha_{ij} \geq 0, \sum_{i}\alpha_{ij} = b_{j}, \sum_{j}\alpha_{ij} = a_{i} \right)</math>
which we may approach using classical methods that exploit the structure of the resulting linear system.
== The Entropic Regularization ==
In studying a slight variant to this problem, for fixed <math> \varepsilon > 0 </math> we may now instead focus on tackling
<math> \min \left( \sum_{i,j}(c_{ij}\alpha_{ij} + \varepsilon \alpha_{ij}\log(\alpha_{ij})) : \alpha_{ij} \geq 0, \sum_{i}\alpha_{ij} = b_{j}, \sum_{j}\alpha_{ij} = a_{i} \right) </math>
which converges as <math> \varepsilon \rightarrow 0.</math> rewriting the objective function at hand using
<math> c_{ij}\alpha_{ij} + \varepsilon \alpha_{ij}\log(\alpha_{ij}) = \varepsilon \alpha_{ij}\log(\frac{\alpha_{ij}}{\eta_{ij}}),\ \ \ \sum_{i,j}(c_{ij}\alpha_{ij} + \varepsilon \alpha_{ij}\log(\alpha_{ij})) = \varepsilon \textbf{KL}(\alpha| \eta),  </math>
where <math> \eta_{ij} = e^{-c_{ij}/\varepsilon} </math> and \textbf{KL} denotes the <math> \textit{Kullback-Leibler} </math> divergence (based on the relative entropy).
Thus, from here, this minimization problem reads as a projection from this divergence, of the point <math> \eta \in \mathbb{R}^{N \times N} </math> on the set of constraints
<math> C = C^{x} \cap C^{y} = \{\alpha \in \mathbb{R}^{N\times N} : \sum_{j} \alpha_{ij} = a_{i}\}  \bigcap \{\alpha \in \mathbb{R}^{N\times N} : \sum_{i} \alpha_{ij} = b_{j}\} </math>
In particular, this Kullback-Leibler divergence has the property that the projection onto the intersection of two linear subspaces such as <math> C^{x}, C^{y} </math> can be achieved by alternate projections. This is done by defining <math> \alpha^{2k+1} = Proj_{C^{x}}(\alpha^{2k}) </math> and <math> \alpha^{2k+2} = Proj_{C^{y}}(\alpha^{2k+1}) </math> with <math> \alpha^{0} = \eta </math> and looking at the limit as we take <math> k \rightarrow \infty </math>.
As such, the problem now becomes of projecting onto one of the linear subspaces, which can be done. If we are given a <math> \beta </math>, we wish to solve
<math> \min \left( \sum_{i,j}\alpha_{ij}\log\left( \frac{\alpha_{ij}}{\beta_{ij}}\right) : \sum_{j} \alpha_{ij} = a_{i} \right) </math>
which we can solve an approximated problem of with the following algorithm.
== Algorithm ==
*Start from <math>\alpha^{0} = \eta </math>
*define <math> \alpha_{ij}^{2k+1} = p_{i}\alpha_{ij}^{2k} </math> with <math> p_{i} = a_{i}/\sum_{j}\alpha_{ij}^{2k} </math>
*then define <math> \alpha_{ij}^{2k+2} = q_{j}\alpha_{ij}^{2k+1} </math> with <math> q_{j} = b_{j}/\sum_{i}\alpha_{ij}^{2k+1} </math>
The limit we observe as <math> k \rightarrow \infty </math> provides us with the solution to the regularized problem. This idea is called the iterative proportional fitting procedure (<math> \textbf{IPFP} </math>). In utilizing this entropic regularization, this algorithm lends itself to the following advantages:
*we may compute the projections explicitly at every step the moment we need only project to <math> C^{x}</math> or <math> C^{y} </math>
*it is quite memory-efficient since the cumulating product of the coefficients <math> p_{i}, q_{j} </math> appear at each step, thus allowing us to only store <math> 2N </math> as opposed to <math> N^{2} </math>
*it can be easily parallelized given the separability of the structure at any given step.
== References ==
<references>
<ref name="Santambrogio">[https://link-springer-com.proxy.library.ucsb.edu:9443/content/pdf/10.1007%2F978-3-319-20828-2.pdf F. Santambrogio, ''Optimal Transport in Applied Mathematics'', Chapter 6.]</ref>
</references.

Latest revision as of 16:25, 1 June 2020