The Wasserstein distance seems to have been reinvented more often than the wheel. It pops up under various names, including “Earth Mover’s Distance”, and “Mallows Distance”. One reason people keep reinventing it is probably because it comes in quite handy when comparing two probability distributions.

There are two interpretations for the Wasserstein distance, and one involves putting stuff in a wheelbarrow and lugging it about. We’ll start with that one.

Say your company owns one mine and one factory. The mine is at location {x} and the factory at location {y} in Flatland, ie {x,y\in\mathbb{R}}. You are VP for Logistics, and your job is to carry the ore from the factory to the mine. Your company kindly provided you with a wheelbarrow for that purpose.

How much work to you have to do, in the physical sense? If you carry {\alpha} tons of ore in a wheelbarrow, the total amount of work done is {|x-y|\times\alpha}, i.e. distance times mass.

Now let’s say you have two mines (at {x_{1}} and {x_{2}}) producing amount {\alpha_{1}} and {\alpha_{2}}, and two factories (at {y_{1}} and {y_{2}}) that require amounts {\beta_{1}} and {\beta_{2}}, and assume {\alpha_{1}+\alpha_{2}=\beta_{1}+\beta_{2}}. You need both factories to get the amount they want. You also want to minimise how much work you have to do. This latter quantity is given by a simple weighted sum:

\displaystyle  w=\sum_{i,j}\left|x_{i}-y_{j}\right|\times m_{ij}=\sum d_{ij}m_{ij}

in which {m_{ij}} is the amount of stuff going from mine {i} to factory {j}.

What we can already guess at this stage is that unless each factory is exactly opposite a mine, and consumes exactly what that mine produces, there’s going to be some work involved. The minimal amount of work you can do is given by:

\displaystyle  \min\,\sum d_{ij}m_{ij}

subject to the following conditions

  1. {\sum_{j}m_{ij}=\alpha_{i}} (a mine can’t give more than it has)
  2. {\sum_{i}m_{ij}=\beta_{i}} (a factory doesn’t take in more than it needs)
  3. {m_{ij}\geq0} for rather plain and boring reasons.

We immediately recognise a classical linear program (linear objective function, linear constraints), and we know that no closed-form solution is available in the general case.

Mines and factories are all well and good, but what does this have to do with probability distributions? Well, if we now assume that {\sum\alpha_{i}=1} and {\sum\beta_{i}=1}, our mines and factories problem corresponds to having two discrete densities, {p(x)} and {q(y)}, such that:

\displaystyle  p(x)=\sum\alpha_{i}\delta(x-x_{i})

and similarly:

\displaystyle  q(y)=\sum\beta_{i}\delta(y-y_{i})

The {\delta} are Dirac “point-masses” at the locations of the factories and mines. At this stage there’s an obvious extension to continuous densities that we won’t get into.

The Wasserstein distance is the minimum work you have to do to “transport” {p(x)} onto {q(y)}. Why does this make sense as a distance between {p} and {q}? One intuition is that before we noted that having no work to do, and therefore a distance of 0, meant that “each factory is exactly opposite a mine, and consumes exactly what that mine produces”. This just means {p=q}