Markov Chain Monte Carlo is not the most intuitive concept in statistics: that an iterative algorithm can asymptotically generate dependent samples from a certain target distribution (and what exactly this means) takes a bit of getting used to. One analogy I’ve found very helpful is to think of MCMC as shuffling a deck of cards.

Let’s say that we have a deck of k cards. We’ll label the cards 1,2,…,k because we don’t want to bother with suits and colours. What does it mean for cards to be shuffled? It means that there’s no way to tell in which order they’ll come, i.e., every permutation of the deck is equally likely.

We start out from a sorted deck: e.g., the cards are ordered (1,2,…,k). The goal will be to randomly rearrange the cards such that no order is more likely than any other. One way to proceed is to take out the first card, and insert it at a random position in the deck. If we do this a whole bunch of times, this should be sufficient to produce the desired effect: the deck might be in just about any order with equal probability.

The first thing to note is that this shuffling technique can be described as a Markov process. The state space is the space of all permutations on 1..k. The “move” (using MCMC jargon) consists in taking out the first card and reinserting it at random. Since the move depends only on the current state, we can compute the probability distribution over the next state given only the current state: this makes it Markov.

This description makes several key MCMC concepts (somewhat) intuitively evident.

Convergence in distribution: If we start from a fully sorted set and you ask me to guess what the first card is, I’ll know it’s a 1. If you apply the shuffling move once and ask me to guess again, there will be some uncertainty: it is likely to be a 2, but it could also be 1 in case you reinserted 1 in the first slot. At the third step there is again a little bit more randomness and I won’t be able to guess as precisely. If you shuffle enough times I won’t have a clue any more: the first card might be any of 1…k with about equal probability.
In MCMC parlance, the probability distribution of the Markov process after a large number of moves is for all instance and purposes be the uniform distribution over the state space. Each successive move takes us closer to our target distribution.

Invariant distribution: you’ve probably read that MCMC moves should leave the target distribution invariant. That’s the case here: if you have an deck of cards in your hands, and I don’t know what order the cards are in, it does not matter if you shuffle the cards once more, I still won’t know any better. The uniform distribution over possible orderings is invariant to a shuffling move.

Dependent samples: in MCMC the state of the chain is measured at successive iterations. Each measured state is called a “sample”, but it is clear from the example that the samples are not independent: it is relatively easy to guess what the state at time t+1 is if one knows what the state at t is. Another way of viewing this is that the chain moves *locally* in the state space: it does not allow for radical changes.

Speed of convergence: some MCMC moves are better than others. To shuffle the card, a more radical strategy consists in throwing them down the stairs and picking them up again in the order you find them. We may start from a fully ordered set as before but after just one of these shuffling moves it’s clear I won’t do a very good job of predicting which card is on top of the stack. If the cards are sticky, there will still be some dependence on the previous step but is clear that convergence to the uniform distribution happens a lot faster. Viewed another way: our first method couldn’t go from state (1,2,3) to state (3,2,1) in one step. Our second can, and will produce states that are less dependent on one another.

Estimating moments of the distribution from samples: in MCMC we estimate some aspects of the target distribution from the samples, for example the mean of the target distribution is estimated from the mean of the samples. In our case let’s say that we want to estimate the average label of the i’th card (which should be (1+2+...+k )/k ): we could do this by noting the value of the i’th card at each iteration of the process and averaging afterwards. It’s clear that over many iterations no particular card will have more of a chance to be in the i-th spot than the other, so that the sample average will be approximately correct. What may be less clear is that the sample average will converge to the true average much more slowly than if we had *independent* samples from the uniform distribution. Let’s say we start from an ordered set, which will have the card with value i in spot i. It’s quite likely that making one “move” will either not affect the position of the i’th card, or that it will be replaced by one of its neighbours. It’s not at all unlikely that even taking 10 successive samples we’d still get some really stupid value for the average, because the chain moves slowly in the state space. The likelihood of having large deviations from the true value would be much lower with independent samples.

The last thing that I’d like to mention is that although here our MCMC chain can take us from a probability distribution that’s fully concentrated (we know the cards are sorted) to one that’s maximally uncertain (the uniform distribution over possible permutations), it’s easy to do just the reverse by having sorting moves instead of shuffling moves. The next step is to make a chain that converges to any possible distribution – exactly what generic MCMC methods achieve.