Andrew Gelman weighs in on the Wishart issue discussed here previously – should people be using (scaled) inverse Wishart priors? The basic problem that prompted my post and Matt Simpson’s follow-up was that the inverse Wishart prior makes correlation and scale dependent, so that when looking at two effects \beta_{1}  and \beta_{2}  you might get a different estimate of their correlation depending on how large they are. There’s a good way to avoid the problem, described in a paper by Barnard, McCulloch, and Meng. I was wondering why it’s not the default choice, and Andrew’s answer is that using this much better prior slows things down, which I can well believe is true. I also agree with Andrew that the problem can be worked around, however I still think that one really should be careful about advocating the scaled inverse-Wishart as a default choice, even though it is the convenient one.

The reason is that end users of a statistical method don’t expect the method to mix up two things they think of as independent. We think of correlation and scale as being two different things, and we can come up with reasonable constraints on what they should be. Losing this independence is a fairly significant sacrifice to computational convenience.

I can think of two more “priors of convenience” that have similar issues, which I should say are well known to people in the field. One is the Lasso-style prior, where coefficients in a regression are given a Laplace prior:

p(\beta_{i}|\lambda)\propto\exp\left(-\lambda\left|\beta_{i}\right|\right)

The effect of having a high value of \lambda  is to make the maximum-a posteriori estimate of the \beta‘s sparse, ie. with many coefficients set to 0. The Lasso is amazingly convenient from the computational point of view, but it confuses scale and sparsity, which should stay independent things: increasing \lambda  will kill some coefficients but will also shrink the remaining ones. Here also there are work-arounds, such as scaling the data matrix, but you need to know what you are doing. On the other hand, fixing the problem the clean way is a lot harder computationally, so maybe the trade-off is worth it in most applications.

Another problematic prior from that point of view is the random walk Gauss-Markov process as a substitute for a Gaussian process prior. Gaussian processes let you express very nicely a prior distribution over a space of functions, and you can parameterise them so that one parameter expresses how much you expect your function to vary vertically (scale) and how much you expect it to vary over time (correlation length). These are completely orthogonal quantities in all the applications I can think of, and it is very good that the prior keeps them that way. The bad news is that Gaussian Processes are complex, computationally speaking, and the Gauss-Markov approximation leads to much, much, much faster inference (see INLA). The approximation comes at a cost, and the Gauss-Markov prior confuses scale and correlation length, which I always find a pain.

Frequentists have a point when they criticise Bayesians who argue that priors are fantastic because they allow you to express useful prior knowledge, and then turn around and use a conjugate prior because that’s what’s convenient. If we really want to allow routine use of Bayesian inference, we have to work more on adapting the computational methods to the prior rather than the other way around. I admit it is easier said than done.