Andrew Gelman weighs in on the Wishart issue discussed here previously – should people be using (scaled) inverse Wishart priors? The basic problem that prompted my post and Matt Simpson’s follow-up was that the inverse Wishart prior makes correlation and scale dependent, so that when looking at two effects and
you might get a different estimate of their correlation depending on how large they are. There’s a good way to avoid the problem, described in a paper by Barnard, McCulloch, and Meng. I was wondering why it’s not the default choice, and Andrew’s answer is that using this much better prior slows things down, which I can well believe is true. I also agree with Andrew that the problem can be worked around, however I still think that one really should be careful about advocating the scaled inverse-Wishart as a default choice, even though it is the convenient one.
The reason is that end users of a statistical method don’t expect the method to mix up two things they think of as independent. We think of correlation and scale as being two different things, and we can come up with reasonable constraints on what they should be. Losing this independence is a fairly significant sacrifice to computational convenience.
I can think of two more “priors of convenience” that have similar issues, which I should say are well known to people in the field. One is the Lasso-style prior, where coefficients in a regression are given a Laplace prior:
The effect of having a high value of is to make the maximum-a posteriori estimate of the
‘s sparse, ie. with many coefficients set to 0. The Lasso is amazingly convenient from the computational point of view, but it confuses scale and sparsity, which should stay independent things: increasing
will kill some coefficients but will also shrink the remaining ones. Here also there are work-arounds, such as scaling the data matrix, but you need to know what you are doing. On the other hand, fixing the problem the clean way is a lot harder computationally, so maybe the trade-off is worth it in most applications.
Another problematic prior from that point of view is the random walk Gauss-Markov process as a substitute for a Gaussian process prior. Gaussian processes let you express very nicely a prior distribution over a space of functions, and you can parameterise them so that one parameter expresses how much you expect your function to vary vertically (scale) and how much you expect it to vary over time (correlation length). These are completely orthogonal quantities in all the applications I can think of, and it is very good that the prior keeps them that way. The bad news is that Gaussian Processes are complex, computationally speaking, and the Gauss-Markov approximation leads to much, much, much faster inference (see INLA). The approximation comes at a cost, and the Gauss-Markov prior confuses scale and correlation length, which I always find a pain.
Frequentists have a point when they criticise Bayesians who argue that priors are fantastic because they allow you to express useful prior knowledge, and then turn around and use a conjugate prior because that’s what’s convenient. If we really want to allow routine use of Bayesian inference, we have to work more on adapting the computational methods to the prior rather than the other way around. I admit it is easier said than done.
Hmmm. I don’t really agree with you here Simon! (Warning: this is long and could miss any number of points)
- I don’t think “need[ing] to know what you are doing” is *ever* a downside of an algorithm! This goes with what appears to be the underlying assumption on these two posts that default (or conjugate) priors are chosen to be “uninformative”. (which you’ve shown to be untrue). ((Well, you’ve shown that there are assumptions encoded in these priors, which is not necessarily the same thing as them being informative for the problem at hand))
- Also, the Laplace prior is a “prior of convenience” only because it matches with a particular frequentist technique. So any issues that it has actually comes not from ‘prior choice’ but from matching a dodgy method!
- On that point, I don’t think the people who invented quasi-likelihoods get to complain about suboptimal ‘methods of convenience’. They risk sounding silly.
- The GMRF vs GRF thing is correct, I guess. But it’s easy to fix. We know exactly how the “scaling parameter” (tau in the SPDE paper) and the “range” parameter (kappa) are related, so you can of course parameterise them independently. We don’t, but you’re welcome to. It’s also worth pointing out that your argument only makes sense for stationary fields (which is usually a fairly optimistic assumption) and for nonstationary fields this is a non-issue.
- I also don’t agree that thinking of things as being separate entities is a reason to model them as independent. The classic example here would be range and smoothness of a random field. Depending on how you observe it, they may not be distinguishable, yet logically they are quite different things. So independence is a bad assumption.
- For the Scaled-IW vs the BMM prior argument, coming from the point of ignorance (not having read the BMM paper), I would definitely say that literally anything that requires you to work with correlation matrices instead of general positive definite matrices is going to be horrible and slow. From the pictures of Matt Simpson’s website, the SIW prior seems to work *fine* by the metrics that he (and you) have chosen. As it smooths out the problem and doesn’t explode the running time, I would say that it’s a pretty good choice.
- It also strikes me that the area that the IW prior has trouble with is in a “weird” bit of the parameter space – when the correlations are near +/- 1, you’re on the absolute edge of the parameter space. There are two things here: 1) do I really care what’s happening out there? (and that is very much problem dependent) and 2) I don’t know what should happen out here. This second point is again problem dependent and brings up the whole world of pain that is the concept of a “non-informative” prior being a very problem dependent notion. As far as I’m concerned (although I’ve been known to be out on weird limbs with these types of opinions before), we just want a prior that doesn’t get in the way (if we’re looking for something “non-informative”). So I’m just not sold on the idea that this behaviour “at infinity” is really so devastating. At least not devastating enough to dump the SIW prior in favour of working with some truly terrible structures.
Hi Dan! Thanks for your comment. I guess I was trying to make two related points: for the sake of computational convenience, certain priors have
a) fewer degrees of freedom than they should (I mean “degree of freedom” in the engineering sense of the word)
b) embed a kind of dependence that makes no sense as a default choice
I’m ready to admit that point (b) comes mostly from armchair psychology, but seems to me reasonable. To use a clumsy mechanical metaphor, if you have a car with a gearbox and a steering wheel, you don’t expect that turning the steering wheel is going to shift gears. My original point was that the IW prior does a version of that, and using the sensible alternative put forward by Barnard, McCulloch and Meng doesn’t seem like such a computational ordeal *in small dimensions*, although admittedly I haven’t tried yet. I imagine that for high dimensional correlation matrices, you really want a prior that is seriously informative, so it’s kind of a separate issue.
On the range vs. smoothness issue, I haven’t thought much about it, but I don’t see any cases of actual data in the fields I work in where you could distinguish between the two (except in the extreme case where the distinction is between continuous or not). It’s more of an academic distinction, psychologically you end up asking yourself whether you expect f(x) to be close in value to f(x+blah).
I didn’t know you could retain the scaling vs. range distinction in GMRFs, that’d be a very useful feature, I think.
Finally, I agree with you that frequentists use a lot of fairly ad-hoc stuff, and tend to assume n=Inf whenever convenient. The problem for Bayesians is that compared to frequentists we have one more component that could blow up in our face, namely, the prior. Finding a good default prior for everyday users is a tough problem. Maximum likelihood, except in well-documented special cases like mixture models, is less likely to go really wrong when confronted with users who don’t know as much as we’d like them too. That doesn’t necessarily mean they have to be non-informative, although for things like routine regressions and GLMs a bit of resistance to how the data are scaled is a good thing to have. Users don’t expect results to change if they express, say, height in centimeters rather than inches. Ideally you’d have to do large scale studies with actual end users to see what sort of prior doesn’t blow up in spectacular ways. Not sure that’s going to happen.