While playing around with Bayesian methods for random effects models, it occured to me that inverse-Wishart priors can really bite you in the bum. Inverse Wishart-priors are popular priors over covariance functions. People like them priors because they are conjugate to a Gaussian likelihood, i.e, if you have data with each :

so that the ‘s are correlated Gaussian vectors, and you wish to infer the correlation matrix **S**, then putting an inverse-Wishart prior on **S **is convenient because the posterior distribution is very easy to sample from and its mean can be computed analytically.

The inverse-Wishart works like this: a Wishart distribution produces random positive definite matrices by first producing Gaussian vectors:

the Wishart sample is -times the sample covariance matrix:

which is why for large Wishart samples will look like . is the number of degrees of freedom, and we want to keep it low if the prior is to be relatively noninformative.

A matrix **S **has inverse Wishart distribution if its inverse has Wishart distribution . Again, since for large the inverse will look like , for large **S **will look like . More generally, the mean of the inverse Wishart is , where is the dimension.

Let’s say we want to do Bayesian inference for the correlation of two Gaussian variables. Essentially, that’s a special case of what we started out with: datapoints , with each a 2-dimensional vector containing the observations for the two Gaussian variables. We are interested in one aspect of , namely the marginal for the correlation coefficient, i.e. .

What are generally reasonable assumptions about ? I think we don’t want the prior to be too sensitive to scale – we don’t know how big or small the variances of the components are going to be, but we might have a reasonable range. As far as the correlation coefficient is concerned, it’s fairly rare to have variables that are perfectly correlated, so we might want our prior to shrink correlation coefficients a bit.

A good default choice might then be to take , with **I **the identity matrix. Under the prior , and if we simulate from the prior we get a marginal distribution for the log-variance that looks like:

which is suitably spread over several orders of magnitude. The marginal distribution for the correlation coefficient is:

which is uniform – not what we wanted (we’d like very high absolute correlations to be less probable). That’s not the main problem with this prior, though. The *big *problem is this:

For each sample from the prior, I’m plotting the log-variance of the first component against the correlation coefficient. There’s a clear lack of independence, which is even easier to see in the conditional distribution of the correlation coefficient. Below, the histogram of the correlation coefficient conditioned on both variances being more than one (“High variance”), or not (“Low variance”):

What we see here is a strong dependence between variance and correlation: the prior says that high variance implies high correlation, and low variance implies low-to-moderate correlation. This is a disaster for inference, because it means that correlation will tend to be exagerated if variance is higher than expected, which is the opposite of the shrinkage behaviour we’d like to see.

There are better priors for covariance matrices out there, but sometimes you might be stuck with the Wishart for computational reasons (for example, it’s the only choice you have in INLA for random effects). An option is to estimate the variances first, then tweak the inverse-Wishart prior to have the right scale. Increasing the value of will provide correlation shrinkage. From a Bayesian point of view this is moderately dirty, but preferable to just sticking with the default choice (and see here for a prior choice with good frequentist properties). Kass & Natarajan (2006) is a much more sophisticated version of this strategy.

Below, some R code to reproduce the plots:

</pre> require(plyr) require(ggplot2) require(mvtnorm) #Given a 2x2 covariance matrix, compute corr. coeff ccm <- function(M) { M[2,1]/prod(sqrt(diag(M))) } #Generate n samples from the prior rprior <- function(n=1,r=3,M=diag(rep(1,2))) { Minv <- solve(M) rlply(n,chol2inv(chol(rwish(r,Minv)))) } #Wishart samples rwish <- function(r,R) { X <- rmvnorm(r,sig=R) t(X)%*%X } plot.marginal.variance <- function(df) { p <- ggplot(df,aes(var1))+geom_histogram(aes(y=..density..))+scale_x_log() p+theme_bw()+labs(x="\nVariance of the first component",y="Density\n") } plot.marginal.correlation <- function(df) { p <- ggplot(df,aes(cor))+geom_histogram(aes(y=..density..)) p+theme_bw()+labs(x="\nCorrelation coefficient",y="Density\n")+scale_x_continuous(lim=c(-1,1)) } plot.corr.var <- function(df) { p <- ggplot(df,aes(var1,cor))+geom_point()+scale_x_log() p+theme_bw()+labs(x="\nVariance of the first component",y="Correlation coeffient\n") } plot.conditional.corr <- function(df) { df <- mutate(df,high.var=factor((var1>1 & var2 > 1), labels=c("Low variance","High variance"))) p <- ggplot(df,aes(cor))+geom_histogram(aes(y=..density..))+facet_wrap(~ high.var) p+theme_bw()+labs(x="\nCorrelation coefficient",y="Density\n")+scale_x_continuous(lim=c(-1,1)) } df <- ldply(rprior(2000),function(M) data.frame(var1=M[1,1],var2=M[2,2],cor=ccm(M)))

Hi – I am somewhat new to Bayesian analysis and was wondering exactly what alternative priors you would recommend for the multivariate normal, and whether your objections also apply to the *scaled* inverse Wishart. Thanks for your thoughts!

Hi Chris. Good point. Using the scaled inverse Wishart doesn’t change anything, the standard deviations of the invidual coefficients and their covariance are still dependent. My answer would be to use a prior that models the standard deviations and the correlations separately, so that you can express things like “I don’t expect my coefficients to be too large but I expect them to be correlated”. I’ve just googled a bit and I’ve found a very nice article by Barnard, McCulloch and Meng making this exact point.

Essentially you can express a correlation matrix as Sigma = diag(s)*C*diag(s), where s is a vector of standard deviations and C is a correlation matrix. This means you can write down your prior as p(Sigma) = p(s)*p(C), and you’re free to make sensible choices for these two components. Check out the article for details.

I don’t see why this isn’t the default in most statistical software, honestly.

Thank you very much! Andrew Gelman seems to be under the impression that the scaled inverse Wishart *does* alleviate these problems, owing to it being a more general class of models. He said that the following post was relevant, although it’s a bit over my head:

http://andrewgelman.com/2006/09/modeling_the_gr/

As far as I can tell from the blog post what Adnrew Gelman is recommending is a redundant parameterisation Sigma = diag(s)*V*diag(s), where V is a general covariance matrix taken from an inverse-Wishart. This means the actual standard deviations in Sigma are going to depend both on s and on V, so that there’s still no a priori separation between standard deviations and correlations. The solution recommended by Barnard, McCulloch and Meng is much cleaner, I think.

I have a followup to your post on my blog (link below) seeing if the scaled inverse-Wishart improves on the inverse Wishart.

http://www.themattsimpson.com/2012/08/20/prior-distributions-for-covariance-matrices-the-scaled-inverse-wishart-prior/

Thanks Matt! Note to everyone who might have Googled their way here: go read Matt’s post.

( Lewandowski, Kurowicka, and Joe 2009) explores an alternative way of sampling correlation matrices (and by scaling, covariance matrices). (Sorry for the Elsevier paywalled link.)

You can use their strategy to construct more reasonable priors which shrink toward the identity matrix as you desire.

A (very) recent paper actually proposes a hierarchical prior based on the inverse-Wishart distribution

http://ba.stat.cmu.edu/journal/forthcoming/huang.pdf

“Simple Marginally Noninformative Prior Distributions for Covariance Matrices”

-Alan Huang and M. P. Wand

Bayesian Analysis

I may be missing something really simple, but what is the mean of an inverse wishart process when the dimensionality is 2 and nu is 3? From the formula you gave M/(nu-p-1), it shouldn’t exist!

That’s correct, nu should be greater than p – 1.

Hey, a small tip: rwish is already in R core: http://svn.r-project.org/R/trunk/src/library/stats/src/rWishart.c

Why an inverse-Wishart prior may be a good idea:

In financial modeling, it’s been noted that during “normal” times, asset returns have relatively low correlations, but in the (fat!) tail of a financial crisis, asset return correlations go to one. So the inverse-Wishart behavior in a hierarchical setting is perfect: black swan large deviations from normal (in *two* senses of normal 🙂 get modeled as large variances, and high correlations come along for free. (Since there’s no such thing as a financial anti-crisis, some kind of folding would be useful.)

So if you need to model a situation in which, occasionally, everything goes to shit simultaneously, the inverse-Wishart could be a great idea!

So in the context of fitting mixture models for clustering, if one uses the IW prior on the covariance of the components, does this imply that larger clusters (higher variance) will tend to be elliptical (high correlation) whereas smaller clusters (low variance) will tend to be spherical?

That could well be the case, I haven’t checked. Maybe the data are strong enough to override this particular bias.

it appears that there is a mistake on line 46 and possibly extending into line 47. The parenthesis are not balanced. The R code does NOT run as is. I wonder if you care to fix the problem. I am very interested in trying it out. Thanks

You’re right, one line got mangled somehow, thanks. I’ve edited it, should work now.