Attention conservation notice: Self-promotion of an academic talk, based on a three-year-old paper, on the arcana of how Bayesian methods work from a frequentist perspective.
Because is snowing relentlessly and the occasional bout of merely freezing air is a blessed relief, I will be escaping to a balmier clime next week: Cambridgeshire.
"When Bayesians Can't Handle the Truth", statistics seminar, Cambridge University: Abstract: There are elegant results on the consistency of Bayesian updating for well-specified models facing IID or Markovian data, but both completely correct models and fully observed states are vanishingly rare. In this talk, I give conditions for posterior convergence that hold when the prior excludes the truth, which may have complex dependencies. The key dynamical assumption is the convergence of time-averaged log likelihoods (Shannon-McMillan-Breiman property). The main statistical assumption is a building into the prior a form of capacity control related to the method of sieves. With these, I derive posterior and predictive convergence, and a large deviations principle for the posterior, even in infinite-dimensional hypothesis spaces; and clarify role of the prior and of model averaging as regularization devices. (Paper)
Place and time: 1 February 2013, 4-5 pm in MR 12, CMS
And my views on all this hoisted from archives:
(1) Brad DeLong : Cosma Shalizi Waterboards the Rev. Dr. Thomas Bayes: Bayesian inference gone horribly wrong. Cosma Shalizi:
Some Bayesian Finger-Puzzle Exercises, or: Often Wrong, Never In Doubt: The theme here is to construct some simple yet pointed examples where Bayesian inference goes wrong, though the data-generating processes are well-behaved, and the priors look harmless enough. In reality, however, there is no such thing as an prior without bias, and in these examples the bias is so strong that Bayesian learning reaches absurd conclusions....
Example 1:... The posterior estimate of the mean [of the generating process] thus wanders from being close to +1 to being close to -1 and back erratically, hardly ever spending time near zero, even though (from the law of large numbers) the sample mean [of the sufficient statistic] converges to [the true mean of the generating process of] zero....
Example 2:... As we get more and more data, the sample mean converges almost surely to zero (by the law of large numbers), which here drives the mean and variance of the posterior to zero almost surely as well. In other words, the Bayesian becomes dogmatically certain that the data are distributed according to a standard Gaussian with mean 0 and variance 1. This is so even though the sample variance almost surely converges to the true variance, which is 2. This Bayesian, then, is certain that the data are really not that variable, and any time now will start settling down....
It is a violation of the Geneva Convention to force a Bayesian statistician to begin analysis with a prior that places a weight of zero on the true underlying generating process, isn't it?
(2) Brad DeLong : Cosma Shalizi Takes Me to Probability School. Or Is It Philosophy School?: After I accuse Cosma Shalizi of waterboarding the Rev. Dr. Thomas Bayes, he responds:
Cosma Shalizi: Cosma Shalizi Waterboards the Rev. Dr. Thomas Bayes: Hoisted from Comments: I am relieved to learn that the true model of the world is always already known to every competent statistical inquirer, since otherwise it could not be given positive prior weight. I would ask, however, when our model set became complete? And further, when did people stop using models which they knew were at best convenient but tractable approximations?
Less snarkily, these two examples are out-takes from what I like to think is a fairly serious paper on Bayesian non-parametrics with mis-specified models and dependent data:
described less technically here:
The examples were simple sanity-checks on my theorems, and I posted them because they amused me.
Thus Cosma Shalizi takes me to probability school. Or perhaps he takes me to philosophy school.
It is not clear…
Let me give an example simpler than one of the ones Cosma Shalizi gave. Rosencrantz is flipping a coin. Guildenstern is watching and is calling out "heads" or "tails." It is a fair coin--half the time it comes up heads, and half the time it comes up tails. Before Rosencrantz starts flipping, Guildenstern's beliefs about what the next flip of the coin will bring are accurate: he thinks that there is a 50% chance that the next flip of the coin will be heads and a 50% chance that the next flip of the coin will be tails.
Because Guildenstern starts with correct beliefs about what the odds are for the next flip of the coin, you might think that there is nothing for Guildenstern to "learn"--that as Rosencrantz flips, Guildenstern will retain his initial belief that the odds are 50-50 that the next flip of the coin will be heads or tails.
But there is a problem: Guildenstern is not a human being but rather is a Bayesian AI, and Guildenstern is certain that the coin is biased: its prior is such that it thinks that there is a 50% chance it is dealing with a coin that lands heads 3/4 of the time, and a 50% chance it is dealing with a coin that lands tails 3/4 of the time, and its initial prediction that the next flip is equally likely to be heads or tails depends on that initial 50-50 split.
What happens as Rosencrantz starts flipping? The likelihood ratio for an H-biased as opposed to a T-biased coin is 3^z, where z=h-t and h is the number of heads and t is the number of tails flipped, which means that the posterior probabilities assigned by Guildenstern after h heads and t tails are:
P(H | z) = 3^z/(3^z + 1)
P(T | z) = 1/(3^z + 1)
And the estimate that the next flip will be heads is:
(3/4)P(H | z) + (1/4)P(T | z) = (3^(z+1) + 1)/(4(3^z + 1))
If the number of heads and tails are even, then Guildenstern (correctly) forecasts that the odds on the next flip are 50-50. If the n flips Rosencrantz has performed have seen two more heads than tails--no matter how big n is--then Guildenstern is 90% certain that it is dealing with an H-biased coin and thinks that the chance the next flip will be heads is 70%. If the n flips Rosencrantz has performed have seen ten more heads than tails--again, no matter how many flips n there have been--Guildenstern is 99.9983% sure that it is dealing with an H-biased coin and will forecast the odds of a head on the next flip at 74.9999%.
How will Guildenstern's beliefs behave over time? Well, this passage from Shalizi's more complex example applies:
Three-Toed Sloth: The sufficient statistic z [for P(H)]... follows an unbiased random walk, meaning that as n grows it tends to get further and further away from [zero], with a typical size growing roughly like n^(1/2). It does keep returning to the origin, at intervals dictated by the arc sine law, but it spends more and more of its time very far away from it. The posterior estimate of [the probability of] an H-biased coin thus wanders from being close to +1 to being close to  and back erratically, hardly ever spending time near zero, even though (from the law of large numbers) the sample mean [fraction of heads] converges to zero...
So Guildenstern spends all of its time being nearly dead certain that it is dealing with an H-biased coin or a T-biased coin--switching its belief occasionally--even though there is almost surely never any statistically significant evidence for H-bias against the null hypothesis that the coin is fair and 50-50. There is an allowable set of beliefs for Guildenstern that will lead it to make the right 50-50 forecast of the odds on the next flip: if Guildenstern simply continues to believe that there is no evidence either way for H-bias or T-bias. But Guildenstern's beliefs are not those beliefs and do not converge to those beliefs: look far enough out into the future and you see that Guildenstern is almost sure either that the coin is H-biased or that the coin is T-biased, and has virtually no chance of being unsure about in which direction the bias lies.
Thus Guildenstern's processing of the data is not sensible, is not smart, is not rational, is not human--but it is Bayesian. For positive values of z, Guildenstern thinks:
there are fewer heads than I would expect for an H-biased coin, but this could never come about with a T-biased coin; the coin must be H-biased: I am sure of it.
A sensible agent, a smart agent, a rational agent, a human agent would think:
Hmmm. Right now I am sure that the coin is T-biased, but 100 flips ago I was sure that the coin was H-biased. I know that as I get more evidence my beliefs should be converging to the truth, but they don't seem to be converging at all. Something is wrong.
But there is nothing in the Bayesian agent's little brain to allow it to reason from the failure of its beliefs to converge to the conclusion that there is something badly wrong here.
It does seem, intuitively, that Bayesian Guildenstern should be able to make good forecasts. The prior that Guildenstern started with does admit of beliefs that would lead to accurate forecasts of the next coin flip: all Guildenstern has to do is to doubt that it has enough information to decide about the bias of the coin. Indeed, Guildenstern's initial beliefs generate the right forecast of probabilities for the next flip. So, given that Guildenstern starts out with a set of beliefs that supports and generates the "right" forecast probabilities, given that there really isn't enough information to decide about the bias of the coin--there can't be, for the coin is not biased--and given that Bayesian learning is a kind of learning, why doesn't Guildenstern simply keep its original beliefs and keep making good forecasts?
Shalizi has identified a case in which it seems at first glance that a Bayesian agent sees enough information and has a flexible-enough and comprehensive-enough prior that it should be able to learn enough to make good predictions, but in which it cannot in fact do so.
Shalizi has gotten his Bayesian AI Guildenstern to confess. But has he done so by legitimate means? Or by waterboarding? The probability theory question, or perhaps the philosophy question, is: Is this a problem for the Bayesian way of looking at the world? Or only a demonstration that torture elicits confessions?