How sure are you?
Oddly, even though it's apparently one of the most fundamental questions to ask, we are often not very good at accurately judging and reporting our certainty of our own knowledge of the world.
He talked variously about the vagaries of different ways of interpreting probabilities (particularly the frequentist, Bayesian and 'metaphysical' interpretations, of which more below) and how they relate to our understanding of information in crucial areas like healthcare. He then went on to illustrate (with an excellent practical example in which he involved the whole group) exactly how our understanding of certainty can be quantified (or at least put in a useful frame) and used for the training and adjustment of accurate forecasts.
The 'metaphysical' interpretation involves conceiving of possible futures, in an attempt to personalize risk. The article linked to above references this interactive demonstration which you might like to play around with, and maybe even use as the basis for a little experiment of your own. His argument is that the best way of getting people to understand the risks involved in individual lifestyle choices (like the purchase and use of statins) is to phrase it in terms of what could happen to them over 'multiple lives'.
If I am told there is a 10% chance I will suffer a heart attack or stroke in the next 10 years, lowered to 8% by taking statins, then my understanding of this changes according to whether I think of this in terms of, say, a 20% cut in the chance I suffer such an event; or for every 50 people taking statins, one 1 avoid a heart attack.
It is always clear that the mode of presentation of statistical information is of prime importance, and juxtaposition of statistics allows intuitive understanding of relative magnitudes to dominate over any attempt at a full understanding of, say, the worth of a treatment.
He contests that thinking in terms of 'future mes' is the most natural and valuable way to present what are by their very nature subjective statistics and risks. In 2 out of every 100 lives I lead, I would avoid a heart attack because I took statins.
The main body of the talk, however, was devoted to understanding the concept of a scoring rule. The idea is that, given some questions, and a subject who is answering them and providing an estimate of their own confidence in the result, you want to know how to reward (or penalise) the subject in order to get the most reliable information out of them about their own confidence.
For example, consider a meteoriologist forecasting the weather - if they are predicting weather with some stated percentage of confidence, how do you give them the incentive to be truthful and accurate and assess their accuracy?
The basic idea, stripped down to the simplest case of a binary event (e.g. will it rain tomorrow?), is to allocate some score S to every a probabilitistic forecast (or set of forecasts). In fact, we always assume we have several foecasts. Then the overall score should be higher when the forecaster is most reliably accurate at declaring uncertainty, and lowest when the forecaster routinely makes incorrect forecasts with high confidence.
Specifically, we call a scoring rule a proper scoring rule if the expected score is maximised (or more generally optimised) if and only if the actual probability of the event is exactly equal to the forecast probability.
For example, take a test with a quadratic (Brier, after the man who created the score around 1950) score, , where x is 0 or 1 according to whether or not the event occurs and q is the estimated probability it does. Then if the forecaster believes that there is a probability p that it does occur, the expected score (from his point of view) is which is equal to . So clearly, the forecaster maximizes his expected payoff by setting his projected likelihood q to exactly what he truly believes it to be, p.
not all scoring rules are proper - not even all likely seeming
candidates are. For example, you can check
encourages forecasters to exaggerate (i.e. to pick q to be always
either 0 or 1).
You can easily conduct an experiment using a scoring rule with any group of people - give them a multiple-choice test (two options each time for these simple rules) and get them to write down a 'confidence' probability for their answer - say 5/10 for a total guess and 10/10 if they are entirely certain. Then give them the corresponding score according to whether their answer was correct and their stated confidence, and you are left with a set of overall scores indicating who was the most honest and reliable answerer. This works very well (especially with punishing negative scores for high-confidence mistakes) as both a party game and statistical exercise.
A sample (quadratic) set of numbers are (listed as confidence, score given correct, score given incorrect):
- 5: +0, -0
- 6: +9, -11
- 7: +16, -24
- 8: +21, -39
- 9: 24, -56
- 10: 25, -75
A few different scoring rules are illustrated here.
In the long run, i.e. over several forecasts and experiments, we can 'decompose' (see this article also) the sum of the Brier score above into three components: Uncertainty, Reliability and Resolution. (Note the linked article uses a Brier score which is lower for better forecasts - this corresponds to dropping the leading 1 term in the above definition and making the remaining expression positive.)
- Uncertainty provides a measure of how naturally diverse the events are (maximised when the event occurs 1/2 of the time for a binary event like we are considering). The Brier score is higher (worse) when the uncertainty is naturally larger, reflecting the fact that it is more difficult to give accurate predictions in this case.
- Reliability shows how far from the truth the forecasts were, on average; so a figure of 0 indicates perfect accuracy, whilst larger values show more deviation.
- Resolution shows how much the forecasts differ from the overall average score - note that one can achieve very high levels of reliability simply by constantly predicting the climatic average; resolution is higher when the forecasts are 'more definite' than this, and so is substracted from the score to improve it for higher resolution. It precisely equals (and hence cancels out) the uncertainty in the score when the predictions are always definite (0 or 1).
Spiegelhalter also noted that one can actually justify the construction of a semi-frequentist, semi-Bayesian theory of probability - and in fact derive the standard axioms of probability as theorems in this alternative, 'deeper' formulation of the subject.
It's possibly worth considering the slightly more fickle aspects of this method - encroaching on the field of game theory, in fact, as one considers less local tactics and maybe even psychological aspects to the situation. Most notably, it may be the case that maximizing expected score leads to non-optimal behaviours, and with human error and tendency to bias confidence included the situation becomes a lot less clear. Scoring rules are fairly heavily used (as suggested by the terminology and examples) in meteorology and similar areas, but the field seems to be studied rather less than one might hope. Still, something to do over the holidays...
Spiegelhalter is the Winton
Professor of the Public Understanding of Risk here at the
University of Cambridge. He works largely with public-sector outreach
and educational projects and the media, but also teaches an advanced
course in Applied Bayesian Statistics, and retains a research interest
in several practical areas of statistics, especially healthcare. He also
runs the website Understanding
Uncertainty, a resource for easy to read statistical treatments of various topics. He was
talking at the Trinity Mathematical Society
(TMS) on Monday 15th of February 2010.