Histogram Errors

Written: 10/04/2012 Last edited: 22/06/2012 Filed under: Specialist Topics, Applied Maths

Target audience: You will want to know some basic probability.

Warning: NOT YET PROOFREAD

Here's a moderately common problem: you make a bunch of observations of data, falling into some buckets or bins in a histogram, and maybe they have different weights too (because you know some observations will be more or less common than others in the real world, but they're not in your sampling, typically). You draw a histogram. Where do the error bars go?

Suppose you are taking N observations in total, and pick some particular bin, k. Let S_k be the sum of all weights of the observations/events lying in this bin, $S_{k}=\sum_{j=1}^{N_{k}}W_{j}$ Now here we have two sets of random variables - N_k is the number of observations lying in this bin, whilst W_j is the weight of each sample in the bin. We want to work out the variance of S_k.

Aside: As discussed in the paper mentioned below, it's actually debatable whether you should use this variance to obtain error bars. We'll focus here on the explicit problem of obtaining the variance of the above sum in terms of the exact distributions of the incoming data, and just assume we can approximate it in the usual ways given some data, and use this for error bars.

Here are two ways of looking at the problem:

Skip to the good way...

The Silly Way - Random Sums

The complicated way to do this involves using the so-called random sum formulae. It's surprisingly difficult to find things using this name online, so we'll go through the derivation of the relevant case. (You can look up Wald's Equation for more information.)

Derivation of the random sum results

To make clear what we are doing, we assume that X_k for k = 1, 2, 3, ... are independent, identically distributed random variables with finite mean μ and finite variance σ². Then we are given a random variable M with some finite mean and variance μ_M and σ_M² and want to talk about the distribution of the sum of M of these variables.

Consider first the mean of the random sum X₁ + ... + X_M. This is pretty easy to work out using a so-called conditional expectation - the idea is that the expected value of an expression f(M,X₁,...) depending on a random variable M can be worked out by first calculating the expected value F(m) of f given that M = m, and then working out the expectation of F(M), so long as M is independent of the other variables. (This is quite easy to prove, and known as the law of total expectation amongst other things.) Using this,

$\mathbb{E}\left[X_1+X_2+\cdots+X_M\right]=\mathbb{E}\left[\mathbb{E}\left[X_1+X_2+\cdots+X_n | M=n\right]\right]$

But if we know how many terms there are in the sum, the expectation is easy to work out! We end up with just nμ. But now we're done, because

$\mathbb{E}\left[M\mu\right] = \mathbb{E}\left[M\right]\mu = \mu_M\mu$

... which is not a very flabbergasting result.

What about the variance? Let's start by attacking the expectation of the squared sum

$\mathbb{E}\left[\left(X_{1}+X_{2}+\cdots+X_{M}\right)^{2}\right]=\mathbb{E}\left[\sum_{j=1}^{M}X_{j}^{2}+2\sum_{1\le i<j\le M}X_{i}X_{j}\right]$

and then using the same trick as before, we can write this as

$\mathbb{E}\left[M\mathbb{E}\left[X_{1}^{2}\right]+2\times\frac{M\left(M-1\right)}{2}\mathbb{E}\left[X_{1}X_{2}\right]\right]=\mathbb{E}\left[M\left(\sigma^{2}+\mu^{2}\right)+M\left(M-1\right)\mu^{2}\right]$

where we've used the expression for the variance of X₁ to work out the first term and the independence of these random variables to simplify the second term. Then, finally, we use similar formulae for M to recover

$\mu_{M}\sigma^{2}+\left(\sigma_{M}^{2}+\mu_{M}^{2}\right)\mu^{2}$

From this, and the above expression for the expectation of the sum, we have our answer:

$\mathrm{var}\left[\sum\right]=\mu_{M}\sigma^{2}+\sigma_{M}^{2}\mu^{2}$

which you probably wouldn't have guessed!

Summary of random sum formulae

The two results we have proven are

$\boxed{\mathbb{E}\left[\sum_{j=1}^{M}X_{j}\right]=\mu_{M}\mu}$ $\boxed{\mathrm{var}\left[\sum_{j=1}^{M}X_{j}\right]=\mu_{M}\sigma^{2}+\sigma_{M}^{2}\mu^{2}}$

Note that we can explicitly see here how the two sources of error ( $\sigma^2,\sigma_M^2$ ) identified above contribute to the total error. We can deduce that error in the estimate of the weight dominates when

$\frac{\sigma}{\mu} \gg \frac{\sigma_M}{\sqrt{\mu_M}}$

and so on. (We will see below that the right-hand side is approximated by $1-N_k/N$ for the histogram, where the last ratio is the probability of an event lying in the bin. Hence typically, when this ratio is fairly small, the larger source of error is determined by whether $\sigma\gtrless\mu$ .)

Application to the histogram problem

In the histogram problem, we can easily come up with some estimators of μ, σ², μ_N and σ_N² and use these to estimate the variance of the above sum. The estimators of the means are simply the obvious $\mu\approx\sum W_j/N_k$ and $N_k$ . For the variance of the W_i, we can use the sample variance in the large N_k limit, which is $\left(\sum W_{i}^{2}-N_{k}\left(\sum W_{i}/N_{k}\right)^{2}\right)/N_{k}$ . Finally, for the variance of N_k, we note that we want essentially the variance of the fraction N_k/N of data points in this bin, which suggests using the multinomial distribution, which would indicate the variance of N_k is $N_{k}\left(1-N_{k}/N\right)$ .

Phew. Plugging this into the boxed formula above, we get

$N_{k}\times\frac{\sum W_{i}^{2}-N_{k}\left(\sum W_{i}/N_{k}\right)^{2}}{N_{k}}+N_{k}\left(1-\frac{N_{k}}{N}\right)\times\frac{\left(\sum W_{i}\right)^{2}}{N_{k}^{2}}$

... which simplifies to...

$\boxed{\mathrm{var}\left[\sum_{j=1}^{N}W_{j}\right]\approx\sum W_{i}^{2}-\frac{1}{N}\left(\sum W_{i}\right)^{2}}$

... at which point a mathematician think something like "crap". Why? Because that formula is way too simple for that horrible mess to have been the best way to derive it! The particularly conspicuous thing about this equation is... can you see?

The thing that really strikes me, given how we got here, is that it doesn't mention N_k (except in the limits of the sum, but that somehow doesn't seem to count). So can we avoid this mess altogether?

The Sensible Way

Well, we were led down the path of the above derivation by our idea that there are two sources of error in our bin total - how many samples ended up in it, and what the typical weight of a sample in that bin is. Can we avoid this division, given that it looks like we don't need to think like this?

Actually, yes we can, and it's so much simpler to think about it this way it makes the above look like a bad joke.

The message of the formula we found is that N, not N_k, is what describes 'how much data' we have collected in working out the bin total. So instead of thinking about our N_k weights W_j, let's think about N weights, N_k of which are the original W_j, and the rest of which are... zero!

This makes perfect sense, really - every sample from the original population contributes to the sum of weights in this bin, it's just that lots of them contribute zero. Sort of perverse, I guess, but it allows us to write the sum of weights in the bin as

$S_{k}=\sum_{j=1}^{N}W_{j}$

where lots of the W'_j happen to be zero. Why is this a good idea? Because we no longer have any mention of N_k, and more importantly no horrible random sums.

Derivation of the answer

There's almost nothing to write, now - we have a sum of N random variables, and we want to estimate the variance. We already used the formula for that (in the case of estimating the mean of the W_j) - it's

$\sum W_{i}^{\prime 2}-\frac{1}{N}\left(\sum W'_{i}\right)^{2}$

... and when you remember that W'_j = 0 except for the ones which fall in this bin, you realize we just got the answer. Already. Yep.

$\boxed{\mathrm{var}\left[S_k\right]\approx\sum W_{i}^{2}-\frac{1}{N}\left(\sum W_{i}\right)^{2}}$

And now you know.

Equal weights case

The above all reduces, in the case W_j = 1, to

$\boxed{\mathrm{var}\left[N_k\right]\approx N_k-\frac{N_k^2}{N}} = N_k(1-\frac{N_k}{N})$

as stated in the longer derivation - that is, it reduces to the multinomial distribution.

Rare events - Poisson distribution

In accordance with the so-called law of small numbers or the law of rare events (or more formally, the Poisson limit theorem), if the probability of an event is very small, then the Poisson distribution approximates the binomial distribution corresponding to whether that event occurs. This manifests itself if we consider N_k/N small, as then the above variance is approximately N_k. This leads to the common practice in high energy physics etc. of plotting error bars which are $N_k \pm \sqrt{N_k}$ .

See this blog (and the referenced paper) for discussions of the validity of using $\pm \sqrt{N_k}$ error bars.

Try it out

The following Mathematica code generates bigN (N above, and 10000 here by default) samples from a histogram bucket - the distribution given has 30% of data points in this bucket, with weights in a uniform [0,1] distribution.

It creates metaSamples such histograms, and then compares the actual standard deviation of the gathered distribution with the (average) estimate of the variance according to the above formula.

If you make metaSamples large enough - or just run the for loop repeatedly to gather more data - you should find the difference between the predicted error and observed error decaying to at least under 1%. (Of course, since N is finite, and so on, one does not expect exact agreement.)

meanDats = {};
varDats = {};

bigN = 10000;
metaSamples = 2500;
For[j = 1, j <= metaSamples, j++,
 dat = RandomReal[{0, 1}, bigN];
 dat = (Select[dat, (# >= 0.7) &] - 0.7)/0.3;
 AppendTo[totDats, Total[dat]];
 AppendTo[varDats, Total[dat^2] - Total[dat]^2/bigN];]

Print["Observed standand deviation of data: ", 
 StandardDeviation[meanDats]]
Print["Mean predicted error: ", Mean[varDats] // Sqrt]
Print["Percentage difference: ", 
 100*Abs[Sqrt[Mean[varDats]] - StandardDeviation[meanDats]]/
   StandardDeviation[meanDats], "%"]
Print["Data points: ", Length[meanDats]]
Print["Standard deviation of predicted error: ", 
 StandardDeviation[
  varDats // Sqrt], ", which is ", (StandardDeviation[
     varDats // Sqrt]/(Mean[varDats] // Sqrt))*100, "% of the estimated error"]

After collecting 7500 (meta)samples, I obtained the following output:

Observed standand deviation of data: 27.9623
Mean predicted error: 27.8385
Percentage difference: 0.442649%
Data points: 7500
Standard deviation of predicted error: 0.260689, which is 0.936431% of the estimated error

So it certainly seems that this formula provides an excellent estimate of the error. Note that the predicted error from a single run deviates from the predicted error by less than 1%, so using this as an error estimate based on a single run also gives a very respectable error estimate.

The Silly Way - Random Sums
Derivation of the random sum results
Summary of random sum formulae
Application to the histogram problem
The Sensible Way
Derivation of the answer
Equal weights case
Rare events - Poisson distribution
Try it out

Histogram Errors

Neat example of how to think the right way in maths

Return to top

suchideas.com

Histogram Errors

The Silly Way - Random Sums

Derivation of the random sum results

Summary of random sum formulae

Application to the histogram problem

The Sensible Way

Derivation of the answer

Equal weights case

Rare events - Poisson distribution

Try it out

Table of Contents

Histogram Errors

Neat example of how to think the right way in maths