# Histogram Errors

*Target audience:* You will want to know some basic probability.

**Warning:** NOT YET PROOFREAD

Here's a moderately common problem: you make a bunch of observations of data, falling into some buckets or bins in a histogram, and maybe they have different weights too (because you know some observations will be more or less common than others in the real world, but they're not in your sampling, typically). You draw a histogram. Where do the error bars go?

Suppose you are taking *N* observations in total, and pick some particular bin, *k*. Let *S _{k}* be the sum of all weights of the observations/events lying in this bin, Now here we have two sets of random variables -

*N*is the number of observations lying in this bin, whilst

_{k}*W*is the weight of each sample in the bin. We want to work out the variance of

_{j}*S*.

_{k}*Aside:* As discussed in the paper mentioned below, it's actually debatable whether you should use this variance to obtain error bars. We'll focus here on the explicit problem of obtaining the variance of the above sum in terms of the exact distributions of the incoming data, and just assume we can approximate it in the usual ways given some data, and use this for error bars.

Here are two ways of looking at the problem:

## The Silly Way - Random Sums

The complicated way to do this involves using the so-called *random sum formulae*. It's surprisingly difficult to find things using this name online, so we'll go through the derivation of the relevant case. (You can look up Wald's Equation for more information.)

### Derivation of the random sum results

To make clear what we are doing, we assume that *X _{k}* for

*k*= 1, 2, 3, ... are

**independent, identically distributed**random variables with finite mean

*μ*and finite variance

*σ*. Then we are given a random variable

^{2}*M*with some finite mean and variance

*μ*and

_{M}*σ*and want to talk about the distribution of the sum of

_{M}^{2}*M*of these variables.

Consider first the mean of the random sum *X _{1}* + ... +

*X*. This is pretty easy to work out using a so-called

_{M}*conditional expectation*- the idea is that the expected value of an expression

*f(M,X*depending on a random variable

_{1},...)*M*can be worked out by first calculating the expected value

*F(m)*of

*f*given that

*M*=

*m*, and then working out the expectation of

*F(M)*, so long as

*M*is independent of the other variables. (This is quite easy to prove, and known as the law of total expectation amongst other things.) Using this,

But if we know how many terms there are in the sum, the expectation is easy to work out! We end up with just *nμ*. But now we're done, because

... which is not a very flabbergasting result.

What about the variance? Let's start by attacking the expectation of the squared sum

and then using the same trick as before, we can write this as

where we've used the expression for the variance of *X _{1}* to work out the first term and the independence of these random variables to simplify the second term. Then, finally, we use similar formulae for

*M*to recover

From this, and the above expression for the expectation of the sum, we have our answer:

which you probably wouldn't have guessed!

### Summary of random sum formulae

The two results we have proven are

Note that we can explicitly see here how the two sources of error () identified above contribute to the total error. We can deduce that error in the estimate of the weight dominates when

and so on. (We will see below that the right-hand side is approximated by for the histogram, where the last ratio is the probability of an event lying in the bin. Hence typically, when this ratio is fairly small, the larger source of error is determined by whether .)

### Application to the histogram problem

In the histogram problem, we can easily come up with some estimators of *μ*, *σ ^{2}*,

*μ*and

_{N}*σ*and use these to estimate the variance of the above sum. The estimators of the means are simply the obvious and . For the variance of the

_{N}^{2}*W*, we can use the sample variance in the large

_{i}*N*limit, which is . Finally, for the variance of

_{k}*N*, we note that we want essentially the variance of the fraction

_{k}*N*of data points in this bin, which suggests using the multinomial distribution, which would indicate the variance of

_{k}/N*N*is .

_{k}Phew. Plugging this into the boxed formula above, we get

... which simplifies to...

... at which point a mathematician think something like "crap". Why? Because that formula is **way** too simple for that horrible mess to have been the best way to derive it! The particularly conspicuous thing about this equation is... can you see?

The thing that really strikes me, given how we got here, is that it doesn't mention *N _{k}* (except in the limits of the sum, but that somehow doesn't seem to count). So can we avoid this mess altogether?

## The Sensible Way

Well, we were led down the path of the above derivation by our idea that there are two sources of error in our bin total - how many samples ended up in it, and what the typical weight of a sample in that bin is. Can we avoid this division, given that it looks like we don't need to think like this?

Actually, yes we can, and it's so much simpler to think about it this way it makes the above look like a bad joke.

The message of the formula we found is that *N*, not *N _{k}*, is what describes 'how much data' we have collected in working out the bin total. So instead of thinking about our

*N*weights

_{k}*W*, let's think about

_{j}*N*weights,

*N*of which are the original

_{k}*W*, and the rest of which are... zero!

_{j}This makes perfect sense, really - *every* sample from the original population contributes to the sum of weights in this bin, it's just that lots of them contribute zero. Sort of perverse, I guess, but it allows us to write the sum of weights in the bin as

where lots of the *W' _{j}* happen to be zero. Why is this a good idea? Because we no longer have any mention of

*N*, and more importantly

_{k}**no horrible random sums**.

### Derivation of the answer

There's almost nothing to write, now - we have a sum of *N* random variables, and we want to estimate the variance. We already used the formula for that (in the case of estimating the mean of the *W _{j}*) - it's

... and when you remember that *W' _{j}* = 0 except for the ones which fall in this bin, you realize we just got the answer. Already. Yep.

And now you know.

## Equal weights case

The above all reduces, in the case *W _{j}* = 1, to

as stated in the longer derivation - that is, it reduces to the multinomial distribution.

### Rare events - Poisson distribution

In accordance with the so-called law of small numbers or the law of rare events (or more formally, the Poisson limit theorem), if the probability of an event is very small, then the Poisson distribution approximates the binomial distribution corresponding to whether that event occurs. This manifests itself if we consider *N _{k}/N* small, as then the above variance is approximately

*N*. This leads to the common practice in high energy physics etc. of plotting error bars which are .

_{k}See this blog (and the referenced paper) for discussions of the validity of using error bars.

## Try it out

The following Mathematica code generates `bigN`

(*N* above, and 10000 here by default) samples from a histogram bucket - the distribution given has 30% of data points in this bucket, with weights in a uniform [0,1] distribution.

It creates `metaSamples`

such histograms, and then compares the actual standard deviation of the gathered distribution with the (average) estimate of the variance according to the above formula.

If you make `metaSamples`

large enough - or just run the for loop repeatedly to gather more data - you should find the difference between the predicted error and observed error decaying to at least under 1%. (Of course, since *N* is finite, and so on, one does not expect exact agreement.)

meanDats = {}; varDats = {}; bigN = 10000; metaSamples = 2500; For[j = 1, j <= metaSamples, j++, dat = RandomReal[{0, 1}, bigN]; dat = (Select[dat, (# >= 0.7) &] - 0.7)/0.3; AppendTo[totDats, Total[dat]]; AppendTo[varDats, Total[dat^2] - Total[dat]^2/bigN];] Print["Observed standand deviation of data: ", StandardDeviation[meanDats]] Print["Mean predicted error: ", Mean[varDats] // Sqrt] Print["Percentage difference: ", 100*Abs[Sqrt[Mean[varDats]] - StandardDeviation[meanDats]]/ StandardDeviation[meanDats], "%"] Print["Data points: ", Length[meanDats]] Print["Standard deviation of predicted error: ", StandardDeviation[ varDats // Sqrt], ", which is ", (StandardDeviation[ varDats // Sqrt]/(Mean[varDats] // Sqrt))*100, "% of the estimated error"]

After collecting 7500 (meta)samples, I obtained the following output:

Observed standand deviation of data: 27.9623 Mean predicted error: 27.8385 Percentage difference: 0.442649% Data points: 7500 Standard deviation of predicted error: 0.260689, which is 0.936431% of the estimated error

So it certainly seems that this formula provides an excellent estimate of the error. Note that the predicted error from a single run deviates from the predicted error by less than 1%, so using this as an error estimate based on a single run also gives a very respectable error estimate.