Why Benford's Law works
From AJS.COM
Benford's Law is an interesting math puzzle and a law of mathematics. It is easy to state, but hard to understand for most. Here, I'll attempt to explain what it is and why it works.
Definition: Benford's Law states that any sufficiently complex system (e.g. one arising from a statistically significant number of independent systems) which produces numbers will exhibit a frequency in the leading digit such that 1 is the most common and 2 the next most and so on.
Here's the exact distribution of first digits according to Wikipedia:
1 30.1% 2 17.6% 3 12.5% 4 9.7% 5 7.9% 6 6.7% 7 5.8% 8 5.1% 9 4.6%
Contents |
Example
There's more to it, but that's the basic definition. Here's an example: If you measure the height of all buildings in the world in centimeters, strip off the leading digit of every height, and count up how many of each digit you saw (1,000,000 centimeters thus being counted as "1" and so on), you will find that 1 is the most frequent, and in fact there will be about 30% of the measurements that start with a 1, and by the time you get down to 9, there will only be about 5% of those.
So why is that a problem?
It's not always clear at first why this is a problem. So house heights have a leading first digit of 1 more often than 2. Is that so odd? Well, actually, it is. If you just picked random numbers, evenly distributed amongst the integers, then the leading digits would be evenly mixed. If you set some arbitrary upper-bound (houses can't be any taller than the tallest building), then there would be some skew, based on that largest number, but it's still not going to match Benford's Law. It's not so hard to buy that houses have some sort of bias toward numbers that start with 1, but why is the same exact bias present in the distance between stars in a galaxy or the prices in the stock market? These things can't all be weighted the same way, can they? Are we seeing some underlying mechanism in all these things? Well, sort off...
What you missed
What you missed, here is that there's really two numbers being picked every time you take a measurement, not one. The first number limits the "domain" and the second number is the "answer". You tend to think that seemingly random numbers can be betwen 0 and infinity, but they cannot because infinity isn't actually a number. Any given building might have been limited in its height by local zoning, cost, structural realities, etc. So, any given building could not have been taller than some value. That value defines the domain.
Why it works
Let's call the domain d and the measurement that you take n. The measurement (height of a building, in our example) is going to be any number between 1 and d (we don't call a structure of height 0 a "building", so we don't count 0 as a valid answer). Now, if the leading digit of d happens to be 5, then 1, 2, 3, 4, and 5 will be more common as leading digits than 6, 7, 8, and 9.
Want that proved? Say d is 54cm. You could have found that the building was 1cm tall or between 10 and 19cm tall. That's 11 possible outcomes that would have a leading 1. There are also 11 possible outcomes that have a leading 2, 3, and 4. However, only 5 and 50-54 have a leading 5, so that's 6 possible leading 5s. And of the remaining numbers, only one possible outcome has those for a leading digit.
That's pretty simple, but 1 isn't the most common result. Instead, 1, 2, 3, and 4 are equally common. Ah, but that's the reason that I said that every time you select two numbers, not one. You select a new d every time, and a new n every time. So the first time, you selected d=54, but the second time you select d=72. We can quickly see that with a d of 72, our leading digit of n has a higher chance of being 1-6 than it does 7, and more chance of being 7 than it does 8 or 9. If you keep doing this, over and over again, you might as well be asking something like this:
- Pick a number between 1 and 5
- Pick a number between 1 and 7
- Pick a number between 1 and 3
- Pick a number between 1 and 9
- Pick a number between 1 and 1
... and so on. You can see that 1 is a possible outcome for all results, but 2 can be the answer to all but the last one, thus two is the second most likely answer. This continues on until 9. It's always possible that you choose a number between 1 and
(1-9 in our earlier examples), but those numbers have an even distribution of first digits, so it just smooths out the results a bit, but doesn't change the fact that 1 is still the most common, and so on.
So, Benford's Law works as long as you have a d that varies quite a lot, and in the real world, the domain of a function usually varies because there are multiple underlying functions. In our example, the height of a building could be limited by any number of factors, and the most significant one to the builder might not be obvious. If all buildings were built as tall as current materials would allow, then you might not see the same distribution, but as you know, the reasons for a building's height are more complex than that.
I lied
I'm actually over-simplifying slightly. A single d actaully produces a resulting distribution of first digits that's not quite right. Here's the result of 1,000,000 trials:
1 24.085% 240846 2 18.391% 183914 3 14.544% 145438 4 11.748% 117476 5 9.500% 95004 6 7.614% 76136 7 6.049% 60489 8 4.644% 46442 9 3.425% 34255
If we think back to the house example, it becomes easy to see why this isn't working. In that example, the reason that a house is as tall as it is was the culmination of not one, but many factors, and the one that won out was the one that limited the height the most. So we need to select several d at each step, representing the domain of several functions, but only apply the smallest. Here's what we get when we select 10 d at each step, and use the smallest:
1 29.760% 297601 2 16.714% 167136 3 12.162% 121624 4 9.818% 98182 5 8.184% 81844 6 7.072% 70719 7 6.172% 61718 8 5.376% 53759 9 4.742% 47417
As you can see, this very nearly matches the distribution anticipated by Benford's Law. The more complex the system (the more ds you select at each step), the more the distribution will match Bendford's Law.
What can you do with this
It turns out that you can reliably detect tampering with complex systems by looking for violations of Benford's Law. If you see significant deviations from the expected frequency of first digits, then you probably are looking at a system with a limited number of underlying functions... such as one person making up a bunch of numbers that sound random to them.
External links and other resources
- Benford's Law
on Wikipedia
- Assessing Data Authenticity with Benford's Law on ISACA
