Benfords Law

· April 1, 2011

Benfords Law is not an exciting new John Nettles based detective show, but an interesting observation about the distribution of the first digit in sets of numbers originating from various processes. It says, roughly, that in a big collection of data you should expect to see a number starting with 1 about 30% of the time, but starting with 9 only about 5% of the time. Precisely, the proportion for a given digit can be worked out as:

<?php
function benford($num) {
        return log10(1+1/$num);
}
?>

Real data does tend to fit this pretty well. For example, just leaping onto data.gov.uk at random and grabbing a dataset - in this case a list of spending in the Science and Technology Facilities Council, I can compare the first digit to Benford’s expected ones (I grabbed the Amount column out of the april 2010 data and put it into a text file, one amount per line):

<?php
$fh = fopen("data.txt", 'r');
$score = array();
$total = 0;
$nums = range(1, 9);
// Count up appearances of digits
while($data = fgets($fh)) {
        $total++;
        $digit = substr(trim($data), 0, 1);
        if(!in_array($digit, $nums)) {
                continue;
        }
        if(!isset($score[$digit])) {
                $score[$digit] = 0;
        }
        $score[$digit]++;
}
arsort($score);
echo "# - Data  - Benford", PHP_EOL;
foreach($score as $digit => $count) {
        echo    "$digit - ",
                number_format($count/$total, 3),
                " - ",
                number_format(benford($digit), 3),
                PHP_EOL;
}
?>

We get a pretty clear match:

# - Data  - Benford
1 - 0.273 - 0.301
2 - 0.181 - 0.176
3 - 0.114 - 0.125
4 - 0.107 - 0.097
5 - 0.088 - 0.079
6 - 0.070 - 0.067
7 - 0.055 - 0.058
8 - 0.050 - 0.051
9 - 0.047 - 0.046

Graph of the STFC versus Beford’s Law

This is fun, because if someone makes up a data set, it probably wont follow this distribution. This is used in accountancy to detect fraudulent entries. If there is a reporting limiting at £3000 within a certain company where fraud is going on, there will probably be more dodgy transactions at £2999, for example, which will throw off the stats. More advanced checking actually goes further into the digits rather than just considering the initial one. As always, there’s plenty more on the law on Wikipedia.