Drill #17 - Benford's Law

Monday, November 12 - Due at the end of the day

The purpose of this exercise is to give you a little practice using arrays and to demonstrate Benford's Law.  Given a collection of naturally occurring numbers -  for example, the length of rivers around the world, enrollment at UAA over the years, the number of votes for a candidate by precinct - you might expect these numbers to have nothing in common. 

But what if you looked at the distribution of the leading digit of these numbers?  In other words, count the number of times 1 is the leading digit, 2 is the leading digit, etc.  A natural assumption is that all of these digits are equally likely so there should be about a 11% chance to see any one value from 1-9 as the leading digit.

What we find instead is quite remarkable! 1 is the leading digit about 30% of the time, and the probability drops to around 5% for the digit 9.  This distribution holds for all of these examples despite the disparate sources.  One application of this phenomenon is verification of voting records - if the number of votes by precinct doesn't match the expected distribution then that raises questions about vote tampering.


In this exercise your task is to complete a program that uses an array to compute the percentage 1-9 appears as the first digit from the following data sources (save them to your Java project folder):

  1. enrollments.txt  - Enrollment by course section at UAA in Spring 2010.  The original data is here.
  2. livejournal.txt - Number of new posts per day at LiveJournal.com.  Data from here.
  3. internethosts.txt - Number of hosts on the internet since 1981.

Start with this Java code which loads in the file for you and defines some variables. It also extracts the first digit out of each number in the file:


import java.util.Scanner;
import java.io.FileInputStream;;

public class Benford
{
	public static void main(String[] args)
	{
		Scanner inputStream = null;
		int[] firstDigitCount = new int[10];  // Counts how many times 0 is the first digit, 1 is the first digit, etc.
		// *** TO DO :
		// You will need to add a variable to count how many numbers are read in

		// Initialize digit counts to zero
		for (int i = 0; i < firstDigitCount.length; i++)
		{
			firstDigitCount[i] = 0;
		}

		// Loop through the file, read in each number one at a time,
		// extract the first digit, then count up the number of times we see
		// each digit in the firstDigitCount array
		try
		{
			// Pick a file to load, after your program works try changing
			// this to the other files
			inputStream = new Scanner(new FileInputStream("enrollments.txt"));
			while (inputStream.hasNextLine())
			{
				String s = inputStream.nextLine();
				// Get the first digit; subtracts the ascii code for '0'
				int firstDigit = s.charAt(0) - '0';

				// *** TO DO HERE
				// At this point, firstDigit contains the first digit of the
				// number read from the file. Use this as an index into the firstDigitCount
				// array, so if firstDigit = 3 then firstDigitCount[3] is incremented, if
				// firstDigit = 4 then firstDigitCount[4] is incremented, etc.
				// You don't need a bunch of different if statements to do this!
				//
				// Also increment by one the count of how many numbers are read in

			}
			inputStream.close();

		}
		catch (Exception e)
		{
			System.out.println("There was a problem with the file...");
		}

		// At this point the loop has ended.
		// *** TO DO HERE
		// Add a loop that calculates and prints out the percentage of times
		// that 1 was the first digit, that 2, was the first digit, etc. all the
		// way up to digit 9.
		//
		// For example, if firstDigitCount[1] = 30 and the total number of digits
		// counted is 100, then the program should output something like:
		//
		//     The first digit of 1 appeared 30.0000 percent of the time.
		//
		// You don't have to control for the number of digits after the decimal point.

	}
}

The program uses an array named firstDigitCount. You have to complete the program to use this array. The intent is for it to count the number of times the first digit appears for each number read from the file:

  firstDigitCount[0] - Number of times 0 is the leading digit (will be 0 because none of the numbers starts with a 0)
  firstDigitCount[1] - Number of times 1 is the leading digit
     ...
  firstDigitCount[9] - Number of times 9 is the leading digit

For example, if the file has 10 numbers and 3 of them start with the digit 2, then firstDigitCount[2] should end up with the number 3. You will have to write the code that increments the appropriate value in the array. You will also have to define a variable that counts up how many total numbers were read in, so you can calculate a percentage.

In the section after the loop, add code to compute the percentage each digit appears and output it. For example, if firstDigitCount[1] is 30 and firstDigitCount[2] is 15, and 100 numbers were read in, then the program should output something like:

        Digit 1 appeared 30.0% of the time.
        Digit 2 appeared 15.0% of the time.
        ...

You don't have to worry about rounding the numbers nicely.