Wednesday, August 31, 2011

Beware of Econometricians Bearing Spreadsheets

"Let's not kid ourselves: the most widely used piece of
software for statistics is Excel"
(B. D. Ripley, RSS Conference, 2002)

What a sad state of affairs! Sad, but true when you think of all of the number crunching going on in those corporate towers.

With the billions of dollars that are at stake when some of those spreadsheets are being used by the uninitiated, you'd think (and hope) that the calculations are squeaky clean in terms of reliability. Unfortunately, you'd be wrong!

A huge number of reputable studies over the years - ranging from McCullough (1998, 1999), to the special section in Computational Statistics & Data Analysis in 2008 - have pointed out some of the numerical inaccuracies in various releases of some widely used spreadsheets. With reputations at stake, and the potential for litigation, you'd again think (and hope) that by now the purveyors of such software would be on the ball. Not so, it seems!

A recent, and very comprehensive study by Almiron et al. (2010) in the open-access Journal of Statistical Software is a must-read if you're feeling tempted to use spreadsheets for any statistical/econometric analysis that matters. In other words, for any such analysis!

Would you trust a spreadsheet that claims to use a particular, highly regarded algorithm for generating random numbers, when it has been proven that this claim is false? What about a spreadsheet that, when purportedly computing the cumulative distribution function for the Poisson distribution, fails to compute even the first digit correctly? Regrettably, you (and a few million other people) have probably been falling prey to these errors for a very long time!

The study undertaken by Almiron et al. uses a range of different data-sets, and deals with the performances of five spreadsheets (Calc, Excel, Gnumeric, NeoOffice, and Oleo) running on up to three different operating systems:



i386 / Windows
i386 / Ubuntu
amd64 / Mac OS
Calc
2.4.1
 X
X


3.0.1
X
X
X
Excel
2007
X



2008



X
Gnumeric
1.8.3


X


1.9.1
X


NeoOffice
2.2.5



X

3.0



X
Oleo
1.99.16


X


The authors checked the number of correct significant digits computed by the spreadsheets when calculating the sample mean, standard deviation, and first-order autocorrelation; the F-statistic in one-way ANOVA tests; linear and non-linear regression; and a range of distribution functions. As I've mentioned already, the quality of the spreadsheets' algorithms  for computing pseudo-random numbers is also analyzed.

Read the results and weep! And don't presume that you get what you pay for!

Of course, econometricians are much too careful to fall for this! Apparently not. A fairly recent text book by Barreto and Howland (2006), is extremely well motivated - it uses Monte Carlo simulation to teach basic econometrics. I like the concept a lot. Unfortunately, the book's sub-title leaves me a bit concerned - see the reference below.

The bottom line to all of this? Almiron et al. (2010, p.25) :
"Finally, as a rule of the thumb, every user should be aware that spreadsheets have serious limitations. Other platforms are advisable, being currently R the most dependable FLOSS (Free/Libre Open Source Software, see Almiron et al. 2009)."
I couldn't have said it better myself!


Note: The links to the following references may require that your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.

References

Almiron, M. G., E. S. Almeida and  M. N. Miranda (2009). The reliability of statistical functions in four software packages freely used in numerical computation. Brazilian Journal of Probability and Statistics, 23, 107-119.

Almiron, M. G., B. Lopes, A. L. Oliviera, A. C. Medeiros and A. C. Frery (2010). On the numerical accuracy of spreadsheets. Journal of Statistical Software, 34 (4), 1-29.

Barreto, H. and F. M. Howland (2006). Introductory Econometrics: Using Monte Carlo Simulation with Microsoft Excel®. Cambridge University Press, Cambridge.

McCullough, B. D.  (1998). Assessing the reliability of statistical software: Part I. The American Statistician, 52, 358-366.

McCullough, B. D.  (1999). Assessing the reliability of statistical software: Part II. The American Statistician, 53, 149-159.


© 2011, David E. Giles

No comments:

Post a Comment