Tuesday, March 15, 2011

Modelling Flowers on the Wall

It probably wasn't what the Statler Brothers actually had in mind as they soared to Number 4 on the U.S. Billboard Hot 100 and Number 1 on the Canadian RPM Top Singles charts in 1965, but around this time of year the good citizens of our town engage in what its organisers descibe as a "light-hearted" week-long activity known as The Victoria Flower Count. In short, this event involves its devotees, and multitudes of coerced school children, in "counting" (we'd call it "estimating") the number of flowers already in bloom in the gardens and parks of the region. Flowering daffodils, snowdrops, heather, and the like are carefully counted as each local municipality in the Greater Victoria region vies to be top dog in their contribution to the grand total count - amounting to 260,457,579 colourful blooms this year.

The 2011 Flower Count  (FC from here on) was held from 1 to 7 March, though historically it's often been held at times ranging from late February to late March. We feel that it's O.K. to bend the rules (sorry - vary the dates) just a little if it helps in making a point - something that you may have noticed already in this blog. Apparently, the idea of the FC is to celebrate the impending demise of our long, dark and bitter three weeks, or so, of winter. Much more importantly, it gives us another chance to annoy the heck out of our friends and relatives who live elsewhere in Canada where they are still checking their calendars and looking forward to half-time in the snow-shovelling game to which they claim to be devoted.

I thought that this FC might provide some interesting data that I could sift through. It probably does, but getting hold of the numbers for past years is not as easy as you might think. I Googled everything under the sun, and eventually came up with what seem to reliable (and disturbingly exact) numbers for 1996 and 2001 to 2011.  At that point I decided to get serious because the clock was ticking, so on Sunday 6 March I emailed the contact person at The Greater Victoria Chamber of Commerce  - the primary organizing body for the FC - as follows:

"Hi: I wonder if someone would be able to supply me with data for the numbers of flowers counted in the Victoria Flower Count, back over the years - preferably to when it began, if possible. I see that there are one or two such numbers on your website, but I am hoping to obtain more information to help with a small study that I am planning to undertake. If there is someone else who I should be contacting with this request, please let me know. Thank you, in anticipation. Sincerely, Dave Giles."
The reply came at 9:10 the next morning - it was friendly and unnervingly precise, but not quite what I had hoped for:

"Good Morning, The flower count is currently at 166,670,111. Warm Regards, (name suppressed to protect the innocent)............ etc."
Interestingly the closing time for the FC was 3:00p.m. that day (a Monday), so apparently 93,787,468 blooms were reported in the space of just under 6 hours. I guess it had been a busy weekend for someone.  Or maybe lots of our selflless little friends skipped Monday's classes for the greater good?

Such are  the joys of trying to gather data! Undaunted, I've taken a close look at the numbers that I actually have to hand. That didn't take very long - there are precious few of them, but they're too interesting and large (over 21 billion in 2010) to ignore, as you can see:

You can find these data in an Excel file on the "Data" page for this blog.

Now, for those of you who are just starting out in pursuit of an econometric lifestyle (and who among us doesn't aspire to that!), there are several interesting lessons that we can draw from these data:
  1. The word "data" is plural (the singular counterpart is "datum") - please remember that.
  2. Always plot your data before undertaking any analysis of the numbers.
  3. It can be difficult, in practice, to distinguish measurement or recording errors from genuine "outliers" (extreme values) in a set of data.
  4. Outliers can wreak havoc with statistical inference unless you recognize them and deal with them appropriately.
  5. If your sample is "complete", and no values are missing, then it's a miracle, and you should immediately celebrate in the manner of your choice.
  6. Missing values can often be viewed as a blessing in disguise - they provide an opportunity to be creative.
  7. In the case of time-series data (which is what we have here), you can usually do something about the "gaps" in the series, and often this is going to be better than just "truncating" the series so as to eliminate them.
I'll leave you to write a note to yourself about point #1, and put it on your refrigerator door. Let's focus on the other points, more or less in order. Point #2: it's a good thing that I plotted the data because Figure 1 suggests that 2010 was an interesting year - botanically speaking, at least. Perhaps the FC was held a little later last year? No, it turns out that it was actually held from 28 February to 4 March. (Originally, I was hoping to obtain the FC dates for the other years, but regrettably, this was not to be.) Was Spring sprung earlier than usual last year? Well, potentially that gets us into a discussion of climate change, and there is NO WAY that I'm going there in this town! I'd just end up with obnoxious emails and 'phone calls telling me that I don't have the right to speak on such matters because I'm not a climate scientist, or some such sanctimonious nonsense.

So (point #3), we're going to have to live with that number of  21,691,666,716 for 2010 - at least for the moment. Did they cook the books (sorry, make a recording error), or is it a genuine outlier? I applied the Grubbs (1969) test for a positive outlier to the 2010 observation, using the only complete (unbroken)sample I have, comprising data from 2001 to 2011. The test statistic is 2.782, and the 5% (1%) one-sided critical values for n = 11 are 2.34 (2.48), suggesting that we should reject the hypothesis of an outlier against the alternative of a measurement error. Strictly speaking the Grubbs test assumes normality of the data. Ignoring the 2010 observation and applying various tests for normality, we get p-values in excess of 50% for the Cramér-von Mises (W2), Watson (U2) and Anderson-Darling (A2) tests, and a p-value in excess of 10% for the Liilefors (D) test. These all support the normality assumption, so I'd be justified in treating the outrageous observation for 2010 as a mistake of some sort, rather than a genuine, but very extreme, observation. This means that I should drop it from my sample.

Suppose that I wanted to estimate the trend in the FC numbers. I have very few data, but at worst I have the complete data from 2001 to 2009, and we usually think of a trend in a time-series as something that emerges over 10+ years, so I could stretch a point and just go with the "complete" sample. Here's what I get when I fit a linear trend to the 2001 - 2009 data, using OLS:

(The EViews workfile for all of the econometric analysis for this post is available on the "Code" page for this blog.)

So, there's a downward trend in the success of the FC, but it is utterly insignificant (F = 0.36; p-value = 0.56). What would have happened if I had included the data for 2010 (and then for 2011 as well, of course)? The OLS results are quite different numerically, though they're still statistically insignificant:

Oh dear! Point #4 well taken! We've seen the havoc, so how do we deal with it? Just suppose that we were uneasy about dropping the 2010 observation - perhaps because of the small sample that was used to apply the Grubbs test. In that case this observation would be an outlier, and the obvious thing to do is to fit the trend line using Least Absolute Deviations (LAD) regression, instead of OLS. This is a special case of "Quantile Regression", using the 0.5 quantile (50th percentile). The LAD regression line will fit through the median (rather than the mean) of the sample data, so using LAD here should produce a result that is relatively "robust", to the extreme observation in 2010. Here's what we get using the sample for 2001 to 2011:

Using the robust regression gets us back to something rather like the OLS trend when 2010 observation is dropped, but not quite - that observation is given some weight, but not as much as when we use OLS. Again, the trend is not statistically significant.

Now, whether the observation for 2010 should be treated as "missing" or not, I'm still missing the data for 1997, 1998 and 1999 and 2000. No miracles here anywhere on the horizon at this stage, so we can forget about celebrating (point #5)! What can I do instead? Well, I've sometimes seen people come up with the following supposedly bright idea (point #6). First, fit the model (the trend in our case) using just the sample observations that are not "missing". That's 1996, 2001 to 2009, and 2011 in our case. Then use this fitted model to predict (impute) the "missing" values - those for 1997 to 2000, and 2010, in our case. Finally, fit the model (the trend) using the "filled in" series (actual plus imputed) over the period 1996 to 2011. It sounds almost too good to be true! And it is! Let's see what happens if we do this with the FC data. First, here is the basic regression based on all of the "non-missing" data (with Newey-West standard errors):

FLOWERS = 491774.00 - 243.64YEAR + residual      ;    n = 11    ;    R2 = 0.19
                        (242183.00)  (120.82)

Now, here's the corresponding result when we use the imputed data:

FLOWERS = 491774.00 - 243.64YEAR + residual      ;    n = 16    ;    R2 = 0.31
                         (129617.10)  (64.75)

So what went on here? Well, in statistics/econometrics you never get something for nothing! Here we used the available data, and no additional information, to impute the missing values. Then we used the original data, and the values imputed using these data alone, to fit the final trend. Imputing the missing values in this particular way added absolutely no information to what we had already in the "complete" sample - so, of course, the coefficient estimates didn't change. You can easily check that this must be the case, mathematically, if you don't like the intuition. If you have difficulty with this, check out Kmenta (1986, pp. 379-388). His is one of the few texts that discuss this issue. Of course, the standard errors, R2, etc. changed because the residuals (and the sample mean of FLOWERS) are different across the two models - notice that n = 11 in the first case, and n = 16 for the second regression.

Just make sure that you don't fall into this trap!

Of course, if we had additional relevant variables (information) that could have been included when fitting the model that was used to impute the missing values, but were not used in fitting the final trend, then the story would have changed. By the way, the fact that we are fitting a linear trend here, has nothing to do with this basic message. It's also worth pointing out that there's an old (but recently revived) econometrics literature on fitting regressions using additional, relevant, variables to impute missing values - see Chow & Lin (1971), Giles (1986), Santos Silva & Cardoso (2001) and Lahiri et al. (2010), for example. This can be extremely helpful in cases such as ours, but also if you have a mixture of (say) annual time-series and quarterly time-series that you want to use in the same regression model. You don't necessarily have to convert the quarterly data to annual data (and so throw away a lot of potentially valuable information) in order to proceed; and you can even test for unit roots successfully when there are missing observations (Ryan & Giles, 1998). Just keep point #7 in mind!

So, after all of this I guess we're left with the OLS trend line based on the non-missing data: 1996, 2001 to 2009, and 2011. Recall that the estimated intercept and slope coefficients are statistically significant at the 10% level. If this trend line is used to project forward from 2011 then you find that by the year 2019, we'll be asking each other: "Where Have all the Flowers Gone?"

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.

References

Barnett, V. and T. Lewis (1998). Outliers in Statistical Data, 3rd ed. Wiley, New York.

Chow, G. C. and A-L. Lin (1971). Best linear unbiased interpolation, distribution, and extrapolation of time series by related series. Review of Economics and Statistics, 53, 372-375.

Giles, D. E. A. (1986). Missing measurements and estimator inefficiency in linear regression: a generalization. Journal of Quantitative Economics, 2, 87-91.

Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11, 1-21.

Huber, P. J. and E. M. Ronchetti (2009). Robust Statistics, 2nd ed.. Wiley, New York.

Kmenta, J. (1986). Elements of Econometrics, 2nd ed.. Macmillan, New York.

Lahiri, W., A. A. Haug and A. Garces-Ozanne (2010). Estimating quarterly GDP data for the South Pacific island nations. Singapore Economic Review, in press.

Rousseeuw, P.J. and A. M. Leroy (2003). Robust Regression and Outlier Detection. Wiley, New York.

Ryan, K. F. and D. E. A. Giles (1998). Testing for unit roots in economic time-series with missing observations. In T. B. Fomby and R. C. Hill (eds.), Advances in Econometrics. JAI Press, Greenwich CT, 203-242.

Santos Silva, J. M. C. and F. N. Cardoso (2001). The Chow-Lin method using dynamic models. Economic Modelling, 18, 269-280.


© 2011, David E. Giles

No comments:

Post a Comment