Tuesday, May 21, 2013

Variance Estimators That Minimize MSE

In this post I'm going to look at alternative estimators for the variance of a population. The following discussion builds on a recent post, and once again it's really directed at students. Well, for the most part.

Actually, some of the results relating to populations that are non-Normal probably won't be familiar to a lot of readers. In fact, I can't think of a reference for where these results have been assembled in this way previously. So, I think there's some novelty here. But we'll get to that in due course.

I can just imagine you smacking your lips in anticipation!
To get things started, let's suppose that we're using simple random sampling to get our n data-points, and that this sample is being drawn from a population that's Normal, with a (finite) mean of μ and a (finite) variance of σ2. Let x* denote the sample average: x* = (1 / n)Σxi, where the range of summation here (and everywhere below) is from 1 to n.

Let's consider the family of estimators of σ2:

                       sk2 = (1 / k)Σ[(xi - x*)2],

where "k" is a positive number. Now, to be very clear, I'm not suggesting that we should necessarily restrict our attention to estimators that happen to be in this family - especially when we move away from the case where the population is Normal. I'll come back to this point towards the end of this post.

Clearly, if k = (n - 1), we just have the usual unbiased estimator for σ2, which for simplicity we'll call s2. If k = n, we have the mean squared deviation of the sample,  sn2 , which is a downward-biased estimator of σ2.

However, we all know that unbiasedness isn't everything!

Often, we look at our potential estimators and evaluate them in the context of some sort of loss function. If this loss function is quadratic, then the expected loss (or "risk") of an estimator is its Mean Squared Error (MSE). You'll recall that the MSE of an estimator is just the sum of its variance and the square of its bias.

Let's compare the unbiased estimator, s2, and the biased estimator, sn2, in terms of MSE.

Because we're using simple random sampling from a Normal population, we know that the statistic c = [(n - 1)s2 / σ2] follows a Chi-square distribution with (n - 1) degrees of freedom. So, s2 = (σ2 / (n - 1))χ2(n - 1). We also know that the mean of a Chi-square random variable equals its degrees of freedom; and its variance is twice those degrees of freedom. So, E[s2] = σ2, and Var.(s2) = 2σ4 / (n - 1).

The first of these two results also holds if the population is non-Normal, but the second result doesn't hold, as I discussed in this earlier post.

Next, noting that sn2 = (n - 1)s2 / n, it follows that;

      E[sn2] = [(n - 1) / n]σ2 ;
      Bias[sn2] = E[sn2] - σ2 = - (σ/ n)

and
     Var.[sn2] = 2σ4(n - 1) / n2.

So, the MSE of sn2 is given by the expression,

     MSE(sn2) = Var.[sn2] + (Bias[sn2])2 = σ4 (2n - 1) / n2.

Because s2 is unbiased, its MSE is just its variance, so MSE(s2) = 2σ4 / (n - 1). Noting that MSE(sn2) = [(n - 1) / n] MSE(s2) -4 / n2), we see immediately that MSE(sn2) < MSE(s2), for any finite sample size, n. This can be seen in the following chart, drawn for σ= 1. (Of course, the two estimators, and their MSEs coincide when the sample size is infinitely large.)
Although sn2 dominates s2, in terms of having smaller MSE for all possible sample values, and all (finite) sample sizes, it's still not the best we can do within the family of estimators we're considering! Let's go back to this class of estimators and ask, "what value of k will lead to the estimator with the smallest possible MSE for all members of this class?"

We can answer this question by using a bit of simple calculus. We'll write out  the expression for the MSE of sk2, and it will be some function of "k". Then we'll differentiate this function with respect to "k", set the derivative to zero, and then solve for the value of k (say k*). We should then check the sign of the second derivative to make sure that k* actually minimizes the MSE, rather than maximizes it!

So, here goes ........

Notice that we can write a typical member of our family of estimators as

      sk2 = (1 / k)Σ[(xi - x*)2] = [(n - 1) / k]s2 .

So, using the results that E[s2] = σ2, and Var.(s2) = 2σ4 / (n - 1), we get:

    E[sk2] = [(n - 1) / k]σ2 ;
    Bias[sk2] = E[sk2] - σ2 = [(n - 1 - k) / k]σ2;
and
    Var.[sk2] = 2σ4(n - 1) / k2.

The MSE of sk2 is given by the expression,

     M = MSE(sk2) = Var.[sk2] + (Bias[sk2])2 = (σ4 /k2)[2(n - 1) + (n - 1 - k)2].

Differentiating M with respect to "k", and setting this derivative to zero, yields the solution, k* = (n + 1). You can easily check that k* minimizes (not maximizes) M.

So, within this family that we've been considering, the minimum MSE (MMSE) estimator of σ2 is the estimator,

      sn+12 = (1 / (n + 1))Σ[(xi - x*)2] .

This is certainly a well-known result. It also extends naturally to the situation where we're estimating the variance of the (Normally distributed) error term in a linear regression model. In that case the MMSE of this variance is (1 / (n - p + 2))Σei2, where ei is the ith OLS residual, and p is the number of coefficients in the model.

So far, so good!

Now, let's connect with the earlier post that I mentioned above, and see how all of this works out if we have a population that's non-Normal. We'll retain the simple random sampling, though. As was discussed in that post, in general the variance of s2 is given by:

         Var.[s2] = (1 / n)[μ4 - (n - 3)μ22 / (n - 1)]  = V, say.

Here, μ2 and μ4 are the second and fourth central moments of the population distribution. Recall that μ2 is the population variance, and for the result immediately above to hold the first four moments of the distribution must exist.

Let's extend this variance expression to members of the family, sk2. Then we'll work out the expression for the MSE of such estimators for a non-normal population. Finally, this will allow us to derive the MMSE estimator in this family for any population distribution - not just for the Normal population that we dealt with earlier in this post.

Once again, we'll begin by using the fact that we can write:

      sk2 = (1 / k)Σ[(xi - x*)2] = [(n - 1) / k]s2 .

The estimator, s2, is still unbiased for σ2 even in the non-Normal case, so we still have the results:

      E[sk2] = [(n - 1) / k]σ2 ; and  Bias[sk2] = [(n - 1 - k) / k]σ2 =  [(n - 1 - k) / k]μ2

as before.

Also,

       Var.[sk2] = [(n - 1) / k]2 Var.[s2] = [(n - 1) / k]2(1 / n)[μ4 - (n - 3)μ22 / (n - 1)] ,

and so the MSE of sk2 is given by:

         MSE(sk2) =  [(n - 1 - k) / k]2μ22 + V[(n - 1) / k]2,

where V = Var.[s2] is defined above.

Going through the calculus once again, it's easy to show (I used to hate that statement in textbooks) that the value of "k" for which the MSE is minimized is:

        k** = (n - 1)(V + μ22) / μ22,

where V = (1 / n)[μ4 - (n - 3)μ22 / (n - 1)].

Check: For the Normal distribution, μ4 = 3μ22, and so k** = (n + 1) = k*, as before.

Having gone to all of this effort, let's finish up by illustrating the optimal k** values for a small selection of other population distributions:

Uniform, continuous on [a , b]

μ2 = (b - a)/ 12  ;  μ4 =  (9 / 5)
k** = (n - 2) + (3 / n) +1296(n - 1) / [5n(b - a)4]

Standard Student's-t, with v degrees of freedom

μ2 =   v / (v - 2)  ;  μ4 =   3v2 / [(v - 2)(v - 4)]
 k** = (n - 2) + (3 / n) +3(n - 1)(v - 2) /[n(v - 4)]                   ;   for v > 4

χ2, with v degrees of freedom

μ2 = 2v  ;  μ4 = 3(v + 4) / v
k** =  (n - 2) + (3 / n) +3(n - 1)(v + 4)] / (4nv3)

Exponential, with mean θ

μ2 = θ2  ;  μ4 = 9θ4  
k** = (n2 + 7n - 6) / n

Poisson, with parameter λ

μ2 = λ  ;  μ4 = λ(3λ + 1)
k** = (n - 2) + (3 / n) +(n - 1)(3λ + 1) / (nλ3)

Now, a final word of caution.

Yes, setting k = k** in the case of each of these non-Normal populations, and then estimating the variance by using the statistic,  sk2 = (1 / k)Σ[(xi - x*)2], will ensure that your "estimator" is MMSE within this particular family of estimators. However, this doesn't mean to say that it's the "best", or even a feasible, estimator to use.

For instance, consider the last example where the population is Poisson. In that case, the population mean and variance are both  λ. The MLE for  λ is the sample average, x*. Therefore, x* is also the MLE for the population variance. Actually, x* is the "minimum variance unbiased" (MVUE) estimator of λ. The statistic s2 is also an unbiased estimator of λ, but it is inefficient relative to x*.  So x* dominates s2 in terms of MSE. See here for a nice discussion.

If we were to try and implement our MMSE estimator of the variance in this case, we'd be trying to estimate λ. However, k** is a function of λ. We'd need to know the population variance in order to obtain the MMSE estimator of that parameter! You can see that the same issue applies to the Student's-t and χ2 examples given above but it's not an issue with the other two examples.

Sometimes, MMSE estimators simply aren't "feasible". They're functions of the unknown parameters we're trying to estimate. They're not really estimators, at all!



© 2013, David E. Giles

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.