We want to test hypotheses about the set of true scores, , or equivalently . In particular, we wish to test whether there are no differences between the scores of the wines. More formally, we wish to test the null hypothesis
Figure 3 shows the beginnings of making such an inference for the data in Figure 2. That these distributions overlap so little shows that we are extremely likely to correctly reject using only four scores for each wine as long as there is such a large difference between at least one pair of . We would like to do this more formally, however.
Returning to the sum of the square deviations of our data, we can decompose it as a sum of the squares of our regression errors and the square deviations of our model from the data mean:
Notice that all of these terms are proportional to variances. The one on the left is the variance of our data from the “grand mean.” The first one on the right is proportional to the variance of our data from the sample mean of each wine (called, “within-group variance”). And the last one is proportional to the variance of all our predicted data to the grand mean (called, “between-group variance”). The expectations of the terms on the right are:
If for all wines and scores, is iid with zero mean and variance , then the above become
If in addition is in effect, then , and we expect these two terms to be equal. Hence, we wish to compute our estimates of these quantities and see if
More formally, with in effect and iid with zero mean and variance , then
Hence, we compute the statistic , and see whether the probability of achieving its value as or more extreme in an F-distribution with and degrees of freedom exceeds our significance level .
For the results shown in Fig. 2, the F-statistic is 20.36 with 3 degrees of freedom in the numerator and 12 degrees of freedom in the denominator. The probability of seeing a statistic at least that large given and iid with zero mean and variance , is . We are thus compelled to reject . For the results in Table 1, the F-statistic is 9.06 and . For a level of statistical significance of , we are thus also compelled to reject under limitations imposed by our measurement model and assumptions on .
It is interesting to see whether there are major discrepancies in the result of hypothesis testing when using the measurement model that does not take into consideration that the responses are integers. The figure above compares the -values observed for a resulting F-statistic from many simulations of 4 scores of 4 wines with in effect, and with iid zero mean Gaussian with . We compare the results when we restrict measurements to be integers in , to those without such a restriction, for several true wine parameters. When the true parameter is an integer, the two appear to be quite in agreement. In other words, our does reflect the probability of making a type 1 error, i.e., rejecting when it is actually true. When the true parameter is not an integer, we see that all -values occur more frequently than they we expect them to, and so our underestimates the probability of making a type 1 error.
In the next, and final, part, I will reveal the fatal flaw.