One-Factor Anova with R

Data Description

The female Cuckoo lays her eggs in other birds' nests. The "foster parents" are usually deceived, probably because of the similarity in sizes of their own eggs. 

The data file "cuckoo.txt" at http://ramanujan.math.trinity.edu/ekwessi/misc/cuckoo.txt  represents the lenghts of Cuckoo eggs (in millimeters) that were found in the nests of three species, Hedge sparrow, Robin, Wren.

We would like to use ANOVA to find out if there  is any significant difference between the means  of the lengths of Cuckoo's eggs found in the nests of these three species.

The response variable here "Length" and the Factor is "Species" and the Treatments or Levels are Hedge sparrow, Robin, Wren.

Analysis

>cuckoo<-read.table(file="http://ramanujan.math.trinity.edu/ekwessi/misc/cuckoo.txt",header=T)   # Loading the data set into the R-workspace
> head(cuckoo)                                                                                   # Observing the first 6 data points


             Species Length
1 HedgeSparrow   22.0
2 HedgeSparrow   23.0
3 HedgeSparrow   20.9
4 HedgeSparrow   23.8
5 HedgeSparrow   25.0
6 HedgeSparrow   24.0

>boxplot(cuckoo$Length~cuckoo$Species, col=c("green","red","yellow"),xlab="Species",
ylab="Length of Cuckoo's eggs", ,main="Comparative boxplots")  # Boxplot to have a first glance of the mean differences.



>anova=aov(Length~Species, data=cuckoo)           # Performing anova 
>summary(anova)                                   # Summary of the results, with SS of type I


                 Df      Sum Sq       Mean Sq        F value     Pr(>F)    
Species      2       29.6           14.799            21.73        3.31e-07 ***
Residuals   42      28.6           0.681                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

>drop1(anova,~.,test="F")   # Displaying SS of type III as in SAS and SPSS


Single term deletions

Model:
Length ~ Species
               Df    Sum of Sq       RSS     AIC    F value    Pr(>F)    
<none>                28.598    -14.399                      
Species  2            29.598     58.196  13.572  21.734     3.314e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Post-Hoc Analysis

>TukeyHSD(anova)                   # Pairwise Comparison using Tukey Honestly Significant Difference (HSD)


Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Length ~ Species, data = cuckoo)

$Species
                                        diff             lwr              upr            p adj
Robin-HedgeSparrow  -0.49375   -1.227416   0.2399161    0.2424181
Wren-HedgeSparrow   -1.93000   -2.674991 - 1.1850087    0.0000004
Wren-Robin                -1.43625   -2.156755  -0.7157449    0.0000518

Interpretation of the results:

1. The Boxplot shows that the mean length of Cuckoo's eggs differs from species nets to species nets.
2. The P-value (3.31e-07) of the overall Anova is very small, which means  that there the null hypothesis that the mean lengths of cuckoo's eggs are the equal is not plausible.
    This also means that the alternative hypothesis that there are at least two mean lengths that are different is very plausible, but does not mention which one.
3. The Tukey  HSD pairwise comparison shows there is a significant difference between the lengths of cuckoo's eggs found in the nest of Robin and Wren, and in the nest of Hedge Sparrow and Wren.
    On the other hand, there is no statistically significant difference between the lengths of cukoo's eggs found in the nests of Hedge Sparrow and Robin, which confirms the impression given by the boxplot.

Remarks

1.There are other pairwise comparison methods available like the Fisher's Least Significant Difference (LSD) method, and  the Bonferroni method, etc.
2. The Tukey method is more suitable here because it is specifically designed for  multiple comparisons of means of normal populations.
3. Before deciding on the validity of the results, it worth checking is the assumptions are met.

Checking Assumptions

1. Normality: In practice, it better to check if the residuals of Anova are normally distributed.


> res<-anova$res       # Obtaining the residuals
> qqnorm(res)          # QQ plot
>qqline(res)           # Adding a line

It is clear from this plot that the normality assumption is not  gravely violated since not many points do not follow closely the line.

More over, the Anderson darling test also confirms it (P-value=0.3375)

>ad.test(res)



Anderson-Darling normality test

data:  res
A = 0.406, p-value = 0.3375

2. Equal variance: This assumption can be verified by looking at the plot of the residuals versus fitted values

This assumption appears not to be gravely violated. 

Conclusion:


It is good practice to redo Anova procedures using nonparametric approaches (in this case,  the Kruskal-Wallis Test shows that the pairwise comparisons results are preserved) when assumptions seems to have been  violated.
If similar results are found using nonparametric procedures, then the violations might have had just a minor effect on the overall results of parametric Anovas.