Barley Data
In barley.rdata
are results from an experiment where the response is yield of barley pr 1000 square meter, and the factors sorts of barley, soil types, types of fertilizers. In addition was the experiment done in two different geographical areas (sites).
- Assume a two factor model including the main effects of
sort
andsoil
and their interaction. State the model and explain all parameters under a sum-to-zero parametrization.
The model can be written as:
\[ y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk}\]
where, \(\epsilon \sim N(0, \sigma^2)\)
The parametrization restrictions for “sum-to-zero”" are:
\[ \begin{aligned} \sum_i{\alpha_i} = 0, & \sum_j{\beta_j} = 0, & \sum_{i}{(\alpha\beta)_{ij}} = 0 \text{ and } \sum_{j}{(\alpha\beta)_{ij}} = 0 \end{aligned} \]
- Fit the model in R and estimate all parameters
options(contrasts = c("contr.sum", "contr.poly"))
mod1 <- lm(Yield ~ sort * soil, data = Barley)
summod1 <- summary(mod1)
summod1$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 380.9 10.1 37.55 1.70e-25
sort(1) -23.6 10.1 -2.33 2.73e-02
soil(1) -43.1 10.1 -4.25 2.14e-04
sort(1):soil(1) 13.8 10.1 1.36 1.84e-01
summod1$sigma ^ 2
[1] 3293
From the model summaries we find the estimated parameters to be:
- The estimated overall mean yield: \(\hat{\mu} = 380.94\)
- The estimated effects of sort 1 and 2: \(\hat{\alpha}_1=\) -23.625, \(\hat{\alpha}_2 = -\hat{\alpha}_1 =\) 23.625
- The estimated effects of soil 1 and 2: \(\hat{\beta}_1=\) -43.125, \(\hat{\beta}_2 = -\hat{\beta}_1 =\) 43.125
- The estimated interaction effects of sort and soil: \(\hat{\alpha\beta}_{11}=\) 13.812, \(\hat{\alpha\beta}_{12}=-\hat{\alpha\beta}_{11}\) -13.812, \(\hat{\alpha\beta}_{21}=-\hat{\alpha\beta}_{11}\) -13.812, \(\hat{\alpha\beta}_{22}=-\hat{\alpha\beta}_{21}=-\hat{\alpha\beta}_{12}=\) 13.812
- The estimated within
sort:soil
levels unexplained variability: \(\hat{\sigma}^2=MSE=\) 3293.21
- Make an interaction plot and try to conclude about the presence of interaction from the plot. How would you explain interaction effect in this example for a person with experience in agriculture, but minimal statistical experience?
plot(allEffects(mod1,confidence.level = .95))
From the plot, it seems that a farmer can expect an increase in yield when using sort 2 instead of sort 1 for both types of soil, and there may be a higher gain of sort 2 over sort 1 for soil-type 2 than for soil type 1. If this is so, there is a so-called interaction effect of sort and soil on yield.
- Perform a hypothesis test for the interaction effect. Conclusion?
Anova(mod1, type = "III")
Anova Table (Type III tests)
Response: Yield
Sum Sq Df F value Pr(>F)
(Intercept) 4643628 1 1410.06 < 2e-16 ***
sort 17861 1 5.42 0.02731 *
soil 59513 1 18.07 0.00021 ***
sort:soil 6105 1 1.85 0.18419
Residuals 92210 28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction between soil
and sort
is insignificant (high p-value). This result indicates that the slight non-parallel tendency of the lines in the interaction plot may be due to random errors. More data would be needed to verify any interaction effect.
- Fit a four factor model with sort, soil, fert and site as factors and with all possible 2-factor, 3-factor and 4-factor interactions. Use
Anova()
from the car-package with argumenttype="III"
to produce an ANOVA table. Type III means that all factors are tested as if they were the last factor to be added to the model. Are there any higher order significant effects? Compare the R-squared and the adjusted R-squared values. What can you conclude from these?
mod2 <- lm(Yield ~ sort * soil * fert * site, data = Barley)
Anova(mod2, type = "III")
Anova Table (Type III tests)
Response: Yield
Sum Sq Df F value Pr(>F)
(Intercept) 4643628 1 4503.19 < 2e-16 ***
sort 17861 1 17.32 0.00073 ***
soil 59512 1 57.71 0.0000011 ***
fert 2813 1 2.73 0.11812
site 1458 1 1.41 0.25176
sort:soil 6105 1 5.92 0.02707 *
sort:fert 2080 1 2.02 0.17472
soil:fert 33930 1 32.90 0.0000306 ***
sort:site 23220 1 22.52 0.00022 ***
soil:site 595 1 0.58 0.45849
fert:site 105 1 0.10 0.75364
sort:soil:fert 2245 1 2.18 0.15953
sort:soil:site 1405 1 1.36 0.26029
sort:fert:site 7442 1 7.22 0.01622 *
soil:fert:site 12 1 0.01 0.91370
sort:soil:fert:site 406 1 0.39 0.53914
Residuals 16499 16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2)$r.squared
[1] 0.906
summary(mod2)$adj.r.squared
[1] 0.818
There is only one interaction effect significant at 5% levels, namely the
sort:fert:site
interaction. However, the least squares estimator is prone to inflated variance estimates and difficulties in finding significance when the number of variables k
rises compared to the number of observations N
and if variables are highy correlated. To many variables to estimate from a limited number of observations leaves few degrees of freedom to SSE, and tests with low power. Lack of significance may be an over-fitting problem. By removing some of the least significant variables, the problem is reduced, and a reduced model with significant effects may be identified.
The large difference between \(R^2\) and \(R_{adj}^2\) is also an indication of an over-fitted model.
- Fit also a reduced model without site, but all other effects up to 3rd order
interactions. Perform a partial F-test with the
anova()
function to test whether site (and all its interactions with the others) should be excluded from the model (See also Exercise set 3 and Ex-3).
mod3 <- lm(Yield ~ sort * soil * fert, data = Barley)
anova(mod2, mod3)
Analysis of Variance Table
Model 1: Yield ~ sort * soil * fert * site
Model 2: Yield ~ sort * soil * fert
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 16499
2 24 51142 -8 -34644 4.2 0.0071 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova
result for partial F-test shows that site is needed to a certain degree in the model. The low p-value reject the hypothesis that there is no effect of site
on Yield
.
- In the
mixlm
-package there are convenient functions for performing automatic model selection by eitherbackward()
elimination of factors from a "full-model, byforward()
addition of factors from a minimal model with only intercept, or a combinedstepwise()
function which combines both forward and backward addition/elimination.
For the backward function the least significant factor is removed from the model in each step if the p-value is larger than testlevel alpha
. If all factors are significant at any step, the procedure stops. The elimination obeys the so-called marginality principle (hierarchy of factors) which states that any lower order effect or interaction should not be removed from the model if it is part of a higher order significant interaction. This is a good principle for practical data analysis.
Try to run the backward()
function on the model object you created above with all four factors and interactions. Use alpha=0.05
as test level. Use the Step-information from the output to explain which factors being excluded at each step. What is the final reduced model?
mod4 <- backward(mod2, alpha = 0.05)
Backward elimination, alpha-to-remove: 0.05
Full model: Yield ~ sort * soil * fert * site
<environment: 0x7fac315e75c8>
Step RSS AIC R2pred Cp F value Pr(>F)
sort:soil:fert:site 1 16905 231 0.659 14.4 0.39 0.54
soil:fert:site 2 16918 229 0.696 12.4 0.01 0.91
sort:soil:site 3 18322 229 0.704 11.8 1.49 0.24
soil:site 4 18917 228 0.724 10.3 0.62 0.44
sort:soil:fert 5 21162 230 0.720 10.5 2.37 0.14
In the first step, the 4th order interaction is removed. In 2nd, 3rd and 4th step some 3rd order and a 2nd order interactions are removed. After removing interaction between sort
,soil
and site
in the 5th step, the resulting model is,
Yield ~ sort + soil + fert + site + sort:soil + sort:fert + soil:fert +
sort:site + fert:site + sort:fert:site
- Fit the reduced model and perform a model check using residual analysis.
The backward
function has fitted the reduced model (her named mod4
):
mod4 <- lm(Yield ~ sort + soil + fert + site + sort:soil +
sort:fert + soil:fert + sort:site + fert:site + sort:fert:site,
data = Barley)
par(mfrow = c(2,2))
plot(mod4)
The residual plots give no evidence of any problems with the model assumptions.
- Make an interaction effects plot using the following code:
eff <- allEffects(mod4, confidence.level = .95)
plot(eff[3])
eff <- allEffects(mod4, confidence.level = .95)
plot(eff[3])
Try to explain the plot. Why do you think the interaction between sort, fertilizer and site was significant?
The plot shows that, when changing sort from 1 to 2, a farmer can expect different change in average Yield for site:1
and site:2
and the difference is not same when using fertilizer:1
and fertilizer:2
.