Exam 2017
library(nlme)
load("_data/Pear2011.Rdata")
load("_data/Pear.Rdata")
load("_data/Reindeer.Rdata")
Exercise 1
The quality of fruits are often judged by the index of refraction (REF) and measured by means of a refractometer. The REF value is connected to the amount of sugar in a liquid solution. In an experiment at Bioforsk at Ås they were monitoring the REF’s in four types of pears. The data for the year 2011 containing 6 replicates for each sort of pear are given in Table 1a in the Appendix.
Define an ANOVA model suitable for analyzing these data in order to investigate the potential difference between the 4 sorts of pear with respect to the expected REF level. State the model assumptions and parameter restrictions (sum-to-zero) and interpret the model parameters.
Based on Table 1b in the Appendix give answers to the following questions:
What are the estimated values of all unknown model parameters?
Perform a hypothesis test for testing the significance of “sort” (Test level 5%). Interpret the results in light of the topic of the study.
Consider the output from the Tukey test given in Table 1c. If we test with an over-all error rate of 5%, which pear sorts are significantly different from each other with respect to their expected REF level? When and why may it be important to use Tukey tests instead of regular pairwise contrasts in ANOVA?
Exercise 2
The data in exercise 1 were really part of a bigger study that was repeated over three consecutive years from 2009 to 2011. The scientist performed an analysis of the data, across all three years, which gave the R-output given in Table 2 in the Appendix. Write a short report (~1-2 pages) based on the R output where you explain the model, its assumptions, the hypothesis tests and conclude on the results of the study.
Exercise 3
A student studied reindeer that stayed at two different locations in northern Norway. One site was close to a wind mill farm near the coast (Kjøllefjord) and the other site was in the inland at the Finnmark plateau. A random sample of 10 reindeer was sampled at each site and the student counted the number of steps each reindeer took in a 4 minute period (variable “steps”) as a response variable and as an indicator of stress level. The student also registered the number of calves for each reindeer (0 or 1). The data are displayed in Table 3a in the Appendix.
In Table 3b the results from a Poisson-regression is given (with reference level parametrization). Explain shortly the model which has been assumed for the analysis and give reasons to why a Poisson model is a reasonable choice in this case.
Based on the output find the estimated expected step counts for two reindeer, one with and one without calf.
In Table 3c the results from a deviance test for comparing two models has been conducted. Describe briefly the test procedure and use the output to test the hypothesis that reindeer close to the wind mill farm (coast) is expected to have a different expected step count (stress level) than reindeer in the inland.
Exercise 4
In a beer liking study 9 women tasted 9 different beer types: 3 types of Ale (A1, A2, A3), 3 types of Strong Lager (SL1, SL2, SL3) and 3 types of Lager (L1, L2, L3). They scored each beer with a score between 1 (dislike) and 7 (like), and the data are given in Table 4a. A cluster analysis was conducted on the Euclidian distances between the beer type scores with the single linkage method.
Describe shortly the single linkage method and illustrate by finding the missing number in the bottom part of Table 4b (the distance between SL3 and the cluster L1/L3).
Based on the dendrogram in Figure 4, answer the questions below:
Which class of beers (Ale, Strong Lager or Lager) seems to show highest internal similarity with regard to liking?
Which of the nine beers is least similar to the others in liking?
If you should put the nine beers into three clusters based on the single linkage clustering, at approximately what distance (height) would you cut the dendrogram tree, and which beers would make up the three clusters?
Appendix to exam STAT340
Table 1a: The pear data from 2011
do.call(cbind, split(Pear2011$REF, Pear2011$Sort)) %>%
as.tibble()
Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
This warning is displayed once per session.
# A tibble: 6 x 4
KvedeA kvedeadams KvedeC Pyrodwarf
<dbl> <dbl> <dbl> <dbl>
1 12.1 12.4 12.7 11.3
2 12.1 11.5 12.7 11.2
3 12.4 11.4 12.3 12.1
4 11.8 11.4 13.8 11.3
5 11.6 10.8 12.5 10.8
6 12.4 11.3 11.3 11.4
Table 1b: R Commander output for exercise 1b
LinearModel.1 <- lm(REF ~ Sort, data = Pear2011)
summary(LinearModel.1)
Call:
lm(formula = REF ~ Sort, data = Pear2011)
Residuals:
Min 1Q Median 3Q Max
-1.250 -0.188 -0.050 0.150 1.250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.858 0.112 106.09 <2e-16 ***
Sort(KvedeA) 0.208 0.194 1.08 0.2947
Sort(kvedeadams) -0.392 0.194 -2.02 0.0566 .
Sort(KvedeC) 0.692 0.194 3.57 0.0019 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 0.548 on 20 degrees of freedom
Multiple R-squared: 0.483,
Adjusted R-squared: 0.405
F-statistic: 6.23 on 3 and 20 DF, p-value: 0.00367
Anova(LinearModel.1, type = "III")
Anova Table (Type III tests)
Response: REF
Sum Sq Df F value Pr(>F)
(Intercept) 3375 1 11255.86 <2e-16 ***
Sort 6 3 6.23 0.0037 **
Residuals 6 20
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 1c: Tukey test results for exercise 1d.
simple.glht(LinearModel.1, "Sort")
Simultaneous Confidence Intervals and Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = REF ~ Sort, data = Pear2011)
Quantile = 2.8
Minimum significant difference = 0.885
95% confidence level
Linear Hypotheses:
Lower Center Upper Std.Err t value P(>t)
KvedeA-kvedeadams -0.285 0.600 1.485 0.316 1.90 0.2604
KvedeA-KvedeC -1.368 -0.483 0.402 0.316 -1.53 0.4398
KvedeA-Pyrodwarf -0.168 0.717 1.602 0.316 2.27 0.1397
kvedeadams-KvedeC -1.968 -1.083 -0.198 0.316 -3.43 0.0131 *
kvedeadams-Pyrodwarf -0.768 0.117 1.002 0.316 0.37 0.9823
KvedeC-Pyrodwarf 0.315 1.200 2.085 0.316 3.80 0.0057 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
cld(simple.glht(LinearModel.1,'Sort', level = 0.95))
Tukey's HSD
Alpha: 0.05
Mean G1 G2
KvedeC 12.6 A
KvedeA 12.1 A B
kvedeadams 11.5 B
Pyrodwarf 11.3 B
Table 2: R output for exercise 2
Pear$Year <- as.factor(Pear$Year)
LinearModel.2 <- lm(REF ~ Sort * r(Year), data = Pear)
Anova(LinearModel.2, type = "III")
Analysis of variance (unrestricted model)
Response: REF
Mean Sq Sum Sq Df F value Pr(>F)
Sort 3.10 9.31 3 12.12 0.0059
Year 2.24 4.47 2 8.73 0.0167
Sort:Year 0.26 1.54 6 0.77 0.5956
Residuals 0.33 19.92 60 - -
Err.term(s) Err.df VC(SS)
1 Sort (3) 6 fixed
2 Year (3) 6 0.0825
3 Sort:Year (4) 60 -0.0127
4 Residuals - - 0.3321
(VC = variance component)
Expected mean squares
Sort (4) + 6 (3) + 18 Q[1]
Year (4) + 6 (3) + 24 (2)
Sort:Year (4) + 6 (3)
Residuals (4)
Table 3a : Data for exercise 3
Reindeer
steps calf site
1 19 0 Inland
2 23 0 Inland
3 15 0 Inland
4 16 0 Inland
5 17 1 Inland
6 16 1 Inland
7 14 1 Inland
8 10 1 Inland
9 12 1 Inland
10 13 1 Inland
11 21 0 Coast
12 20 0 Coast
13 26 0 Coast
14 25 0 Coast
15 17 0 Coast
16 18 0 Coast
17 17 1 Coast
18 12 1 Coast
19 21 1 Coast
20 19 1 Coast
Table 3b: Output from R Commander for exercise 3a
GLM.1 <- glm(steps ~ calf, family = poisson(log), data = Reindeer)
GLM.2 <- glm(steps ~ calf * site, family = poisson(log), data = Reindeer)
summary(GLM.1)
Call:
glm(formula = steps ~ calf, family = poisson(log), data = Reindeer)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.399 -0.724 -0.113 0.523 1.433
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.9957 0.0707 42.37 <2e-16 ***
calf(1) -0.2810 0.1078 -2.61 0.0091 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 20.314 on 19 degrees of freedom
Residual deviance: 13.451 on 18 degrees of freedom
AIC: 111.1
Number of Fisher Scoring iterations: 4
Table 3c: Deviance test for exercise 3c
anova(GLM.1, GLM.2, test = "Chisq")
Analysis of Deviance Table
Model 1: steps ~ calf
Model 2: steps ~ calf * site
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 18 13.4
2 16 10.4 2 3.04 0.22
Table 4a : Beer liking data for 9 women
K1 K2 K3 K4 K5 K6 K7 K8 K9
A1 3 1 3 1 5 7 2 1 4
A2 6 5 7 5 3 5 4 5 6
A3 6 6 3 2 2 2 7 6 6
SL1 6 6 4 3 2 5 5 1 6
SL2 5 2 6 2 3 5 2 6 2
SL3 5 2 7 2 6 7 4 7 5
L1 4 5 7 4 5 4 6 6 4
L2 5 6 5 5 5 3 5 5 6
L3 5 6 6 3 6 3 6 7 5
Table 4b: Euclidian distance matrix before clustering (top), and after step 1 of single linkage (bottom). Note that “L1/L3” denotes the cluster of L1 and L3.
dist(dta)
A1 A2 A3 SL1 SL2 SL3 L1 L2
A2 9.43
A3 11.09 6.78
SL1 8.06 5.66 6.32
SL2 7.21 6.40 8.77 8.54
SL3 8.00 6.08 9.22 9.22 5.29
L1 9.59 4.36 6.56 7.42 6.32 5.48
L2 9.70 3.87 5.39 5.92 7.75 7.21 3.74
L3 10.58 5.39 5.57 7.94 7.48 6.16 2.83 3.46
plot(hclust(dist(dta), method = "single"),
main = "Cluster Dendrogram for Solution HClust.12",
sub = "Method=single, Distance=euclidian",
xlab = "Observation Number in dataset female.beer")