Exam 2017

library(nlme)
load("_data/Pear2011.Rdata")
load("_data/Pear.Rdata")
load("_data/Reindeer.Rdata")

Exercise 1

The quality of fruits are often judged by the index of refraction (REF) and measured by means of a refractometer. The REF value is connected to the amount of sugar in a liquid solution. In an experiment at Bioforsk at Ås they were monitoring the REF’s in four types of pears. The data for the year 2011 containing 6 replicates for each sort of pear are given in Table 1a in the Appendix.

Define an ANOVA model suitable for analyzing these data in order to investigate the potential difference between the 4 sorts of pear with respect to the expected REF level. State the model assumptions and parameter restrictions (sum-to-zero) and interpret the model parameters.
Based on Table 1b in the Appendix give answers to the following questions:
1. What are the estimated values of all unknown model parameters?
2. Perform a hypothesis test for testing the significance of “sort” (Test level 5%). Interpret the results in light of the topic of the study.
Consider the output from the Tukey test given in Table 1c. If we test with an over-all error rate of 5%, which pear sorts are significantly different from each other with respect to their expected REF level? When and why may it be important to use Tukey tests instead of regular pairwise contrasts in ANOVA?

Exercise 2

The data in exercise 1 were really part of a bigger study that was repeated over three consecutive years from 2009 to 2011. The scientist performed an analysis of the data, across all three years, which gave the R-output given in Table 2 in the Appendix. Write a short report (~1-2 pages) based on the R output where you explain the model, its assumptions, the hypothesis tests and conclude on the results of the study.

Exercise 3

A student studied reindeer that stayed at two different locations in northern Norway. One site was close to a wind mill farm near the coast (Kjøllefjord) and the other site was in the inland at the Finnmark plateau. A random sample of 10 reindeer was sampled at each site and the student counted the number of steps each reindeer took in a 4 minute period (variable “steps”) as a response variable and as an indicator of stress level. The student also registered the number of calves for each reindeer (0 or 1). The data are displayed in Table 3a in the Appendix.

In Table 3b the results from a Poisson-regression is given (with reference level parametrization). Explain shortly the model which has been assumed for the analysis and give reasons to why a Poisson model is a reasonable choice in this case.
Based on the output find the estimated expected step counts for two reindeer, one with and one without calf.
In Table 3c the results from a deviance test for comparing two models has been conducted. Describe briefly the test procedure and use the output to test the hypothesis that reindeer close to the wind mill farm (coast) is expected to have a different expected step count (stress level) than reindeer in the inland.

Exercise 4

In a beer liking study 9 women tasted 9 different beer types: 3 types of Ale (A1, A2, A3), 3 types of Strong Lager (SL1, SL2, SL3) and 3 types of Lager (L1, L2, L3). They scored each beer with a score between 1 (dislike) and 7 (like), and the data are given in Table 4a. A cluster analysis was conducted on the Euclidian distances between the beer type scores with the single linkage method.

Describe shortly the single linkage method and illustrate by finding the missing number in the bottom part of Table 4b (the distance between SL3 and the cluster L1/L3).
Based on the dendrogram in Figure 4, answer the questions below:
1. Which class of beers (Ale, Strong Lager or Lager) seems to show highest internal similarity with regard to liking?
2. Which of the nine beers is least similar to the others in liking?
3. If you should put the nine beers into three clusters based on the single linkage clustering, at approximately what distance (height) would you cut the dendrogram tree, and which beers would make up the three clusters?

Appendix to exam STAT340

Table 1a: The pear data from 2011

do.call(cbind, split(Pear2011$REF, Pear2011$Sort)) %>% 
  as.tibble()

Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
This warning is displayed once per session.

# A tibble: 6 x 4
  KvedeA kvedeadams KvedeC Pyrodwarf
   <dbl>      <dbl>  <dbl>     <dbl>
1   12.1       12.4   12.7      11.3
2   12.1       11.5   12.7      11.2
3   12.4       11.4   12.3      12.1
4   11.8       11.4   13.8      11.3
5   11.6       10.8   12.5      10.8
6   12.4       11.3   11.3      11.4

Table 1b: R Commander output for exercise 1b

LinearModel.1 <- lm(REF ~ Sort, data = Pear2011)
summary(LinearModel.1)


Call:
lm(formula = REF ~ Sort, data = Pear2011)

Residuals:
   Min     1Q Median     3Q    Max 
-1.250 -0.188 -0.050  0.150  1.250 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        11.858      0.112  106.09   <2e-16 ***
Sort(KvedeA)        0.208      0.194    1.08   0.2947    
Sort(kvedeadams)   -0.392      0.194   -2.02   0.0566 .  
Sort(KvedeC)        0.692      0.194    3.57   0.0019 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 0.548 on 20 degrees of freedom
Multiple R-squared: 0.483,
Adjusted R-squared: 0.405 
F-statistic: 6.23 on 3 and 20 DF,  p-value: 0.00367

Anova(LinearModel.1, type = "III")

Anova Table (Type III tests)

Response: REF
            Sum Sq Df  F value Pr(>F)    
(Intercept)   3375  1 11255.86 <2e-16 ***
Sort             6  3     6.23 0.0037 ** 
Residuals        6 20                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table 1c: Tukey test results for exercise 1d.

simple.glht(LinearModel.1, "Sort")


     Simultaneous Confidence Intervals and Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = REF ~ Sort, data = Pear2011)

Quantile = 2.8 
Minimum significant difference = 0.885
95% confidence level
 
Linear Hypotheses:
                      Lower Center  Upper Std.Err t value  P(>t)   
KvedeA-kvedeadams    -0.285  0.600  1.485   0.316    1.90 0.2604   
KvedeA-KvedeC        -1.368 -0.483  0.402   0.316   -1.53 0.4398   
KvedeA-Pyrodwarf     -0.168  0.717  1.602   0.316    2.27 0.1397   
kvedeadams-KvedeC    -1.968 -1.083 -0.198   0.316   -3.43 0.0131 * 
kvedeadams-Pyrodwarf -0.768  0.117  1.002   0.316    0.37 0.9823   
KvedeC-Pyrodwarf      0.315  1.200  2.085   0.316    3.80 0.0057 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

cld(simple.glht(LinearModel.1,'Sort', level = 0.95))

Tukey's HSD
Alpha: 0.05

           Mean G1 G2
KvedeC     12.6  A   
KvedeA     12.1  A  B
kvedeadams 11.5     B
Pyrodwarf  11.3     B

Table 2: R output for exercise 2

Pear$Year <- as.factor(Pear$Year)
LinearModel.2 <- lm(REF ~ Sort * r(Year), data = Pear)
Anova(LinearModel.2, type = "III")

Analysis of variance (unrestricted model)
Response: REF
          Mean Sq Sum Sq Df F value Pr(>F)
Sort         3.10   9.31  3   12.12 0.0059
Year         2.24   4.47  2    8.73 0.0167
Sort:Year    0.26   1.54  6    0.77 0.5956
Residuals    0.33  19.92 60       -      -

            Err.term(s) Err.df  VC(SS)
1 Sort              (3)      6   fixed
2 Year              (3)      6  0.0825
3 Sort:Year         (4)     60 -0.0127
4 Residuals           -      -  0.3321
(VC = variance component)

          Expected mean squares
Sort      (4) + 6 (3) + 18 Q[1]
Year      (4) + 6 (3) + 24 (2) 
Sort:Year (4) + 6 (3)          
Residuals (4)

Table 3a : Data for exercise 3

Reindeer

   steps calf   site
1     19    0 Inland
2     23    0 Inland
3     15    0 Inland
4     16    0 Inland
5     17    1 Inland
6     16    1 Inland
7     14    1 Inland
8     10    1 Inland
9     12    1 Inland
10    13    1 Inland
11    21    0  Coast
12    20    0  Coast
13    26    0  Coast
14    25    0  Coast
15    17    0  Coast
16    18    0  Coast
17    17    1  Coast
18    12    1  Coast
19    21    1  Coast
20    19    1  Coast

Table 3b: Output from R Commander for exercise 3a

GLM.1 <- glm(steps ~ calf, family = poisson(log), data = Reindeer)
GLM.2 <- glm(steps ~ calf * site, family = poisson(log), data = Reindeer)
summary(GLM.1)


Call:
glm(formula = steps ~ calf, family = poisson(log), data = Reindeer)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.399  -0.724  -0.113   0.523   1.433  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.9957     0.0707   42.37   <2e-16 ***
calf(1)      -0.2810     0.1078   -2.61   0.0091 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 20.314  on 19  degrees of freedom
Residual deviance: 13.451  on 18  degrees of freedom
AIC: 111.1

Number of Fisher Scoring iterations: 4

Table 3c: Deviance test for exercise 3c

anova(GLM.1, GLM.2, test = "Chisq")

Analysis of Deviance Table

Model 1: steps ~ calf
Model 2: steps ~ calf * site
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        18       13.4                     
2        16       10.4  2     3.04     0.22

Table 4a : Beer liking data for 9 women

    K1 K2 K3 K4 K5 K6 K7 K8 K9
A1   3  1  3  1  5  7  2  1  4
A2   6  5  7  5  3  5  4  5  6
A3   6  6  3  2  2  2  7  6  6
SL1  6  6  4  3  2  5  5  1  6
SL2  5  2  6  2  3  5  2  6  2
SL3  5  2  7  2  6  7  4  7  5
L1   4  5  7  4  5  4  6  6  4
L2   5  6  5  5  5  3  5  5  6
L3   5  6  6  3  6  3  6  7  5

Table 4b: Euclidian distance matrix before clustering (top), and after step 1 of single linkage (bottom). Note that “L1/L3” denotes the cluster of L1 and L3.

dist(dta)

       A1    A2    A3   SL1   SL2   SL3    L1    L2
A2   9.43                                          
A3  11.09  6.78                                    
SL1  8.06  5.66  6.32                              
SL2  7.21  6.40  8.77  8.54                        
SL3  8.00  6.08  9.22  9.22  5.29                  
L1   9.59  4.36  6.56  7.42  6.32  5.48            
L2   9.70  3.87  5.39  5.92  7.75  7.21  3.74      
L3  10.58  5.39  5.57  7.94  7.48  6.16  2.83  3.46

plot(hclust(dist(dta), method = "single"), 
     main = "Cluster Dendrogram for Solution HClust.12",
     sub = "Method=single, Distance=euclidian",
     xlab = "Observation Number in dataset female.beer")

Figure 7: Dendrogram from single linkage clustering of female beer liking scores