Exercises and Diet

The data and some examples are found at statistics.ats.ucla.edu

The data called exer, consists of people who were randomly assigned to two different diets: low-fat and not low-fat and three different types of exercise: at rest, walking leisurely and running. Their pulse rate was measured at three different time points during their assigned exercise: at 1 minute, 15 minutes and 30 minutes.

Upon downloading the data, you can load and plot the data by

plot(exer, outer = TRUE, aspect = "fill", key = NULL)

You will see six plots with heading’s like 2:1 which means diet=2 and exercises type=1. In each plot we have three repeated measurements for each of 5 people.

In these exercises you will be asked to answer some questions. Use the R-outputs to the extent your find it necessary.

Below is a output from a linear model fitted to the data. State the model that has been fitted and the assumptions made. Which parametrization is used?

options(contrasts = c("contr.sum", "contr.poly"))
mod1 <- lm(pulse ~ (time + diet + exertype) ^ 2, data = exer)
summary(mod1)


Call:
lm(formula = pulse ~ (time + diet + exertype)^2, data = exer)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.233  -5.088   0.683   4.913  26.933 

Coefficients:
                    Estimate Std. Error t value   Pr(>|t|)    
(Intercept)           88.400      2.100   42.09    < 2e-16 ***
time                   5.650      0.972    5.81 0.00000012 ***
diet(1)               -0.178      2.100   -0.08    0.93276    
exertype(1)            1.233      2.970    0.42    0.67911    
exertype(2)            4.000      2.970    1.35    0.18192    
time:diet(1)          -1.783      0.972   -1.83    0.07036 .  
time:exertype(1)      -5.050      1.375   -3.67    0.00043 ***
time:exertype(2)      -4.250      1.375   -3.09    0.00275 ** 
diet(1):exertype(1)    2.244      1.123    2.00    0.04899 *  
diet(1):exertype(2)    2.011      1.123    1.79    0.07703 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 7.53 on 80 degrees of freedom
Multiple R-squared: 0.769,
Adjusted R-squared: 0.743 
F-statistic: 29.6 on 9 and 80 DF,  p-value: <2e-16

anova(mod1)

Analysis of Variance Table

Response: pulse
              Df Sum Sq Mean Sq F value      Pr(>F)    
time           1   1915    1915   33.77 0.000000121 ***
diet           1   1262    1262   22.25 0.000010015 ***
exertype       2   8326    4163   73.39     < 2e-16 ***
time:diet      1    191     191    3.36      0.0704 .  
time:exertype  2   2601    1301   22.93 0.000000013 ***
diet:exertype  2    816     408    7.19      0.0013 ** 
Residuals     80   4538      57                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

<ANS/>

We can write the model in this way:

$\begin{aligned} {p u l s e}_{i j k l} = μ + & β {t i m e}_{l} + {d i e t}_{j} + {e x e r t y p e}_{k} + β {t i m e}_{l} \cdot {e x e r t y p e}_{k} + \\ β {t i m e}_{l} \cdot {d i e t}_{j l} + {d i e t}_{j} \cdot {e x e r t y p e}_{k} + ϵ_{i j k l} \end{aligned}$

Where we assume that the errors are identically and independently distributed (iid) as:

$ϵ_{i j k l} \sim N (0, σ^{2})$ for all $j = 1, 2$ (For Diet); $k = 1, 2, 3$ (For Exercise Type) and $l = 1, 2, 3$ (For individual time points)

In orther words it is a linear model with main effects of time, diet and exercise type, plus all second-order interactions between these.

Remark: From the ANOVA-table output we may observe that time has only one degree of freedom (thus one parameter is estimated for the effect of time) which means that time is assumed to be a continuous effect. If time alternatively was modelled as a categorical variable, it would have had 2 degrees of freedom in the ANOVA table (the number of time-levels minus one).

Further, a sum-to-zero parametrization is used for all categorical effects, for instance, we assume that the two diet effects sum to zero, hence the estimate of diet(2) is: diet(2) = -diet(1) = -(-0.1778) = 0.1778

The three exercise type effects likewise, that is: exertype(3) = -(exertype(1) + exertype(2)) = -(1.2333 + 4.0000) = -5.233

For interactions between categorical effects, the effects sum to zero if we keep one variable fixed at one level and sum across the levels of the other. E.g. for diet 1 the interaction effects with exercise types 1, 2 and 3 sum to zero, and so on. That is: we would from the ouput find:

diet(1):exertype(3) = -(diet(1):exertype(1) + diet(1):exertype(2)) = -(2.2444 + 2.0111) = -4.256

However, it also means that for fixed level 3 for exercise type, the diet effects should sum to zero. That is:

diet(2):exertype(3) = - diet(1):exertype(3) = -(-4.256) = 4.256

For interactions between the continuous variable time and a catergorical variable, the effects sum to zero across the levels of the categorical variable if we keep time fixed.

Hence: time:diet(2) = -time:diet(1) = -(-1.783) = 1.783

and time:exertype(3) = -(time:exertype(1) + time:exertype(2)) = -(-5.05 - 4.25)= 9.30

Write up the fitted model for pulse as a function of time for the special case of diet=1 and exertype=1. Do the same for diet=2 and exertype=3.

<ANS/>

The fitted model for diet=1 and exertype=1 is found directly by selecting the estimated coefficients from the output for which diet=1 and exertype=1:

$\begin{aligned} \hat{p u l s e} & = 88.4 + 5.65 t i m e - 0.178 + 1.233 - 1.783 t i m e - 5.05 t i m e + 2.244 \\ = 91.699 - 1.183 t i m e \end{aligned}$

Similarly, using the estimates found above through the sum-to-zero parametrizations, the fitted model for diet=2 and exertype=3 becomes:

$\begin{aligned} \hat{p u l s e} & = 87.6 + 16.733 t i m e \end{aligned}$

What is the estimated noise variance? Give an interpretation of the estimated time effect (5.65) for a person with limited statistics knowledge.

<ANS/>

The estimated noise variance is ${\hat{σ}}^{2}$ = MSE= 56.724. The estimate of time variable which is 5.65 refers to the average change in pulse per unit increase in time, for persons being part of this study.

The anova table gives significant interaction between time and exertype. Use the graphs above to explain why there is an interaction between these two variables.

<ANS/>

Since the interaction term is significant, for each individual, the line representing the pulse measured at three time points are inclined different (not parallel) which also says that the pulse for an individual under different exercise type changes differently over time.

Are there any problems related to the model assumptions for the model fitted in a. ?

<ANS/>

Model 1 is based on the assumption that all error terms (and hence all observations) are independent. Since we know that there are three repeated observations on each person, we should expect that observations made on the same person are dependent. For instance, a person with low puls at time 1 is expected to also have (relatively) low puls at time points2 and 3. The Figure below confirms this tendency, especially for exercise types 1 and 2. We see that the residuals of any given person tend to be either above or below the fitted lines.

Another analysis was performed as shown below. Describe the difference between mod2 and mod1. Are there any extra parameters estimated? Why is mod2 probably a more reasonable model than mod1?

mod2 <- lme(pulse ~ (time + diet + exertype) ^ 2, 
            random = ~1 | id, data = exer)
summary(mod2)

Linear mixed-effects model fit by REML
 Data: exer 
  AIC BIC logLik
  610 639   -293

Random effects:
 Formula: ~1 | id
        (Intercept) Residual
StdDev:        3.79     6.62

Fixed effects: pulse ~ (time + diet + exertype)^2 
                Value Std.Error DF t-value p-value
(Intercept)      88.4     1.971 56    44.8  0.0000
time              5.6     0.854 56     6.6  0.0000
diet1            -0.2     1.971 24    -0.1  0.9289
exertype1         1.2     2.788 24     0.4  0.6621
exertype2         4.0     2.788 24     1.4  0.1642
time:diet1       -1.8     0.854 56    -2.1  0.0414
time:exertype1   -5.1     1.208 56    -4.2  0.0001
time:exertype2   -4.3     1.208 56    -3.5  0.0009
diet1:exertype1   2.2     1.390 24     1.6  0.1193
diet1:exertype2   2.0     1.390 24     1.4  0.1607
 Correlation: 
                (Intr) time   diet1  exrty1 exrty2 tm:dt1 tm:xr1 tm:xr2
time            -0.867                                                 
diet1            0.000  0.000                                          
exertype1        0.000  0.000  0.000                                   
exertype2        0.000  0.000  0.000 -0.500                            
time:diet1       0.000  0.000 -0.867  0.000  0.000                     
time:exertype1   0.000  0.000  0.000 -0.867  0.433  0.000              
time:exertype2   0.000  0.000  0.000  0.433 -0.867  0.000 -0.500       
diet1:exertype1  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
diet1:exertype2  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
                dt1:x1
time                  
diet1                 
exertype1             
exertype2             
time:diet1            
time:exertype1        
time:exertype2        
diet1:exertype1       
diet1:exertype2 -0.500

Standardized Within-Group Residuals:
   Min     Q1    Med     Q3    Max 
-2.576 -0.566 -0.038  0.458  3.475 

Number of Observations: 90
Number of Groups: 30

<ANS/>

Model 2 assumes a random intercept for each individual and the random intercept is assumed to follow $N (0, σ_{α}^{2})$ . So, $σ_{α}$ is an extra parameters to be estimated. From model 2, we can obtain a separate fitted line for each individual, which is more resonable and which take better into account that we may have dependence between observations made on the same person.

For models containing random effects there is a function ranef in the nlme-package which displays the predicted random effects. The first 5 random effects are shown below with “rowname” equal to the id number of a given person. Explain what it means that person with id=4 has a random effect of -3.802.

ranef(mod2)[1:5,,drop = F]

  (Intercept)
4      -3.802
1      -1.653
5       0.992
2       1.157
3       3.306

<ANS/>

This means that person with id=4 has on average 3.802 units lower pulse than the average person, keeping all other variables fixed.

A third model was analysed by:

library(nlme)
mod3 <- lme(pulse ~ (time + diet + exertype) ^ 2, 
            random = ~1 | id, correlation = corAR1(), 
            data = exer)
summary(mod3)

Linear mixed-effects model fit by REML
 Data: exer 
  AIC BIC logLik
  611 642   -293

Random effects:
 Formula: ~1 | id
        (Intercept) Residual
StdDev:        1.45     7.47

Correlation Structure: AR(1)
 Formula: ~1 | id 
 Parameter estimate(s):
  Phi 
0.254 
Fixed effects: pulse ~ (time + diet + exertype)^2 
                Value Std.Error DF t-value p-value
(Intercept)      88.2     2.099 56    42.0  0.0000
time              5.6     0.933 56     6.1  0.0000
diet1            -0.2     2.099 24    -0.1  0.9259
exertype1         1.4     2.968 24     0.5  0.6421
exertype2         4.0     2.968 24     1.4  0.1861
time:diet1       -1.8     0.933 56    -1.9  0.0611
time:exertype1   -5.1     1.320 56    -3.8  0.0003
time:exertype2   -4.2     1.320 56    -3.2  0.0021
diet1:exertype1   2.2     1.357 24     1.7  0.1111
diet1:exertype2   2.1     1.357 24     1.5  0.1430
 Correlation: 
                (Intr) time   diet1  exrty1 exrty2 tm:dt1 tm:xr1 tm:xr2
time            -0.889                                                 
diet1            0.000  0.000                                          
exertype1        0.000  0.000  0.000                                   
exertype2        0.000  0.000  0.000 -0.500                            
time:diet1       0.000  0.000 -0.889  0.000  0.000                     
time:exertype1   0.000  0.000  0.000 -0.889  0.445  0.000              
time:exertype2   0.000  0.000  0.000  0.445 -0.889  0.000 -0.500       
diet1:exertype1  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
diet1:exertype2  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
                dt1:x1
time                  
diet1                 
exertype1             
exertype2             
time:diet1            
time:exertype1        
time:exertype2        
diet1:exertype1       
diet1:exertype2 -0.500

Standardized Within-Group Residuals:
   Min     Q1    Med     Q3    Max 
-2.470 -0.656  0.108  0.634  3.595 

Number of Observations: 90
Number of Groups: 30

How does mod 3 differ from mod2 in its model assumptions? Are there any extra parameters estimated?

<ANS/>

Apart from all the assumption in Model 2, Model 3 also assumes a first order autocorrelation between the error terms. In simple words, observation far from each other tend to be less correlated.

Which of mod2 and mod3 is to prefer. Explain why based on the output below.

anova(mod2, mod3)

     Model df AIC BIC logLik   Test L.Ratio p-value
mod2     1 12 610 639   -293                       
mod3     2 13 611 642   -293 1 vs 2   0.702   0.402

<ANS/>

The anova() function used here returns the result of a likelihood-ratio test (a chi-square test). The null-hypothesis for the test is that the two models fit the data equally well. The p-value is large, and we retain the null-hypothesis. Hence, the extra refinement introduced in mod3 is not improving the model fit significantly, and we may keep the simpler mod2.