Dataset: mtcars
We will use an old data set from 1974 on gasoline consumption for various cars which is part of the datasets
package in R.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
As you can see there are multiple variables. If you have the datasets package you may look at the help file for the data by:
?datasets::mtcars
Ex-1: Model Fitting
- Fit a multiple linear regression model with mpg (Miles/Gallon) as response variable and
wt
(Weight) andcyl
(Number of cylinders) as predictors. Call the model object “cars1
”.
cars1 <- lm(mpg ~ cyl + wt, data = mtcars)
- Check the model assumptions by residual analysis.
par(mfrow = c(1,2)) #This creates a layout for plots, one row and two columns
plot(cars1, which = c(1,2)) #Only the two first residual plots
- Give a summary of the results and compute the ANOVA-table.
summary(cars1)
Call:
lm(formula = mpg ~ cyl + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.289 -1.551 -0.468 1.574 6.100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.686 1.715 23.14 < 2e-16 ***
cyl -1.508 0.415 -3.64 0.00106 **
wt -3.191 0.757 -4.22 0.00022 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 2.57 on 29 degrees of freedom
Multiple R-squared: 0.83,
Adjusted R-squared: 0.819
F-statistic: 70.9 on 2 and 29 DF, p-value: 0.00000000000681
anova(cars1)
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 818 818 124.0 0.0000000000054 ***
wt 1 117 117 17.8 0.00022 ***
Residuals 29 191 7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Give a short report on the results.
Both cyl
and wt
appears to be highly significant predictors for mpg
. The estimated effects are negative implying that the mileage decreases as both weight and cylinder numbers increase, which is a reasonable result. The \(R^2\) is 0.83, hence, about 83% of the variability in mileage is explained by the linear relationship with cyl
and wt
. The residual plot of fitted values versus residuals gives an indication of a non-linear relationship, which may be a result of non-linear dependencies or missing explanatory variable(s). The normal probability plot is more or less OK.
Ex-2: Indicator variable
The am
variable is an indicator variable for transmission system of the cars, 0=automatic
, 1=manual
. Run the following model in R:
cars2 <- lm(mpg ~ cyl + wt*am, data = mtcars)
- Write up the assumed model which has been run here. Also write up the estimated models for automatic and manual transmission, respectively.
The model is:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2\cdot x_3 + \epsilon\]
where \(y\) = mpg
, \(x_1\) = cyl
, \(x_2\) = wt
, \(x_3\) = am
and \(\epsilon \sim N(0, \sigma^2)\).
The fitted model from R is:
summary(cars2)
Call:
lm(formula = mpg ~ cyl + wt * am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.462 -1.491 -0.788 1.396 5.350
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.283 2.796 12.26 0.0000000000015 ***
cyl -1.181 0.380 -3.11 0.0044 **
wt -2.369 0.824 -2.87 0.0078 **
am 11.939 3.845 3.10 0.0044 **
wt:am -4.197 1.312 -3.20 0.0035 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 2.26 on 27 degrees of freedom
Multiple R-squared: 0.877,
Adjusted R-squared: 0.859
F-statistic: 48.1 on 4 and 27 DF, p-value: 0.00000000000664
For automatic transmission (am
=\(x_3\)=0) we have the estimated model:
\[\hat{y} = 34.28 - 1.18x_1 - 2.37x_2\]
For manual transmission (am
=\(x_3\)=1) we have
\[\hat{y} = 34.28 - 1.18x_1 - 2.37x_2 + 11.94\cdot 1 - 4.20x_2\cdot 1\]
\[= 46.22 - 1.18x_1 - 6.57x_2\]
We observe that the negative effect of weight on mileage is larger for manual transmission than for automatic.
Ex-3: Comparing models - Partial F-test
From the p-values we observe that transmission gives a significant addition to the intercept
and to the effect of weight, respectively. These p-values correspond to testing each effect GIVEN that all other variables are included in the model.
Sometimes we would rather like to test several effects jointly. For instance, should we
add both transmission (am
) AND the interaction betwen transmission and weight (wt:am
) to the model? This is a joint test of the significance of transmission in the model.
To accomplish this we may compare the fits of cars2
(a full model) with cars1
(a reduced model)
since the difference between these models are exactly the transmission effects. This is called
a partial F-test (Fisher test) were we test whether the SSE has decreased significantly as we
go from the reduced model to the full model. The partial F-test may be run in RStudio by:
anova(cars1, cars2)
Analysis of Variance Table
Model 1: mpg ~ cyl + wt
Model 2: mpg ~ cyl + wt * am
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 191
2 27 138 2 52.7 5.13 0.013 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- From the output we see that the test statistic is an F-statistic. Is there a significant effect of transmission do you think?
A lengthy answer:
We are here really testing the hypotheses:
\[H_0: \beta_3 = \beta_4 = 0\] versus the alternative that at least one of them is different from zero.
We reject the null-hypothesis at test-level \(\alpha\) if
\[F = \frac{(SSE_\text{red.mod}-SSE_\text{full.mod})/r}{MSE_\text{full.mod}}\] is larger than \(F_{\alpha, r, n-p}\), where \(r\) is the difference in the degrees of freedom for SSE for the two models (here \(r=2\)), and \(n-p\) are the degrees of freedom for SSE of the full model (here \(n-p=27\)).
From the output we have (note RSS = SSE): \[F = \frac{(191.17 - 138.51)/2}{138.51/27} = 5.13\]
We reject at level \(\alpha=0.05\) if this observed F is larger than \(F_{0.05, 2, 27}\). At this point we could look this up in a Fisher-table, or alternatively compute this quantile of the Fisher distribution by:
qf(0.05, 2, 27, lower.tail = FALSE)
[1] 3.35
See ?FDist
for help-file for the Fisher distribution.
We reject the null-hypothesis.
Alternatively we reject since the p-value from the output is smaller than 0.05.
- Perform a residual analysis of the
cars2
model.
par(mfrow = c(1,2))
plot(cars2, which = c(1,2))
The linearity has improved, but maybe there is an increasing variance with increasing fitted value (estimated mileage). The normality looks good.
Ex-4: Influential measurements
- Use the
influence.measures()
function to compute the Cook’s distances and the leverage (hat) values for all observations accoring to thecars2
model. Are there any influential observations according to these measures?
summary(influence.measures(cars2))
Potentially influential observations of
lm(formula = mpg ~ cyl + wt * am, data = mtcars) :
dfb.1_ dfb.cyl dfb.wt dfb.am dfb.wt:m dffit cov.r
Lincoln Continental 0.38 0.16 -0.52 -0.28 0.23 -0.59 1.57_*
Fiat 128 0.10 -0.31 0.17 0.25 -0.15 0.91 0.36_*
Ford Pantera L 0.01 -0.02 0.01 0.01 -0.02 -0.04 1.61_*
Maserati Bora -0.03 0.09 -0.05 -0.35 0.49 0.73 1.64_*
cook.d hat
Lincoln Continental 0.07 0.33
Fiat 128 0.13 0.10
Ford Pantera L 0.00 0.25
Maserati Bora 0.11 0.38
Four observations are flagged by R, but none according to Cook’s distance or leverage (hat).
Ex-5: Model selection
- Fit a third multiple linear regression model with mpg (Miles/Gallon) as response variable
and
cyl
,disp
,hp
,drat
,wt
andqsec
as predictor variables. Call the model object “cars3
”. Report a summary of the analysis.
cars3 <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
summary(cars3)
Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.968 -1.580 -0.435 1.166 5.527
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.3074 14.6299 1.80 0.0842 .
cyl -0.8186 0.8116 -1.01 0.3228
disp 0.0132 0.0120 1.10 0.2831
hp -0.0179 0.0155 -1.16 0.2585
drat 1.3204 1.4795 0.89 0.3806
wt -4.1908 1.2579 -3.33 0.0027 **
qsec 0.4015 0.5166 0.78 0.4444
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 2.56 on 25 degrees of freedom
Multiple R-squared: 0.855,
Adjusted R-squared: 0.82
F-statistic: 24.5 on 6 and 25 DF, p-value: 0.00000000245
Apparently only wt is significant, but having many variables in a model may lead to inflated Std. Errors of the estimates due to correlation between predictors, and problems finding truly significant variables.
We would like to check various sub-models of this model by combining different variables. Install and load the ‘mixlm
’ package. The package contains a function called best.subsets()
which can help us find a good model.
Best Subset Output:
best.subsets(cars3)
cyl disp hp drat wt qsec RSS R2 R2adj Cp
1 ( 1 ) * 278 0.753 0.745 14.56
1 ( 2 ) * 308 0.726 0.717 19.15
1 ( 3 ) * 317 0.718 0.709 20.50
1 ( 4 ) * 448 0.602 0.589 40.46
1 ( 5 ) * 604 0.464 0.446 64.30
2 ( 1 ) * * 191 0.830 0.819 3.24
2 ( 2 ) * * 195 0.827 0.815 3.83
2 ( 3 ) * * 195 0.826 0.814 3.89
2 ( 4 ) * * 247 0.781 0.766 11.72
2 ( 5 ) * * 269 0.761 0.744 15.17
3 ( 1 ) * * * 177 0.843 0.826 3.01
3 ( 2 ) * * * 181 0.840 0.822 3.62
3 ( 3 ) * * * 184 0.837 0.820 4.07
3 ( 4 ) * * * 184 0.837 0.819 4.09
3 ( 5 ) * * * 186 0.835 0.817 4.45
4 ( 1 ) * * * * 170 0.849 0.826 4.07
4 ( 2 ) * * * * 174 0.845 0.822 4.63
4 ( 3 ) * * * * 174 0.845 0.822 4.67
4 ( 4 ) * * * * 175 0.844 0.821 4.80
4 ( 5 ) * * * * 176 0.844 0.821 4.86
5 ( 1 ) * * * * * 167 0.851 0.823 5.60
5 ( 2 ) * * * * * 169 0.850 0.821 5.80
5 ( 3 ) * * * * * 170 0.849 0.820 6.02
5 ( 4 ) * * * * * 171 0.848 0.819 6.20
5 ( 5 ) * * * * * 172 0.847 0.818 6.34
6 ( 1 ) * * * * * * 163 0.855 0.820 7.00
The function reports by default the 5 best models for each model size (number of predictors). The model size is given in the first column. Column two is the rank within model size, then comes a column for each variable with a star indicating that a given variable is part of the model. Finally comes the residual sum of squares (RSS or SSE), \(R^2\), \(R^2\)-adjusted and finally a diagnostic called Mallow’s Cp.
- Which sub-model would you say is the best fitting model according to the \(R^2\)-adjusted?
A couple of models are quite similar, but the largest \(R^2\)-adjusted is obtained with a model with predictors cyl
, hp
and wt
. This is also a quite simple model with
few predictors. We should always strive for simple models and choose the simpler model in cases where the fit appears to be more or less equal for several models.
Ex-6: Model validation
On canvas you find a file called “CV.R
” containing two functions CV()
and Kfold()
. Download this file and open it in RStudio and press the “source” button up to the right in the script window. This will run the file and create these functions. You can also scource the file as,
source('_functions/CV.R')
A fitted model should ideally be validated on a test set of new and un-touched data. We could predict the new samples using our best choice model and evaluate the prediction performance. If the model predicts well, we probably have a good model!
If we don’t have a test set, we may perform Cross-validation. The most common version is the Leave-One-Out Cross-validation where we successively remove one observation from the data and fit the model to the remaining observations. The fitted model is used to predict the left out observation. After fitting a model, the left out observation is put back, and another is left out. In ttoal we then fit \(n\) models, and perform \(n\) predictions.
- Use the
CV()
function to perform a Leave-One-Out CV using thecars2
model fit by:
res <- CV(cars2)
print(res)
$pred
[1] 22.0 20.1 26.6 19.4 16.4 19.1 16.6 21.3 21.9 19.0 19.1 15.1 15.9 15.9
[15] 13.1 12.8 11.1 26.5 31.0 28.7 24.5 16.6 16.9 15.9 15.4 29.0 27.6 32.0
[29] 16.0 21.1 12.3 23.7
$msep
[1] 6.09
$r.squared.pred
[1] 0.828
The CV()
returns a list with three elements, the predictions, the Mean Square Error of Prediction and and \(R^2\)-predicted.
- Make a plot of the observed mpg versus the cross-validation predictions.
plot(mtcars$mpg, res$pred, xlab = "Observed", ylab = "Predicted")
- Is the best model from the previous exercise, identified by the
best.subsets()
, a better model in terms of prediction error (MSEP)? The MSEP is defined by:
\[\text{MSEP} = \frac{1}{n}\sum_{i=1}^{n}\left(y_i - \hat{y}_{(i)}\right)^2\]
where \(\hat{y}_{(i)}\) is the prediction of \(y_i\) using a model where observation \(i\) was left out from the model estimation. A small value implies better prediction.
cars4 <- lm(mpg ~ cyl + hp + wt, data = mtcars)
CV(cars4)
$pred
[1] 22.97 22.08 26.26 20.91 16.98 20.41 15.65 23.65 23.38 20.03 20.10
[12] 14.97 16.05 16.07 11.02 10.09 8.74 26.32 28.71 27.35 25.83 17.69
[23] 18.09 14.79 15.57 27.70 26.62 27.72 16.58 21.28 12.89 24.61
$msep
[1] 7.19
$r.squared.pred
[1] 0.797
No, this model does not predict better than cars2.
- Leave-One-out CV is known to have large uncertainty, and using a K-fold CV is an alternative. Then the data are divided into K subsets (folds) of approximately equal sizes, and a leave-one-fold-out CV is performed instead. A K=10 is often recommended. Here, since n=32 a K=8 is better, since this gives subsets of equal sizes. Create K=8 random folds by
myfolds <- Kfold(n = 32, K = 8, random = TRUE)
print(myfolds)
[[1]]
[1] 11 22 14 10
[[2]]
[1] 8 6 4 21
[[3]]
[1] 17 25 9 16
[[4]]
[1] 29 26 5 3
[[5]]
[1] 1 24 19 30
[[6]]
[1] 32 15 28 31
[[7]]
[1] 2 27 20 23
[[8]]
[1] 13 18 7 12
Since the folds are sampled randomly, we get r different folds each time we run Kfold()
.
As you see, myfolds is a list of 8 random subsets of observation numbers. Re-run the validation of cars2 by
CV(cars2, folds = myfolds)
$pred
[1] 22.3 20.1 27.0 19.5 16.2 18.8 16.6 21.7 21.9 19.2 19.2 15.3 16.1 16.1
[15] 13.2 10.8 11.0 26.5 31.2 28.8 23.5 16.8 16.8 16.0 15.3 29.4 26.9 32.4
[29] 16.0 21.3 12.3 23.2
$msep
[1] 5.83
$r.squared.pred
[1] 0.837
Note that the result now will vary if you repeat the creation of random K-folds for the cross-validation. Try to make a new myfolds and run the CV again.