Track and Field data
Load the trackfieldrecords.rdata
file with the objects runMen
and runWomen
containing national records (a few years ago…) for several track and field events like 100m, 200m and so on up to marathon.
- Explore the data for both men and women using the
pairs()
- plotting function (you must exclude the Nation variable since this is non-numeric). Which events appear to be most correlated with each other? Check by using thecor()
- function.
For dataset: runMen
pairs(runMen[,-9], cex = 0.7)
cor(runMen[,-9])
M100 M200 M400 M800 M1500 M5000 M10K M42K
M100 1.000 0.918 0.831 0.744 0.686 0.601 0.615 0.504
M200 0.918 1.000 0.840 0.800 0.767 0.682 0.683 0.585
M400 0.831 0.840 1.000 0.866 0.828 0.770 0.779 0.701
M800 0.744 0.800 0.866 1.000 0.914 0.858 0.864 0.803
M1500 0.686 0.767 0.828 0.914 1.000 0.926 0.933 0.865
M5000 0.601 0.682 0.770 0.858 0.926 1.000 0.974 0.931
M10K 0.615 0.683 0.779 0.864 0.933 0.974 1.000 0.943
M42K 0.504 0.585 0.701 0.803 0.865 0.931 0.943 1.000
Mens running 5000m have the highest correlation (0.974) with the mens running 10000m.
For dataset: runWomen
pairs(runWomen[,-8], cex = 0.7)
cor(runWomen[,-8])
M100 M200 M400 M800 M1500 M3000 M42K
M100 1.000 0.947 0.808 0.693 0.704 0.724 0.683
M200 0.947 1.000 0.836 0.690 0.671 0.690 0.682
M400 0.808 0.836 1.000 0.887 0.771 0.767 0.710
M800 0.693 0.690 0.887 1.000 0.894 0.856 0.782
M1500 0.704 0.671 0.771 0.894 1.000 0.968 0.879
M3000 0.724 0.690 0.767 0.856 0.968 1.000 0.899
M42K 0.683 0.682 0.710 0.782 0.879 0.899 1.000
Womens running 3000m and 1500m have highest correlation (0.968)
- Run the following command and inspect the results:
cor(runWomen[,-8], runMen[,-9])
If you were going to “predict” a nation’s record for women’s M3000, which record among men would you use (rows in the table are women variables, columns are men)?
cor(runWomen[,-8], runMen[,-9])
M100 M200 M400 M800 M1500 M5000 M10K M42K
M100 0.650 0.760 0.777 0.804 0.784 0.718 0.713 0.651
M200 0.716 0.799 0.811 0.817 0.766 0.703 0.703 0.621
M400 0.645 0.715 0.792 0.778 0.770 0.738 0.731 0.691
M800 0.595 0.697 0.740 0.785 0.834 0.817 0.820 0.785
M1500 0.517 0.636 0.676 0.848 0.876 0.858 0.868 0.822
M3000 0.571 0.681 0.684 0.856 0.882 0.865 0.864 0.814
M42K 0.612 0.701 0.647 0.811 0.830 0.803 0.817 0.762
Men’s M1500 shows the highest correlation to women’s M3000 and appears to be the best indicator, but in order to really check this we should run a regression analysis with cross-validation to see which predicts best.
- Run a PCA (with scale=FALSE) on the men’s data and print out a summary and the weights (loadings) for the two first PC’s. How many components is needed to explain at least 99% of the variation in the data? Try to give an interpretation of PC1, and explain why so few components explains so much of the total variability.
pr <- prcomp(runMen[,1:8], scale = FALSE)
summary(pr)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 9.571 0.64469 0.16890 0.05837 0.02504 0.0117
Proportion of Variance 0.995 0.00451 0.00031 0.00004 0.00001 0.0000
Cumulative Proportion 0.995 0.99964 0.99995 0.99999 1.00000 1.0000
PC7 PC8
Standard deviation 0.00553 0.00191
Proportion of Variance 0.00000 0.00000
Cumulative Proportion 1.00000 1.00000
pr
Standard deviations (1, .., p=8):
[1] 9.57139 0.64469 0.16890 0.05837 0.02504 0.01174 0.00553 0.00191
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4 PC5 PC6 PC7
M100 0.000311 0.00379 -0.000446 0.03426 0.0692868 0.185797 -0.387429
M200 0.000656 0.00660 -0.004506 0.06800 0.0994247 0.274399 -0.850073
M400 0.001759 0.01308 -0.005035 0.13044 0.2647439 0.885091 0.355339
M800 0.005393 0.03175 -0.020469 0.38876 0.8602795 -0.326436 0.027660
M1500 0.014222 0.08536 -0.050540 0.90239 -0.4184781 -0.015186 0.015555
M5000 0.078990 0.37121 -0.921125 -0.08648 0.0031867 0.000642 0.000858
M10K 0.180437 0.90259 0.385347 -0.06522 0.0031326 -0.002815 -0.000183
M42K 0.980290 -0.19749 0.004152 0.00345 -0.0000589 0.000652 -0.000359
PC8
M100 0.8996602
M200 -0.4330211
M400 -0.0551806
M800 -0.0018789
M1500 0.0073064
M5000 0.0012362
M10K -0.0009295
M42K 0.0000788
Only one component is needed to explain more than 99% of the variation. From the loadings we see that the loading weight for M42K (Marathon) is totally dominating with a weight of 0.98. All other weights are small. PC1 is therefore more or less indentical to the Marathon variable.
PC1 is located in the direction of larget variability, and due to the large scale of marathon times, this variable totally dominates the PCA. In such cases it may be better to use standardized variables (which is equivalent to running the Eigenvalue decomposition on the correlation matrix instead of the covariance matrix of the variables).
- Re-run the PCA with option
scale=TRUE
inprcomp()
. How many variables are needed to explain 99% of the variation in this case? How much is explained by the two first components? How would you interprete the loadings of PC1, PC2 and PC3?
pr <- prcomp(runMen[,1:8], scale = TRUE)
summary(pr)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.563 0.953 0.4076 0.3615 0.2859 0.26585 0.21933
Proportion of Variance 0.821 0.114 0.0208 0.0163 0.0102 0.00883 0.00601
Cumulative Proportion 0.821 0.935 0.9557 0.9720 0.9822 0.99106 0.99707
PC8
Standard deviation 0.15305
Proportion of Variance 0.00293
Cumulative Proportion 1.00000
pr
Standard deviations (1, .., p=8):
[1] 2.563 0.953 0.408 0.361 0.286 0.266 0.219 0.153
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
M100 0.315 0.5716 0.322 -0.1783 0.269 -0.5781 0.1425 -0.10781
M200 0.336 0.4638 0.372 0.2485 -0.158 0.6498 -0.1218 0.09974
M400 0.356 0.2439 -0.615 -0.5946 -0.236 0.1619 -0.0119 0.00229
M800 0.369 0.0142 -0.496 0.5214 0.541 -0.0318 -0.2233 0.03623
M1500 0.374 -0.1404 -0.105 0.4103 -0.494 -0.1747 0.6048 -0.14352
M5000 0.365 -0.3129 0.194 -0.0463 -0.237 -0.1448 -0.5974 -0.54333
M10K 0.367 -0.3071 0.180 -0.0985 -0.120 -0.2202 -0.1728 0.79734
M42K 0.343 -0.4319 0.228 -0.3176 0.489 0.3413 0.4028 -0.15986
Now we need 6 components to achieve 99% explained variance. Two components explain about 93.5% of the total variance. The loadings for PC1 are almost identical for all variables, hence PC1 is close to identical to the average run record across all distances for each country. PC2 has weights ranging from highly negative for marathon to highly positive for M100. This PC therefore contrasts short versus long runs. PC3 is a component that contrasts medium long distances (400, 800 and 1500 m) and either short or long distances. This component appears to extract information about how these distances differ from both sprint and endurance distances.
Make a biplot with the first two components. You may use the argument “xlabs” in biplot to indicate that “Nations” should be used to label the scores from each country. Give comments to how the nations cluster relative to the loadings.
biplot(pr, xlabs = runMen$Nation, cex = 0.7)
On the first axis (PC1) we observe that Cook’s Islands and West Samoa have the largest weights, and they therefore have on average poor (long time) national records for all distance. At the other end we find USA and others with on average good national records. Along PC2 (vertically) we find those with relatively poor times on long distances, but relatively good times on short, at the bottom (Dom. Repub, Bermuda, Singapore, Malaysia, Thailand and West Samoa) whereas at the top we find countries with poor records on short distances compared to their long distance records (Costa Rica, North-Korea, Cook’s Island and others).
- Let’s try to predict the 3000M national records for women using the men’s data. First use a least squares approach using the men’s variables directly as predictor for women’s M3000. To accomplish this use
runWomen$M3000
as response inlm()
. Are there any significant variables? What is the \(R^2\) and the adjusted \(R^2\) values?
lmmod <- lm(runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 +
M5000 + M10K + M42K, data = runMen)
(lm_sumry <- summary(lmmod))
Call:
lm(formula = runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 +
M5000 + M10K + M42K, data = runMen)
Residuals:
Min 1Q Median 3Q Max
-0.7333 -0.2060 -0.0537 0.2496 0.7184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.09205 1.98465 -3.07 0.0037 **
M100 -23.17895 23.87260 -0.97 0.3370
M200 18.93566 13.76942 1.38 0.1762
M400 -11.69187 5.19182 -2.25 0.0295 *
M800 6.06528 2.25105 2.69 0.0100 *
M1500 1.70738 1.16573 1.46 0.1503
M5000 0.20068 0.28632 0.70 0.4871
M10K 0.08099 0.14783 0.55 0.5866
M42K -0.00202 0.01715 -0.12 0.9070
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 0.357 on 43 degrees of freedom
Multiple R-squared: 0.84,
Adjusted R-squared: 0.81
F-statistic: 28.2 on 8 and 43 DF, p-value: 1e-14
lm_sumry[c("r.squared", "adj.r.squared")]
$r.squared
[1] 0.84
$adj.r.squared
[1] 0.81
Two variables are significant at 5% level in this fitted model, M800 and M400. The \(R^2\)-values indicate more than 80% explained variance. There is a slight difference between the adjusted and the non-adjusted \(R^2\) indicating that there may be too many variables included in the model.
- Either perform a manual elimination of insignificant variables, or run
backward()
from the mixlm-package to find a reduced model with only significant effects (5% testlevel). Which variables do you end up having in your model?
red.mod <- backward(lmmod, alpha = 0.05)
Backward elimination, alpha-to-remove: 0.05
Full model: runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 + M5000 +
M10K + M42K
<environment: 0x7fac34092650>
Step RSS AIC R2pred Cp F value Pr(>F)
M42K 1 5.49 -101 0.644 7.01 0.01 0.907
M10K 2 5.53 -102 0.687 5.32 0.32 0.576
M100 3 5.64 -104 0.730 4.15 0.86 0.359
M200 4 5.75 -104 0.746 3.03 0.92 0.343
M5000 5 6.20 -103 0.755 4.55 3.67 0.061 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(red.mod)
Call:
lm(formula = runWomen$M3000 ~ M400 + M800 + M1500, data = runMen)
Residuals:
Min 1Q Median 3Q Max
-0.6139 -0.2798 -0.0541 0.2591 0.7221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.930 1.687 -4.11 0.00015 ***
M400 -11.241 4.286 -2.62 0.01166 *
M800 6.661 2.211 3.01 0.00412 **
M1500 3.556 0.806 4.41 0.000058 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 0.359 on 48 degrees of freedom
Multiple R-squared: 0.819,
Adjusted R-squared: 0.808
F-statistic: 72.6 on 3 and 48 DF, p-value: <2e-16
In addition to M400 and M800 I find M1500 to be highly significant after removing other variables. This “covered” effect from the full model was because of the inflated variances due to multicollinear variables in the full model.
- Fit another model using all principal component scores from the men’s data as predictors for women’s M3000. The scores are stored in the principal component object as element names
x
. Which components are significant at a 5% test level? Compare the \(R^2\) values with those from the full model using all original variables as predictors.
#You need to transform the scores into a data.frame first
PCdata <- as.data.frame(pr$x)
pcrmod <- lm(runWomen$M3000 ~ PC1 + PC2 + PC3 + PC4 + PC5 +
PC6 + PC7 + PC8, data = PCdata)
summary(pcrmod)
Call:
lm(formula = runWomen$M3000 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 +
PC7 + PC8, data = PCdata)
Residuals:
Min 1Q Median 3Q Max
-0.7333 -0.2060 -0.0537 0.2496 0.7184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.49673 0.04956 191.61 < 2e-16 ***
PC1 0.27603 0.01952 14.14 < 2e-16 ***
PC2 -0.17201 0.05249 -3.28 0.00208 **
PC3 0.03614 0.12279 0.29 0.76990
PC4 0.53593 0.13844 3.87 0.00036 ***
PC5 0.00982 0.17503 0.06 0.95551
PC6 0.04261 0.18825 0.23 0.82201
PC7 -0.09547 0.22818 -0.42 0.67772
PC8 0.04242 0.32698 0.13 0.89737
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 0.357 on 43 degrees of freedom
Multiple R-squared: 0.84,
Adjusted R-squared: 0.81
F-statistic: 28.2 on 8 and 43 DF, p-value: 1e-14
Components 1,2 and 4 are highly significant, the others not. The \(R^2\) values are identical to the full model on the orginal variables since we use all available components. The information content in PC1-PC8 is therefore the same as in the eight original variables.
- Perfom a model reduction also for the PCR-model by excluding Principal components until you have only PC’s significant at 5% test level. Compare the estimated regression coefficients between the full and the reduced PCR-models. Why do you think the effects don’t change for the variables retained in the model?
pcr.red.mod <- backward(pcrmod, alpha = 0.05)
Backward elimination, alpha-to-remove: 0.05
Full model: runWomen$M3000 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8
<environment: 0x7fac4d0228c8>
Step RSS AIC R2pred Cp F value Pr(>F)
PC5 1 5.49 -101 0.632 7.003 0.00 0.96
PC8 2 5.50 -103 0.683 5.020 0.02 0.90
PC6 3 5.50 -105 0.733 3.071 0.05 0.82
PC3 4 5.51 -107 0.749 1.158 0.09 0.76
PC7 5 5.54 -108 0.763 -0.667 0.19 0.66
summary(pcr.red.mod)
Call:
lm(formula = runWomen$M3000 ~ PC1 + PC2 + PC4, data = PCdata)
Residuals:
Min 1Q Median 3Q Max
-0.7232 -0.2207 -0.0461 0.2415 0.6907
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4967 0.0471 201.67 < 2e-16 ***
PC1 0.2760 0.0186 14.88 < 2e-16 ***
PC2 -0.1720 0.0499 -3.45 0.00118 **
PC4 0.5359 0.1315 4.07 0.00017 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 0.34 on 48 degrees of freedom
Multiple R-squared: 0.839,
Adjusted R-squared: 0.829
F-statistic: 83.3 on 3 and 48 DF, p-value: <2e-16
The reduced model has only PC1, PC2 and PC4 as predictors, and the \(R^2\) is almost as high as for the full model. The estimated effects are identical due to the fact that the PC’s are orthogonal to each other and explain independent variability.
- For special interested: Use the estimated effects (alphas from the lecture) from the reduced PCR-model to compute PCR-estimated regression coefficients (betas in the lecture) for the original variables back back-rotation. Compare the estimated regression coefficients from the PCR-approach with Least squares estimates using original (BUT SCALED) variables.
#PCR-estimated betas using components 1,2 and 4. The %*% operator is the inner-product
#multiplier in R
PCRest <- pr$rotation[,c(1,2,4)] %*% (coef(pcr.red.mod)[-1])
#Scaling the original variables and re-running the least squares model:
runMen.scaled <- as.data.frame(scale(runMen[,-9]))
lmmod2 <- lm(runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 +
M5000 + M10K + M42K, data = runMen.scaled)
#Estimates without intercept.
Estimates <- cbind(coef(lmmod2)[-1],PCRest)
colnames(Estimates) <- c("Least Squares", "PCR")
Estimates
Least Squares PCR
M100 -0.1355 -0.10698
M200 0.2015 0.14607
M400 -0.2789 -0.26247
M800 0.3879 0.37900
M1500 0.2673 0.34725
M5000 0.1622 0.12968
M10K 0.1477 0.10142
M42K -0.0189 -0.00125
For most of the variables the PCR-estimates are closer to zero (shrinkage effect) which induces a bias, but as lectured, the PCR-model may have smaller variance for the estimates due to avoidance of the multicollinearity problem. It seems like PCR has down-weighted the short and long distances compared to the Least Squares approach, which seems reasonable.