Track and Field data

Load the trackfieldrecords.rdata file with the objects runMen and runWomencontaining national records (a few years ago…) for several track and field events like 100m, 200m and so on up to marathon.

Explore the data for both men and women using the pairs() - plotting function (you must exclude the Nation variable since this is non-numeric). Which events appear to be most correlated with each other? Check by using the cor() - function.

For dataset: runMen

pairs(runMen[,-9], cex = 0.7)

cor(runMen[,-9])

       M100  M200  M400  M800 M1500 M5000  M10K  M42K
M100  1.000 0.918 0.831 0.744 0.686 0.601 0.615 0.504
M200  0.918 1.000 0.840 0.800 0.767 0.682 0.683 0.585
M400  0.831 0.840 1.000 0.866 0.828 0.770 0.779 0.701
M800  0.744 0.800 0.866 1.000 0.914 0.858 0.864 0.803
M1500 0.686 0.767 0.828 0.914 1.000 0.926 0.933 0.865
M5000 0.601 0.682 0.770 0.858 0.926 1.000 0.974 0.931
M10K  0.615 0.683 0.779 0.864 0.933 0.974 1.000 0.943
M42K  0.504 0.585 0.701 0.803 0.865 0.931 0.943 1.000

Mens running 5000m have the highest correlation (0.974) with the mens running 10000m.

For dataset: runWomen

pairs(runWomen[,-8], cex = 0.7)

cor(runWomen[,-8])

       M100  M200  M400  M800 M1500 M3000  M42K
M100  1.000 0.947 0.808 0.693 0.704 0.724 0.683
M200  0.947 1.000 0.836 0.690 0.671 0.690 0.682
M400  0.808 0.836 1.000 0.887 0.771 0.767 0.710
M800  0.693 0.690 0.887 1.000 0.894 0.856 0.782
M1500 0.704 0.671 0.771 0.894 1.000 0.968 0.879
M3000 0.724 0.690 0.767 0.856 0.968 1.000 0.899
M42K  0.683 0.682 0.710 0.782 0.879 0.899 1.000

Womens running 3000m and 1500m have highest correlation (0.968)

Run the following command and inspect the results:

cor(runWomen[,-8], runMen[,-9])

If you were going to “predict” a nation’s record for women’s M3000, which record among men would you use (rows in the table are women variables, columns are men)?

cor(runWomen[,-8], runMen[,-9])

       M100  M200  M400  M800 M1500 M5000  M10K  M42K
M100  0.650 0.760 0.777 0.804 0.784 0.718 0.713 0.651
M200  0.716 0.799 0.811 0.817 0.766 0.703 0.703 0.621
M400  0.645 0.715 0.792 0.778 0.770 0.738 0.731 0.691
M800  0.595 0.697 0.740 0.785 0.834 0.817 0.820 0.785
M1500 0.517 0.636 0.676 0.848 0.876 0.858 0.868 0.822
M3000 0.571 0.681 0.684 0.856 0.882 0.865 0.864 0.814
M42K  0.612 0.701 0.647 0.811 0.830 0.803 0.817 0.762

Men’s M1500 shows the highest correlation to women’s M3000 and appears to be the best indicator, but in order to really check this we should run a regression analysis with cross-validation to see which predicts best.

Run a PCA (with scale=FALSE) on the men’s data and print out a summary and the weights (loadings) for the two first PC’s. How many components is needed to explain at least 99% of the variation in the data? Try to give an interpretation of PC1, and explain why so few components explains so much of the total variability.

pr <- prcomp(runMen[,1:8], scale = FALSE)
summary(pr)

Importance of components:
                         PC1     PC2     PC3     PC4     PC5    PC6
Standard deviation     9.571 0.64469 0.16890 0.05837 0.02504 0.0117
Proportion of Variance 0.995 0.00451 0.00031 0.00004 0.00001 0.0000
Cumulative Proportion  0.995 0.99964 0.99995 0.99999 1.00000 1.0000
                           PC7     PC8
Standard deviation     0.00553 0.00191
Proportion of Variance 0.00000 0.00000
Cumulative Proportion  1.00000 1.00000

pr

Standard deviations (1, .., p=8):
[1] 9.57139 0.64469 0.16890 0.05837 0.02504 0.01174 0.00553 0.00191

Rotation (n x k) = (8 x 8):
           PC1      PC2       PC3      PC4        PC5       PC6       PC7
M100  0.000311  0.00379 -0.000446  0.03426  0.0692868  0.185797 -0.387429
M200  0.000656  0.00660 -0.004506  0.06800  0.0994247  0.274399 -0.850073
M400  0.001759  0.01308 -0.005035  0.13044  0.2647439  0.885091  0.355339
M800  0.005393  0.03175 -0.020469  0.38876  0.8602795 -0.326436  0.027660
M1500 0.014222  0.08536 -0.050540  0.90239 -0.4184781 -0.015186  0.015555
M5000 0.078990  0.37121 -0.921125 -0.08648  0.0031867  0.000642  0.000858
M10K  0.180437  0.90259  0.385347 -0.06522  0.0031326 -0.002815 -0.000183
M42K  0.980290 -0.19749  0.004152  0.00345 -0.0000589  0.000652 -0.000359
             PC8
M100   0.8996602
M200  -0.4330211
M400  -0.0551806
M800  -0.0018789
M1500  0.0073064
M5000  0.0012362
M10K  -0.0009295
M42K   0.0000788

Only one component is needed to explain more than 99% of the variation. From the loadings we see that the loading weight for M42K (Marathon) is totally dominating with a weight of 0.98. All other weights are small. PC1 is therefore more or less indentical to the Marathon variable.

PC1 is located in the direction of larget variability, and due to the large scale of marathon times, this variable totally dominates the PCA. In such cases it may be better to use standardized variables (which is equivalent to running the Eigenvalue decomposition on the correlation matrix instead of the covariance matrix of the variables).

Re-run the PCA with option scale=TRUE in prcomp(). How many variables are needed to explain 99% of the variation in this case? How much is explained by the two first components? How would you interprete the loadings of PC1, PC2 and PC3?

pr <- prcomp(runMen[,1:8], scale = TRUE)
summary(pr)

Importance of components:
                         PC1   PC2    PC3    PC4    PC5     PC6     PC7
Standard deviation     2.563 0.953 0.4076 0.3615 0.2859 0.26585 0.21933
Proportion of Variance 0.821 0.114 0.0208 0.0163 0.0102 0.00883 0.00601
Cumulative Proportion  0.821 0.935 0.9557 0.9720 0.9822 0.99106 0.99707
                           PC8
Standard deviation     0.15305
Proportion of Variance 0.00293
Cumulative Proportion  1.00000

pr

Standard deviations (1, .., p=8):
[1] 2.563 0.953 0.408 0.361 0.286 0.266 0.219 0.153

Rotation (n x k) = (8 x 8):
        PC1     PC2    PC3     PC4    PC5     PC6     PC7      PC8
M100  0.315  0.5716  0.322 -0.1783  0.269 -0.5781  0.1425 -0.10781
M200  0.336  0.4638  0.372  0.2485 -0.158  0.6498 -0.1218  0.09974
M400  0.356  0.2439 -0.615 -0.5946 -0.236  0.1619 -0.0119  0.00229
M800  0.369  0.0142 -0.496  0.5214  0.541 -0.0318 -0.2233  0.03623
M1500 0.374 -0.1404 -0.105  0.4103 -0.494 -0.1747  0.6048 -0.14352
M5000 0.365 -0.3129  0.194 -0.0463 -0.237 -0.1448 -0.5974 -0.54333
M10K  0.367 -0.3071  0.180 -0.0985 -0.120 -0.2202 -0.1728  0.79734
M42K  0.343 -0.4319  0.228 -0.3176  0.489  0.3413  0.4028 -0.15986

Now we need 6 components to achieve 99% explained variance. Two components explain about 93.5% of the total variance. The loadings for PC1 are almost identical for all variables, hence PC1 is close to identical to the average run record across all distances for each country. PC2 has weights ranging from highly negative for marathon to highly positive for M100. This PC therefore contrasts short versus long runs. PC3 is a component that contrasts medium long distances (400, 800 and 1500 m) and either short or long distances. This component appears to extract information about how these distances differ from both sprint and endurance distances.

Make a biplot with the first two components. You may use the argument “xlabs” in biplot to indicate that “Nations” should be used to label the scores from each country. Give comments to how the nations cluster relative to the loadings.

biplot(pr, xlabs = runMen$Nation, cex = 0.7)

On the first axis (PC1) we observe that Cook’s Islands and West Samoa have the largest weights, and they therefore have on average poor (long time) national records for all distance. At the other end we find USA and others with on average good national records. Along PC2 (vertically) we find those with relatively poor times on long distances, but relatively good times on short, at the bottom (Dom. Repub, Bermuda, Singapore, Malaysia, Thailand and West Samoa) whereas at the top we find countries with poor records on short distances compared to their long distance records (Costa Rica, North-Korea, Cook’s Island and others).

Let’s try to predict the 3000M national records for women using the men’s data. First use a least squares approach using the men’s variables directly as predictor for women’s M3000. To accomplish this use runWomen$M3000 as response in lm(). Are there any significant variables? What is the $R^2$ and the adjusted $R^2$ values?

lmmod <- lm(runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 + 
              M5000 + M10K + M42K, data = runMen)
(lm_sumry <- summary(lmmod))


Call:
lm(formula = runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 + 
    M5000 + M10K + M42K, data = runMen)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7333 -0.2060 -0.0537  0.2496  0.7184 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -6.09205    1.98465   -3.07   0.0037 **
M100        -23.17895   23.87260   -0.97   0.3370   
M200         18.93566   13.76942    1.38   0.1762   
M400        -11.69187    5.19182   -2.25   0.0295 * 
M800          6.06528    2.25105    2.69   0.0100 * 
M1500         1.70738    1.16573    1.46   0.1503   
M5000         0.20068    0.28632    0.70   0.4871   
M10K          0.08099    0.14783    0.55   0.5866   
M42K         -0.00202    0.01715   -0.12   0.9070   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 0.357 on 43 degrees of freedom
Multiple R-squared: 0.84,
Adjusted R-squared: 0.81 
F-statistic: 28.2 on 8 and 43 DF,  p-value: 1e-14

lm_sumry[c("r.squared", "adj.r.squared")]

$r.squared
[1] 0.84

$adj.r.squared
[1] 0.81

Two variables are significant at 5% level in this fitted model, M800 and M400. The $R^2$-values indicate more than 80% explained variance. There is a slight difference between the adjusted and the non-adjusted $R^2$ indicating that there may be too many variables included in the model.

Either perform a manual elimination of insignificant variables, or run backward() from the mixlm-package to find a reduced model with only significant effects (5% testlevel). Which variables do you end up having in your model?

red.mod <- backward(lmmod, alpha = 0.05)

Backward elimination, alpha-to-remove: 0.05

Full model: runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 + M5000 + 
    M10K + M42K
<environment: 0x7fac34092650>

      Step  RSS  AIC R2pred   Cp F value Pr(>F)  
M42K     1 5.49 -101  0.644 7.01    0.01  0.907  
M10K     2 5.53 -102  0.687 5.32    0.32  0.576  
M100     3 5.64 -104  0.730 4.15    0.86  0.359  
M200     4 5.75 -104  0.746 3.03    0.92  0.343  
M5000    5 6.20 -103  0.755 4.55    3.67  0.061 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(red.mod)


Call:
lm(formula = runWomen$M3000 ~ M400 + M800 + M1500, data = runMen)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6139 -0.2798 -0.0541  0.2591  0.7221 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -6.930      1.687   -4.11  0.00015 ***
M400         -11.241      4.286   -2.62  0.01166 *  
M800           6.661      2.211    3.01  0.00412 ** 
M1500          3.556      0.806    4.41 0.000058 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 0.359 on 48 degrees of freedom
Multiple R-squared: 0.819,
Adjusted R-squared: 0.808 
F-statistic: 72.6 on 3 and 48 DF,  p-value: <2e-16

In addition to M400 and M800 I find M1500 to be highly significant after removing other variables. This “covered” effect from the full model was because of the inflated variances due to multicollinear variables in the full model.

Fit another model using all principal component scores from the men’s data as predictors for women’s M3000. The scores are stored in the principal component object as element names x. Which components are significant at a 5% test level? Compare the $R^2$ values with those from the full model using all original variables as predictors.

#You need to transform the scores into a data.frame first
PCdata <- as.data.frame(pr$x)
pcrmod <- lm(runWomen$M3000 ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
               PC6 + PC7 + PC8, data = PCdata)

summary(pcrmod)


Call:
lm(formula = runWomen$M3000 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + 
    PC7 + PC8, data = PCdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7333 -0.2060 -0.0537  0.2496  0.7184 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.49673    0.04956  191.61  < 2e-16 ***
PC1          0.27603    0.01952   14.14  < 2e-16 ***
PC2         -0.17201    0.05249   -3.28  0.00208 ** 
PC3          0.03614    0.12279    0.29  0.76990    
PC4          0.53593    0.13844    3.87  0.00036 ***
PC5          0.00982    0.17503    0.06  0.95551    
PC6          0.04261    0.18825    0.23  0.82201    
PC7         -0.09547    0.22818   -0.42  0.67772    
PC8          0.04242    0.32698    0.13  0.89737    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 0.357 on 43 degrees of freedom
Multiple R-squared: 0.84,
Adjusted R-squared: 0.81 
F-statistic: 28.2 on 8 and 43 DF,  p-value: 1e-14

Components 1,2 and 4 are highly significant, the others not. The $R^2$ values are identical to the full model on the orginal variables since we use all available components. The information content in PC1-PC8 is therefore the same as in the eight original variables.

Perfom a model reduction also for the PCR-model by excluding Principal components until you have only PC’s significant at 5% test level. Compare the estimated regression coefficients between the full and the reduced PCR-models. Why do you think the effects don’t change for the variables retained in the model?

pcr.red.mod <- backward(pcrmod, alpha = 0.05)

Backward elimination, alpha-to-remove: 0.05

Full model: runWomen$M3000 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8
<environment: 0x7fac4d0228c8>

    Step  RSS  AIC R2pred     Cp F value Pr(>F)
PC5    1 5.49 -101  0.632  7.003    0.00   0.96
PC8    2 5.50 -103  0.683  5.020    0.02   0.90
PC6    3 5.50 -105  0.733  3.071    0.05   0.82
PC3    4 5.51 -107  0.749  1.158    0.09   0.76
PC7    5 5.54 -108  0.763 -0.667    0.19   0.66

summary(pcr.red.mod)


Call:
lm(formula = runWomen$M3000 ~ PC1 + PC2 + PC4, data = PCdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7232 -0.2207 -0.0461  0.2415  0.6907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4967     0.0471  201.67  < 2e-16 ***
PC1           0.2760     0.0186   14.88  < 2e-16 ***
PC2          -0.1720     0.0499   -3.45  0.00118 ** 
PC4           0.5359     0.1315    4.07  0.00017 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 0.34 on 48 degrees of freedom
Multiple R-squared: 0.839,
Adjusted R-squared: 0.829 
F-statistic: 83.3 on 3 and 48 DF,  p-value: <2e-16

The reduced model has only PC1, PC2 and PC4 as predictors, and the $R^2$ is almost as high as for the full model. The estimated effects are identical due to the fact that the PC’s are orthogonal to each other and explain independent variability.

For special interested: Use the estimated effects (alphas from the lecture) from the reduced PCR-model to compute PCR-estimated regression coefficients (betas in the lecture) for the original variables back back-rotation. Compare the estimated regression coefficients from the PCR-approach with Least squares estimates using original (BUT SCALED) variables.

#PCR-estimated betas using components 1,2 and 4. The %*% operator is the inner-product
#multiplier in R
PCRest <- pr$rotation[,c(1,2,4)] %*% (coef(pcr.red.mod)[-1])

#Scaling the original variables and re-running the least squares model:
runMen.scaled <- as.data.frame(scale(runMen[,-9]))
lmmod2 <- lm(runWomen$M3000 ~ M100 + M200 + M400 + M800 + M1500 + 
               M5000 + M10K + M42K, data = runMen.scaled)

#Estimates without intercept.
Estimates <- cbind(coef(lmmod2)[-1],PCRest)
colnames(Estimates) <- c("Least Squares", "PCR")
Estimates

      Least Squares      PCR
M100        -0.1355 -0.10698
M200         0.2015  0.14607
M400        -0.2789 -0.26247
M800         0.3879  0.37900
M1500        0.2673  0.34725
M5000        0.1622  0.12968
M10K         0.1477  0.10142
M42K        -0.0189 -0.00125

For most of the variables the PCR-estimates are closer to zero (shrinkage effect) which induces a bias, but as lectured, the PCR-model may have smaller variance for the estimates due to avoidance of the multicollinearity problem. It seems like PCR has down-weighted the short and long distances compared to the Least Squares approach, which seems reasonable.