Prediction of cow milk percentage
In the data file “maldidata.rdata
” you find four objects:
Y
: The percentage of cow-milk for 4 replicates of 45 different milk mixturesX
: Mass Spectrometry data (MALDI-TOF
) for the milk samples. The 6179 variables is a quantification of molecule “size
” and “charge
” in a sample. For simplicity we may say that the size of molecules increases from variable 1 to variable 6179. The measurements are then amounts of molecules of different sizes. The method is used to separate proteins, peptides and other ionizable compunds.Ytest
: Cow-milk percentages for 45 extra test samplesXtest
: TheMALDI-TOF
values for the test samples.
You may plot the spectra in X
one-by-one as,:
plot(X[, 1], type = "l")
We can also plot multiple (all) spectra together, takes more time.
matplot(X[, 1:5], type = "l", lty = 1, col = "black")
The peaks show molecule sizes that are abundant in the sample.
- First run a PCA on
X
. How many components are needed to explain 80% and 90% of the variability inX
?
pr <- prcomp(X)
summary(pr)$importance[, 1:14]
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 12.502 4.4833 3.0218 2.5373 2.0112 1.7430 1.30637
Proportion of Variance 0.648 0.0833 0.0379 0.0267 0.0168 0.0126 0.00708
Cumulative Proportion 0.648 0.7313 0.7691 0.7958 0.8126 0.8252 0.83228
PC8 PC9 PC10 PC11 PC12 PC13
Standard deviation 1.0875 1.05133 0.96666 0.91363 0.84363 0.83145
Proportion of Variance 0.0049 0.00458 0.00387 0.00346 0.00295 0.00287
Cumulative Proportion 0.8372 0.84176 0.84564 0.84910 0.85205 0.85491
PC14
Standard deviation 0.81638
Proportion of Variance 0.00276
Cumulative Proportion 0.85768
We need 5 components to explain 80% and 36 to explain 90% (check yourself). Hopefully we do not need to use all X-information to predict Y
.
- Compute the correlations between \(Y\) and the principal components. How can you use this get an idea of how many components PLS-regression will require to make a good model for cow-milk prediction?
cor(Y, pr$x)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
[1,] 0.528 -0.754 0.0338 -0.225 0.0744 -0.0503 0.015 -0.166 -0.00757
PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
[1,] -0.0754 -0.137 -0.0432 0.0183 0.0167 -0.0155 -0.0263 0.0222 -0.0187
PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26
[1,] -0.00495 -0.0413 0.00649 -0.0187 -0.0442 -0.0069 -0.0411 0.00412
PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35
[1,] -0.0123 0.017 -0.0147 0.0387 0.000618 0.0185 -0.0213 0.0478 -0.00964
PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44
[1,] -0.0106 -0.00503 0.021 -0.011 -0.00874 -0.0191 -0.014 0.0082 -0.0124
PC45 PC46 PC47 PC48 PC49 PC50 PC51 PC52
[1,] 0.0176 0.00597 0.0141 -0.0193 0.0217 -0.00107 -0.00454 0.00781
PC53 PC54 PC55 PC56 PC57 PC58 PC59 PC60
[1,] 0.00674 -0.0041 0.0027 -0.00404 -0.0269 0.00521 -0.00471 -0.000586
PC61 PC62 PC63 PC64 PC65 PC66 PC67 PC68
[1,] 0.0124 -0.0142 0.0125 -0.0181 -0.00766 0.00276 0.0146 -0.0221
PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76
[1,] -0.00245 -0.00385 -0.00334 -0.00403 -0.038 -0.0394 0.00702 0.0103
PC77 PC78 PC79 PC80 PC81 PC82 PC83 PC84
[1,] -0.017 -0.0104 -0.0116 -0.00895 0.00664 -0.00716 0.0086 -0.0205
PC85 PC86 PC87 PC88 PC89 PC90 PC91 PC92 PC93
[1,] -0.0217 0.018 0.0284 0.00673 0.0166 0.00445 -0.00696 -0.0109 -0.00113
PC94 PC95 PC96 PC97 PC98 PC99 PC100 PC101
[1,] 0.0244 0.0304 -0.0186 -0.00388 0.0214 -0.00707 -0.00238 0.000283
PC102 PC103 PC104 PC105 PC106 PC107 PC108 PC109 PC110
[1,] 0.0251 -0.00757 -0.0175 -0.0259 0.0244 -0.026 0.00272 -0.0157 0.0142
PC111 PC112 PC113 PC114 PC115 PC116 PC117 PC118 PC119
[1,] -0.0115 -0.009 -0.0115 0.0299 0.00924 0.00246 -0.01 -0.00907 -0.0242
PC120 PC121 PC122 PC123 PC124 PC125 PC126 PC127 PC128
[1,] 0.00516 -0.0066 0.0147 0.00467 -0.0254 -0.0224 -0.0124 0.00816 0.0101
PC129 PC130 PC131 PC132 PC133 PC134 PC135 PC136
[1,] -0.0142 -0.0119 -0.00018 -0.0108 -0.00172 0.00275 -0.00635 0.0103
PC137 PC138 PC139 PC140 PC141 PC142 PC143 PC144
[1,] -0.0087 0.00552 0.00742 -0.00805 -0.00327 0.0159 -0.00109 0.00291
PC145 PC146 PC147 PC148 PC149 PC150 PC151 PC152
[1,] -0.00378 -0.00148 -0.00736 -0.00656 0.00354 -0.00579 0.0034 0.00252
PC153 PC154 PC155 PC156 PC157 PC158 PC159 PC160
[1,] 0.0015 -0.00918 -0.0111 0.0232 0.0108 -0.00734 -0.00148 0.00503
PC161 PC162 PC163 PC164 PC165 PC166 PC167 PC168
[1,] -0.00242 -0.00583 -0.00701 0.00558 -0.00344 0.00436 0.00368 0.00529
PC169 PC170 PC171 PC172 PC173 PC174 PC175 PC176
[1,] -0.015 -0.0092 -0.0134 -0.00933 -0.00255 0.0113 0.00024 -0.000418
PC177 PC178 PC179 PC180
[1,] -0.0057 -0.0103 0.0218 0.0288
Components 1, 2, 4 and perhaps 8 are moderately to highly correlated with Y
. Supposedly 3 to 4 components will be necessary for PLSR.
- Fit a PLS-regression model using
Y
as response andX
as predictor (You may simply writeY ~ X
as your model call inplsr
. Also usencomp=10
as extra argument to only fit 1 to 10 components). Use thescoreplot()
function to make a scoreplot. Check the help-file for this function to see how you can chose the component numbers to plot, and how you can put labels to your observations. Plot component 1 against 2 and put observation numbers 1:180 as labels. If the noise level of the measurements is low, the replicates should group in clusters of size four (obs 1,2,3 and 4 are from the same mixture, and so on). Do you see any tendency to this?
plsmod <- plsr(Y ~ X, ncomp = 10)
scoreplot(plsmod, comps = c(1,2), labels = 'names', pch = 0.7)
We observe that there are clusters of observations and that there is a tendency of successive numbers to be close in the scoreplot. Cluster sizes are difficult to determine.
- Perform hierarchical clustering of the samples using the 3 first PLS-component scores as input variables. Try both “complete” and “average” agglomeration method and make a dendrogram. Are the replicate clusters more apparent in the dendrogram than in the scoreplot? Is there any samples that are very different from all others?
clust1 <- hclust(dist(plsmod$scores[,1:3]), method = "complete")
plot(clust1, cex = 0.5)
From the dendrogram we observe several clusters of size four, and some of three with successive observation numbers. This implies that the replicates are more similar to each other than samples from different milk samples. Sample 159 seems to be an outlier, very different from all others.
- Use K-means clustering with K=45 on the three first PLS-component scores. How do the replicates cluster?
clust2 <- kmeans(plsmod$scores[,1:3], centers = 45)
clust2$cluster
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
39 39 39 39 13 13 13 44 20 20 20 20 31 31 31 31 38 15
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
38 15 13 1 1 40 25 25 14 17 2 2 2 2 22 32 22 32
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
25 17 25 25 44 23 9 34 3 3 3 3 11 11 19 23 17 24
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
24 24 30 30 14 14 40 23 40 39 9 9 9 9 1 30 1 13
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
12 45 45 45 44 44 33 22 5 13 12 35 34 34 34 23 14 17
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
17 17 6 6 28 28 42 42 42 42 22 32 43 43 7 7 7 7
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
38 38 38 38 4 4 4 4 37 37 37 37 41 8 8 41 29 29
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
29 26 25 16 16 16 25 41 25 25 35 5 5 35 42 42 42 42
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
21 21 21 21 36 36 36 36 18 18 18 18 33 6 27 28 8 8
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
8 30 14 41 41 16 12 45 45 35 10 30 10 10 15 15 3 15
Many replicates fall into the same clusters, but some clusters observations from more than one “true” cluster.
E.g. cluster number 43 contains observations (97, 98, 99, 100) and (141, 142, 143, 144), which is a mixture of four
true replica clusters. We conclude that the replicates are somewhat similar, but there is some noise in the MALDI-TOF
data which makes similar milk mixtures hard to distinguish.
- We will use cross-validation to estimate the expected prediction error of PLSR and to choose the appropriate number of components. Instead of using Leave-One-Out Cross-validation we will exclude all four replicates in a “Leave-Four-Out” type of Cross-validation. Why is this smart?
Since the replicates are from the same mixture, they are not independent. If we use Leave-One-Out CV the three replicates still contained in the training set will make the model “too good” to predict the given mixture, and we will under-estimate the prediction error of new samples.
- Refit the plsmodel from exercise c. but add the arguments
validation="CV"
,segments=45
andsegment.type="consecutive"
in the plsr-model call. This sets up the “Leave-Four-Out” CV to be performed.
plsmod <- plsr(Y ~ X, ncomp = 10, validation = "CV",
segments = 45, segment.type = "consecutive")
- The sum of squared prediction errors (PRESS) for different number of components is given in the
validation$PRESS
element of the fitted pls-model. The null-model PRESS (prediction using the mean-response) is given in thevalidation$PRESS0
element. You find the MSEP-values (Mean Squared Error of Prediction) by dividing the PRESS by the number of observations (N=180). Make a prediction error plot of MSEP with 0 to 10 components. How many components do you think gives satisfying prediction of cow-milk content in new mixtures? Remember that simple models are often better than complex.
MSEP <- c(plsmod$validation$PRESS0, plsmod$validation$PRESS)/180
plot(0:10, MSEP, type = "b", col = 2, lwd = 2)
The prediction error is heavily reduced as we introduce components 1 and 2, but there is a small gain by adding components 3, 4 and 5. Simple models are usually more robust, so I would not go any further than 5 components here.
- Predict the cow-milk content of the test samples using
Xtest
as newdata in thepredict
-function and use the number of components you found as best from the previous exercise. Save the predictions into an object calledpred
. (See?predict.mvr
for the help-file to predict for pls-objects.). The predict-function returns an array of dimension [45,1,1] of predictions. You can extract all predictions bypred[,1,1]
.
pred <- predict(plsmod, newdata = Xtest, ncomp = 5)
Copy the folowing code into R and execute:
MSEPfunc <- function(y, yhat){
mean((y - yhat) ^ 2)
}
- Use
MSEPfunc()
to compute the MSEP-value for the test-predictions usingYtest
and the predicted values as inputs. Did the cross-validation under-estimate the prediction error?
MSEPfunc(Ytest, pred[,1,1])
[1] 0.00692
The MSEP-value is slightly larger than the value found for five components using Cross-validation, but the order of magnitude is the same.
- EXTRA for those interested. Redo the exercises g. and h. with Leave-One-Out CV. (See also lecture notes Lesson 7.) Compare the MSEP-values you find with the previous CV-routine.
plsmod2 <- plsr(Y ~ X, ncomp = 10, validation = "LOO")
MSEP2 <- c(plsmod2$validation$PRESS0, plsmod2$validation$PRESS)/180
plot(0:10, MSEP, type = "b", col = 2, lwd = 2)
points(0:10, MSEP2, type = "b", col = 4, lwd = 2)
We see that LOO-CV underestimates the prediction error, as commented in exercise f.