Chlorine levels in cities
Independent measurements of chlorine (ppm parts per million) were taken from 3 large cities:
city
City1 City2 City3
1 0.46 1.21 0.44
2 0.92 10.86 0.71
3 0.86 2.09 2.87
4 0.33 6.76 0.81
5 0.53 3.20 7.34
6 5.13 2.02 2.21
7 1.51 2.16 9.00
8 1.00 0.44 17.20
9 0.70 7.80 2.76
10 1.89 1.20 1.07
11 0.68 1.39 2.07
12 0.71 2.42 1.70
13 0.41 1.62 7.89
14 0.78 10.78 0.79
15 0.11 3.25 0.24
16 2.70 0.41 4.44
17 0.41 0.52 15.40
18 0.75 3.71 4.21
19 0.81 5.20 8.63
20 0.35 13.89 3.90
- Load the
city.rdata
which is available on Canvas.
Before analyzing the data it is best to “stack” the data into two columns, a response column y
and a city factor column city
. By using the stack()
function in R you can restructure the city data by
citydata <- stack(city)
- Use the
colnames()
function to rename the variables asy
andcity
.
colnames(citydata) <- c("y", "city")
- Assume an ANOVA-model with chlorine as response and city as a factor. What assumptions do you make?
The model assumptions are:
\[y_{ij} = \mu_i + \epsilon_{ij}\]
where the error terms are assumed independent and identically distributed \(\epsilon_{ij} \sim N(0, \sigma^2)\). Further \(\mu_i\) is the expected chlorine level in city \(i\).
- Make a box-plot of the data. Describe the variabilities between cities and within cities.
plot(y ~ city, data = citydata)
City 1 seems to have both the lowest average chlorine levels and the smallest variability in the measurements. Cities 2 and 3 are quite similar with both higher means and variability.
- Fit the model and perform a residual analysis. Comment on the model fit.
options(contrasts=c("contr.treatment", "contr.poly"))
citymod1 <- lm(y ~ city, data = citydata)
par(mfrow = c(1, 2))
plot(citymod1, which = c(1, 2))
The residual analysis reveals two problems: 1) Non-constant variability, which was also observed in the boxplot, and 2) A skeewed and non-normal distribution for the residuals with a heavy tail to the right.
- Make a log-transformation of \(y\) to create the variable
logy
, and add the new variable to your data set by:
citydata$logy <- log(citydata$y)
- Fit a new model with
logy
as response and check the model fit again.
citymod2 <- lm(logy~city, data = citydata)
par(mfrow = c(1,2))
plot(citymod2, which = c(1,2))
Both problems from the previous model appear to be corrected. We continue with this model.
- Estimate all model parameters from the latter model. Explain what the parameters measure given the parameterization of your choice (See lecture notes for parameterization).
The ANOVA analysis using contr.treatment parametrization:
summary(citymod2)
Call:
lm(formula = logy ~ city, data = citydata)
Residuals:
Min 1Q Median 3Q Max
-2.3987 -0.5954 -0.0175 0.7122 1.9326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.297 0.231 -1.29 0.20266
city(City2) 1.226 0.326 3.76 0.00041 ***
city(City3) 1.269 0.326 3.89 0.00027 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s: 1.03 on 57 degrees of freedom
Multiple R-squared: 0.255,
Adjusted R-squared: 0.229
F-statistic: 9.75 on 2 and 57 DF, p-value: 0.000228
Anova(citymod2)
Anova Table (Type II tests)
Response: logy
Sum Sq Df F value Pr(>F)
city 20.8 2 9.75 0.00023 ***
Residuals 60.7 57
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefs <- citymod2$coefficients
We observe that \(R^2 = 0.25\) indicating that there is still a lot of unexplained variability in the response. We get the following estimates for the city means of log(chlorine)-levels using this parametrization (city 1 is reference city):
\(\mu_1 = \mu =\) -0.3
\(\mu_2 = \mu + \alpha_2 =\) 0.93
\(\mu_3 = \mu + \alpha_3 =\) 0.97
The estimate of the error variance is \(\hat{\sigma}^2 = MSE= SSE/(N-3) = 60.724/57 = 1.065\).
As we see from above, the “intercept” \(\mu\) is the expected log-chlorine level for city 1. The \(\alpha_2\) is the difference between city 1 and city 2, and \(\alpha_3\) is the difference between city 1 and city 3. The \(\sigma^2\) is the variance for log-chlorine levels measured in the same city.
- State the hypotheses for testing whether the expected chlorine levels differ between the cities, choose a test level \(\alpha\) and perform the test. What is the conclusion?
The hypotheses are:
\[H_0: \mu_1 = \mu_2 = \mu_3\] versus \[H_1:\mu_i \neq\mu_{i'}\] for at least two different cities \(i\) and \(i'\).
Note:This is by the chosen parametrization equivalent to testing
\[H_0: \alpha_2 = \alpha_3 = 0\] versus \[H_1:\alpha_i\neq0\] for at least one \(i \in \{2,3\}\).
For test-level \(\alpha=0.05\) we reject the null-hypothesis if \(F>F_{\alpha, 2, 57}\) or if the p-value is smaller than 0.05. Here we observe a very small p-value and reject the null-hypothesis. We claim that the expected log-chlorine levels differ between at least two of the cities, and from the observed means we know that cities 1 and 3 are significantly different since they have the largest observed difference in means. There may also be a significant difference between cities 1 and 2, and 2 and 3, but this should be checked by a pair-wise contrast.