Data from the NSR education test

The Norwegian Centre for Science Recruitment (NSR) has an online “education test” where youths may answer a questionnaire to check their so-called cognitive types, their science interest, their preferred learning methods and their interest to various science subjects. The test suggests different ares within the STEM (Science, Technology, Engineering and Mathematics) within which the youth may find suitable work.

We have an excerpt of these data which can be downloaded from Canvas as the nsr.rdata file. The data.frame NSRdata contains two variables, Science and Age:

head(NSRdata,5)

    Science Age
101     4.2  16
102     4.2  16
103     4.2  16
104     3.0  19
105     2.8  16

Science is an average liking score (scale 1-6) to various STEM-subjects, and Age is a factor indicating different age-groups:

1: 1-12 yrs
13: 13-15 yrs
16: 16-19 yrs
19: 19-29 yrs
30: 30 + yrs

Perform an analysis of the NSR data to check whether Age influences the liking to STEM subjects. State the model, fit the model, check model assumptions, test hypotheses, and give model critique. Write a short summary of the results.

<ANS/>

Answer will come later.

Follow up the previous exercise by performing a Tukey test for all pair wise comparisons with overall (family-wise) error rate 5%. Give a summary of the results.

<ANS/>

NSRmod <- lm(Science ~ Age, data = NSRdata)
Anova(NSRmod, type = "III")

Anova Table (Type III tests)

Response: Science
            Sum Sq   Df F value Pr(>F)    
(Intercept)   1601    1  1557.6 <2e-16 ***
Age            225    4    54.7 <2e-16 ***
Residuals    10270 9995                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pt <- simple.glht(NSRmod, "Age", corr = c("Tukey"),level = 0.95)
print(pt)


     Simultaneous Confidence Intervals and Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Science ~ Age, data = NSRdata)

Quantile = 2.73 
Minimum significant difference = 0.157
95% confidence level
 
Linear Hypotheses:
        Lower  Center   Upper Std.Err t value      P(>t)    
1-13   0.0455  0.2021  0.3588  0.0574    3.52     0.0040 ** 
1-16  -0.0183  0.1384  0.2951  0.0574    2.41     0.1126    
1-19   0.1079  0.2646  0.4212  0.0574    4.61 0.00004063 ***
1-30  -0.3466 -0.1899 -0.0333  0.0574   -3.31     0.0084 ** 
13-16 -0.2204 -0.0637  0.0929  0.0574   -1.11     0.8015    
13-19 -0.0942  0.0624  0.2191  0.0574    1.09     0.8132    
13-30 -0.5487 -0.3921 -0.2354  0.0574   -6.83    < 2e-16 ***
16-19 -0.0305  0.1262  0.2828  0.0574    2.20     0.1808    
16-30 -0.4850 -0.3283 -0.1717  0.0574   -5.72 0.00000011 ***
19-30 -0.6112 -0.4545 -0.2978  0.0574   -7.91    < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)


WARNING: Unbalanced data may lead to poor estimates

The anova table indicates a clear significance of Age-group with regard to the interest level to Science subjects. hence, at least two age-groups are have significantly different expected means. We can use the Tukey test to identify pairwise differences.

The Tukey output provides differences in mean (column “center”) between all pairs of groups, and at the top of output the minimum difference yielding a significant difference in means is given to be 0.157. That is, all age-groups with a difference in averages of more than 0.157 are significantly different, according to Tukey. From the p-values we observe that 6 out of 10 pairs are significantly different, and the largest difference is found between age groups 30+ and 19-29.

cld(pt)

Tukey's HSD
Alpha: 0.05

   Mean G1 G2 G3
30 3.33  A      
1  3.14     B   
16 3.00     B  C
13 2.94        C
19 2.88        C

The compact letter display gives a grouping of the similar levels, and there are three groups of levels that are internaly non-significantly different. 30+ is different from all other levels, whereas 1-12 and 16-19 are similar and 16-19, 13-15 and 19-29 are also similar.