class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Quantitative Factors ## .black[STAT3022 Applied Linear Models Lecture 17] <br><br><br> ### .black[2020/02/17] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Quantitative factor 1. Polynomial regression and ANOVA 1. Nesting of linear effects 1. Bartlett test to assess homoscedasticity assumption ]] --- class: bg-brand-red white center middle # Factor or numerical variables? -- <br><br> ### Sometimes it is unclear as to whether a particular explanatory variable should be regarded as a .brand-yellow[factor] (categorical explanatory variable) or a .brand-yellow[numerical covariate]. -- <br><br> ### We will see that .brand-yellow[polynomial regression is the key] to understand the problem. --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Vitamin C and tooth growth ]] .row[.split-50[ .column[.content[ * This example concerns an (old) experiment into the effects of vitamin C on tooth growth. * 30 guinea pigs were divided (at random) into three groups of ten and dosed with vitamin C (administered in orange juice). * Group 1 dose was low (0.5mg vit. C), * Group 2 dose was medium (1mg vit. C) and * Group 3 dose was high (2mg vit. C). * Length of odontoblasts (teeth) measured as response variable. .bottom_abs.width100.gray[ Reference: Bliss (1952) The Statistics of Bioassay. Academic Press. ] ]] .column.bg-brand-gray[.content[ <br> <center> <table> <thead> <tr> <th style="text-align:right;"> Replicate </th> <th style="text-align:right;"> 0.5 </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4.2 </td> <td style="text-align:right;"> 16.5 </td> <td style="text-align:right;"> 23.6 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 11.5 </td> <td style="text-align:right;"> 16.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 7.3 </td> <td style="text-align:right;"> 15.2 </td> <td style="text-align:right;"> 33.9 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5.8 </td> <td style="text-align:right;"> 17.3 </td> <td style="text-align:right;"> 25.5 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6.4 </td> <td style="text-align:right;"> 22.5 </td> <td style="text-align:right;"> 26.4 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 10.0 </td> <td style="text-align:right;"> 17.3 </td> <td style="text-align:right;"> 32.5 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 11.2 </td> <td style="text-align:right;"> 13.6 </td> <td style="text-align:right;"> 26.7 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 11.2 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 21.5 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 5.2 </td> <td style="text-align:right;"> 18.8 </td> <td style="text-align:right;"> 23.3 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 7.0 </td> <td style="text-align:right;"> 15.5 </td> <td style="text-align:right;"> 29.5 </td> </tr> </tbody> </table> </center> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Getting the data in the right form ]] .row[.split-50[ .column[.content[ * You will get the data in all sorts of forms especially when they are grouped in some way. * Strive to get your data in the form ready for analysis. * Remember the tidy data principle: each row is an observation, each column is a variable. * We have 30 observations so the data frame should contain 30 rows. ```r head(datwide) #short format ``` ``` Replicate 0.5 1 2 [1,] 1 4.2 16.5 23.6 [2,] 2 11.5 16.5 18.5 [3,] 3 7.3 15.2 33.9 [4,] 4 5.8 17.3 25.5 [5,] 5 6.4 22.5 26.4 [6,] 6 10.0 17.3 32.5 ``` ]] .column.bg-brand-gray[.content[ ```r dat <- datwide %>% select(-Replicate) %>% gather(key=Dose, value=Length) %>% mutate(Dose=as.numeric(Dose), Dose.fac=case_when( Dose=="0.5" ~ "Low", Dose=="1" ~ "Medium", Dose=="2" ~ "High" )) %>% # relevel first so order is meaningful mutate(Dose.fac=factor(Dose.fac, level=c("Low", "Medium", "High"))) ``` .scroll-box-10[ ```r head(dat,5) #long format ``` ``` Length Dose.fac Dose 1 4.2 Low 0.5 2 11.5 Low 0.5 3 7.3 Low 0.5 4 5.8 Low 0.5 5 6.4 Low 0.5 ``` ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Vitamin C and tooth growth ]] .row[.split-70[ .column[.content[ ```r skimr::skim(dat) ``` <img src="images/L17Skim.png" width="90%" height="90%"> * For the plot it seems highly plausible that mean tooth growth depends on vitamin C levels. * Could the dependence of tooth growth on vitamin C be adequately modelled by a linear regression of `Length` on `Dose`, or is an ANOVA type model with `Length` and `Dose.fac` to be preferred? ]] .column.bg-brand-gray[.content[ ```r ggplot(dat, aes(Dose.fac, Length)) + geom_boxplot() + labs(x="Dose") ``` <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Simple linear regression or one-way ANOVA ]] .row[.split-50[ .column[.content[ ```r mdat <- dat %>% group_by(Dose) %>% summarise(Length=mean(Length)) ggplot(dat, aes(Dose, Length)) + geom_point() + geom_smooth(method="lm", se=F) + geom_point(data=mdat, aes(Dose, Length), color="red", size=8, alpha=1/3) + theme_bw(base_size=18) ``` ``` `geom_smooth()` using formula 'y ~ x' ``` <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ ```r *M.SLR <- lm(Length ~ Dose, dat) #simple linear reg anova(M.SLR) ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Dose 1 1601.34 1601.34 117.95 1.509e-11 *** Residuals 28 380.15 13.58 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r *M.ANOVA <- lm(Length ~ Dose.fac, dat) #1-way ANOVA anova(M.ANOVA) ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Dose.fac 2 1649.5 824.74 67.072 3.357e-11 *** Residuals 27 332.0 12.30 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Simple linear regression or one-way ANOVA ]] .row[.content[ Take Dose as .brand-blue[continuous] in **simple linear regression**? ```r print(broom::tidy(M.SLR), digits = 6) ``` ``` # A tibble: 2 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 3.29 1.43 2.31 2.85e- 2 2 Dose 11.7 1.08 10.9 1.51e-11 ``` Take Dose as .brand-blue[categorical] in **one-way ANOVA model**? ```r print(broom::tidy(M.ANOVA), digits = 6) ``` ``` # A tibble: 3 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 7.98 1.11 7.20 9.71e- 8 2 Dose.facMedium 8.79 1.57 5.61 6.02e- 6 3 Dose.facHigh 18.2 1.57 11.6 5.58e-12 ``` Note that although RSS is lower, it also has 1 df less. Both models are highly significant. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Recall: simple linear model passes through `\((\bar{x}, \bar{Y}_{\bullet\bullet})\)` ]] .row[.split-50[ .column.bg-brand-gray[.content[ * The blue dot here is the `\((\bar{x}, \bar{Y}_{\bullet\bullet})\)`. * Remember that the one-way ANOVA model can predict the red dots but cannot predict for factor levels not included in the model. <br><br> ``` `geom_smooth()` using formula 'y ~ x' ``` <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> ]] .column[.content[ ### Can we consider another linear model that goes through the mean of each treatment group? That is a linear model that goes through `\((x_i, \bar{Y}_{i\bullet})\)`? <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # What about polynomial regression? ]] .row[.split-50[ .column[.content[ * Suppose the `\(i\)`th treatment corresponds to a measurement `\(x_i\in\Bbb{R}\)`. * We can plot the sample mean response at each `\(x_i\)` value. * If we have `\(t\)` treatments then we have `\(t\)` points on the plot. * By looking at the way the mean values vary with `\(x_i\)` we can put forward a model for `\(E(Y|x)\)`. * Through any `\(t\)` points we can fit a polynomial of degree `\((t-1)\)`. Thus the most general polynomial regression model for this situation is `$$Y_{ij} = \beta_0 + \beta_1 x_{i} + \ldots + \beta_{t-1} x_i^{t-1} + \epsilon_{ij},\;\; \epsilon_{ij}\sim NID(0,\sigma^2).$$` ]] .column.bg-brand-gray[.content[ <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Polynomial regression equivalent to ANOVA ]] .row[.split-50[ .column[.content[ * The polynomial model has the same number of freely determined parameters `\((t)\)` as the one-way ANOVA model `$$Y_{ij} = \mu + \alpha_i + \epsilon_{ij}.$$` * Both models give exactly the same fit and have the same RSS! * The advantage of the polynomial model is that it suggests a range of models if `$$H_0: \alpha_1 = \ldots = \alpha_t$$` or equivalently `\(\beta_1 = \ldots = \beta_{t-1} = 0\)` does not hold because there are different possibility for the order of the polynomial model. * We can also predict `\(Y_i\)` at any level of `\(x_i\)` not included in the experimental design (0.5,1,2). ]] .column.bg-brand-gray[.content[ .scroll-box-20[ ```r M.poly <- lm(Length ~ poly(Dose,2),data=dat) #ortho data.frame(M.poly$residuals, M.ANOVA$residuals) ``` ``` M.poly.residuals M.ANOVA.residuals 1 -3.78 -3.78 2 3.52 3.52 3 -0.68 -0.68 4 -2.18 -2.18 5 -1.58 -1.58 6 2.02 2.02 7 3.22 3.22 8 3.22 3.22 9 -2.78 -2.78 10 -0.98 -0.98 11 -0.27 -0.27 12 -0.27 -0.27 13 -1.57 -1.57 14 0.53 0.53 15 5.73 5.73 16 0.53 0.53 17 -3.17 -3.17 18 -2.27 -2.27 19 2.03 2.03 20 -1.27 -1.27 21 -2.54 -2.54 22 -7.64 -7.64 23 7.76 7.76 24 -0.64 -0.64 25 0.26 0.26 26 6.36 6.36 27 0.56 0.56 28 -4.64 -4.64 29 -2.84 -2.84 30 3.36 3.36 ``` ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Nesting of linear effects ]] .row[.split-60[ .column[.content[ * The estimator for the linear term is `$$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^t \sum_{j=1}^{n_i} (x_i - \bar{x})Y_{ij}}{ \sum_{i=1}^t \sum_{j=1}^{n_i} (x_i - \bar{x})^2} = \frac{ \sum_{i=1}^t n_i (x_i - \bar{x})\bar{Y}_{i\bullet}}{ \sum_{i=1}^t n_i (x_i - \bar{x})^2}$$` * Recall: `\(\text{Within SS = Total SS - Treatment SS = TotSS - TSS}\)`. * To test adequacy of fit of the linear regression we use `$$f = \frac{(\text{RSS}_{H_0}-\text{RSS}_{H_1})/(t-2)}{\text{RSS}_{H_1}/(n-t)}$$` * The difference between the two RSS in `\(H_0\)` with linear regression and `\(H_1\)` with one-way ANOVA is `$$\text{RSS}_{H_0}-\text{RSS}_{H_1} = \text{TotSS} - \color{blue}{\hat{\beta}_1^2 S_{xx}} - (\text{TotSS} - \text{TSS})$$` `$$\hspace{5mm} = \text{TSS} - \color{blue}{\text{LinRegSS}}.$$` ]] .column.bg-brand-gray[.content[ ```r anova(M.ANOVA) ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Dose.fac 2 1649.5 824.74 67.072 3.357e-11 *** Residuals 27 332.0 12.30 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r anova(M.poly) #RSS same orthogonal or not ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) poly(Dose, 2) 2 1649.5 824.74 67.072 3.357e-11 *** Residuals 27 332.0 12.30 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` The two RSS are the same. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # LinRegSS is a 1 df component of TSS ]] .row[.split-50[ .column[.content[ ```r anova(M.SLR) ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Dose 1 1601.34 1601.34 117.95 1.509e-11 *** Residuals 28 380.15 13.58 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r anova(M.poly) ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) poly(Dose, 2) 2 1649.5 824.74 67.072 3.357e-11 *** Residuals 27 332.0 12.30 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ]] .column.bg-brand-gray[.content[ * The simple linear regression model is nested within the order 2 polynomial regression model. * Then there must be a contrast of LinRegSS within TSS in one-way ANOVA model which has the same RSS as the order 2 polynomial regression. * What would the coefficients of the treatment means in the contrast of the one-way ANOVA be? `$$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{ \sum_{i=1}^t n_i (x_i - \bar{x})(\bar{Y}_{i\bullet}-\bar Y_{\bullet \bullet})}{ \sum_{i=1}^t n_i (x_i - \bar{x})^2}$$` `$$\hspace{10mm}=\frac{\sum_{i=1}^t \color{blue}{n_i (x_i-\bar{x})}\bar{Y}_{i\bullet}}{\color{blue}{\sum_{i=1}^t n_i (x_i-\bar{x})^2}}.$$` * Then `\(\hat \beta_1=\hat c = \sum_{i=1}^t c_i \bar Y_{i\bullet}\)` where `\(c_i=\color{blue}{\frac{n_i (x_i - \bar{x})}{\sum_{i=1}^t n_i (x_i - \bar{x})^2}}\)`. .font_large.center[ ] .font_large.center[ ] .font_large.center[ ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Linear Regression Sum of Squares ]] .row[.split-90[ .column[.content[ * Any contrast coefficients which are proportional to `$$c_i = n_i(x_i-\bar{x})$$` satisfy `\(\sum_{i=1}^t c_i=0\)` as `\(\sum_{i=1}^t c_i=\sum_{i=1}^t n_i (x_i - \bar{x}) = 0\)`. * So a 1 df component corresponding to the TSS is `$$\frac{\left(\sum_{i=1}^t c_i\bar{Y}_{i\bullet}\right)^2}{\sum_{i=1}^t\frac{c_i^2}{n_i}} = \frac{\left[ \sum_{i=1}^t n_i (x_i - \bar{x})\bar{Y}_{i\bullet} \right]^2}{\sum_{i=1}^t \frac{[n_i(x_i-\bar{x})]^2}{n_i}} = \frac{\left[ \sum_{i=1}^t n_i (x_i - \bar{x})(\bar{Y}_{i\bullet}-\bar Y_{\bullet \bullet}) \right]^2}{\sum_{i=1}^t n_i(x_i-\bar{x})^2} = \frac{S_{xy}^2}{S_{xx}}=\hat{\beta}_1^2 S_{xx}=\text{LinRegSS}$$` * You can extend this idea to test adequacy of fit for .brand-blue[higher order polynomials]. ]] .column[.content[ ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Nesting of linear effects ]] .row[.split-50[ .column[.content[ * To test if the sample means vary linearly with `\(x\)` is equivalent to testing * `\(H_0: \beta_2 = \beta_3 = \ldots = \beta_{t-1} = 0\)` * or a test for adequacy of fit of a linear regression. * If `\(H_0\)` is true, `\(Y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij}.\)` * The linear model under `\(H_0\)` is .brand-blue[nested] not only in polynomial regression but also in ANOVA model. * Recall that for a simple linear regression model under `\(H_0\)` has `$$S_{yy} = \text{TotSS} = \text{RSS}_{H_0} + \hat{\beta}^2_1 S_{xx}.$$` * `\(\hat{\beta}^2_1 S_{xx}\)` for `\(\text{LinRegSS}\)` is `\(1601.34\)` in `Length`. ]] .column.bg-brand-gray[.content[ * Thus, we calculate the RSS as `$$\text{RSS}_{H_0} = S_{yy} - \hat{\beta}_1^2 \sum_{i=1}^t \sum_{j=1}^{n_i} (x_i - \bar{x})^2,$$` where `\(\bar{x} = \frac{1}{n} \sum_{i=1}^{t}n_i x_i\)`. ```r anova(M.SLR) ``` ``` Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Dose 1 1601.34 1601.34 117.95 1.509e-11 *** Residuals 28 380.15 13.58 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Tooth: Linear regression vs ANOVA ]] .row[.split-50[ .column.bg-brand-gray[.content[ ```r anova(M.SLR, M.ANOVA) ``` ``` Analysis of Variance Table Model 1: Length ~ Dose Model 2: Length ~ Dose.fac Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 380.15 2 27 332.00 1 48.146 3.9155 0.05813 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * The corresponding `\(p\)`-value is `\(p=0.058\)`. We would retain `\(H_0\)` at the 5% significance level. * So we should use the simple linear regression. ]] .column.bg-white[.content[ ```r autoplot(M.SLR) ``` <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Homogeneity of Variances ]] .row[.split-50[ .column.bg-brand-gray[.content[ <br> <img src="lecture17_2020JC_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto;" /> ]] .column[.content[ * The spread of `Length` in the `Medium` group seems to be smaller than other groups. * We have reasons to doubt if the variances are equal across groups. * We can conduct a *statistical test* to test this model assumption. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Bartlett test ]] .row[.split-50[ .column[.content[ * Let `\(s_i^2\)` denote the sample variance in treatment group `\(i=1,\ldots,t\)`. The .brand-blue[pooled variance] is defined as `$${s}_{p}^2 = \sum_{i=1}^t (n_i - 1)\frac{s_i^2}{(n-t)}.$$` * The .brand-blue[Bartlett test] compares `\(H_0: \sigma^2_1 = \ldots = \sigma^2_t\)` vs `\(H_1: \sigma^2_i \neq \sigma^2_j, \exists i\neq j\)`. * The Bartlett test can be calculated with `bartlett.test()` and is defined as `$$B=\frac{(n-t)\log{s}_p^2 - \sum_{i=1}^t (n_i - 1)\log s_i^2}{1 + [\frac{1}{3(t-1)}](\sum_{i=1}^{t}\frac{1}{n_i - 1} - \frac{1}{n-t})} \approx \chi^2_{t-1}$$` * If the Bartlett test is non-significant, we can retain the .brand-blue[homoscedasticity] assumption. ]] .column.bg-brand-gray[.content[ ### Tooth: Bartlett test in R ```r bartlett.test(Length ~ Dose.fac, data=dat) ``` ``` Bartlett test of homogeneity of variances data: Length by Dose.fac Bartlett's K-squared = 4.5119, df = 2, p-value = 0.1048 ``` * Thus, we conclude from the Bartlett test that there is not sufficient evidence in the data against the homoscedasticity assumption in the one-way ANOVA model even though the spread of `Length` in the `medium` group is smaller. * But the spread is not small enough to be significant. ]] ]]