class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # ANCOVA ## .black[STAT3022 Applied Linear Models Lecture 28] <br><br><br> ### .black[2020/02/20] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Analysis of covariance 2. Concomitant variable 3. Common slope model ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Fun fact ]] .row[.split-60[ .column[.content[ Eden and Fisher (1927) *Studies in crop variation: IV*. The experimental determination of the value of top dressings with cereals introduced a method to factor out the effects of conditions that are not part of the experimental design. ]] .column.bg-brand-gray[.content[ <img src="images/Youngronaldfisher2.jpg" width="80%" height="80%"> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Optimal fish meal for ducklings ]] .row[.split-70[ .column[.content[ * Experiment was to determine the optimum amount of fish meal to be used as a supplement in rations for young growing ducklings. * There were four groups of differing amount of supplementary rations (5%, 10%, 15% and 20%) with five ducks in each group. * The `age` of the ducks (in weeks) and the `weight` of the ducks (in grams) are measured at the end of the experiment. |Weight | Age | Weight | Age | Weight | Age | Weight | Age | |---|---|---|---|---|---|---|---| |46.1 | 1 | 67.8 | 2 | 52.2 | 1 | 100.0 | 2 | |151.6| 5| 142.8| 4 | 190.8| 4| 249.2|4 | |287.1| 8| 365.2| 7 | 367.2| 6| 525.0| 6| |545.9| 11| 637.7| 10| 547.8| 7| 942.8|9 | |634.1| 12| 855.9| 12| 1012.8| 11| 1210.2|12 | ]] .column.bg-brand-gray[.content[ <img src="images/ducklings.jpg" width="90%" height="90%"> .bottom_abs.width100[ Castillo (1938) Determination of the optimum amount of fsh meal rations. Philippine Agriculturist 27 (18). ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Fish meal one-way ANOVA ]] .row[.split-60[ .column[.content[ ```r Meal<-factor(rep(c("5%","10%","15%","20%"),times=5), levels=c("5%", "10%", "15%", "20%")) Age<-c(1,2,1,2,5,4,4,4,8,7,6,6,11,10,7,9,12,12,11,12) Weight<-c( 46.1, 67.8,52.2,100,151.6,142.8,190.8, 249.2,287.1,365.2,367.2,525,545.9,637.7,547.8, 942.8,634.1,855.9,1012.8,1210.2) dat <- data.frame(Age, Weight, Meal) summary(aov(log(Weight) ~ Meal, data=dat)) ``` ``` Df Sum Sq Mean Sq F value Pr(>F) Meal 3 0.989 0.3296 0.288 0.834 Residuals 16 18.331 1.1457 ``` * The meal effect is insignificant. Why? ]] .column.bg-brand-gray[.content[ ```r dat ``` ``` Age Weight Meal 1 1 46.1 5% 2 2 67.8 10% 3 1 52.2 15% 4 2 100.0 20% 5 5 151.6 5% 6 4 142.8 10% 7 4 190.8 15% 8 4 249.2 20% 9 8 287.1 5% 10 7 365.2 10% 11 6 367.2 15% 12 6 525.0 20% 13 11 545.9 5% 14 10 637.7 10% 15 7 547.8 15% 16 9 942.8 20% 17 12 634.1 5% 18 12 855.9 10% 19 11 1012.8 15% 20 12 1210.2 20% ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Fish meal plots ]] .row[.split-50[ .column[.content[ ```r ggplot(dat, aes(Meal, log(Weight), col=Meal)) + geom_boxplot() + theme_classic() + guides(col=FALSE) + coord_flip() ``` <img src="lecture28_2020JC_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> * No clear difference between 4 boxplots. ]] .column.bg-brand-gray[.content[ ```r ggplot(dat, aes(Age, log(Weight), col=Meal)) + geom_point(size=3) + geom_line() + theme_classic() ``` <img src="lecture28_2020JC_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> * Consistent order in Meal across Age. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Analysis of Covariance ]] .row[.split-40[ .column[.content[ * Analysis of covariance (ANCOVA) incorporates regression with (treatment) groups into the analysis of designed experiments to adjust treatment comparisons for concomitant variables}. * .brand-blue[Concomitant variables] ( `\(x_1, x_2, \dots, x_k\)` ) are quantitative extraneous variables that are thought to influence the response that was not taken into account in the design of experiment. * It is assumed that the concomitant variables are .brand-blue[not affected] by the treatments. * That is, the two factors of meal and age are .brand-blue[independent]. ]] .column.bg-brand-gray[.content[ * The model for one treatment factor and one concomitant variable is given as `\(Y_{ij} = \mu + \beta x_{ij} + \alpha_i + \epsilon_{ij},\hspace{3mm} \epsilon\sim NID(0,\sigma^2),\hspace{2mm} i = 1,\ldots,t,\)` `\(j= 1,\dots, n_i.\)` * The ANOVA is effectively carried out .brand-blue[on the residuals after regression] on the true but unknown `\(\beta\)`'s, `$$\underbrace{Y_{ij}-\beta_1 x_{1ij}-\ldots -\beta_k x_{kij}}_{\approx R_{ij} \text{ from regression on conc. var.}}=\mu_{i}+\epsilon_{ij},\hspace{3mm} i=1,\ldots,t.$$` * This is a regression model with .brand-blue[different intercepts] for the treatment groups but .brand-blue[common slope]. `\(Y_{ij}=\underbrace{(\mu+\alpha_i)}_{\beta_{0,i}}+\beta x_{ij} + \epsilon_{ij},\hspace{2mm}\epsilon\sim NID(0,\sigma^2),\hspace{2mm} i=1,\ldots,t,\)` `\(j= 1,\dots,n_i.\)` ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Fish meal ANCOVA (not orthogonal) ]] .row[.content[ ```r summary(M1 <- aov(log(Weight) ~ Age + Meal, data = dat)) #Meal depends on age; Meal is significant now. ``` ``` Df Sum Sq Mean Sq F value Pr(>F) Age 1 16.458 16.458 258.89 7.18e-11 *** Meal 3 1.909 0.636 10.01 0.000716 *** Residuals 15 0.954 0.064 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r summary(aov(log(Weight) ~ Meal + Age, data = dat)) #Age depends on Mean. Make sense? Which one is correct? ``` ``` Df Sum Sq Mean Sq F value Pr(>F) Meal 3 0.989 0.330 5.184 0.0117 * Age 1 17.378 17.378 273.365 4.88e-11 *** Residuals 15 0.954 0.064 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r round(lm(M1)$coeff,3) #So age comes first. ``` ``` (Intercept) Age Meal10% Meal15% Meal20% 3.566 0.255 0.306 0.622 0.818 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Fish meal - common slope model ]] .row[.split-50[ .column[.content[ * Meal becomes significant after allowing for the variability due to age within each meal group. `$$Y_{ij} = \mu + \beta x_{ij} + \alpha_i + \epsilon_{ij},\hspace{3mm} \epsilon\sim NID(0,\sigma^2),$$` `\(i = 1,\ldots,t,\;\; j= 1,\dots, n_i.\)` So for each meal group, $$ \widehat{Y}= `\begin{cases} (3.566 + 0.000) + 0.255x, & \text{if in Meal 5% group} \\ (3.566 + 0.306) + 0.255x, & \text{if in Meal 10% group}\\ (3.566 + 0.622) + 0.255x, & \text{if in Meal 15% group}\\ (3.566 + 0.818) + 0.255x, & \text{if in Meal 20% group}\\ \end{cases}` $$ ### Thinking Question How is this different to ftting a simple linear regression model to each group? Can we have different intercept to each group as well? ]] .column.bg-brand-gray[.content[ ```r gg_color_hue <- function(n) { hues = seq(15,375,length = n + 1) hcl(h = hues, l = 65, c = 100)[1:n]} line.df <- data.frame(intercept=lm(M1)$coeff[1] + c(0, lm(M1)$coeff[3:5]), slope=lm(M1)$coef[2], Meal=c("5%", "10%", "15%", "20%")) ggplot(dat, aes(Age, log(Weight), col=Meal)) + geom_point(size=3) + theme_classic() + geom_abline(slope=line.df$slope, intercept=line.df$intercept, col=gg_color_hue(4)) ``` <img src="lecture28_2020JC_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # ANCOVA: Treatment contrast ]] .row[.content[ * Consider a contrast `\(\sum_{i=1}^t c_i \alpha_i,\hspace{3mm} \sum_{i=1}^t c_i = 0.\)` * In one-way ANOVA, it is estimated by `$$\sum_{i=1}^t c_i\hat{\alpha}_i=\sum_{i=1}^t c_i \Big(\bar{Y}_{i\bullet}-\hat{\beta} \bar{x}_{i\bullet}\Big),$$` where `\(\hat{\beta}\)` is the estimate for the coefficient of `\(x\)`. * Then the variance of this contrast estimate is `$$\text{var}\left(\sum_{i=1}^t c_i\Big(\bar{Y}_{i\bullet}- \hat{\beta}\bar{x}_{i\bullet}\Big)\right) =\sigma^2\bigg( \sum_{i=1}^t\frac{c_i^2}{n_i}+\frac{(\sum_{i=1}^t c_i \bar{x}_{i\bullet})^2}{S_{0xx}}\bigg).$$` * `\(S_{0xx}\)` is the residual SS if a one-way model is fitted to the `\(x\)`-data, ie taking Age as `\(Y\)` variable and Meal as `\(x\)` variable. As `\(x\)` is the outcome variable, this is Total `\(\text{S}_{xx}\)` minus SS for meal. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # ANCOVA: Treatment contrast ]] .row[.split-50[ .column[.content[ * Find a 95% CI for `\(\alpha_3 - \alpha_2\)` with `\(c_1=0,c_2=-1\)`, `\(c_3=1,c_4=0\)` and `\(n_i\equiv 5\)` and `\(\hat{\sigma}^2 = 0.064\)` (P.7). ```r round(c(tapply(Age, Meal, mean), tapply(log(Weight),Meal,mean)),3) ``` ``` 5% 10% 15% 20% 5% 10% 15% 20% 7.400 7.000 5.800 6.600 5.453 5.658 5.668 6.067 ``` * From one-way ANOVA on the `\(x\)`-values, `\(S_{0xx} = 267.2.\)` ```r anova(lm(Age ~ Meal)) ``` ``` Analysis of Variance Table Response: Age Df Sum Sq Mean Sq F value Pr(>F) Meal 3 7.0 2.3333 0.1397 0.9347 Residuals 16 267.2 16.7000 ``` ]] .column.bg-brand-gray[.content[ ```r sig2=sigma(M1)^2; S0xx=sum(lm(Age ~ Meal)$res^2) round(c(sig2,S0xx),4) ``` ``` [1] 0.0636 267.2000 ``` `\(\sum_{i=1}^t c_i \Big(\bar{Y}_{i\bullet}-\hat{\beta} \bar{x}_{i\bullet}\Big)\)` <br> `\(=(5.668-0.255\times 5.8)-(5.658-0.255\times 7)=0.316\)` * So standard error for `\(\hat\alpha_3 - \hat\alpha_2\)` is `\(\sqrt{ \sigma^2 \bigg[ \sum_{i=1}^t\frac{c_i^2}{n_i} + \frac{(\sum_{i=1}^t c_i \bar{x}_{i\bullet})^2}{S_{0xx}} \bigg] }\)` `\(=\sqrt{0.0636\bigg[ 2(\frac{1}{5})+\frac{(5.8-7)^2}{267.2}\bigg]}=0.161\)` So the 95% CI for `\(\alpha_3 - \alpha_2\)` is `\(0.316 \pm t_{15}(0.975) \times 0.161= (-0.026114, \ 0.658219)\)` and so the difference is insignificant. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Linear ANCOVA model ]] .row[.split-50[ .column[.content[ * For more complex models it is better to proceed via the linear model theory. * Construct the design matrix `\(\boldsymbol{X}\)` with `\(\mathbf{1}\)` in the first column, the `\(x_{ij}\)` values in the second (if there are `\(k\)` covariates use columns `\(2,\ldots,k+1\)`) and the remaining columns are 0-1 factor codes (treatment constraint incorporated). * `\(\boldsymbol{\theta}=(\mu,\beta,\alpha_2,\dots,\alpha_t)\)` * `\(\hat{\boldsymbol{\theta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1}\boldsymbol{X}^\top \boldsymbol{Y}\)` * Write the contrasts of interest as `\(\mathbf{c}^\top\boldsymbol{\theta}\)`. * The estimator is `\(\mathbf{c}^\top\hat{\boldsymbol{\theta}}\)` and its variance is `\(\sigma^2 \mathbf{c}^\top(\boldsymbol{X}^\top \boldsymbol{X})^{-1}\mathbf{c}\)`. * These results can be used to construct CIs and tests. How? ]] .column.bg-brand-gray[.content[ ### LS estimators ```r x.t <- model.matrix(~Meal - 1) #dummy var for factor X <- cbind(1, Age, x.t[,-1]) b <- solve( t(X) %*% X) %*% t(X) %*% log(Weight) round(b[,1],3) ``` ``` Age Meal10% Meal15% Meal20% 3.566 0.255 0.306 0.622 0.818 ``` ```r round(coef(M1)[1:4],3); round(coef(M1)[5],3) ``` ``` (Intercept) Age Meal10% Meal15% 3.566 0.255 0.306 0.622 ``` ``` Meal20% 0.818 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: design matrix and LS estimate ]] .row[.split-40[ .column[.content[ ``` Age Meal10% Meal15% Meal20% 1 1 1 0 0 0 2 1 2 1 0 0 3 1 1 0 1 0 4 1 2 0 0 1 5 1 5 0 0 0 6 1 4 1 0 0 7 1 4 0 1 0 8 1 4 0 0 1 9 1 8 0 0 0 10 1 7 1 0 0 11 1 6 0 1 0 12 1 6 0 0 1 13 1 11 0 0 0 14 1 10 1 0 0 15 1 7 0 1 0 16 1 9 0 0 1 17 1 12 0 0 0 18 1 12 1 0 0 19 1 11 0 1 0 20 1 12 0 0 1 ``` ]] .column.bg-brand-gray[.content[ ### Covariance of LS estimators ```r res <- log(Weight) - X %*% b RSS <- t(res) %*% res; RSS <- RSS[1,1] covLS <- RSS/15 * solve(t(X)%*%X) round(covLS, 4) ``` ``` Age Meal10% Meal15% Meal20% 0.0257 -0.0018 -0.0134 -0.0155 -0.0141 Age -0.0018 0.0002 0.0001 0.0004 0.0002 Meal10% -0.0134 0.0001 0.0255 0.0129 0.0128 Meal15% -0.0155 0.0004 0.0129 0.0260 0.0130 Meal20% -0.0141 0.0002 0.0128 0.0130 0.0256 ``` ```r round(vcov(M1),4) ``` ``` (Intercept) Age Meal10% Meal15% Meal20% (Intercept) 0.0257 -0.0018 -0.0134 -0.0155 -0.0141 Age -0.0018 0.0002 0.0001 0.0004 0.0002 Meal10% -0.0134 0.0001 0.0255 0.0129 0.0128 Meal15% -0.0155 0.0004 0.0129 0.0260 0.0130 Meal20% -0.0141 0.0002 0.0128 0.0130 0.0256 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Predict weight and pairwise constrast ]] .row[.split-50[ .column[.content[ ### Predict weight at Age=10 and Meal=10% ```r predict.at=data.frame(Age=10,Meal="10%") pred0=predict(M1,newdata=predict.at, interval="prediction",level=0.95) exp(pred0) #take exp for weight ``` ``` fit lwr upr 1 615.7019 338.9529 1118.412 ``` ```r x0=c(1,10,1,0,0); y0=sum(x0*b) #check calculation sig2=sigma(M1)^2 sey0=sqrt(t(x0)%*%covLS%*%x0+sig2) round(exp(c(y0,y0-qt(0.975,15)*sey0, y0+qt(0.975,15)*sey0)),4) ``` ``` [1] 615.7019 338.9529 1118.4115 ``` ]] .column.bg-brand-gray[.content[ ### Pairwise CI Construct a 95% CI to compare the average effect of the 10% and 15% fish meal diets using the transformed data. ```r c <- matrix(c(0,0,-1, 1,0)) est <- t(c)%*%b se <- sqrt(t(c)%*%covLS%*%c) round(c(est,se),4) ``` ``` [1] 0.3161 0.1605 ``` ```r t <- qt(0.975, 15) round(c(est - t * se, est + t * se),6) #agree ``` ``` [1] -0.026114 0.658219 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: different slopes for each Meal group? ]] .row[.split-50[ .column[.content[ ```r M2 <- lm(log(Weight) ~ Age*Meal, data = dat) round(coef(M2)[1:4],4); round(coef(M2)[5:8],4) ``` ``` (Intercept) Age Meal10% Meal15% 3.7039 0.2364 0.1866 0.2389 ``` ``` Meal20% Age:Meal10% Age:Meal15% Age:Meal20% 0.7464 0.0160 0.0610 0.0085 ``` ```r b=coef(M2) ``` So for each meal group, $$ \widehat{Y}= `\begin{cases} (3.704 + 0.000) + (0.236+0.000)x, & \text{if 5% meal} \\ (3.704 + 0.187) + (0.236+0.016)x, & \text{if 10% meal}\\ (3.704 + 0.239) + (0.236+0.061)x, & \text{if 15% meal}\\ (3.704 + 0.746) + (0.236+0.009)x, & \text{if 20% meal}\\ \end{cases}` $$ ]] .column.bg-brand-gray[.content[ ```r gg_color_hue <- function(n) { hues = seq(15,375,length = n + 1) hcl(h = hues, l = 65, c = 100)[1:n]} line.df <- data.frame(intercept=coef(M2)[1]+ c(0,coef(M2)[3:5]), slope=coef(M2)[2]+c(0,coef(M2)[6:8]), Meal=c("5%", "10%", "15%", "20%")) ggplot(dat, aes(Age, log(Weight), col=Meal)) + geom_point(size=3) + theme_classic() + geom_abline(slope=line.df$slope, intercept=line.df$intercept, col=gg_color_hue(4)) ``` <img src="lecture28_2020JC_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: design matrix with interaction terms ]] .row[.split-60[ .column[.content[ ```r X ``` ``` Age Meal10% Meal15% Meal20% Meal10% Meal15% Meal20% 1 1 1 0 0 0 0 0 0 2 1 2 1 0 0 2 0 0 3 1 1 0 1 0 0 1 0 4 1 2 0 0 1 0 0 2 5 1 5 0 0 0 0 0 0 6 1 4 1 0 0 4 0 0 7 1 4 0 1 0 0 4 0 8 1 4 0 0 1 0 0 4 9 1 8 0 0 0 0 0 0 10 1 7 1 0 0 7 0 0 11 1 6 0 1 0 0 6 0 12 1 6 0 0 1 0 0 6 13 1 11 0 0 0 0 0 0 14 1 10 1 0 0 10 0 0 15 1 7 0 1 0 0 7 0 16 1 9 0 0 1 0 0 9 17 1 12 0 0 0 0 0 0 18 1 12 1 0 0 12 0 0 19 1 11 0 1 0 0 11 0 20 1 12 0 0 1 0 0 12 ``` ]] .column.bg-brand-gray[.content[ ```r round(anova(M2),2) ``` ``` Analysis of Variance Table Response: log(Weight) Df Sum Sq Mean Sq F value Pr(>F) Age 1 16.46 16.46 240.82 <2e-16 *** Meal 3 1.91 0.64 9.31 <2e-16 *** Age:Meal 3 0.13 0.04 0.65 0.6 Residuals 12 0.82 0.07 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * It is clear that the interaction term is not significant. * There is no significant difference in slope. ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # A note on ANCOVA ]] .row[.content[ * ANCOVA is sensible only if common slope model is a good fit to the data and if the covariate values are unrelated to the groups. <br> <br> <img src="images/ANCOVA.png" width="70%" height="70%"> ]]