class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # General F-test & Multicollinearity ## .black[STAT3022 Applied Linear Models Lecture 9] <br><br><br> ### .black[2020/02/17] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. The general F-test 2. Choosing between alternative models 2. The common slope model 3. The use of I() and offset() in model formulae 3. Multicollinearity ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Hypothesis testing for regression parameters ]] .row[.split-40[ .column[.content[ Recall: `\(t\)`-test for `\(H_0: \beta=\beta^0\)` vs. `\(H_1: \beta \ne \beta^0\)` * We use the `\(t\)`-test statistic `\begin{equation} t_{obs} = \frac{\hat \beta - \beta^0}{\hat{SE}(\hat \beta)}\sim t_{n-p}. \label{eq:b1test} \end{equation}` * Most commonly `\(\beta^0=0\)`. * But this only test one parameter at a time. <br> .blockquote[ What if you want to test a hypothesis that constrains ("fixes values") several parameters simultaneously? ] ]] .column.bg-brand-gray[.content[ ### Omnibus test problem For the multiple regression model, `\(p\geq 2\)`, `\begin{equation*} Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{i,p-1} + \epsilon_i,\quad \epsilon_i \sim NID(0,\sigma^2) \end{equation*}` what if you want to test all slope parameter values set to zero `$$H_0: \beta_1 = \ldots = \beta_{p-1} = 0\quad \text{vs}\quad H_1: \exists \beta_j \neq 0,\;\; j=1,\ldots,(p-1).$$` This test problem is also known as the .brand-blue[omnibus test problem] or .brand-blue[overall test]. * If `\(p=2\)` then this is the same as the single parameter test. * If `\(p>2\)` then is this the same as testing each constraint parameter separately? E.g. are hypotheses below equivalent? $$H_0: \beta_1 = \beta_2 = 0\quad\text{vs}.\quad H_1: \beta_1\neq 0 \vee \beta_2\neq 0 $$ `$$H_0: \beta_1 = 0\text{ vs }. H_1: \beta_1\neq 0 \text{ and } H_1: \beta_2 = 0\text{ vs }. H_1: \beta_2\neq 0$$` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Catheter data ]] .row[.split-30[ .column.bg-newblue.white[.content[ * Three variables were observed around a catheter procedure on `\(n =12\)` children: * height (`H`) of patient, * weight (`W`) of patient, * length (`L`) of catheter. * What is the effect of height and weight on the chosen catheter length for a child with congenital heart disease? <center> <img src="images/catheter.jpg"/> </center> .bottom_abs.width100.font_small[ Reference: S. Weisberg (1980). Applied Linear Regression, New York: Wiley, 174-202. ] ]] .column[.content[ ```r dat <- data.frame( H=c(42.8,63.5,37.5,39.5,45.5,38.5,43,22.5,37,23.5,33,58), W=c(40,93.5,35.5,30,52,17,38.5,8.5,33,9.5,21,79), L=c(37,49.5,34.5,36,43,28,37,20,33.5,30.5,38.5,47)) skimr::skim(dat) ``` ``` Skim summary statistics n obs: 12 n variables: 3 -- Variable type:numeric --------------------------------------------------------------------------------------------------- variable missing n mean sd p0 p25 p50 p75 p100 hist H 0 12 40.36 11.94 22.5 36 39 43.62 63.5 <U+2583><U+2581><U+2586><U+2587><U+2582><U+2581><U+2582><U+2582> L 0 12 36.21 8.08 20 32.75 36.5 39.62 49.5 <U+2582><U+2581><U+2585><U+2585><U+2587><U+2582><U+2582><U+2585> W 0 12 38.12 26.06 8.5 20 34.25 43 93.5 <U+2585><U+2582><U+2587><U+2581><U+2582><U+2581><U+2582><U+2582> ``` <img src="lecture09_2020JC_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Catheter data ]] .row[.split-60[ .column.bg-brand-gray[.content[ ```r fit <- lm(L ~ H + W, data=dat) summary(fit) ``` ``` Call: lm(formula = L ~ H + W, data = dat) Residuals: Min 1Q Median 3Q Max -7.048 -1.258 -0.259 1.899 7.004 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.0084 8.7512 2.401 0.0399 * H 0.1964 0.3606 0.545 0.5993 W 0.1908 0.1652 1.155 0.2777 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.943 on 9 degrees of freedom Multiple R-squared: 0.8053, Adjusted R-squared: 0.7621 F-statistic: 18.62 on 2 and 9 DF, p-value: 0.0006336 ``` ]] .column[.content[ * We fit the model `$$\texttt{L}_i = \beta_0 + \beta_1\texttt{H}_i + \beta_2\texttt{W}_i + \epsilon_i$$` where `\(\epsilon_i\sim NID(0, \sigma^2)\)`. * Notice that the `\(t\)`-tests for `\(H_0:\beta_1=0\text{ vs }H_0:\beta_1\neq 0\)` and `\(H_0:\beta_2=0\text{ vs }H_0:\beta_2\neq 0\)` suggest that `\(\beta_1=0\)` and `\(\beta_2=0\)` since the p-value of 0.6 and 0.3 for .brand-blue[H] and .brand-blue[W] respectively are large. * So neither .brand-blue[height] nor .brand-blue[weight] of the child are relevant to the .brand-blue[length] of catheter? ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Data Exploration and Model Diagnosis ]] .row[.split-50[ .column[.content[ ```r GGally::ggpairs(dat) ``` ![](lecture09_2020JC_files/figure-html/unnamed-chunk-7-1.png)<!-- --> The two covariates H and W are highly correlated. ]] .column[.content[ ```r library(ggfortify) autoplot(fit) ``` ![](lecture09_2020JC_files/figure-html/unnamed-chunk-8-1.png)<!-- --> There are outliers in QQ-plot and points with high leverage `\(h_{ii}>2(3)/12=0.5\)`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Possible testing strategies ]] .row[.split-60[ .column[.content[ .brand-blue[Could we combine the outcomes to get a single `\(p\)`-value for the combined claim?] `$$H_0: \beta_1=\beta_2=0 \quad\text{ vs. }\quad H_1:\beta_1\neq 0 \vee\beta_2\neq0$$` * Using the information in the residuals? * Working with likelihood methodology? * Any other ideas? .brand-blue[Could we generalise the test?] `$$H_0: \beta_1 = \ldots = \beta_k = \beta_{\text{common}}\text{ vs } H_1: \exists \beta_j \neq \beta_{\text{common}},\;\; j=1,\ldots,k.$$` .brand-blue[Could we generalise the test even more?] `$$H_0: \beta_1 = \beta_1^0, \ldots, \beta_k = \beta_{k}^0\text{ vs } H_1: \exists \beta_j \neq \beta_{k}^0,\;\; j=1,\ldots,k.$$` ]] .column.bg-brand-gray[.content[ ```r fit0 <- lm(L ~ 1, data=dat) #null fit1 <- lm(L ~ H + W, data=dat) #full ``` ### Residual sum of squares ```r deviance(fit0) ``` ``` [1] 718.7292 ``` ```r deviance(fit1) ``` ``` [1] 139.913 ``` The longer way ```r c(sum(fit0$residuals^2), sum(fit1$residuals^2)) ``` ``` [1] 718.7292 139.9130 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # The general F-test ]] .row[.split-60[ .column[.content[ Develop a statistic based on the .brand-blue[residual sum of squares] (RSS) that enables to test a number of constraints simultaneously. * We denote the RSS for the largest model in `\(H_1\)`: `\(\text{RSS}_{H_1}\)` with `\(n-p\)` degrees of freedom. Recall: RSS `\(\sim \sigma^2 \chi^2(\text{df})\)`. * Construct the linear model that would hold if the null hypothesis is true and calculate RSS, denoting `\(\text{RSS}_{H_0}\)` with say `\(n-q\)` degrees of freedom with `\(q<p\)` . * Then, the test statistic `\begin{equation*} f = \frac{(\text{RSS}_{H_0} - \text{RSS}_{H_1})/(p-q)}{\text{RSS}_{H_1}/(n-p)} \stackrel{\text{under } H_0}{\sim} F_{p-q, n-p}. \end{equation*}` * The `\(p\)`-value for an observed statistic `\(f^*\)` is given by `\(P(F_{p-q, n-p}\geq f^*)\)`. ]] .column.bg-brand-gray[.content[ ### Remarks on the F-statistic * The denominator `\(\text{RSS}_{H_1}/(n-p) = \hat{\sigma}^2\)` is the adjustment for (estimator of) scale. * `\((p-q)\)` is the number of linearly independent constraints imposed by `\(H_0\)`. * If `\(H_0: \beta_j = 0\)` for one and only one `\(j\in\{0,\ldots,p\}\)` then the F-test just reduces to the usual `\(t\)`-test. * Recall: `\(F_{1,n-p} = t^2_{n-p}\)`. * The F-statistic provided in `summary(lm(...))` is the statistic for the omnibus test problem. ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Sum of squares in F-test ]] .row[.content[ <img src="images/Ftest.png" width="80%" height="50%"> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # General thoughts on choosing between models ]] .row[.split-70[ .column[.content[ .brand-blue[Aim for selecting models] In choosing between models, statisticians have two aims: * to choose a *simple* (i.e.~not too complex) model, i.e. parsimonious model. * to choose a model that *fits* the data *well*. .brand-blue[General thoughts] * A measure of the complexity of a linear regression model is by counting the number of regression parameters, `\(p\)`. The greater this value, the more complex the model. * A measure of the goodness of model fit to data is by using the RSS which contains information on the remaining uncertainty. * Think of model comparison like shopping - is it worth spending more (parameters) in order to get a better (fitting) model? ]] .column.bg-newblue.white[.content[ ### Example: Catheter * Three variables were observed around a catheter procedure on `\(n =12\)` children: * height (`H`) of patient, * weight (`W`) of patient, * length (`L`) of catheter. <center> <img src="images/catheter.jpg"/> </center> .bottom_abs.width100.font_small[ Reference: S. Weisberg (1980). Applied Linear Regression, New York: Wiley, 174-202. ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Null model: `\(H_0: \beta_1 = \beta_2 = 0\)` ]] .row[.split-two[ .column[.content[ ### Information for the full model `$$\texttt{M1} \ (p=3): \hspace{2mm} \texttt{L}_i = \beta_0 + \beta_1\texttt{H}_i + \beta_2\texttt{W}_i + \epsilon_i \hspace{2mm}$$` `\(\text{RSS}_{H_1} = 139.913\)` with `\(\text{df}=9\)` is the RSS from the full model. ```r fit1 <- lm(L ~ H + W, data = dat) (RSS_H1 <- deviance(fit1)) ``` ``` [1] 139.913 ``` ```r fit1$df.residual ``` ``` [1] 9 ``` ]] .column.bg-brand-gray[.content[ ### Information for the null model * For `\(H_0: \beta_1 = \beta_2 = 0\)`, the model is `$$\texttt{M0} \ (p=1): \ \texttt{L}_i = \beta_0 + \epsilon_i.$$` * `\(\text{RSS}_{H_0} = 718.729\)` with `\(\text{df}=11\)` is the RSS from the null model with ```r fit0 <- lm(L ~ 1, data = dat) (RSS_H0 <- deviance(fit0)) ``` ``` [1] 718.7292 ``` ```r fit0$df.residual ``` ``` [1] 11 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Test for the null model vs full model ]] .row[.content[ The observed `\(f\)` value is `$$f^*= \frac{(\text{RSS}_{H_0} - \text{RSS}_{H_1})/(p-q)}{\text{RSS}_{H_1}/(n-p)} = \frac{(718.73-139.91)/(3-1)}{139.91/(12-3)} = 18.616,$$` implies `\(p\text{-value} = P(F_{2,9}\geq 18.616) = 0.00063\)`. Thus, we reject `\(H_0\)`. With R: ```r fobs <- (RSS_H0 - RSS_H1) / 2 / (RSS_H1 / 9) fobs ``` ``` [1] 18.61637 ``` ```r 1 - pf(fobs, 2, 9) # omnibus p-value ``` ``` [1] 0.0006336043 ``` Although we can drop .brand-blue[height] or .brand-blue[weight] individually (since they are highly correlated with highly overlapped information), we cannot drop both variables. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Common slope model: `\(H_0: \beta_1 = \beta_2 = \beta \in \Bbb{R}\)` ]] .row[.split-two[ .column[.content[ ### Information for the common slope model * The common slope appears justified from the parameter estimates of the slopes. * We consider the common slope model: `\begin{eqnarray*} \texttt{M2} \ (p=2): \hspace{2mm} \texttt{L}_i &=& \beta_0 + \beta_1 \texttt{H}_i + \beta_2 \texttt{W}_i + \epsilon_i \\ &=& \beta_0 + \beta \texttt{H}_i + \beta \texttt{W}_i + \epsilon_i \\ &=& \beta_0 + \beta (\texttt{H}_i + \texttt{W}_i) + \epsilon_i. \end{eqnarray*}` * The `\(RSS_{H_0}\)` from the common slope model is 139.915 with `\(\text{df}=10\)`. ```r *fit2 <- lm(L ~ 1 + I(H + W), dat) RSS_H0 <- deviance(fit2); c(RSS_H0,RSS_H1); fit2$coeff ``` ``` [1] 139.9148 139.9130 ``` ``` (Intercept) I(H + W) 21.0965912 0.1925471 ``` ]] .column.bg-brand-gray[.content[ ### Test for the common slope vs full model The observed `\(f\)` value is `\begin{eqnarray*} f^* &=& \frac{(\text{RSS}_{H_0}-\text{RSS}_{H_1})/1}{\text{RSS}_{H_1}/(12-3)} = \frac{(139.915-139.913)/1}{139.913/9}\\ &=& 0.00011. \end{eqnarray*}` Since `\(p\text{-value} = P(F_{1,9}\geq f^*) = 0.9918\)`, we don't reject `\(H_0\)`. ```r (fobs <- (RSS_H0 - RSS_H1) / 1 / (RSS_H1 / 9)) ``` ``` [1] 0.0001124752 ``` ```r c(1 - pf(fobs, 1, 9),fit1$df.residual) ``` ``` [1] 0.9917696 9.0000000 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Simpler common slope model: `\(H_0: \beta_1 = \beta_2 = 0.2\)` ]] .row[.split-two[ .column[.content[ * We don't reject the common slope model and the common slope is estimated to be 0.19255! * Could we simplify this model to a more `attractive' common slope model? That is `\begin{eqnarray*} \texttt{M3} \ (p=1): \hspace{2mm} \texttt{L}_i &=& \beta_0 + 0.2 \texttt{H}_i + 0.2 \texttt{W}_i + \epsilon_i \end{eqnarray*}` at fixed value 0.2 which is no longer a parameter. * For `\(\texttt{L}_i = \beta_0 + \beta(\texttt{H}_i + \texttt{W}_i) + \epsilon_i\)`, we test the hypotheses: `$$H_0: \beta = 0.2\quad\text{vs}\quad H_1: \beta \neq 0.2.$$` ```r fit2 <- lm(L ~ 1 + I(H + W), dat) RSS_H1 <- deviance(fit2) *fit3 <- lm(L ~ offset(H/5)+offset(0.2 * W),dat) RSS_H0 <- deviance(fit3) c(RSS_H0,RSS_H1,fit2$df.residual) ``` ``` [1] 140.7820 139.9148 10.0000 ``` ]] .column.bg-brand-gray[.content[ The observed `\(f\)` value is `\begin{eqnarray*} f^* & = & \frac{(\text{RSS}_{H_0}-\text{RSS}_{H_1})/1}{\text{RSS}_{H_1}/(12-2)} =\frac{(140.78- 139.915)/1}{139.915/10} \\ & = & 0.062. \end{eqnarray*}` Since `\(p\text{-value}=0.8084 > 0.05\)`, we don't reject the simpler hypothesis. ```r (fobs <-(RSS_H0 - RSS_H1)/1/(RSS_H1 / 10)) ``` ``` [1] 0.06197925 ``` ```r c(1-pf(fobs,1,10),fit3$coeff) #pvalue & coeff ``` ``` (Intercept) 0.8084332 20.5116667 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Summary of F test ]] .row[.content[ <br> | Test| `\(H_0\)` (*p*) | `\(H_1\)` (*p*) | RSS `\(H_0\)` | RSS `\(H_1\)` | F (*p-q,n-p*) | p-value | Result | | ----|:------------|:------------|:---------:|:---------:|:--------------:|:-------:|:---------| | 1 |**M0 (b0; 1)**|**M1 (b0,b1,b2; 3)**|**718.729**|**139.913**|**18.62 (2,9)**|**0.0006**| **Rej** `\(H_0\)`| | 2 |**M2 (b0,b1=b2; 2)**|**M1 (b0,b1,b2; 3)**|**139.915**|**139.913**|**0.00011 (1,9)**|**0.9918**| **Acc** `\(H_0\)`| | 3 |**M3 (b0,b1=b2=0.2; 1)**|**M2 (b0,b1=b2; 2)**|**140.782**|**139.915**|**0.062 (1,10)**|**0.8084**|**Acc** `\(H_0\)`| | 4 |M3 (b0,b1=b2=0.2; 1)|M1 (b0,b1,b2; 3) |140.782 |139.913|0.0279 (2,9) | 0.9725 | Acc `\(H_0\)` | | 5 |M0 (b0; 1) |M2 (b0,b1=b2; 2) |718.729 |140.782|41.053 (1,10)| 0.00008| Rej `\(H_0\)` | | 6 |M0 (b0; 1) |M3 (b0,b1=b2=0.2; 1)| - | - | - | - |not nested| <br> Test 4: `\(H_0: \beta_1=0.2\)` vs `\(H_1: \beta_1 \ne 0.2\)` and `\(H_0: \beta_2=0.2\)` vs `\(H_1: \beta_2 \ne 0.2\)` Test 5: `\(H_0: \beta_1=\beta_2=0\)` vs `\(H_1: \beta_1=\beta_2=\beta\ne 0\)` Test 6: `\(H_0: \beta_1=\beta_2=0\)` vs `\(H_1: \beta_1=\beta_2=0.2\)`; Not nested; df = `\(p-q=0\)` ! Nested models: .brand-blue[M0 - (M2) - M1]; (M3) - M2 - M1; So M3 should be chosen overall. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Conclusions for the seen F-tests for the catheter data ]] .row[.content[ * The F-test is *very* versatile and is the basis for the tests undertaken when analyzing ANOVA models. * For the regression of catheter length on height and weight we note that in the `lm` summary both `\(\beta_1\)` and `\(\beta_2\)` have associated large `\(p\)`-values (0.599, 0.278 resp.) but we can only drop *one* of these terms from the model. * Main reason: height and weight are .brand-blue[highly correlated]. Therefore `H` and `W` contain similar information about `L`. * The effect of adding weight to a model already containing height is minimal which is measured by the multiple regression coefficient `\(p\)`-value. * Retention of `\(H_0\)` (i.e. acceptance that a smaller model is adequate) does not mean that the `\(p-q\)` variables not indexed by `\(H_0\)` are unrelated to the response. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Dangers of multicollinearity ]] .row[.content[ When some of the `\(x\)`-variables are highly correlated then we can find that the fitted multiple regression models can have * `\(\hat{\beta}_j\)` coefficients with counter intuitive signs, `\(j=1,\ldots,p\)`, * terms with large estimated standard errors ( `\(Var(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}\)` ), * several rather large `\(p\)`-values. Often removing one `\(x\)`-variable * changes all of the above with very little impact on `\(R^2\)` (GoF measure). * In these situations we say there is .brand-blue[weak multicollinearity]. * This comes from the fact that `\(\hat{\beta}_j\)` reflects the additional information provided by variable `\(x_j\)` given that all the other variables have been fitted. * Perfect multicollinearity is one of the `\(x_j\)`'s is a linear combination of the other `\(x\)`'s. ]]