class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Variable Selection: <br>Backward and Forward ## .black[STAT3022 Applied Linear Models Lecture 10] <br><br><br> ### .black[2020-03-16] ]] .column.bg-brand-charcoal[.content.white[ ## Today * Backward variable selection * The drop1(), add1() and update() command * Forward variable selection ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Variable Selection ]] .row[.split-50[ .column[.content[ ### .brand-blue[Hypotheses testing] Aims to test for redundancy/non-redundancy of * single explanatory variables and * groups of explanatory variables. Reason: Wanting to test hypotheses about a given group of covariates. <br> ### .brand-blue[Statistical learning] What is the "best" group of explanatory variables for describing and/or predicting the response? ]] .column.bg-brand-gray[.content[ ### Example for two covariates * If there are `\((p - 1) = 2\)` covariates then we could consider the models (without interaction yet): `\begin{eqnarray*} (1) && \quad Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i\\ (2) && \quad Y_i = \beta_0 + \beta_1 x_{1i}+ \epsilon_i\\ (3) && \quad Y_i = \beta_0 + \beta_2 x_{2i}+ \epsilon_i\\ (4) && \quad Y_i = \beta_0 + \epsilon_i\\ \end{eqnarray*}` * Which one do we choose? * We could use hypothesis testing to select the "best" model. * In above we considered `\(2^{(3 - 1)}=4\)` models. * How many models with intercept would we consider if we assume an additive structure for all `\((p - 1)\)` covariates for say `\(p=21\)`? ]] ]] --- class: bg-brand-red white center middle # .font_large[A LOT] <br> # `\(2^{20} = 1,048,576\)` models --- class: bg-brand-red white center middle count: false # .font_large[A LOT] <br> # `\(2^{20} = 1,048,576\)` models ![](lecture10_2020JC_files/figure-html/unnamed-chunk-3-1.svg)<!-- --> --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Let's formalise the notation for variable selection ]] .row[.split-50[ .column[.content[ ### .brand-blue[Possible subsets] Let `\(\mathbf{m}\)` denote any subset of `\(p_{\mathbf{m}}\)` distinct elements from `\(\{1,\ldots,p-1\}\)`. Remark: Typically the intercept is forced to be part of the model. * Let `\(\mathcal{M}\)` denote a set of linear regression models for the relationship between `\(\boldsymbol{Y}\)` and `\(\mathbf{X}\)`. Remark: Often `\(\mathcal{M}\)` is reduced by preselection. ]] .column.bg-brand-gray[.content[ ### .brand-blue[Example - Three explanatory variables] * There are `\(2^4 = 16\)` distinct subsets of `\(\{0,1,2,3\}\)`: `$$\mathbf{m}: \hspace{4mm} \emptyset, \{0\}, \{1\}, \{0,1\}, \{2\}, \ldots, \{0,1,2,3\}.$$` * If the intercept is forced to be part of the model, then there are `\(2^{4-1} = 2^3 = 8\)` possible subsets. * For simplicity we use the set `\(\mathbf{m}\)` as an abbreviation for the linear regression model of `\(\boldsymbol{Y}\)` on those columns of `\(\mathbf{X}\)` indexed by `\(\mathbf{m}\)`. * The linear regression model `\(\mathbf{m}\)` is given by `\begin{equation*} y_i = \beta_{0} + \sum_{j\in{\mathbf{m}\setminus{\{0\}}}}\beta_{j} x_{ij} + \epsilon_{{\mathbf{m}} i},\;\;\epsilon_{{\mathbf{m}} i} \sim NID(0,\sigma^2_{\mathbf{m}}), \end{equation*}` <!-- Think about changing `\(x_{ij}\)` to `\(x_{i,j-1}\)` --> * `\(\sigma_{{\mathbf{m}}}^2\)` is (unknown) error variance for model `\(\mathbf{m}\)`, * `\(\boldsymbol{\beta}_{{\mathbf{m}}}\)` unknown `\(p_{{\mathbf{m}}}\)`-vector of regression parameters. ]] ]] --- class: split-20 .row.bg-brand-blue.white[.content.vmiddle[ # Automated variable selection algorithms... Bushwalking in `\(\mathcal{M}\)` ]] .row[.split-50[ .column[.content[ Automated variable selection procedures "walk along" the following "path": 1. .brand-blue[Choose] a model to start with, e.g. * the model with no covariates (null model), * or the model with all covariates included (full model). 2. .brand-blue[Test] to see if there is an advantage in adding or removing covariates. 3. .brand-blue[Repeat] adding/removing variables until there is no advantage in changing the model. Such a strategy requires to visit a *quadratic order* of number of models! ]] .column.bg-gray[.content[ <img src="lecture10_2020JC_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> <img src="lecture10_2020JC_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Purpose of model selection ]] .row[.split-60[ .column[.content[ * Purpose of model selection is to choose one or more models `\(\mathbf{m}\)` from `\(\mathcal{M}\)` with specified desirable properties. * Let `\(\mathbf{m}_0\)` denote the true model which actually generates the data. * Variable selection procedures are intended to identify *useful models* whether or not a true model exists. * In practice .brand-blue[model selection] is more general than .brand-blue[variable selection] because: * there is often more than one way of estimating parameters (least squares, MLE, robust estimators); * there is often more than one class of models (linear regression, non-linear regression, semiparametric regression, nonparametric regression etc). ]] .column.content-box-yellow[.content[ ### Cheese tasting data * Data on cheddar cheese production from the LaTrobe Valley of Victoria. * Taste of the final product is related to the concentration of several chemicals in the cheese. * `\(n = 30\)` samples of cheese were tasted by experts, and the following four variables recorded: * `taste` - Tasters' ratings * `Acetic` - Acetic acid in cheese * `H2S` - Hydrogen sulphide in cheese * `Lactic` - Lactic acid in the cheese. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Backward variable selection ]] .row[.split-50[ .column[.content[ 1. Start with model containing *all* possible explanatory variables. 2. For each variable in turn, investigate effect of removing that variable from the current model. 3. *Remove the least informative variable*, unless this variable is nonetheless supplying significant information about the response. 4. Go to step 2. Stop only if all variables in the current model are important. * Implementation depends on how we assess the importance of variables. * Possibilities are to use p-values (F, `\(t\)` or `\(\chi^2\)` tests), GoF criteria ( `\(R^2\)` or `\(R^2_a\)`) or especially designed selection criteria to measure what it means to be informative. ]] .column.bg-brand-gray[.content[ * It is of interest to the manufacturers to relate the cheese's taste to the "chemical" variables. * Therefore construct multiple linear regression model of `taste` on other variables. * Variable selection will allow us to produce a .brand-blue[parsimonious] model. * Backwards variable selection starts with the **full model** (i.e. with all predictors). * Let us have a look at the data first and then we will run a backward selection based on the F-test with the deletion of the least significant variable as long as p-value `\(p_{\text{out}}>5\%\)`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Cheese data ]] .row[.split-50[ .column.bg-brand-gray[.content[ ```r dat <- read.table("data/cheese.txt", header = TRUE) skimr::skim(dat) ``` ``` Skim summary statistics n obs: 30 n variables: 4 -- Variable type:numeric --------------------------------------------------------------------------------------------------- variable missing complete n mean sd p0 p25 p50 p75 p100 hist Acetic 0 30 30 5.5 0.57 4.48 5.24 5.42 5.88 6.46 <U+2583><U+2582><U+2581><U+2587><U+2581><U+2583><U+2583><U+2582> H2S 0 30 30 5.94 2.13 3 3.98 5.33 7.57 10.2 <U+2587><U+2587><U+2587><U+2582><U+2585><U+2583><U+2583><U+2582> Lactic 0 30 30 1.44 0.3 0.86 1.25 1.45 1.67 2.01 <U+2582><U+2583><U+2587><U+2583><U+2587><U+2585><U+2583><U+2583> taste 0 30 30 24.53 16.26 0.7 13.55 20.95 36.7 57.2 <U+2586><U+2585><U+2587><U+2585><U+2582><U+2585><U+2581><U+2583> ``` ```r fit <- lm(taste ~ Acetic + H2S + Lactic, data=dat) print(broom::tidy(fit), digits = 4) ``` ``` term estimate std.error statistic p.value 1 (Intercept) -28.8768 19.735 -1.46320 0.155399 2 Acetic 0.3277 4.460 0.07349 0.941980 3 H2S 3.9118 1.248 3.13341 0.004247 4 Lactic 19.6705 8.629 2.27957 0.031079 ``` ]] .column.bg-white[.content[ ```r GGally::ggpairs(dat, progress=FALSE) ``` ![](lecture10_2020JC_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Cheese data: Backward selection ]] .row[.split-50[ .column[.content[ Start with the full model: ```r M1 <- lm(taste ~ Acetic + H2S + Lactic, data=dat) M1 <- lm(taste ~ ., data=dat) #same; Step1 fullmodel *drop1(M1, test="F") ``` ``` Single term deletions Model: taste ~ Acetic + H2S + Lactic Df Sum of Sq RSS AIC F value Pr(>F) <none> 2668.4 142.64 Acetic 1 0.55 2669.0 140.65 0.0054 0.941980 H2S 1 1007.66 3676.1 150.25 9.8182 0.004247 ** Lactic 1 533.32 3201.7 146.11 5.1964 0.031079 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r c(deviance(M1),deviance(lm(taste~Acetic+Lactic,data=dat))) ``` ``` [1] 2668.411 3676.069 ``` * Note: `\(F_{1,n-p,1-\alpha}=t^2_{n-p,1-\alpha/2}\)` ]] .column.bg-brand-gray[.content[ * The R-command `drop1(M1, test = "F")` returns information criteria for all variables in `M1`. ```r out <- tidy(M1); print(broom::tidy(M1), digits = 4) ``` ``` term estimate std.error statistic p.value 1 (Intercept) -28.8768 19.735 -1.46320 0.155399 2 Acetic 0.3277 4.460 0.07349 0.941980 3 H2S 3.9118 1.248 3.13341 0.004247 4 Lactic 19.6705 8.629 2.27957 0.031079 ``` ```r out$statistic[-1]^2 # t^2=F ``` ``` [1] 0.005400575 9.818243701 5.196444054 ``` * In particular, this includes the `\(p\)`-values of the F-test for `\(H_0: \beta_j = 0\)` versus `\(H_1: \beta_j \neq 0\)` for all `\(j = 1,\ldots,k\)`. * Note the two sets of p-values from t- and F-test are the same. Hence in t-test, we test if `\(\beta_j=0\)` relative to the .brand-blue[full model] in `\(H_1\)`. ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # So why use drop1? ]] .row[.content[ * `drop1` gives other statistics and not only the `\(t\)`-test/F-test result. * `drop1` drop all variables related to levels of a categorical variable at a time. ### Example Recall iris data with categorical variable .brand-blue[Species] having 3 levels and 2 parameters. ```r head(iris$Species,3) ``` ``` [1] setosa setosa setosa Levels: setosa versicolor virginica ``` ```r fit_iris <- lm(Sepal.Length ~ Species + Sepal.Width, data=iris) print(broom::tidy(fit_iris), digits = 4) ``` ``` term estimate std.error statistic p.value 1 (Intercept) 2.2514 0.3698 6.089 9.568e-09 2 Speciesversicolor 1.4587 0.1121 13.012 3.478e-26 3 Speciesvirginica 1.9468 0.1000 19.465 2.094e-42 4 Sepal.Width 0.8036 0.1063 7.557 4.187e-12 ``` ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # So why use drop1? ]] .row[.content[ ```r drop1(fit_iris, test="F") ``` ``` Single term deletions Model: Sepal.Length ~ Species + Sepal.Width Df Sum of Sq RSS AIC F value Pr(>F) <none> 28.004 -243.75 Species 2 72.752 100.756 -55.69 189.651 < 2.2e-16 *** Sepal.Width 1 10.953 38.956 -196.23 57.102 4.187e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * So the two parameters of .brand-blue[Species] are dropped together in .brand-blue[drop1]. Note the df is 2 for this drop of .brand-blue[Species]. * Clearly, it does not make sense if some level parameters are present but not the others for a categorical variable. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Second pass through backward selection algorithm ]] .row[.split-50[ .column[.content[ #### Back to the Cheddar data * To efficiently delete a variable from regression model `M1`, say `x1`, the `update` command can be used: ``` M2 <- update(M1, . ~ . - x1, data = dat) ``` * The general syntax for updating models is: ``` update(old.model, new.formula, ...) ``` * Note that full stops in the updated formula stand for "whatever was in the comparison position in the old formula". * The p-values for remaining parameters change after removing .brand-blue[Acetic], eg from 0.004247 to 0.001743 for .brand-blue[H2S]. ]] .column.bg-brand-gray[.content[ ```r M2 <- update(M1, . ~ . - Acetic, data = dat) drop1(M2, test="F") #Step 2 with Acetic dropped ``` ``` Single term deletions Model: taste ~ H2S + Lactic Df Sum of Sq RSS AIC F value Pr(>F) <none> 2669.0 140.65 H2S 1 1193.52 3862.5 149.74 12.0740 0.001743 ** Lactic 1 617.18 3286.1 144.89 6.2435 0.018850 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` Stop here because now every variable is significant under the `\(t\)`-test / F-test. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Backward selection comments ]] .row[.split-50[ .column[.content[ * First pass through reduction algorithm: * Both `H2S` and `Lactic` should not be dropped, otherwise considerably worse fit than the full model (as evidenced by the p-values of `\(0.002\)` and `\(0.019\)`). * But `Acetic` should be deleted as it makes little difference (e.g. at 5% significance level) in terms of model fit (p-value of `\(0.942\)` in M1) and `\(R^2\)`. ```r c(summary(M1)$r.squared,summary(M2)$r.squared) ``` ``` [1] 0.6517747 0.6517024 ``` * If there had been more than one variable with p-value greater than 0.05, then we would have removed the covariate with largest corresponding p-value. ]] .column.bg-brand-gray[.content[ * Second pass through reduction algorithm: * Neither of the covariates `H2S` and `Lactic` can be removed from the model without an important loss of fit. * Hence, "best" model for the data (according to backward selection with significance level 5%) is `$$E[{\tt\color{mycolr} taste}] = -27.59 + 3.95 \cdot {\tt H2S} + 19.89 \cdot {\tt Lactic}.$$` ```r print(broom::tidy(M2), digits = 4) ``` ``` term estimate std.error statistic p.value 1 (Intercept) -27.592 8.982 -3.072 0.004813 2 H2S 3.946 1.136 3.475 0.001743 3 Lactic 19.887 7.959 2.499 0.018850 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Forward variable selection ]] .row[.split-50[ .column[.content[ 1. Start with model containing *no* possible explanatory variables, i.e. the null model $$ \mathbf{m} = \emptyset. $$ 2. For each variable in turn, investigate effect of adding variable from current model. 3. .brand-blue[Add the most informative/significant variable], unless this variable is not supplying significant information about the response. 4. Go to step 2. Stop only if none of the non-included variables is important. ]] .column.bg-brand-gray[.content[ Comments: * Implementation depends on how we assess the importance of variables. * Possibilities are to use `\(p\)`-values (F, `\(t\)` or `\(\chi^2\)` tests), GoF criteria ( `\(R^2\)` or `\(R^2_a\)`) or especially designed selection criteria to measure what it means to be informative. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # The add1 command ]] .row[.split-50[ .column[.content[ * Start with the null model ``` M1 <- lm(Y ~ 1, data = dat) ``` with explanatory variables in the set `\(\mathbf{m}_1\)`. * Then, the R-command ```r add1(M1, scope =~ x1 + x2 + ... + xk, data = dat, test = "F")} ``` criteria for all variables specified after the option `scope=~` to model the response variable. ]] .column.bg-brand-gray[.content[ Alternatively, list all the variable names by coding up a full model ```r Mf <- lm(Y ~ ., data = dat)} ``` and then using ```r add1(M1, scope = Mf, data = dat, test = "F") ``` In particular, this includes the `\(p\)`-values of the F-test for `\(H_0: \beta_j = 0\)` versus `\(H_1: \beta_j \neq 0\)` for all `\(j \not\in \mathbf{m}_1\)`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Cheese data: Forward selection ]] .row[.split-50[ .column[.content[ First pass through algorithm ```r M1 <- lm(taste ~ 1 , data = dat) #from null model add1(M1, scope = ~ Acetic + H2S + Lactic, data = dat, test = "F") ``` ``` Single term additions Model: taste ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) <none> 7662.9 168.29 Acetic 1 2314.1 5348.7 159.50 12.114 0.001658 ** H2S 1 4376.7 3286.1 144.89 37.293 1.374e-06 *** Lactic 1 3800.4 3862.5 149.74 27.550 1.405e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * With the smallest p-value, .brand-blue[H2S] is first selected. ]] .column.bg-brand-gray[.content[ Second pass through algorithm ```r M2 <- update(M1, . ~ . + H2S, data = dat) #from model with H2C add1(M2, scope = ~ Acetic + H2S + Lactic, data = dat, test = "F") ``` ``` Single term additions Model: taste ~ H2S Df Sum of Sq RSS AIC F value Pr(>F) <none> 3286.1 144.89 Acetic 1 84.41 3201.7 146.11 0.7118 0.40625 Lactic 1 617.18 2669.0 140.65 6.2435 0.01885 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * With the smallest p-value, .brand-blue[Lactic] is next selected. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Cheese data: Forward selection ]] .row[.split-50[ .column[.content[ Third pass through algorithm ```r M3 <- update(M2, . ~ . + Lactic, data = dat) add1(M3, scope = ~ Acetic + H2S + Lactic, data = dat, test = "F") ``` ``` Single term additions Model: taste ~ H2S + Lactic Df Sum of Sq RSS AIC F value Pr(>F) <none> 2669.0 140.65 Acetic 1 0.55427 2668.4 142.64 0.0054 0.942 ``` * With a large p-value, .brand-blue[Acetic] is not selected. ]] .column.bg-brand-gray[.content[ "Best" model (according to forward selection with sig. level 5%) ```r summary(M3) ``` ``` Call: lm(formula = taste ~ H2S + Lactic, data = dat) Residuals: Min 1Q Median 3Q Max -17.343 -6.530 -1.164 4.844 25.618 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -27.592 8.982 -3.072 0.00481 ** H2S 3.946 1.136 3.475 0.00174 ** Lactic 19.887 7.959 2.499 0.01885 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 9.942 on 27 degrees of freedom Multiple R-squared: 0.6517, Adjusted R-squared: 0.6259 F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07 ``` ]] ]]