class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Outliers and High Leverage Points ## .black[STAT3022 Applied Linear Models Lecture 8] <br><br><br> ### .black[2020/02/17] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Outlying points in `\(\Bbb{R}^p\)` 1. Leverage 1. Studentized residuals 1. Cook's distance 1. Outlier detection ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # What exactly is an outlier? ]] .row[.split-two[ .column[.content[ * No hard and fast definition. * Is defined as a data point far away from the range of `\(x\)`? * Or defined as a data point that emanates from a different model than do the rest of the data. * Notice that this makes the definition dependent on the model in question. <br><br> .blockquote[ Would you consider the blue point in either plot as outliers? ] ]] .column[.content[ <img src="lecture08_2020JC_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> <img src="lecture08_2020JC_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Leverage and Influence ]] .row[.split-40[ .column[.content[ * <span class="brand-blue">Leverage</span> - a point with high leverage has the potential to dramatically impact the regreassion model. * <span class="brand-blue">Influence</span> - measures how much impact a point has on the regression model. ##### Red dash line: for all black data point. ##### Black line: all data including the outlier indicated by a circle. ]] .column.bg-brand-gray[.content[ <center> <img src="images/LeverageInfluence2.png" width="80%" height="80%"> </center> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Outlying points in `\(\Bbb{R}^p\)` - High leverage (case A) ]] .row[.split-two[ .column.bg-brand-gray[.content[ * As with simple linear regression the fitted model should not be used to predict `\(Y\)` values for `\(\boldsymbol{x}\)` combinations that are well away from the set of observed `\(\boldsymbol{x}_i\)` values. * This is not always easy to detect. * For example, when there are several predictor variables and each component of `\(\boldsymbol{x}_0\)` is well within the observed range of corresponding values in `$$\boldsymbol{x}_1, \boldsymbol{x}_2, \boldsymbol{x}_3\ldots,\boldsymbol{x}_n$$` but the point `\(\boldsymbol{x}_0\)` is a long way from the observed points in `\(p\)`-dimensional space. ]] .column[.content[ ### Example - Outlying points in `\(\Bbb{R}^2\)` * Here, `\(\boldsymbol{x}_0\)` has `\(x_1\)` and `\(x_2\)` coordinates well within their respective ranges but `\(\boldsymbol{x}_0\)` is not close to the observed sample values in 2-dimensional space. * In higher dimensions this type of behaviour is even harder to detect but we need to be on guard against extrapolating to extreme values. <center> <img src="images/outlierR2.png" width="60%"> </center> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Leverage ]] .row[.split-60[ .column[.content[ * Recall that the fitted values are given as `$$\hat{\boldsymbol{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underbrace{ \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top}_{\text{projection to} \ Y}\boldsymbol{Y}$$` * The projection matrix `\(\mathbf{P}_X = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\)` projects the response to the fitted values, i.e. `\(\mathbf{P}_X\boldsymbol{Y}=\hat{\boldsymbol{Y}}\)`, and also `\((\mathbf{I} - \mathbf{P}_X)\boldsymbol{Y}=\hat{\boldsymbol{\epsilon}}=\boldsymbol{R}\)` since `\(\boldsymbol{Y}-\hat{\boldsymbol{Y}}=\boldsymbol{R}\)`. * The matrix `\(\mathbf{P}_X\)` is also referred to as the hat matrix which is commonly denoted as `\(\mathbf{H}\)`. * The `\(i\)`-th diagonal element of `\(\mathbf{H}\)`, `\(h_{ii}\)`, is called the .brand-blue[leverage] of the `\(i\)`th point `\((\boldsymbol{x}_i,Y_i)\)` and `\(h_{ii} = \boldsymbol{x}_i^\top (\mathbf{X}^\top\mathbf{X})^{-1}\boldsymbol{x}_i.\)` * Also `\(h_{ii} = Var(\boldsymbol{x}_i^\top\hat{\boldsymbol{\beta}})/\sigma^2 \geq 0\)` since `$$Var(\boldsymbol{x}_i^\top\color{blue}{\hat{\boldsymbol{\beta}}})/\sigma^2=\boldsymbol{x}_i^\top\color{blue}{(\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})}\boldsymbol{x}_i/\sigma^2=\boldsymbol{x}_i^\top (\mathbf{X}^\top\mathbf{X})^{-1}\boldsymbol{x}_i=h_{ii}.$$` ```r fit<-lm(log(sales)~log(youtube)*facebook,data=marketing[-c(131,156),]) ``` ]] .column.bg-brand-gray[.content[ * Leverages have many properties. * For example they are always between zero and one, `$$0\leq h_{ii} \leq 1.$$` * This is true because `\begin{eqnarray*} Var(R_i) &=& Var((\mathbf{I}-\mathbf{H})\boldsymbol{Y})_{ii} \\ &=& \left((\mathbf{I}-\mathbf{H})Var(\boldsymbol{Y})(\mathbf{I}-\mathbf{H})^\top\right)_{ii}\\ &=& \left(\sigma^2(\mathbf{I}-\mathbf{H})^2\right)_{ii}=\left(\sigma^2(\mathbf{I}-\mathbf{H})\right)_{ii}\\ &=& (1-h_{ii})\sigma^2 \geq 0 \end{eqnarray*}` since `\(\mathbf{H}^2= \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{\mathbf{X}^\top \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}}_{\mathbf{I}}\mathbf{X}^\top\)` `\(= \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top=\mathbf{H}\)` is idempotent, `\((\mathbf{I}-\mathbf{H})^2=\mathbf{I}^2-2\mathbf{H}+\mathbf{H}^2=\mathbf{I}-\mathbf{H}\)` is also idempotent and `\(Var(\boldsymbol{Y})=\sigma^2\mathbf{I}\)`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # More on leverage values ]] .row[.split-two[ .column.bg-brand-gray[.content[ * `\(\operatorname{trace}\)` of a square matrix `\(\mathbf{H}\)` is defined as the sum of diagonal elements. Hence `\(\sum_{i=1}^n h_{ii} = \operatorname{trace}(\mathbf{H})\)`. * Note `\(\operatorname{trace}(\mathbf{A}\mathbf{B}) = \operatorname{trace}(\mathbf{B}\mathbf{A})\)` for `\(\mathbf{A}\)` and `\(\mathbf{B}\)` of commensurate size. * The average leverage is `\(p/n\)` since `\begin{eqnarray*} \operatorname{trace}(\mathbf{H}) &=& \operatorname{trace}(\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \color{blue}{\mathbf{X}^\top}) \\ &=& \operatorname{trace}( \underbrace{\color{blue}{\mathbf{X}^\top} \mathbf{X}}_{p\times p \text{ matrix}}(\mathbf{X}^\top\mathbf{X})^{-1}) \\ &=& \operatorname{trace}( \mathbf{I} ) = p. \end{eqnarray*}` * We say a point is a .brand-blue[high leverage point] if `$$h_{ii} > \frac{2p}{n}.$$` * The leverage value `\(h_{ii}\)` does not depend on the response `\(Y_i\)`. ]] .column[.content[ * Points with high leverage can exert a lot of .brand-blue[influence] on the parameter estimates. * If `\(h_{ii}\)` is large (close to 1) then `\(Var(R_i)=(1-h_{ii})\sigma^2\)` is close to 0 and so the `\(i\)`-th point must have a residual close to its expected value of 0. * That is, it pulls the regression model to be close to the point such that the residual is close to 0. * High leverage points are <span class="brand-blue">extreme</span> points in that their `\(\boldsymbol{x}_i\)` vector has a long way from the other `\(\boldsymbol{x}_i\)`'s in `\(p\)`-dimensional space. <p align="right"> <img src="images/preg_reg_resize.png" width="40%" height="40%"> </p> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Extracting leverage values in R ]] .row[.split-two[ .column[.content[ .content-box-blue[ <b>Extracting leverage values in `base`</b>: ```r round(as.vector(lm.influence(fit)$hat),4) ``` ``` [1] 0.0154 0.0207 0.0790 0.0138 0.0115 0.1599 0.0105 0.0054 0.1532 0.0220 [11] 0.0165 0.0077 0.0291 0.0109 0.0100 0.0251 0.0115 0.0212 0.0067 0.0055 [21] 0.0083 0.0228 0.0451 0.0102 0.0112 0.0286 0.0063 0.0109 0.0093 0.0080 [31] 0.0115 0.0058 0.0163 0.0105 0.0165 0.0309 0.0266 0.0243 0.0109 0.0151 [41] 0.0073 0.0091 0.0113 0.0156 0.0187 0.0064 0.0097 0.0206 0.0108 0.0110 [51] 0.0213 0.0095 0.0188 0.0216 0.0104 0.0283 0.0523 0.0057 0.0302 0.0086 [61] 0.0266 0.0243 0.0116 0.0060 0.0142 0.0125 0.0148 0.0073 0.0090 0.0218 [71] 0.0086 0.0069 0.0229 0.0125 0.0077 0.0696 0.0580 0.0057 0.0670 0.0106 [81] 0.0062 0.0247 0.0063 0.0193 0.0203 0.0079 0.0064 0.0120 0.0055 0.0189 [91] 0.0135 0.0560 0.0110 0.0153 0.0070 0.0077 0.0206 0.0069 0.0262 0.0134 [101] 0.0223 0.0179 0.0199 0.0081 0.0126 0.0182 0.0347 0.0181 0.1236 0.0096 [111] 0.0175 0.0163 0.0083 0.0079 0.0202 0.0096 0.0074 0.0198 0.0095 0.0321 [121] 0.0057 0.0255 0.0254 0.0081 0.0108 0.0087 0.0898 0.0199 0.0304 0.0122 [131] 0.0300 0.0466 0.0111 0.0239 0.0314 0.0351 0.0109 0.0107 0.0189 0.0073 [141] 0.0111 0.0109 0.0122 0.0069 0.0167 0.0200 0.0336 0.0260 0.0103 0.0155 [151] 0.0102 0.0071 0.0135 0.0070 0.0149 0.0182 0.0591 0.0058 0.0071 0.0093 [161] 0.0078 0.0107 0.0068 0.0251 0.0441 0.0194 0.0077 0.0195 0.0153 0.0062 [171] 0.0259 0.0138 0.0236 0.0381 0.0105 0.0133 0.0327 0.0111 0.0173 0.0203 [181] 0.0198 0.0273 0.0096 0.0223 0.0165 0.0076 0.0158 0.0426 0.0263 0.0104 [191] 0.0781 0.0154 0.0093 0.0355 0.0132 0.0124 0.0252 0.0176 ``` ] ]] .column[.content[ .content-box-blue[ <b>Extracting leverage values using `broom`</b>: ```r print(augment(fit)$.hat,2) ``` ``` [1] 0.0154 0.0207 0.0790 0.0138 0.0115 0.1599 0.0105 0.0054 0.1532 0.0220 [11] 0.0165 0.0077 0.0291 0.0109 0.0100 0.0251 0.0115 0.0212 0.0067 0.0055 [21] 0.0083 0.0228 0.0451 0.0102 0.0112 0.0286 0.0063 0.0109 0.0093 0.0080 [31] 0.0115 0.0058 0.0163 0.0105 0.0165 0.0309 0.0266 0.0243 0.0109 0.0151 [41] 0.0073 0.0091 0.0113 0.0156 0.0187 0.0064 0.0097 0.0206 0.0108 0.0110 [51] 0.0213 0.0095 0.0188 0.0216 0.0104 0.0283 0.0523 0.0057 0.0302 0.0086 [61] 0.0266 0.0243 0.0116 0.0060 0.0142 0.0125 0.0148 0.0073 0.0090 0.0218 [71] 0.0086 0.0069 0.0229 0.0125 0.0077 0.0696 0.0580 0.0057 0.0670 0.0106 [81] 0.0062 0.0247 0.0063 0.0193 0.0203 0.0079 0.0064 0.0120 0.0055 0.0189 [91] 0.0135 0.0560 0.0110 0.0153 0.0070 0.0077 0.0206 0.0069 0.0262 0.0134 [101] 0.0223 0.0179 0.0199 0.0081 0.0126 0.0182 0.0347 0.0181 0.1236 0.0096 [111] 0.0175 0.0163 0.0083 0.0079 0.0202 0.0096 0.0074 0.0198 0.0095 0.0321 [121] 0.0057 0.0255 0.0254 0.0081 0.0108 0.0087 0.0898 0.0199 0.0304 0.0122 [131] 0.0300 0.0466 0.0111 0.0239 0.0314 0.0351 0.0109 0.0107 0.0189 0.0073 [141] 0.0111 0.0109 0.0122 0.0069 0.0167 0.0200 0.0336 0.0260 0.0103 0.0155 [151] 0.0102 0.0071 0.0135 0.0070 0.0149 0.0182 0.0591 0.0058 0.0071 0.0093 [161] 0.0078 0.0107 0.0068 0.0251 0.0441 0.0194 0.0077 0.0195 0.0153 0.0062 [171] 0.0259 0.0138 0.0236 0.0381 0.0105 0.0133 0.0327 0.0111 0.0173 0.0203 [181] 0.0198 0.0273 0.0096 0.0223 0.0165 0.0076 0.0158 0.0426 0.0263 0.0104 [191] 0.0781 0.0154 0.0093 0.0355 0.0132 0.0124 0.0252 0.0176 ``` ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Studentized residuals ]] .row[.split-40[ .column.bg-brand-gray[.content[ To obtain residuals with equal variance, many texts recommend dividing `\(R_i\)` by the square root of `\(Var(R_i)=(1-h_{ii})\sigma^2\)` to get the .brand-blue[studentized residuals] `\begin{equation*} R_i^* = \frac{R_i}{\hat{\sigma}\sqrt{1-h_{ii}}} \; , \quad\text{ where } \hat{\sigma}^2 = \frac{\text{RSS}}{n-p}, \end{equation*}` as the basis for Q-Q plots to check (A1)-(A4). Note `\(Var(\epsilon_i)=\sigma^2\)` is fixed but `\(Var(\hat \epsilon_i)=Var(R_i)=(1-h_{ii})\sigma^2\)` change over observations. ### Q-Q plot with studentized residual ```r out %>% ggplot(aes(sample=.std.resid)) + stat_qq() + stat_qq_line(color="blue") ``` ]] .column[.content[ ```r out <- fit %>% augment() out %>% pull(.std.resid) %>% head() ``` ``` [1] 1.12436158 -2.80314862 2.21076468 -0.70047968 0.02172577 2.21185287 ``` ```r out %>% mutate(stdresid = .resid / sigma(fit) / sqrt(1 - .hat)) %>% pull(stdresid) %>% head() ``` ``` [1] 1.12436158 -2.80314862 2.21076468 -0.70047968 0.02172577 2.21185287 ``` <img src="lecture08_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Influence ]] .row[.split-two[ .column.bg-brand-gray[.content[ * .brand-blue[Cook's distance], `\(D\)`, is a measure of influence: `\begin{eqnarray*} D_i &=& \dfrac{(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})^\top \color{blue}{\widehat{V}ar(\hat{\boldsymbol{\beta}})^{-1}}(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})}{p}\\ &=& \dfrac{(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})^\top \color{blue}{(\mathbf{X}^\top\mathbf{X})}(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})}{p\color{blue}{\hat \sigma^2}}\\ &=& \dfrac{(\mathbf{X}\hat{\boldsymbol{\beta}}- \mathbf{X}\hat{\boldsymbol{\beta}}_{[-i]})^\top (\mathbf{X}\hat{\boldsymbol{\beta}}- \mathbf{X}\hat{\boldsymbol{\beta}}_{[-i]})}{p \hat \sigma^2}\\ &=&\sum_{j=1}^n \frac{(\hat Y_j - \hat Y_{j[-i]})^2}{p\hat\sigma^2} \\ &=&\frac{R_i^2 h_{ii}}{(1-h_{ii})^2p\hat\sigma^2}, \end{eqnarray*}` where `\(\hat{\boldsymbol{\beta}}_{[-i]}\)` and `\(\hat Y_{j[-i]}\)` are LSEs obtained by fitting the model ignoring the `\(i\)`-th data point `\((\boldsymbol{x}_i,Y_i)\)` respectively. ]] .column[.content[ * In this course, a point is identified as an outlier if `$$D_i > F^{-1}_{p,n-p}(0.5),$$` the median of an `\(F_{p,n-p}\)` distribution. * Note that we always have `\(F^{-1}_{p,n-p}(0.5) > 0.65, \,\forall p\geq 1\)`. ```r round(as.vector(cooks.distance(fit))[1:6],4) ``` ``` [1] 0.0049 0.0415 0.1048 0.0017 0.0000 0.2328 ``` ```r round(augment(fit)$.cooksd[1:6],4) ``` ``` [1] 0.0049 0.0415 0.1048 0.0017 0.0000 0.2328 ``` ```r qf(0.5,4,196) ``` ``` [1] 0.8420844 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Automatic diagnostic plots ]] .row[.split-two[ .column[.content[ * First, you should be able to construct all diagnostic plots without shortcuts! ```r par(mfrow=c(2,2)) # get ready to make 2 x 2 plot plot(fit) ``` <img src="lecture08_2020JC_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ ```r library(ggfortify) autoplot(fit) ``` <img src="lecture08_2020JC_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Automatic diagnostic plots for full data ]] .row[.split-two[ .column.bg-brand-gray[.content[ ```r fit1 <- lm(log(sales) ~ log(youtube)*facebook, data=marketing) # this is the FULL data autoplot(fit1) ``` <img src="lecture08_2020JC_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> ]] .column[.content[ Looking at the covariates alone: <center> <iframe src="lecture08_extra.html" width="600" height="450" frameBorder="0"> </iframe> </center> ###### The outliers are in the lower end of log(youtube) after taking log transformation which seems to create these outliers as log(0.0498) `\(\approx\)` -3 and log(0.0183) `\(\approx\)` -4 differ much more than 0.0498 from 0.0183. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Any outliers? ]] .row[.split-60[ .column.bg-brand-gray[.content[ ```r p <- 4 n <- 200 as.vector(which(cooks.distance(fit1) > qf(0.5, p, n - p))) ``` ``` [1] 131 ``` ```r as.vector(which(lm.influence(fit1)$hat > 2 * p / n)) ``` ``` [1] 3 6 9 57 76 77 79 92 109 127 131 156 159 193 ``` * Observation 131 is an outlier under our definition earlier. * Observations 3, 6, 9, 57, 76, 77 79, 92, 109, 127, 131, 156, 159 and 193 are considered high leverage points using the cut-off in this course. * Of course, the cut-off we introduce is just a rule of thumb and you should be considering outliers with the context of data. ]] .column[.content[ ```r round(as.vector(head (lm.influence(fit)$hat,5)),4) ``` ``` [1] 0.0154 0.0207 0.0790 0.0138 0.0115 ``` ```r round(as.vector(head (lm.influence(fit1)$hat,5)),4) ``` ``` [1] 0.0147 0.0174 0.0621 0.0138 0.0114 ``` * As outliers in the full data have more influence, the influence of other data points are relatively less. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Detecting outliers ]] .row[.split-two[ .column[.content[ Outliers may be detected by: * Inspection of a scatterplot of the data. * Inspection of a .brand-blue[fitted line] plot (where fitted regression line is superimposed on scatterplot). * Inspection plot of residuals versus fitted values or covariate. * Inspection of .brand-blue[high leverage points] and points with .brand-blue[large Cook's distance]. Any data point that lies an unusual distance from the fitted line, or has a large residual in comparison to the other data points, may be considered an outlier. ]] .column.bg-brand-gray[.content[ ### Practice - What to do when you find outliers * .red[Do not ignore them!] * Investigate whether data have been mis-recorded (go back to data source if possible). * If an outlier is not due to transcription errors, then it may need removing before refitting the model. * In general one should refit model after removing a single outlier, because removal of one outlier can alter the fitted models and hence make other points appear less (or more) inconsistent with the model. * The field of .brand-blue[robust statistics] is concerned with more sophisticated ways of dealing with outliers. ]] ]]