Outliers and High Leverage Points

class: split-70 with-border hide-slide-number bg-brand-red
background-image: url("images/USydLogo-black.svg")
background-size: 200px
background-position: 2% 90%

.column.white[.content[

# Outliers and High Leverage Points
## .black[STAT3022 Applied Linear Models Lecture 8]

### .black[2020/02/17]

]]

.column.bg-brand-charcoal[.content.white[
## Today

1. Outlying points in `$\Bbb{R}^p$`
1. Leverage
1. Studentized residuals
1. Cook's distance
1. Outlier detection

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# What exactly is an outlier?
]]
.row[.split-two[
.column[.content[

* No hard and fast definition.
* Is defined as a data point far away from the range of `$x$`?
* Or defined as a data point that emanates from a different model than do the rest of the data.
* Notice that this makes the definition dependent on the model in question.

]]
.column[.content[
<img src="lecture08_2020JC_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Leverage and Influence
]]
.row[.split-40[
.column[.content[

* Leverage - a point with high leverage has the potential to dramatically impact the regreassion model.
* Influence - measures how much impact a point has on the regression model.

##### Red dash line: for all black data point.

##### Black line: all data including the outlier indicated by a circle.

]]
.column.bg-brand-gray[.content[

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Outlying points in `$\Bbb{R}^p$` - High leverage (case A)
]]
.row[.split-two[
.column.bg-brand-gray[.content[

* As with simple linear regression the fitted model should not be used to predict `$Y$` values for `$\boldsymbol{x}$` combinations that are well away from the set of observed `$\boldsymbol{x}_i$` values. 
* This is not always easy to detect.  
* For example, when there are several predictor variables and each component of `$\boldsymbol{x}_0$` is well within the observed range of corresponding values in `$$\boldsymbol{x}_1, \boldsymbol{x}_2, \boldsymbol{x}_3\ldots,\boldsymbol{x}_n$$`
but the point `$\boldsymbol{x}_0$` is a long way from the observed points in `$p$`-dimensional space.

]]
.column[.content[

### Example - Outlying points in `$\Bbb{R}^2$`

* Here, `$\boldsymbol{x}_0$` has `$x_1$` and `$x_2$` coordinates well within their respective ranges but `$\boldsymbol{x}_0$` is not close to the observed sample values in 2-dimensional space.

* In higher dimensions this type of behaviour is even harder to detect but we need to be on guard against extrapolating to extreme values.

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Leverage
]]
.row[.split-60[
.column[.content[
* Recall that the fitted values are given as

`$$\hat{\boldsymbol{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underbrace{ \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top}_{\text{projection to} \ Y}\boldsymbol{Y}$$`

* The projection matrix `$\mathbf{P}_X = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$` projects the response to the fitted values, i.e. `$\mathbf{P}_X\boldsymbol{Y}=\hat{\boldsymbol{Y}}$`, and also `$(\mathbf{I} - \mathbf{P}_X)\boldsymbol{Y}=\hat{\boldsymbol{\epsilon}}=\boldsymbol{R}$` since `$\boldsymbol{Y}-\hat{\boldsymbol{Y}}=\boldsymbol{R}$`.
* The matrix `$\mathbf{P}_X$` is also referred to as the hat matrix which is commonly denoted as `$\mathbf{H}$`.
* The `$i$`-th diagonal element of `$\mathbf{H}$`, `$h_{ii}$`,  is called the .brand-blue[leverage] of the `$i$`th point `$(\boldsymbol{x}_i,Y_i)$` and `$h_{ii} = \boldsymbol{x}_i^\top (\mathbf{X}^\top\mathbf{X})^{-1}\boldsymbol{x}_i.$`
*  Also `$h_{ii} = Var(\boldsymbol{x}_i^\top\hat{\boldsymbol{\beta}})/\sigma^2 \geq 0$` since
`$$Var(\boldsymbol{x}_i^\top\color{blue}{\hat{\boldsymbol{\beta}}})/\sigma^2=\boldsymbol{x}_i^\top\color{blue}{(\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})}\boldsymbol{x}_i/\sigma^2=\boldsymbol{x}_i^\top (\mathbf{X}^\top\mathbf{X})^{-1}\boldsymbol{x}_i=h_{ii}.$$`

```r
fit<-lm(log(sales)~log(youtube)*facebook,data=marketing[-c(131,156),])
```

]]
.column.bg-brand-gray[.content[
* Leverages have many properties. 
* For example they are always between zero and one, `$$0\leq h_{ii} \leq 1.$$` 
* This is true because

`\begin{eqnarray*}
Var(R_i) &=& Var((\mathbf{I}-\mathbf{H})\boldsymbol{Y})_{ii} \\
&=& \left((\mathbf{I}-\mathbf{H})Var(\boldsymbol{Y})(\mathbf{I}-\mathbf{H})^\top\right)_{ii}\\
&=& \left(\sigma^2(\mathbf{I}-\mathbf{H})^2\right)_{ii}=\left(\sigma^2(\mathbf{I}-\mathbf{H})\right)_{ii}\\
&=& (1-h_{ii})\sigma^2 \geq  0
\end{eqnarray*}`
since `$\mathbf{H}^2= \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{\mathbf{X}^\top \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}}_{\mathbf{I}}\mathbf{X}^\top$` `$= \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top=\mathbf{H}$` is idempotent, `$(\mathbf{I}-\mathbf{H})^2=\mathbf{I}^2-2\mathbf{H}+\mathbf{H}^2=\mathbf{I}-\mathbf{H}$` is also idempotent and  `$Var(\boldsymbol{Y})=\sigma^2\mathbf{I}$`.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# More on leverage values
]]
.row[.split-two[
.column.bg-brand-gray[.content[

* `$\operatorname{trace}$` of a square matrix `$\mathbf{H}$` is defined as the sum of diagonal elements. Hence `$\sum_{i=1}^n h_{ii} = \operatorname{trace}(\mathbf{H})$`.
* Note `$\operatorname{trace}(\mathbf{A}\mathbf{B}) = \operatorname{trace}(\mathbf{B}\mathbf{A})$` for `$\mathbf{A}$` and `$\mathbf{B}$` of commensurate size.
* The average leverage is `$p/n$` since
`\begin{eqnarray*}
\operatorname{trace}(\mathbf{H}) &=& \operatorname{trace}(\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \color{blue}{\mathbf{X}^\top}) \\
&=& \operatorname{trace}( \underbrace{\color{blue}{\mathbf{X}^\top} \mathbf{X}}_{p\times p \text{ matrix}}(\mathbf{X}^\top\mathbf{X})^{-1}) \\
&=& \operatorname{trace}( \mathbf{I} ) = p.
\end{eqnarray*}`
* We say a point is a .brand-blue[high leverage point] if `$$h_{ii} > \frac{2p}{n}.$$`
* The leverage value `$h_{ii}$` does not depend on the response `$Y_i$`.

]]
.column[.content[

* Points with high leverage can exert a lot of .brand-blue[influence] on the parameter estimates.
* If `$h_{ii}$` is large (close to 1) then `$Var(R_i)=(1-h_{ii})\sigma^2$` is close to 0 and so the `$i$`-th point must have a residual close to its expected value of 0. 
* That is, it pulls the regression model to be close to the point such that the residual is close to 0.
* High leverage points are extreme points in that their `$\boldsymbol{x}_i$` vector has a long way from the other `$\boldsymbol{x}_i$`'s in `$p$`-dimensional space.

<img src="images/preg_reg_resize.png" width="40%" height="40%">

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Extracting leverage values in R 
]]
.row[.split-two[
.column[.content[

```r
round(as.vector(lm.influence(fit)$hat),4)
```

```
    [1] 0.0154 0.0207 0.0790 0.0138 0.0115 0.1599 0.0105 0.0054 0.1532 0.0220
   [11] 0.0165 0.0077 0.0291 0.0109 0.0100 0.0251 0.0115 0.0212 0.0067 0.0055
   [21] 0.0083 0.0228 0.0451 0.0102 0.0112 0.0286 0.0063 0.0109 0.0093 0.0080
   [31] 0.0115 0.0058 0.0163 0.0105 0.0165 0.0309 0.0266 0.0243 0.0109 0.0151
   [41] 0.0073 0.0091 0.0113 0.0156 0.0187 0.0064 0.0097 0.0206 0.0108 0.0110
   [51] 0.0213 0.0095 0.0188 0.0216 0.0104 0.0283 0.0523 0.0057 0.0302 0.0086
   [61] 0.0266 0.0243 0.0116 0.0060 0.0142 0.0125 0.0148 0.0073 0.0090 0.0218
   [71] 0.0086 0.0069 0.0229 0.0125 0.0077 0.0696 0.0580 0.0057 0.0670 0.0106
   [81] 0.0062 0.0247 0.0063 0.0193 0.0203 0.0079 0.0064 0.0120 0.0055 0.0189
   [91] 0.0135 0.0560 0.0110 0.0153 0.0070 0.0077 0.0206 0.0069 0.0262 0.0134
  [101] 0.0223 0.0179 0.0199 0.0081 0.0126 0.0182 0.0347 0.0181 0.1236 0.0096
  [111] 0.0175 0.0163 0.0083 0.0079 0.0202 0.0096 0.0074 0.0198 0.0095 0.0321
  [121] 0.0057 0.0255 0.0254 0.0081 0.0108 0.0087 0.0898 0.0199 0.0304 0.0122
  [131] 0.0300 0.0466 0.0111 0.0239 0.0314 0.0351 0.0109 0.0107 0.0189 0.0073
  [141] 0.0111 0.0109 0.0122 0.0069 0.0167 0.0200 0.0336 0.0260 0.0103 0.0155
  [151] 0.0102 0.0071 0.0135 0.0070 0.0149 0.0182 0.0591 0.0058 0.0071 0.0093
  [161] 0.0078 0.0107 0.0068 0.0251 0.0441 0.0194 0.0077 0.0195 0.0153 0.0062
  [171] 0.0259 0.0138 0.0236 0.0381 0.0105 0.0133 0.0327 0.0111 0.0173 0.0203
  [181] 0.0198 0.0273 0.0096 0.0223 0.0165 0.0076 0.0158 0.0426 0.0263 0.0104
  [191] 0.0781 0.0154 0.0093 0.0355 0.0132 0.0124 0.0252 0.0176
```

]

]]
.column[.content[
.content-box-blue[
Extracting leverage values using `broom`:

```r
print(augment(fit)$.hat,2)
```

]
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Studentized residuals
]]
.row[.split-40[
.column.bg-brand-gray[.content[

To obtain residuals with equal variance, many texts recommend dividing `$R_i$` by the square root of `$Var(R_i)=(1-h_{ii})\sigma^2$` to get the .brand-blue[studentized residuals]
`\begin{equation*}
R_i^* = \frac{R_i}{\hat{\sigma}\sqrt{1-h_{ii}}}  \; , \quad\text{ where } \hat{\sigma}^2 = \frac{\text{RSS}}{n-p},
\end{equation*}`
as the basis for Q-Q plots to check (A1)-(A4). Note `$Var(\epsilon_i)=\sigma^2$` is fixed but `$Var(\hat \epsilon_i)=Var(R_i)=(1-h_{ii})\sigma^2$` change over observations.

### Q-Q plot with studentized residual
```r
out %>% 
  ggplot(aes(sample=.std.resid)) + 
  stat_qq() + stat_qq_line(color="blue")
```

]]
.column[.content[

```r
out <- fit %>% augment() 
out %>% pull(.std.resid) %>% head()
```

```
  [1]  1.12436158 -2.80314862  2.21076468 -0.70047968  0.02172577  2.21185287
```

```r
out %>% 
  mutate(stdresid = .resid / sigma(fit) / sqrt(1 - .hat)) %>% 
  pull(stdresid) %>% head()
```

```
  [1]  1.12436158 -2.80314862  2.21076468 -0.70047968  0.02172577  2.21185287
```

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Influence
]]
.row[.split-two[
.column.bg-brand-gray[.content[

* .brand-blue[Cook's distance], `$D$`, is a measure of influence: 
`\begin{eqnarray*}
D_i &=& \dfrac{(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})^\top \color{blue}{\widehat{V}ar(\hat{\boldsymbol{\beta}})^{-1}}(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})}{p}\\
&=& \dfrac{(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})^\top \color{blue}{(\mathbf{X}^\top\mathbf{X})}(\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}_{[-i]})}{p\color{blue}{\hat \sigma^2}}\\
&=& \dfrac{(\mathbf{X}\hat{\boldsymbol{\beta}}- \mathbf{X}\hat{\boldsymbol{\beta}}_{[-i]})^\top (\mathbf{X}\hat{\boldsymbol{\beta}}- \mathbf{X}\hat{\boldsymbol{\beta}}_{[-i]})}{p \hat \sigma^2}\\
&=&\sum_{j=1}^n \frac{(\hat Y_j - \hat Y_{j[-i]})^2}{p\hat\sigma^2} \\
&=&\frac{R_i^2 h_{ii}}{(1-h_{ii})^2p\hat\sigma^2},
\end{eqnarray*}`
where `$\hat{\boldsymbol{\beta}}_{[-i]}$` and `$\hat Y_{j[-i]}$` are LSEs obtained by fitting the model ignoring the `$i$`-th data point `$(\boldsymbol{x}_i,Y_i)$` respectively.

]]
.column[.content[

* In this course, a point is identified as an outlier if `$$D_i > F^{-1}_{p,n-p}(0.5),$$` the median of an `$F_{p,n-p}$` distribution.
* Note that we always have `$F^{-1}_{p,n-p}(0.5) > 0.65, \,\forall p\geq 1$`.

```r
round(as.vector(cooks.distance(fit))[1:6],4) 
```

```
  [1] 0.0049 0.0415 0.1048 0.0017 0.0000 0.2328
```

```r
round(augment(fit)$.cooksd[1:6],4)
```

```
  [1] 0.0049 0.0415 0.1048 0.0017 0.0000 0.2328
```

```r
qf(0.5,4,196)
```

```
  [1] 0.8420844
```

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Automatic diagnostic plots
]]
.row[.split-two[
.column[.content[

* First, you should be able to construct all diagnostic plots without shortcuts!

```r
par(mfrow=c(2,2)) # get ready to make 2 x 2 plot
plot(fit)
```

]]
.column.bg-brand-gray[.content[

```r
library(ggfortify)
autoplot(fit) 
```
<img src="lecture08_2020JC_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Automatic diagnostic plots for full data
]]
.row[.split-two[
.column.bg-brand-gray[.content[

```r
fit1 <- lm(log(sales) ~ log(youtube)*facebook, 
 data=marketing) # this is the FULL data
autoplot(fit1) 
```
<img src="lecture08_2020JC_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" />

]]
.column[.content[

Looking at the covariates alone:
<center>
<iframe src="lecture08_extra.html" width="600" height="450" frameBorder="0"> </iframe>
</center>
###### The outliers are in the lower end of log(youtube) after taking log transformation which seems to create these outliers as log(0.0498) `$\approx$` -3 and log(0.0183) `$\approx$` -4 differ much more than 0.0498 from 0.0183.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Any outliers? 
]]
.row[.split-60[
.column.bg-brand-gray[.content[

```r
p <- 4
n <- 200
as.vector(which(cooks.distance(fit1) > qf(0.5, p, n - p)))
```

```
  [1] 131
```

```r
as.vector(which(lm.influence(fit1)$hat > 2 * p / n))
```

```
   [1]   3   6   9  57  76  77  79  92 109 127 131 156 159 193
```

* Observation 131 is an outlier under our definition earlier. 
* Observations 3, 6, 9, 57, 76, 77 79, 92, 109, 127, 131, 156, 159 and 193 are considered high leverage points using the cut-off in this course. 
* Of course, the cut-off we introduce is just a rule of thumb and you should be considering outliers with the context of data.

]]
.column[.content[

```r
round(as.vector(head
          (lm.influence(fit)$hat,5)),4)
```

```
  [1] 0.0154 0.0207 0.0790 0.0138 0.0115
```

```r
round(as.vector(head
         (lm.influence(fit1)$hat,5)),4)
```

```
  [1] 0.0147 0.0174 0.0621 0.0138 0.0114
```
* As outliers in the full data have more influence, the influence of other data points are relatively less.
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Detecting outliers
]]
.row[.split-two[
.column[.content[

Outliers may be detected by:

* Inspection of a scatterplot of the data.
* Inspection of a .brand-blue[fitted line] plot (where fitted regression line is superimposed on scatterplot).
* Inspection plot of residuals versus fitted values or covariate.
* Inspection of .brand-blue[high leverage points] and points with .brand-blue[large Cook's distance].

Any data point that lies an unusual distance from the fitted line, or has a large residual in comparison to the other data points, may be considered an outlier.

]]
.column.bg-brand-gray[.content[
### Practice - What to do when you find outliers

* .red[Do not ignore them!]
* Investigate whether data have been mis-recorded (go back
to data source if possible).
* If an outlier is not due to transcription errors, then it
may need removing before refitting the model.
* In general one should refit model after removing a single
outlier, because removal of one outlier can alter the fitted models
and hence make other points appear less (or more) inconsistent with the model.
* The field of .brand-blue[robust statistics] is concerned with more 
sophisticated ways of dealing with outliers.

]]
]]