Completely Randomised Designs

class: split-70 with-border hide-slide-number bg-brand-red
background-image: url("images/USydLogo-black.svg")
background-size: 200px
background-position: 2% 90%

.column.white[.content[

# Completely Randomised Designs
## .black[STAT3022 Applied Linear Models Lecture 23]

### .black[2020/02/20]

]]

.column.bg-brand-charcoal[.content.white[
## Today

1. Simple designs for comparing one factor

2. Complete randomized design

3. Block design, RCBD

4. Systematic design

5. Design generation

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Design of experiments
]]
.row[
.split-two[
.column[.content[

* Suppose we wish to compare `$t$` number of .brand-blue[**treatments**], and there are `$n$` .brand-blue[**experimental units**] (assuming OU=EU).
* There is some element of variability inherent in the experimental material.
* Initial planning can improve the comparisons by .brand-blue[**reducing the variance in the estimators of the mean responses**].

## Example: `$t=2$` treatments

* Suppose that we use:
  * `$n_1$`  EUs for treatment 1 and
  * `$n_2=n-n_1$` EUs for treatment 2.

]]
.column.bg-brand-gray[.content[
* Assume the model for the analysis is

`$$Y_{ij}=\mu+\alpha_i+\epsilon_{ij}, \ \epsilon_{ij} \sim NID(0,\sigma^2)$$`
`\begin{eqnarray*}
\text{we have } \text{Var}(\alpha_1-\alpha_2) & = & \text{Var}(\bar Y_{1\bullet}-\bar Y_{2\bullet}) \\
& = & \frac{\sigma^2}{n_1}+\frac{\sigma^2}{n-n_1} \\
\frac{\partial}{\partial n_1} \left(\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n-n_1}\right) & = & \sigma^2 \left[-\frac{1}{n_1^2}+\frac{1}{(n-n_1)^2} \right] \\
& = & \sigma^2 \left[-\frac{n(2n_1-n)}{n_1^2(n-n_1)^2} \right]
\end{eqnarray*}`

* Therefore the variance is minimised when `$n_1=\frac{n}{2}$`.
* Thus if there is common variance given by the model, `$n_2=\frac{n}{2}$` implying .brand-blue[balanced] design is better!
* Otherwide, sample size should be proportioinal to variance.

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Completely randomised design (CRD)
]]
.row[.split-two[
.column[.content[

Assume information is collected from an experiment on,

* `$t$` treatments with `$n_i$` EUs on treatment `$i$`.
* `$i=1,\dots, t$` are allocated to `$n=\sum\limits_{i=1}^t n_i$` plots.
* If we have a .brand-blue[completely randomised design], there are
`$${n \choose n_1}\prod_{i=2}^t{n-\sum_{k=1}^{t-1}n_k \choose n_i}=\frac{n!}{n_1!n_2!\dots n_t!}$$`
possible designs that are equally likely.
* In a given case we choose one of these designs at random.

]]
.column.bg-brand-gray[.content[

### Example: `$t=2$`, `$n_1=2$` and `$n_2=3$`
* The total sample size is `$n=\sum_{i=1}^2 n_i=5$`.
* Allocating the two treatment labels 1 and 2 to the five experimental units is possible in 
$$ {5 \choose 2} = \frac{5!}{2! \times 3!}=10 \ \text{ways.}$$
* Thus a completely randomised design is choosing any of the following with a probability of 10%,
`$$(1,1,2,2,2) \hspace{2mm}  (1,2,1,2,2) \hspace{2mm} (1,2,2,1,2) \hspace{2mm} (1,2,2,2,1)$$`
`$$(2,1,1,2,2) \hspace{2mm} (2,1,2,1,2) \hspace{2mm}  (2,1,2,2,1) \hspace{2mm} (2,2,1,1,2)$$`
`$$(2,2,1,2,1) \hspace{2mm} (2,2,2,1,1)$$`

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Randomised complete block design (RCBD)
]]
.row[.split-40[
.column[.content[
* In a RCBD there is besides the treatment factor .brand-blue[ an additional] block factor, which groups the EUs such that in each block, .brand-blue[all treatment levels are observed] (i.e. a .brand-blue[**complete**] design).
* Thus there are
  * `$t$` treatments
  * `$b$` blocks of `$t$` plots each,
  * `$t!$` arrangements in each block and
  * `$(t!)^b$` possible designs that are all equally likely.
* Again, randomisation is preferable to a systematic design.

]]
.column.bg-brand-gray[.content[
* For `$b=3$` and `$t=4$`, there are `$(4!)^3=24^3=13,824$` designs.

```r
set.seed(7) #Use R to generate a RCBD
t <- c("A","B","C","D")
replicate(3,sample(t)) #default replace = FALSE
```

```
       [,1] [,2] [,3]
  [1,] "B"  "C"  "B" 
  [2,] "C"  "B"  "C" 
  [3,] "A"  "A"  "D" 
  [4,] "D"  "D"  "A"
```

* Or just .brand-blue[systematically permuting labels], eg

|**Block 1**|**Block 2**|**Block 3**|
|-----|-----|------|
|A|D|C|
|B|A|D|
|C|B|A|
|D|C|B|

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Problems of systematic designs
]]
.row[.split-50[
.column[.content[
* Systematic designs offer protection against very adverse designs or arrangements  that is possible under randomisation, eg A is consistently positioned in the first few rows; a possible row effect.
* However it does not proect against a potential AB interaction effect.

### Example - Trees

* E.g. if A represents a species of plant that is taller than species B then A being always to the north of B may affect the amount of sunlight B receives.
* This could result in confounding with B response that had not been considered before the experiement was run.

]]
.column.bg-brand-gray[.content[
### Why randomised designs?

* .brand-blue[Randomisation protects] against (unknown) extraneous factors affecting the results systematically.

Main reason: All treatments have the same chance of being affected and the  probability of any design can be calculated.

* Thus .brand-blue[randomisation] ensures .brand-blue[valid estimation of the error variance].

* This is not the case for systematic design in which certain design not having the systematic pattern will have zero chance of being selected.

* If we have .brand-blue[replication], the effect of adverse design is .brand-blue[averaged out] over replicates.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Simple designs for comparing one factor
]]
.row[.split-60[
.column[.content[
### Completely randomised designs
* Randomly allocate EUs to treatmentsin the one-way ANOVA:
`$$Y_{ij}=\mu+\alpha_i+\epsilon_{ij}, \ \epsilon_{ij} \sim NID(0,\sigma^2)$$`
where `$Y_{ij}$` denotes the `$j$`-th observarion on treatment `$i$`.
* Randomisation can be applied to each combination of factors in the two-way ANOVA model. Treating all combinations as groups, the two-way ANOVA is essentially one-way ANOVA.  
* Note we are assuming the treatment only affects the mean response and not the variance.

**Remark**: Of course `$\epsilon_{ij} \sim NID(0,\sigma^2)$` does not always hold, in which case alternative methods to perform one-way ANOVA F-tests must be used. For example the nonparametric Kruskal-Willis test (assumed knowledge from STAT2012/DATA2002).

]]
.column.bg-brand-gray[.content[
### Assumptions

1. Independent observations

2. Normal population

3. Common variance (i.e. `$\sigma^2$` not affected by the treatment).

This has to be checked via the well-known tools such as

* residual plot,
* boxplots,
* Q-Q-plots,
* normality tests, and
* test of homogeneity of variance (Barlett's test).

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - Binding percentages of antibiotics
]]
.row[.split-60[
.column[.content[
Data from Ziv and Sulman (1972):

* A certain fraction of antibiotics that is injected into the bloodstream are bound to serum proteins.

* As the extent of the binding increases the systematic uptake of the drug decreases, this affects the effectiveness of the medication.

* Binding percentages characteristic of five widely used antibiotics were determined.

* Normal lactation Israli-Freisian cows and awassi ewes, which were kept under normal feeding and housing conditions, were used.

* Antibiotics were administered by single or multiple intravenous or intra muscular injections.

* Blood sample were obtained before and at intervals after treatments.

]]
.column.bg-brand-gray[.content[

* Treatment: Antibiotics
* Experimental unit: cows or ewes
* Observational unit: blood from EUs

<img src="images/anitbiotics.jpg" width="100%" height="100%">
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - Binding percentages of antibiotics 
]]
.row[.split-70[
.column[.content[
Data, group means and group variances

```r
meani=unname(apply(dat,2,mean)); vari=unname(apply(dat,2,var)); 
mm=rbind(dat,meani,vari); rownames(mm)=c("1","2","3","4","mean","var"); mm
```

```
       Penicillin.G Tetracycline Streptomycin Erythromycin Chloroamphenicol
  1        29.60000     32.60000       5.8000      21.6000            29.20
  2        24.30000     30.80000       6.2000      17.4000            32.80
  3        28.50000     34.80000      11.0000      18.3000            25.00
  4        32.00000     27.30000       8.3000      19.0000            24.20
  mean     28.60000     31.37500       7.8250      19.0750            27.80
  var      10.35333     10.05583       5.6825       3.2625            15.92
```
Organise the data into a tidy form:

```r
dat2 <- dat %>%
 gather("antibiotic","response")
dat2
```

]]
.column.bg-brand-gray[.content[

```
           antibiotic response
  1      Penicillin.G     29.6
  2      Penicillin.G     24.3
  3      Penicillin.G     28.5
  4      Penicillin.G     32.0
  5      Tetracycline     32.6
  6      Tetracycline     30.8
  7      Tetracycline     34.8
  8      Tetracycline     27.3
  9      Streptomycin      5.8
  10     Streptomycin      6.2
  11     Streptomycin     11.0
  12     Streptomycin      8.3
  13     Erythromycin     21.6
  14     Erythromycin     17.4
  15     Erythromycin     18.3
  16     Erythromycin     19.0
  17 Chloroamphenicol     29.2
  18 Chloroamphenicol     32.8
  19 Chloroamphenicol     25.0
  20 Chloroamphenicol     24.2
```

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - Distribution 
]]
.row[.content[

```r
dat2 %>%
  ggplot(aes(antibiotic,response)) +
  geom_violin() + geom_point()
```

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - Thinking about the assumptions
]]
.row[.split-40[
.column[.content[
### Modelling
* The responses *Y* are binding percentage.

* One-way ANOVA is used to investigate the drug differences.

* It is clearly `$Y$` is not exactly normal as `$0 \le Y \le 100$` is bounded.

* `$Y$` may not be iid as blood sample were obtained repeatedly at intervals before and after treatments.
   
* The sample size of 4 is too small to check if normal is a reasonable approximation. This becomes a purely model assumption as we do not have enough information to verify.

]]
.column.bg-brand-gray[.content[

### Example - Overall/omnibus test

* Test 
$$ H_0: \alpha_1=\alpha_2 =\alpha_3=\alpha_4=\alpha_5 $$
i.e. all drugs have the same average binding percentage.

* Use the F-test
`$$f=\frac{\sum_{i=1}^t n_i(\bar{Y}_{i\bullet}-Y_{\bullet \bullet})^2/(t-1)}{\sum_{i=1}^t \sum_{j=1}^{n_i}(Y_{ij}-\bar{Y}_{i\bullet})^2/(n-t)}\sim F_{t-1,n-t}$$`

* Recall that 
`$$\text{Within SS}=\text{RSS}=S_0=\sum_{i=1}^t (n_i-1) s_i^2,$$`
where `$s_i^2$` is the sample variance from the `$i$`-th sample or treatment group.
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - Overall/omnibus test
]]
.row[.split-50[
.column[.content[
* Since `$p=\Pr(F_{4,15}\ge 40.88)\le 10^{-6}$`, there is strong evidence of a difference in binding percentages.
* Calculations by hand:
 * `$S_{yy}= \text{Total SS}$` 
 `$=(29.6^2+24.3^2+\dots+24.2^2)-\frac{458.7^2}{20}=1616.645$`
 * `$SS_A= \text{TSS}$` 
 `$=(\frac{114.4^2}{4}+\dots+\frac{111.2^2}{4})-\frac{458.7^2}{20}=1480.823$`
 * `$\text{RSS}=1616.645 - 1480.823 = 135.822$`.

```r
mi=tapply(dat2$response,dat2$antibiotic,sum); mi[1:3]; mi[4:5]
```

```
  Chloroamphenicol     Erythromycin     Penicillin.G 
             111.2             76.3            114.4
```

```
  Streptomycin Tetracycline 
          31.3        125.5
```
  
]]
.column.bg-brand-gray[.content[

```r
M1 <- lm(response ~ antibiotic, dat2)
anova(M1)
```

```
  Analysis of Variance Table
  
  Response: response
             Df  Sum Sq Mean Sq F value   Pr(>F)    
  antibiotic  4 1480.82  370.21  40.885 6.74e-08 ***
  Residuals  15  135.82    9.05                     
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```r
shapiro.test(M1$residuals) #fail to disprove normal
```

```
  
  	Shapiro-Wilk normality test
  
  data:  M1$residuals
  W = 0.97171, p-value = 0.7904
```

```r
sum(dat2$response)
```

```
  [1] 458.7
```

```r
RSS=anova(M1)[2,2]; Rdf=anova(M1)[2,1]; RMS=anova(M1)[2,3]
```

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - CIs for estimates and pairwise differences
]]
.row[.split-50[
.column[.content[
* The next step is to investigate the possible differences via condifence intervals.
* A table of sample means is a useful summary of the mean effects.
* In general the CI's of interest should be specified according to research objectives .brand-blue[before] data collection, instead of simply looking at the observed means.
* Below are the CIs for parameter estimates.

```r
confint(M1)
```

```
                               2.5 %    97.5 %
  (Intercept)             24.5931009  31.00690
  antibioticErythromycin -13.2602402  -4.18976
  antibioticPenicillin.G  -3.7352402   5.33524
  antibioticStreptomycin -24.5102402 -15.43976
  antibioticTetracycline  -0.9602402   8.11024
```

]]
.column.bg-brand-gray[.content[
* Assume that one of the research objectives is to compare `Chloroamphenicol` and `Erythromycin`.
* Of course the largest difference is between drugs `Tetracycline` and `Streptomycin`.
* A 95% CI for the difference in average binding percentage for `$\alpha_5-\alpha_4$` is

`\begin{eqnarray*}
(\bar{Y}_{5\bullet}-\bar{Y}_{4\bullet})\pm t_{15}(0.975) \sqrt{\frac{\hat{\sigma}^2}{4}+\frac{\hat{\sigma}^2}{4}}&=&-8.73\pm 4.53\\ &=&(-3.26,-4.2)
\end{eqnarray*}`

where 
`$\bar{Y}_{5\bullet}-\bar{Y}_{4\bullet} = 27.8-19.075=8.725$`, 
`$\displaystyle \hat{\sigma}^2=\frac{\text{RSS}}{n-t}=\frac{135.8225}{15}=9.0548$` and the CI width is 
`$t_{15}(0.975)\sqrt{\hat{\sigma}^2/2}=2.131 \times \sqrt{9.0548/2}=4.535$`.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - CIs of pairwise differences (unadjusted)
]]
.row[.content[

```r
library(emmeans)
pp <- pairs(emmeans(M1,"antibiotic"),adjust="none")
confint(pp)
```

```
   contrast                        estimate   SE df lower.CL upper.CL
   Chloroamphenicol - Erythromycin     8.72 2.13 15     4.19    13.26
   Chloroamphenicol - Penicillin.G    -0.80 2.13 15    -5.34     3.74
   Chloroamphenicol - Streptomycin    19.98 2.13 15    15.44    24.51
   Chloroamphenicol - Tetracycline    -3.58 2.13 15    -8.11     0.96
   Erythromycin - Penicillin.G        -9.53 2.13 15   -14.06    -4.99
   Erythromycin - Streptomycin        11.25 2.13 15     6.71    15.79
   Erythromycin - Tetracycline       -12.30 2.13 15   -16.84    -7.76
   Penicillin.G - Streptomycin        20.77 2.13 15    16.24    25.31
   Penicillin.G - Tetracycline        -2.77 2.13 15    -7.31     1.76
   Streptomycin - Tetracycline       -23.55 2.13 15   -28.09   -19.01
  
  Confidence level used: 0.95
```

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example - CIs of pairwise differences (unadjusted) 
]]
.row[.content[

```r
plot(pp) + geom_vline(xintercept=0, size=2, color="red")
```

]]