class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Completely Randomised Designs ## .black[STAT3022 Applied Linear Models Lecture 23] <br><br><br> ### .black[2020/02/20] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Simple designs for comparing one factor 2. Complete randomized design 3. Block design, RCBD 4. Systematic design 5. Design generation ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Design of experiments ]] .row[ .split-two[ .column[.content[ * Suppose we wish to compare `\(t\)` number of .brand-blue[**treatments**], and there are `\(n\)` .brand-blue[**experimental units**] (assuming OU=EU). * There is some element of variability inherent in the experimental material. * Initial planning can improve the comparisons by .brand-blue[**reducing the variance in the estimators of the mean responses**]. ## Example: `\(t=2\)` treatments * Suppose that we use: * `\(n_1\)` EUs for treatment 1 and * `\(n_2=n-n_1\)` EUs for treatment 2. ]] .column.bg-brand-gray[.content[ * Assume the model for the analysis is `$$Y_{ij}=\mu+\alpha_i+\epsilon_{ij}, \ \epsilon_{ij} \sim NID(0,\sigma^2)$$` `\begin{eqnarray*} \text{we have } \text{Var}(\alpha_1-\alpha_2) & = & \text{Var}(\bar Y_{1\bullet}-\bar Y_{2\bullet}) \\ & = & \frac{\sigma^2}{n_1}+\frac{\sigma^2}{n-n_1} \\ \frac{\partial}{\partial n_1} \left(\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n-n_1}\right) & = & \sigma^2 \left[-\frac{1}{n_1^2}+\frac{1}{(n-n_1)^2} \right] \\ & = & \sigma^2 \left[-\frac{n(2n_1-n)}{n_1^2(n-n_1)^2} \right] \end{eqnarray*}` * Therefore the variance is minimised when `\(n_1=\frac{n}{2}\)`. * Thus if there is common variance given by the model, `\(n_2=\frac{n}{2}\)` implying .brand-blue[balanced] design is better! * Otherwide, sample size should be proportioinal to variance. ]] ]] --- class: split-10 count: false .row.bg-brand-blue.white[.content.vmiddle[ # Completely randomised design (CRD) ]] .row[.split-two[ .column[.content[ Assume information is collected from an experiment on, * `\(t\)` treatments with `\(n_i\)` EUs on treatment `\(i\)`. * `\(i=1,\dots, t\)` are allocated to `\(n=\sum\limits_{i=1}^t n_i\)` plots. * If we have a .brand-blue[completely randomised design], there are `$${n \choose n_1}\prod_{i=2}^t{n-\sum_{k=1}^{t-1}n_k \choose n_i}=\frac{n!}{n_1!n_2!\dots n_t!}$$` possible designs that are equally likely. * In a given case we choose one of these designs at random. ]] .column.bg-brand-gray[.content[ ### Example: `\(t=2\)`, `\(n_1=2\)` and `\(n_2=3\)` * The total sample size is `\(n=\sum_{i=1}^2 n_i=5\)`. * Allocating the two treatment labels 1 and 2 to the five experimental units is possible in $$ {5 \choose 2} = \frac{5!}{2! \times 3!}=10 \ \text{ways.}$$ * Thus a completely randomised design is choosing any of the following with a probability of 10%, `$$(1,1,2,2,2) \hspace{2mm} (1,2,1,2,2) \hspace{2mm} (1,2,2,1,2) \hspace{2mm} (1,2,2,2,1)$$` `$$(2,1,1,2,2) \hspace{2mm} (2,1,2,1,2) \hspace{2mm} (2,1,2,2,1) \hspace{2mm} (2,2,1,1,2)$$` `$$(2,2,1,2,1) \hspace{2mm} (2,2,2,1,1)$$` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Randomised complete block design (RCBD) ]] .row[.split-40[ .column[.content[ * In a RCBD there is besides the treatment factor .brand-blue[ an additional] block factor, which groups the EUs such that in each block, .brand-blue[all treatment levels are observed] (i.e. a .brand-blue[**complete**] design). * Thus there are * `\(t\)` treatments * `\(b\)` blocks of `\(t\)` plots each, * `\(t!\)` arrangements in each block and * `\((t!)^b\)` possible designs that are all equally likely. * Again, randomisation is preferable to a systematic design. ]] .column.bg-brand-gray[.content[ * For `\(b=3\)` and `\(t=4\)`, there are `\((4!)^3=24^3=13,824\)` designs. ```r set.seed(7) #Use R to generate a RCBD t <- c("A","B","C","D") replicate(3,sample(t)) #default replace = FALSE ``` ``` [,1] [,2] [,3] [1,] "B" "C" "B" [2,] "C" "B" "C" [3,] "A" "A" "D" [4,] "D" "D" "A" ``` * Or just .brand-blue[systematically permuting labels], eg |**Block 1**|**Block 2**|**Block 3**| |-----|-----|------| |A|D|C| |B|A|D| |C|B|A| |D|C|B| ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Problems of systematic designs ]] .row[.split-50[ .column[.content[ * Systematic designs offer protection against very adverse designs or arrangements that is possible under randomisation, eg A is consistently positioned in the first few rows; a possible row effect. * However it does not proect against a potential AB interaction effect. ### Example - Trees * E.g. if A represents a species of plant that is taller than species B then A being always to the north of B may affect the amount of sunlight B receives. * This could result in confounding with B response that had not been considered before the experiement was run. ]] .column.bg-brand-gray[.content[ ### Why randomised designs? * .brand-blue[Randomisation protects] against (unknown) extraneous factors affecting the results systematically. Main reason: All treatments have the same chance of being affected and the probability of any design can be calculated. * Thus .brand-blue[randomisation] ensures .brand-blue[valid estimation of the error variance]. * This is not the case for systematic design in which certain design not having the systematic pattern will have zero chance of being selected. * If we have .brand-blue[replication], the effect of adverse design is .brand-blue[averaged out] over replicates. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Simple designs for comparing one factor ]] .row[.split-60[ .column[.content[ ### Completely randomised designs * Randomly allocate EUs to treatmentsin the one-way ANOVA: `$$Y_{ij}=\mu+\alpha_i+\epsilon_{ij}, \ \epsilon_{ij} \sim NID(0,\sigma^2)$$` where `\(Y_{ij}\)` denotes the `\(j\)`-th observarion on treatment `\(i\)`. * Randomisation can be applied to each combination of factors in the two-way ANOVA model. Treating all combinations as groups, the two-way ANOVA is essentially one-way ANOVA. * Note we are assuming the treatment only affects the mean response and not the variance. **Remark**: Of course `\(\epsilon_{ij} \sim NID(0,\sigma^2)\)` does not always hold, in which case alternative methods to perform one-way ANOVA F-tests must be used. For example the nonparametric Kruskal-Willis test (assumed knowledge from STAT2012/DATA2002). ]] .column.bg-brand-gray[.content[ ### Assumptions 1. Independent observations 2. Normal population 3. Common variance (i.e. `\(\sigma^2\)` not affected by the treatment). This has to be checked via the well-known tools such as * residual plot, * boxplots, * Q-Q-plots, * normality tests, and * test of homogeneity of variance (Barlett's test). ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Binding percentages of antibiotics ]] .row[.split-60[ .column[.content[ Data from Ziv and Sulman (1972): * A certain fraction of antibiotics that is injected into the bloodstream are bound to serum proteins. * As the extent of the binding increases the systematic uptake of the drug decreases, this affects the effectiveness of the medication. * Binding percentages characteristic of five widely used antibiotics were determined. * Normal lactation Israli-Freisian cows and awassi ewes, which were kept under normal feeding and housing conditions, were used. * Antibiotics were administered by single or multiple intravenous or intra muscular injections. * Blood sample were obtained before and at intervals after treatments. ]] .column.bg-brand-gray[.content[ * Treatment: Antibiotics * Experimental unit: cows or ewes * Observational unit: blood from EUs <img src="images/anitbiotics.jpg" width="100%" height="100%"> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Binding percentages of antibiotics ]] .row[.split-70[ .column[.content[ Data, group means and group variances ```r meani=unname(apply(dat,2,mean)); vari=unname(apply(dat,2,var)); mm=rbind(dat,meani,vari); rownames(mm)=c("1","2","3","4","mean","var"); mm ``` ``` Penicillin.G Tetracycline Streptomycin Erythromycin Chloroamphenicol 1 29.60000 32.60000 5.8000 21.6000 29.20 2 24.30000 30.80000 6.2000 17.4000 32.80 3 28.50000 34.80000 11.0000 18.3000 25.00 4 32.00000 27.30000 8.3000 19.0000 24.20 mean 28.60000 31.37500 7.8250 19.0750 27.80 var 10.35333 10.05583 5.6825 3.2625 15.92 ``` Organise the data into a tidy form: ```r dat2 <- dat %>% gather("antibiotic","response") dat2 ``` ]] .column.bg-brand-gray[.content[ ``` antibiotic response 1 Penicillin.G 29.6 2 Penicillin.G 24.3 3 Penicillin.G 28.5 4 Penicillin.G 32.0 5 Tetracycline 32.6 6 Tetracycline 30.8 7 Tetracycline 34.8 8 Tetracycline 27.3 9 Streptomycin 5.8 10 Streptomycin 6.2 11 Streptomycin 11.0 12 Streptomycin 8.3 13 Erythromycin 21.6 14 Erythromycin 17.4 15 Erythromycin 18.3 16 Erythromycin 19.0 17 Chloroamphenicol 29.2 18 Chloroamphenicol 32.8 19 Chloroamphenicol 25.0 20 Chloroamphenicol 24.2 ``` ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Distribution ]] .row[.content[ ```r dat2 %>% ggplot(aes(antibiotic,response)) + geom_violin() + geom_point() ``` <img src="lecture23_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Thinking about the assumptions ]] .row[.split-40[ .column[.content[ ### Modelling * The responses *Y* are binding percentage. * One-way ANOVA is used to investigate the drug differences. * It is clearly `\(Y\)` is not exactly normal as `\(0 \le Y \le 100\)` is bounded. * `\(Y\)` may not be iid as blood sample were obtained repeatedly at intervals before and after treatments. * The sample size of 4 is too small to check if normal is a reasonable approximation. This becomes a purely model assumption as we do not have enough information to verify. ]] .column.bg-brand-gray[.content[ ### Example - Overall/omnibus test * Test $$ H_0: \alpha_1=\alpha_2 =\alpha_3=\alpha_4=\alpha_5 $$ i.e. all drugs have the same average binding percentage. * Use the F-test `$$f=\frac{\sum_{i=1}^t n_i(\bar{Y}_{i\bullet}-Y_{\bullet \bullet})^2/(t-1)}{\sum_{i=1}^t \sum_{j=1}^{n_i}(Y_{ij}-\bar{Y}_{i\bullet})^2/(n-t)}\sim F_{t-1,n-t}$$` * Recall that `$$\text{Within SS}=\text{RSS}=S_0=\sum_{i=1}^t (n_i-1) s_i^2,$$` where `\(s_i^2\)` is the sample variance from the `\(i\)`-th sample or treatment group. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - Overall/omnibus test ]] .row[.split-50[ .column[.content[ * Since `\(p=\Pr(F_{4,15}\ge 40.88)\le 10^{-6}\)`, there is strong evidence of a difference in binding percentages. * Calculations by hand: * `\(S_{yy}= \text{Total SS}\)` <br> `\(=(29.6^2+24.3^2+\dots+24.2^2)-\frac{458.7^2}{20}=1616.645\)` * `\(SS_A= \text{TSS}\)` <br> `\(=(\frac{114.4^2}{4}+\dots+\frac{111.2^2}{4})-\frac{458.7^2}{20}=1480.823\)` * `\(\text{RSS}=1616.645 - 1480.823 = 135.822\)`. ```r mi=tapply(dat2$response,dat2$antibiotic,sum); mi[1:3]; mi[4:5] ``` ``` Chloroamphenicol Erythromycin Penicillin.G 111.2 76.3 114.4 ``` ``` Streptomycin Tetracycline 31.3 125.5 ``` ]] .column.bg-brand-gray[.content[ ```r M1 <- lm(response ~ antibiotic, dat2) anova(M1) ``` ``` Analysis of Variance Table Response: response Df Sum Sq Mean Sq F value Pr(>F) antibiotic 4 1480.82 370.21 40.885 6.74e-08 *** Residuals 15 135.82 9.05 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r shapiro.test(M1$residuals) #fail to disprove normal ``` ``` Shapiro-Wilk normality test data: M1$residuals W = 0.97171, p-value = 0.7904 ``` ```r sum(dat2$response) ``` ``` [1] 458.7 ``` ```r RSS=anova(M1)[2,2]; Rdf=anova(M1)[2,1]; RMS=anova(M1)[2,3] ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - CIs for estimates and pairwise differences ]] .row[.split-50[ .column[.content[ * The next step is to investigate the possible differences via condifence intervals. * A table of sample means is a useful summary of the mean effects. * In general the CI's of interest should be specified according to research objectives .brand-blue[before] data collection, instead of simply looking at the observed means. * Below are the CIs for parameter estimates. ```r confint(M1) ``` ``` 2.5 % 97.5 % (Intercept) 24.5931009 31.00690 antibioticErythromycin -13.2602402 -4.18976 antibioticPenicillin.G -3.7352402 5.33524 antibioticStreptomycin -24.5102402 -15.43976 antibioticTetracycline -0.9602402 8.11024 ``` ]] .column.bg-brand-gray[.content[ * Assume that one of the research objectives is to compare `Chloroamphenicol` and `Erythromycin`. * Of course the largest difference is between drugs `Tetracycline` and `Streptomycin`. * A 95% CI for the difference in average binding percentage for `\(\alpha_5-\alpha_4\)` is `\begin{eqnarray*} (\bar{Y}_{5\bullet}-\bar{Y}_{4\bullet})\pm t_{15}(0.975) \sqrt{\frac{\hat{\sigma}^2}{4}+\frac{\hat{\sigma}^2}{4}}&=&-8.73\pm 4.53\\ &=&(-3.26,-4.2) \end{eqnarray*}` where <br> `\(\bar{Y}_{5\bullet}-\bar{Y}_{4\bullet} = 27.8-19.075=8.725\)`, <br> `\(\displaystyle \hat{\sigma}^2=\frac{\text{RSS}}{n-t}=\frac{135.8225}{15}=9.0548\)` and the CI width is <br> `\(t_{15}(0.975)\sqrt{\hat{\sigma}^2/2}=2.131 \times \sqrt{9.0548/2}=4.535\)`. ]] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - CIs of pairwise differences (unadjusted) ]] .row[.content[ ```r library(emmeans) pp <- pairs(emmeans(M1,"antibiotic"),adjust="none") confint(pp) ``` ``` contrast estimate SE df lower.CL upper.CL Chloroamphenicol - Erythromycin 8.72 2.13 15 4.19 13.26 Chloroamphenicol - Penicillin.G -0.80 2.13 15 -5.34 3.74 Chloroamphenicol - Streptomycin 19.98 2.13 15 15.44 24.51 Chloroamphenicol - Tetracycline -3.58 2.13 15 -8.11 0.96 Erythromycin - Penicillin.G -9.53 2.13 15 -14.06 -4.99 Erythromycin - Streptomycin 11.25 2.13 15 6.71 15.79 Erythromycin - Tetracycline -12.30 2.13 15 -16.84 -7.76 Penicillin.G - Streptomycin 20.77 2.13 15 16.24 25.31 Penicillin.G - Tetracycline -2.77 2.13 15 -7.31 1.76 Streptomycin - Tetracycline -23.55 2.13 15 -28.09 -19.01 Confidence level used: 0.95 ``` ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example - CIs of pairwise differences (unadjusted) ]] .row[.content[ ```r plot(pp) + geom_vline(xintercept=0, size=2, color="red") ``` <img src="lecture23_2020JC_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> ]]