Campbell and Mahon (1974) recorded data on 200 crab Leptograpsus variegatus specimens found on the shores of Western Australia. The crabs have two colour forms blue (B) and orange (O). Fifty crabs of each form of each sex were collected and 5 physical measurements recorded. The data are in the 4 arrays BM, BF, OM and OF, where BM denotes the data for blue, male crabs etc. The variables in column order are measurements in cm on the size of frontal lobe (FL), rear width (RW), carapace (shell) length (CL), carapace width (CW) and body depth (BD). The data is part of the R package MASS. The following R-code should help you get started:
library(MASS)
data(crabs)
crabsdat = split(crabs[, 4:8], crabs[, 1:2])
crabsdat = lapply(crabsdat, as.matrix)
names(crabsdat)
## [1] "B.F" "O.F" "B.M" "O.M"
Our aim is to test the null hypothesis that the 4 types of crabs have the same mean vectors. I.e. \( H_0\colon \mu_1 = \mu_2 = \mu_3 = \mu_4 \). To do this we will use Multivariate Analysis of Variance (MANOVA) which assumes that we have independent samples, drawn from a normal distribution with common covariance structure, i.e. \( \Sigma_1 = \Sigma_2 = \Sigma_3 = \Sigma_4 = \Sigma \). That is, we assume
\[ \mathbf{X}_{ij} \sim \mathcal{N}_{p}( \mu_i, \Sigma),\quad\text{ for }j=1,\ldots,g; i = 1,\ldots,p. \]
In the present case we have \( g=4 \) types of crab and \( p=5 \) variables measured on each crab.
We define \( \mathbf{W} \) as the within samples sum of squares and products matrix:
\[ \mathbf{W} = (N-g)\mathbf{S}_p \]
where \( \mathbf{S}_p \) is the pooled covariance matrix:
\[ \mathbf{S}_p = \frac{\sum_{j=1}^{g}(n_i-1)\mathbf{S}_i}{N-g} \]
where \( N = n_1 + n_2 + \ldots + n_g \).
To test \( H_0\colon \mu_1 = \mu_2 = \mu_3 = \mu_4 \), we reduce this to a one dimensional problem and use a hypothesis of the form
\[ H_0\colon \mathbf{a}' \mu_1 = \mathbf{a}' \mu_2 = \mathbf{a}' \mu_3 = \mathbf{a}' \mu_4 \]
for some vector \( \mathbf{a} \) and use a 1-way ANOVA \( F \) statistic. In this case, the within sum of squares is \( \mathbf{a}'\mathbf{W}\mathbf{a} \) and the between groups sum of squares is \( \mathbf{a}'\mathbf{B}\mathbf{a} \) where
\[ \mathbf{B} = \sum_{i=1}^{g}n_i (\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})(\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})' \]
crabsv = lapply(crabsdat, var)
n = 50
N = 4 * 50
W = (n - 1) * Reduce("+", crabsv)
W
## FL RW CL CW BD
## FL 1880 1330 4045 4556 1843
## RW 1330 1016 2904 3280 1321
## CL 4045 2904 8843 9949 4020
## CW 4556 3280 9949 11232 4526
## BD 1843 1321 4020 4526 1854
Note that \( \mathbf{W} \) is the weighted sum of the individual sample covariance matrices. In this case \( n_g = 50 \) for \( g = 1,2,3,4 \) so we can calculate it as above. However if the sample sizes are different, then we would need to use:
\[ \mathbf{W} = (n_1 - 1)\mathbf{S}_1 + (n_2 - 1)\mathbf{S}_2 + \ldots + (n_g - 1)\mathbf{S}_g \]
There's a really easy way to calculate \( \mathbf{B} \):
Tot = (N - 1) * var(crabs[, -(1:3)])
B = Tot - W
B
## FL RW CL CW BD
## FL 551.6 293.3 801.6 727.6 509.8
## RW 293.3 301.4 350.7 350.1 238.4
## CL 801.6 350.7 1242.7 1147.5 749.9
## CW 727.6 350.1 1147.5 1099.7 666.2
## BD 509.8 238.4 749.9 666.2 480.3
Or alternatively, using
\[ \mathbf{B} = \sum_{i=1}^{g}n_i (\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})(\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})' \]
means = t(sapply(crabsdat, apply, 2, mean))
m = apply(Reduce("+", crabsdat), 2, sum)/(50 * 4)
M = matrix(m, nrow = dim(means)[1], ncol = dim(means)[2], byrow = T)
B = 50 * t(means - M) %*% (means - M)
A = solve(W) %*% B
val = eigen(A)$values
round(val, 2)
## [1] 7.52 3.28 0.16 0.00 0.00
The MANOVA statistic given in class is is
\[ V = (N-g)\sum_{i=1}^p\lambda_i = (200-4)\sum_{i=1}^{5}\lambda_{i} = 196\times 10.95538 = 2147.255. \]
If the null hypothesis is true, the exact distribution of the statistic is \( \chi^{2}_{p(g-1) = 15} \).
teststat = (200 - 4) * sum(val)
p.val = 1 - pchisq(teststat, df = 15)
c(teststat, p.val)
## [1] 2147 0
Hence we reject the null hypothesis that the group means can be collapsed into a single point.
v12 = eigen(A)$vectors[, 1:2]
v12
## [,1] [,2]
## [1,] -0.58854 -0.09426
## [2,] -0.23656 -0.74348
## [3,] -0.07101 0.52900
## [4,] 0.57388 -0.31078
## [5,] -0.51311 0.24887
sc = lapply(crabsdat, function(x) as.matrix(x) %*% v12)
cm = t(sapply(sc, apply, 2, mean))
cm
## [,1] [,2]
## B.F -0.01753 -2.6073
## O.F -1.94164 -2.6169
## B.M 0.49399 -1.2930
## O.M -1.59948 -0.6062
plot(do.call(rbind, sc), type = "n", xlab = "First cannonical variable", ylab = "Second cannnical variable")
points(cm, col = c("blue", "orange", "blue", "orange"), cex = 6, pch = c(17,
17, 16, 16))
points(sc[[1]], col = "blue", pch = 17)
points(sc[[2]], col = "orange", pch = 17)
points(sc[[3]], col = "blue")
points(sc[[4]], col = "orange")
text(cm, rownames(cm), col = c("white"))
The first cannonical variable separates the blue crabs from the orange crabs, while the second canonical variable distinguishes between males and females.