MANOVA with arrays in R

Campbell and Mahon (1974) recorded data on 200 crab Leptograpsus variegatus specimens found on the shores of Western Australia. The crabs have two colour forms blue (B) and orange (O). Fifty crabs of each form of each sex were collected and 5 physical measurements recorded. The data are in the 4 arrays BM, BF, OM and OF, where BM denotes the data for blue, male crabs etc. The variables in column order are measurements in cm on the size of frontal lobe (FL), rear width (RW), carapace (shell) length (CL), carapace width (CW) and body depth (BD). The data is part of the R package MASS. The following R-code should help you get started:

library(MASS)
data(crabs)
crabsdat = split(crabs[, 4:8], crabs[, 1:2])
crabsdat = lapply(crabsdat, as.matrix)
names(crabsdat)

## [1] "B.F" "O.F" "B.M" "O.M"

Our aim is to test the null hypothesis that the 4 types of crabs have the same mean vectors. I.e. \( H_0\colon \mu_1 = \mu_2 = \mu_3 = \mu_4 \). To do this we will use Multivariate Analysis of Variance (MANOVA) which assumes that we have independent samples, drawn from a normal distribution with common covariance structure, i.e. \( \Sigma_1 = \Sigma_2 = \Sigma_3 = \Sigma_4 = \Sigma \). That is, we assume
\[ \mathbf{X}_{ij} \sim \mathcal{N}_{p}( \mu_i, \Sigma),\quad\text{ for }j=1,\ldots,g; i = 1,\ldots,p. \]
In the present case we have \( g=4 \) types of crab and \( p=5 \) variables measured on each crab.

Construct the \( \mathbf{B} \) and \( \mathbf{W} \) matrices for these data.

We define \( \mathbf{W} \) as the within samples sum of squares and products matrix:
\[ \mathbf{W} = (N-g)\mathbf{S}_p \]
where \( \mathbf{S}_p \) is the pooled covariance matrix:
\[ \mathbf{S}_p = \frac{\sum_{j=1}^{g}(n_i-1)\mathbf{S}_i}{N-g} \]
where \( N = n_1 + n_2 + \ldots + n_g \).

To test \( H_0\colon \mu_1 = \mu_2 = \mu_3 = \mu_4 \), we reduce this to a one dimensional problem and use a hypothesis of the form
\[ H_0\colon \mathbf{a}' \mu_1 = \mathbf{a}' \mu_2 = \mathbf{a}' \mu_3 = \mathbf{a}' \mu_4 \]
for some vector \( \mathbf{a} \) and use a 1-way ANOVA \( F \) statistic. In this case, the within sum of squares is \( \mathbf{a}'\mathbf{W}\mathbf{a} \) and the between groups sum of squares is \( \mathbf{a}'\mathbf{B}\mathbf{a} \) where
\[ \mathbf{B} = \sum_{i=1}^{g}n_i (\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})(\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})' \]

crabsv = lapply(crabsdat, var)
n = 50
N = 4 * 50
W = (n - 1) * Reduce("+", crabsv)
W

##      FL   RW   CL    CW   BD
## FL 1880 1330 4045  4556 1843
## RW 1330 1016 2904  3280 1321
## CL 4045 2904 8843  9949 4020
## CW 4556 3280 9949 11232 4526
## BD 1843 1321 4020  4526 1854

Note that \( \mathbf{W} \) is the weighted sum of the individual sample covariance matrices. In this case \( n_g = 50 \) for \( g = 1,2,3,4 \) so we can calculate it as above. However if the sample sizes are different, then we would need to use:
\[ \mathbf{W} = (n_1 - 1)\mathbf{S}_1 + (n_2 - 1)\mathbf{S}_2 + \ldots + (n_g - 1)\mathbf{S}_g \]

There's a really easy way to calculate \( \mathbf{B} \):

Tot = (N - 1) * var(crabs[, -(1:3)])
B = Tot - W
B

##       FL    RW     CL     CW    BD
## FL 551.6 293.3  801.6  727.6 509.8
## RW 293.3 301.4  350.7  350.1 238.4
## CL 801.6 350.7 1242.7 1147.5 749.9
## CW 727.6 350.1 1147.5 1099.7 666.2
## BD 509.8 238.4  749.9  666.2 480.3

Or alternatively, using
\[ \mathbf{B} = \sum_{i=1}^{g}n_i (\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})(\bar{\mathbf{X}}_{i\cdot} - \bar{\mathbf{X}}_{\cdot\cdot})' \]

means = t(sapply(crabsdat, apply, 2, mean))
m = apply(Reduce("+", crabsdat), 2, sum)/(50 * 4)
M = matrix(m, nrow = dim(means)[1], ncol = dim(means)[2], byrow = T)
B = 50 * t(means - M) %*% (means - M)

Find the eigenvalues of \( W^{-1}B \).

A = solve(W) %*% B
val = eigen(A)$values
round(val, 2)

## [1] 7.52 3.28 0.16 0.00 0.00

Test the hypothesis that the 4 types of crabs have the same mean vectors.

The MANOVA statistic given in class is is
\[ V = (N-g)\sum_{i=1}^p\lambda_i = (200-4)\sum_{i=1}^{5}\lambda_{i} = 196\times 10.95538 = 2147.255. \]
If the null hypothesis is true, the exact distribution of the statistic is \( \chi^{2}_{p(g-1) = 15} \).

teststat = (200 - 4) * sum(val)
p.val = 1 - pchisq(teststat, df = 15)
c(teststat, p.val)

## [1] 2147    0

Hence we reject the null hypothesis that the group means can be collapsed into a single point.

Calculate the first two canonical scores for the 4 types of crabs and hence find the means of the first two canonical scores for each of the groups.

v12 = eigen(A)$vectors[, 1:2]
v12

##          [,1]     [,2]
## [1,] -0.58854 -0.09426
## [2,] -0.23656 -0.74348
## [3,] -0.07101  0.52900
## [4,]  0.57388 -0.31078
## [5,] -0.51311  0.24887

sc = lapply(crabsdat, function(x) as.matrix(x) %*% v12)
cm = t(sapply(sc, apply, 2, mean))
cm

##         [,1]    [,2]
## B.F -0.01753 -2.6073
## O.F -1.94164 -2.6169
## B.M  0.49399 -1.2930
## O.M -1.59948 -0.6062

Plot the means of the first two canonical scores.

plot(do.call(rbind, sc), type = "n", xlab = "First cannonical variable", ylab = "Second cannnical variable")
points(cm, col = c("blue", "orange", "blue", "orange"), cex = 6, pch = c(17, 
    17, 16, 16))
points(sc[[1]], col = "blue", pch = 17)
points(sc[[2]], col = "orange", pch = 17)
points(sc[[3]], col = "blue")
points(sc[[4]], col = "orange")
text(cm, rownames(cm), col = c("white"))

plot of chunk unnamed-chunk-7

The first cannonical variable separates the blue crabs from the orange crabs, while the second canonical variable distinguishes between males and females.