Rapid ANOVA p-value Calculation

There are times when we need to perform ANOVA calculations on multiple sets of features in order to generate a list of p-values. A traditional approach is to use the base function aov and extract the p-value from it. This can be slow when analyzing thousands of features. An alternative is to use the multtest library and calculate the p-values from the F-values.

An example is shown below using a matrix of 20 samples (5 groups of 4) and 10000 features. Calculating p-values within an apply loop takes about 18 seconds. Calculating p-values from the F-values using the multtest library takes about 15 milliseconds. Both were determined on a Z640.

## create an (g x s) x f matrix where g is the number of groups, s is the number of samples in each group and f is the number of features
g <- 5
s <- 4
f <- 10000
df.raw <- replicate(g * s, sample(100:100000, f, rep = TRUE))  ## matrix of data
groups <- factor(rep(seq(g), each = s))  ## column ids of groups
## using an apply loop
multAnova1 <- function(df.raw, groups) {
 pVector <- apply(df.raw, 1, function(x) {
   submodel <- lm(x~groups)
   subanova <- aov(submodel)
   summary(subanova)[[1]][['Pr(>F)']][[1]]
 })
 p.adjust(pVector, method='BH')
}

library(multtest)
## using multtest
multAnova2 <- function(df.raw, groups) {
 F_new <- mt.teststat(df.raw, groups, test='f')
 P_new <- pf(F_new, length(levels(groups))-1, length(groups)-length(levels(groups)), lower.tail = F)
 p.adjust(P_new, method='BH')
}

## benchmarking
library(microbenchmark)
library(ggplot2)
mb <- microbenchmark(
 multAnova1(df.raw, groups),
 multAnova2(df.raw, groups),
 times = 10
)
print(mb)
autoplot(mb)

Timing (milliseconds)

expr min lq mean median uq max nval
multAnova1(df.raw, groups) 17467.11 17609.23 18085.73 17882.93 18203.27 19628.12 10
multAnova2(df.raw, groups) 13.89 14.25 15.66 15.68 16.31 18.72 10


Harvey Lieberman

Written by

Updated