ddply vs. tapply function for a large number of categories

I was running a group-wise column-wise summary on a large data set with a large number of groups and found it taking longer than expected with ddply in the plyr package (my preferred summary tool). So I made a quick tapply function that does the same thing. This blog seems as good a place as any to store the function for my own future use, while maybe helping someone else address the same problem. Here’s an example, with computing times.

First, make a large data (n=300,000) set with several columns (VAR1 to VAR5) and many groups (n=100,000). The goal is to create an output data frame with one row per group and the mean of each column under VAR1 to VAR5.

# Make a large data set with many groups
dat <- data.frame(GROUP=rep(1:100000,3), VAR1=rnorm(300000,1), VAR2=rnorm(300000,2), VAR3=rnorm(300000,3), VAR4=rnorm(300000,4), VAR5=rnorm(300000,5))

Here’s the way to do it in plyr:

# Install the plyr package
library('plyr')
system.time({
  # Column-wise summaries by groups
  dat.means1 <- ddply(dat, .(GROUP), numcolwise(mean))
})
 user system elapsed 
44.48   0.02   44.56

head(dat.means1)
  GROUP      VAR1      VAR2     VAR3     VAR4     VAR5
1     1  0.8380018 1.873645 3.471506 4.374789 4.831297
2     2  0.1880526 1.737269 1.559789 3.965047 5.714575
3     3  1.0606414 1.632986 3.094070 4.368635 5.406597
4     4  1.5058617 1.381823 2.212632 3.304811 5.323576
5     5  0.9958184 2.199508 2.910383 3.741604 4.901024
6     6 -0.2971577 1.525801 2.263825 4.742539 4.360256

And with tapply within a for loop:

system.time({
  # Loop for all columns
  for(i in names(dat[-1])){
    # Make the output data frame on the first loop
    if(i==names(dat[-1])[1]){
      dat.means2 <- with(dat, data.frame(GROUP=unique(dat$GROUP),VAR1=tapply(VAR1,GROUP,mean)))
    } else {
      # Summarise the groups the with tapply
      dat.means2[ncol(dat.means2)+1] <- tapply(dat[,i],dat$GROUP,mean)
    }
  }
  # Rename the columns back to the original variables
  names(dat.means2) <- names(dat)
})
user system elapsed 
5.45   0.04    5.50 

head(dat.means2)
  GROUP       VAR1     VAR2     VAR3     VAR4     VAR5
1     1  0.8380018 1.873645 3.471506 4.374789 4.831297
2     2  0.1880526 1.737269 1.559789 3.965047 5.714575
3     3  1.0606414 1.632986 3.094070 4.368635 5.406597
4     4  1.5058617 1.381823 2.212632 3.304811 5.323576
5     5  0.9958184 2.199508 2.910383 3.741604 4.901024
6     6 -0.2971577 1.525801 2.263825 4.742539 4.360256

It surprised me that tapply is more than 8x faster than ddply in this case. The data I was processing was millions of rows with dozens of columns, so this is a significant time savings. I would also be interested in how to do this with aggregate (and if it’s any faster with these). I don’t like the complicated for loop coding required with the tapply solution, but the time savings more than compensated.

If anyone has an explanation for the time difference (I don’t know well enough how each function works) or has a suggestion to streamline plyr (for other applications), I would be happy to hear about it.

Advertisements
This entry was posted in R and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s