# Transforming subsets of data in R with by, ddply and data.table

Transforming data sets with R is usually the starting point of my data analysis work. Here is a scenario which comes up from time to time: transform subsets of a data frame, based on context given in one or a combination of columns.As an example I use a data set which shows sales figures by product for a number of years:

```
df <- data.frame(Product=gl(3,10,labels=c("A","B", "C")),
Year=factor(rep(2002:2011,3)),
Sales=1:30)
head(df)
## Product Year Sales
## 1 A 2002 1
## 2 A 2003 2
## 3 A 2004 3
## 4 A 2005 4
## 5 A 2006 5
## 6 A 2007 6
```

I am interested in absolute and relative sales developments by product over time. Hence, I would like to add a column to my data frame that shows the sales figures divided by the total sum of sales in each year, so I can create a chart which looks like this:There are lots of ways of doing this transformation in R. Here are three approaches using:

- base R with
`by`

, `ddply`

of the`plyr`

package,`data.table`

of the package with the same name.

### by

The idea here is to use`by`

to split the data for each year and to apply the `transform`

function to each subset to calculate the share of sales for each product with the following function: `fn <- function(x) x/sum(x)`

. Having defined the function `fn`

I can apply it in a `by`

statement, and as its output will be a list, I wrap it into a `do.call`

command to row-bind (`rbind`

) the list elements:```
R1 <- do.call("rbind", as.list(
by(df, df["Year"], transform, Share=fn(Sales))
))
head(R1)
## Product Year Sales Share
## 2002.1 A 2002 1 0.03030303
## 2002.11 B 2002 11 0.33333333
## 2002.21 C 2002 21 0.63636364
## 2003.2 A 2003 2 0.05555556
## 2003.12 B 2003 12 0.33333333
## 2003.22 C 2003 22 0.61111111
```

### ddply

Hadely's plyr package provides an elegant wrapper for this job with the`ddply`

function. Again I use the `transform`

function with my self defined `fn`

function:```
library(plyr)
R2 <- ddply(df, "Year", transform, Share=fn(Sales))
head(R2)
## Product Year Sales Share
## 1 A 2002 1 0.03030303
## 2 B 2002 11 0.33333333
## 3 C 2002 21 0.63636364
## 4 A 2003 2 0.05555556
## 5 B 2003 12 0.33333333
## 6 C 2003 22 0.61111111
```

### data.table

With data.table I have to do a little bit more legwork, in particular I have to think about the indices I need to use. Yet, it is still straight forward:```
library(data.table)
## Convert df into a data.table
dt <- data.table(df)
## Set Year as a key
setkey(dt, "Year")
## Calculate the sum of sales per year(=key(dt))
X <- dt[, list(SUM=sum(Sales)), by=key(dt)]
## Join X and dt, both have the same key and
## add the share of sales as an additional column
R3 <- dt[X, list(Sales, Product, Share=Sales/SUM)]
head(R3)
## Year Sales Product Share
## [1,] 2002 1 A 0.03030303
## [2,] 2002 11 B 0.33333333
## [3,] 2002 21 C 0.63636364
## [4,] 2003 2 A 0.05555556
## [5,] 2003 12 B 0.33333333
## [6,] 2003 22 C 0.61111111
```

Although `data.table`

may look cumbersome compared to `ddply`

and `by`

, I will show below that it is actually a lot faster than the two other approaches.### Plotting the results

With any of the three outputs I can create the chart from above with`latticeExtra`

:```
library(latticeExtra)
asTheEconomist(
xyplot(Sales + Share ~ Year, groups=Product,
data=R3, t="b",
scales=list(relation="free",x=list(rot=45)),
auto.key=list(space="top", column=3),
main="Product information")
)
```

## Comparing performance of by, ddply and data.table

Let me move on to a more real life example with 100 companies, each with 20 products and a 10 year history:```
set.seed(1)
df <- data.frame(Company=rep(paste("Company", 1:100),200),
Product=gl(20,100,labels=LETTERS[1:20]),
Year=sort(rep(2002:2011,2000)),
Sales=rnorm(20000, 100,10))
```

I use the same three approaches to calculate the share of sales by product for each year and company, but this time I will measure the execution time on my old iBook G4, running R-2.15.0:```
r1 <- system.time(
R1 <- do.call("rbind", as.list(
by(df, df[c("Year", "Company")],
transform, Share=fn(Sales))
))
)
r2 <- system.time(
R2 <- ddply(df, c("Company", "Year"),
transform, Share=fn(Sales))
)
r3 <- system.time({
dt <- data.table(df)
setkey(dt, "Year", "Company")
X <- dt[, list(SUM=sum(Sales)), by=key(dt)]
R3 <- dt[X, list(Company, Sales, Product, Share=Sales/SUM)]
})
```

And here are the results:

```
r1 # by
## user system elapsed
## 13.690 4.178 42.118
r2 # ddply
## user system elapsed
## 18.215 6.873 53.061
r3 # data.table
## user system elapsed
## 0.171 0.036 0.442
```

It is quite astonishing to see the speed of `data.table`

in comparison to `by`

and `ddply`

, but maybe it shouldn't be surprise that the elegance of `ddply`

comes with a price as well. **Addition (13 June 2012):**See also Matt's comments below. I completely missed

`ave`

from base R, which is rather simple and quick as well. Additionally his link to a stackoverflow discussion provides further examples and benchmarks.Finally my session info:

```
> sessionInfo() # iBook G4 800 MHZ, 640 MB RAM
R version 2.15.0 Patched (2012-06-03 r59505)
Platform: powerpc-apple-darwin8.11.0 (32-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] latticeExtra_0.6-19 lattice_0.20-6 RColorBrewer_1.0-5
[4] data.table_1.8.0 plyr_1.7.1
loaded via a namespace (and not attached):
[1] grid_2.15.0
```

Subscribe to:
Post Comments
(
Atom
)

In your data.table example, why not simply do:

ReplyDeleteR3 <- dt[, list(Company, Sales, Product, Share=Sales/sum(Sales)), by=key(dt)]

It should shaved off some time vs doing the merge.

First thought is that there are faster ways in base and plyr, so this is being a little too nice to data.table. Using `ave()` you can get the sum repeated through each group, cbind that on, then divide Sales by sum(Sales) as a vectorized op rather than by group. It's transform() which kills performance really by copying all those groups separately, copying again to add the column on, and rbind'ing them all afterwards.

ReplyDeleteFor this use case, the real advantage of data.table comes in v1.8.1 (on R-Forge, not yet on CRAN). There you can add a column by reference, by group. The syntax is one line :

ReplyDelete`DT[, Share := Sales/sum(Sales), by=list(Year,Company)]`

That adds a new column `Share` to `DT`, and populates it within each group. The aim with this syntax is that it reads more than English, whilst being significantly faster than `ave()`, which is the next fastest solution I know.

Many thanks for your comments. I had tried something like this with v1.8.0. This will be a great addition to the functionality of the package. I am looking forward to your data.table talk at next week's LondonR meeting.

ReplyDeleteHere are the results with 'ave,' which looks indeed quite impressive.

ReplyDeletesystem.time( R4 <- cbind(df,Share=ave(df[["Sales"]], df[["Year"]], df[["Company"]], FUN=function(x) x/sum(x)) )# user system elapsed # 0.118 0.023 0.280

How could I have missed this function for all those years?

That's more like it, right scale now.

ReplyDeleteThe r3 time for data.table appears slower (0.442s) because that timing includes a call to data.table(), to setkey() and the join back. None of those are actually needed. Cutting straight to v1.8.1, you should find the direct := by group is much faster than the 0.280 for ave()+cbind().

However, 2e4 is a very tiny dataset. Time differences of under 1 second hardly matter. Try scaling it up to 1e6, 1e7, 1e8 rows and you should see data.table significantly faster on significant times (e.g. hours down to minutes have been reported on datatable-help).

Here's a related benchmark (but needs N increasing for data.table to shine, see comments) :

http://stackoverflow.com/questions/10748253/idiomatic-r-code-for-partitioning-a-vector-by-an-index-and-performing-an-operati

"...but maybe it shouldn't be surprise that the elegance of ddply comes with a price as well"

ReplyDeletewhy should elegance come with a price in terms of performance?