ave and the "[" function in R

2 comments
The ave function in R is one of those little helper function I feel I should be using more. Investigating its source code showed me another twist about R and the "[" function. But first let's look at ave.

The top of ave's help page reads:

Group Averages Over Level Combinations of Factors

Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels.

As an example I look at revenue data by product and shop.
revenue <- c(30,20, 23, 17)
product <- factor(c("bread", "cake", "bread", "cake"))
shop <- gl(2,2, labels=c("shop_1", "shop_2"))
To answer the question "Which shop sells proportionally more bread?" I need to divide the revenue vector by the sum of revenue per shop, which can be calculated easily by ave:
(shop_revenue <- ave(revenue, shop, FUN=sum))
# [1] 50 50 40 40
(revenue_split_in_shop <- revenue/shop_revenue)
# [1] 0.600 0.400 0.575 0.425 # Shop 1 sells more bread than cake
In other words, ave has to split the revenue vector by shop and apply the sum function to it. Well that's exactly what it does. Here is the source code of ave:
#  Copyright (C) 1995-2012 The R Core Team
ave <- function (x, ..., FUN = mean)
{
    if(missing(...))
 x[] <- FUN(x)
    else {
 g <- interaction(...)
 split(x,g) <- lapply(split(x, g), FUN)
    }
    x
}
However, and this is what intrigued me, if I don't provide a grouping variable (missing(...)) it will apply the function FUN on x itself and write its output to x[]. That's actually what the help file to ave mentioned in its description. So what does it do? Here is an example again:
ave(revenue, FUN=sum)
# [1] 90 90 90 90
I get the sum of revenue repeated as many time as the vector has elements, not just once, as with sum(revenue). The trick is that the output of FUN(x) is written into x[], which of course is output of a function call itself "["(x).

I think it is the following sentence in the help file of "[" (see ?"["), which explains it: Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames.

So there we are. I feel less inclined to use ave more, as it is just short for the usual split, lapply routine, but I learned something new about the subtleties of R.

2 comments :

  1. It is different to tapply, split because it recyles the grouped result to the length of the factor (x[ ] <- ...), so it is the short form of rep(tapply(x, group, FUN), unique(group)). Haven't got a clue however, when this is useful...


    Cheers,
    Andrej

    ReplyDelete
  2. I like to use plyr for those summarizing and transforming.

    The example above would be something like this:

    library(plyr)
    revenue <- c(30,20, 23, 17)
    product <- factor(c("bread", "cake", "bread", "cake"))
    shop <- gl(2,2, labels=c("shop_1", "shop_2"))

    df <- data.frame(revenue = revenue, product = product, shop = shop)
    ddply(df, .(shop), transform, pct=revenue/sum(revenue))

    ReplyDelete