Say it in R with "by", "apply" and friends

Iris versicolor 
By Danielle Langlois
License: CC-BY-SA
R is a language, as Luis Apiolaza pointed out in his recent post. This is absolutely true, and learning a programming language is not much different from learning a foreign language. It takes time and a lot of practice to be proficient in it. I started using R when I moved to the UK and I wonder, if I have a better understanding of English or R by now.

Languages are full of surprises, in particular for non-native speakers. The other day I learned that there is courtesy and curtsey. Both words sounded very similar to me, but of course created some laughter when I mixed them up in an email.

With languages you can get into habits of using certain words and phrases, but sometimes you see or hear something, which shakes you up again. So did the following two lines in R with me:
f <- function(x) x^2
sapply(1:10, f)
[1]   1   4   9  16  25  36  49  64  81 100
It reminded me of the phrase that everything is a list in R. It showed me again how easily a for loop can be turned into a statement using the apply family of functions and how little I know about all the subtleties of R. I remember how happy I felt, when I finally understood the by function in R. I started to use it all the time, closing my eyes on aggregate and the apply functions family. Here is an example where I calculate the means of the various measurements by species of the famous iris data set using by.

by"rbind", as.list(
  by(iris, list(Species=iris$Species), function(x){
    y <- subset(x, select= -Species)
    apply(y, 2, mean)

           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026
Now let's find alternative ways of expressing ourselves, using other words/functions of the R language, such as aggregate, apply, sapply, tapply, data.table, ddply, sqldf, and summaryBy.


The aggregate function splits the data into subsets and computes summary statistics for each of them. The output of aggregate is a data.frame, including a column for species.
iris.x <- subset(iris, select= -Species)
iris.s <- subset(iris, select= Species)
aggregate(iris.x, iris.s, mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
Addition: As John Christie points out in the comments, aggregate has also a formula interface, which simplifies the call to:
aggregate( . ~ Species, iris, mean)

apply and tapply

The combination of tapply and apply achieves a similar result, but this time the output is a matrix and hence I lose the column with species. The species are now the row names.
apply(iris.x, 2, function(x) tapply(x, iris.s, mean))

           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

split and apply

Here I split the data first into subsets for each of the species and calculate then the mean for each column in the subset. The output is a matrix again, but transposed.
sapply(split(iris.x, iris.s), function(x) apply(x, 2, mean))

             setosa versicolor virginica
Sepal.Length  5.006      5.936     6.588
Sepal.Width   3.428      2.770     2.974
Petal.Length  1.462      4.260     5.552
Petal.Width   0.246      1.326     2.026


Hadley Wickham's plyr package provides tools for splitting, applying and combining data. The function ddply is similar to the by function, but it returns a data.frame instead of a by list and maintains the column for the species.
ddply(iris, "Species", function(x){
    y <- subset(x, select= -Species)
    apply(y, 2, mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
Addition: Sean mentions in the comments an alternative, using the colMeans function, while Andrew reminds us of the reshape package with its functions melt and cast.
ddply(iris, "Species", function(x) colMeans(subset(x, select= -Species)))
## or
ddply(iris, "Species", colwise(mean)) 
## same output as above
cast(melt(iris, id.vars='Species'),formula=Species ~ variable,mean)
## same output as above


The summaryBy function of the doBy package by Søren Højsgaard and Ulrich Halekoh has a very intuitive interface, using formulas.
summaryBy(Sepal.Length + Sepal.Width + Petal.Length + Petal.Width ~ Species, data=iris, FUN=mean)

     Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean
1     setosa             5.006            3.428             1.462            0.246
2 versicolor             5.936            2.770             4.260            1.326
3  virginica             6.588            2.974             5.552            2.026


If you are fluent in SQL, then the sqldf package by Gabor Grothendieck might be the one for you.
sqldf("select Species, avg(Sepal_Length), avg(Sepal_Width), 
    avg(Petal_Length), avg(Petal_Width) from iris 
    group by Species")

     Species avg(Sepal_Length) avg(Sepal_Width) avg(Petal_Length) avg(Petal_Width)
1     setosa             5.006            3.428             1.462            0.246
2 versicolor             5.936            2.770             4.260            1.326
3  virginica             6.588            2.974             5.552            2.026


The data.table package by M Dowle, T Short and S Lianoglou is the real rock star to me. It provides an elegant and fast way to complete the task. The statement reads in plain English from right to left: take columns 1 to 4, split them by the factor in column "Species" and calculate on the sub data (.SD) the means.
iris.dt <- data.table(iris)

        Species Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]     setosa        5.006       3.428        1.462       0.246
[2,] versicolor        5.936       2.770        4.260       1.326
[3,]  virginica        6.588       2.974        5.552       2.026


I should mention that R provides the iris data set also in an array form. The third dimension of the iris3 array holds the species information. Therefore I can use the apply function again, I go down the third and then the second dimension to calculate the means.
apply(iris3, c(3,2), mean)

           Sepal L. Sepal W. Petal L. Petal W.
Setosa        5.006    3.428    1.462    0.246
Versicolor    5.936    2.770    4.260    1.326
Virginica     6.588    2.974    5.552    2.026


Many roads lead to Rome, and there are endless ways of explaining how to get there. I only showed a few I know of, and I am curious to hear yours. As a matter of courtesy I should mention the unknownR package by Matthew Dowle. It helps you to discover what you don't know that you don't know in R. Thus, it can help to build your R vocabulary. Of course there is a key difference between R and English. R tells me right away when I make a mistake. Human readers are far more forgiving, but please do point out to me where I made mistakes. I am still hopeful that I can improve, but I need your help.

R code

The R code of the examples is available on github. For more examples on the apply family see also Neil Saunders' post.


Post a Comment

Credit rating by country

The financial crisis has put a lot of pressure on countries' long-term foreign currency credit ratings, with France recently being downgraded by S&P. Wikipedia provides a list of countries by credit ratings as report by US rating agencies S&P, Fitch, Moody's and Dagong, a Chinese rating agency.

So, what does the world look like today through the eyes of those rating agencies?

I use the R packages XML and googleVis to read and display the data from Wikipedia with just a few lines.


Post a Comment

Managing change

No comments

Why the old and the new need to share time together

It takes time to appreciate the new. Even if the new is much better than the old. It is easy to forget when you yourself created the exciting new.

At the end of August 2011 Google announced a new Blogger interface. The new interface offered about the same functionality, but had a different look and feel. At first I was reluctant to use it. I just wanted to get my job done. I knew the old interface, so why disturbe me with something else which does just the same?

Old Blogger interface

I was relieved to see that there was a button, which allowed me to switch back to the old interface. So I continued to use the old interface, although occasionally Google would offer me the new interface after login. Eventually I started to play around with the new interface, e.g. when I was bored and looking for a good excuse to waste my time. After a while I started to like the new interface and used it more often and then I stopped switching back to the old when Google offered me the new on login. It appeared cleaner, more contemporary and organised than the old. Now I have switched completely to the new interface. After a four months transition period!

New Blogger interface

There is something to be learned from this experience

Suppose you write analysis reports for others, e.g. colleagues, clients, friends, students, etc. and you have a new idea, which will make your report so much better. Maybe a better structure, a new R package, a new plot, something really sophisticated. You are excited about this, you implement it and send your report out to your readers.

Then: No feedback.

Appreciating change takes time.

Here is a suggestion: Implement your next great idea to improve your analysis report. Just because it is so much fun and your are really excited about the new. However, continue to produce the old style report and send it out with the new. Hopefully your excitement will sparkle over to the reader. But acknowledge that not all of them will have time to embrace the new immediately. They read the report for a specific purpose, e.g. table X on page Y. Hence give your readers the space and time to review the new report without pressure, eventually they will listen and hopefully switch. But, do listen to the readers as well and value their feedback. Easier said than done.

There is another aspect to it. Sending out the new with the old allows you to fail. You have a safety net. Some of your ideas might actually be not that great after all.

PS: Producing the old and the new at the same time is of course only possible, if you have a slick automated reporting system. You find plenty of ideas on R-Bloggers.

No comments :

Post a Comment

Feedback from vignette survey

1 comment

Many thanks to all who participated in the survey about writing R package vignettes.

Following my post last Thursday the responses came in quickly in the evening and all day on Friday. Since Saturday the response rate has been decreasing constantly and I think it is time for a summary based on the 56 responses received.

Summary - How to write a good vignette

  • Length: Trust yourself, but aim for about 20 pages.
  • Language: Don't use language which assumes that the reader is an R and/or subject expert.
  • Structure: Include at least the following sections:
    • Examples
    • Introduction
    • Case studies
    • References
    It would be nice to include also sections on:
    • Support
    • Motivation
    • Road map
  • Examples: Use lots of examples and don't repeat just the examples from the help pages.
  • Get inspiration from: Rcpp, reshape, plyr, vegan, and see below for more.
  • Secrets of good vignettes:
    • Provide an introduction with a clear purpose of the package.
    • Work with case studies, walk the reader through a task from start to finish.
    • Demonstrate the non-default arguments of the package functions, highlight why and when you want to change them.
    • Write briefly and concisely, but provide reference/footnotes to relevant literature and further help.
    • Provide dummy data to play with.
    • Discuss limitations.
  • What else: Potentially split the vignette into several documents, see Rcpp for an example.

1 comment :

Post a Comment

Survey: Writing package vignette

No comments
I am currently co-writing the vignette for the ChainLadder package and wonder what I should be focusing on. I have co-written the vignette of the googleVis package in the past and based it purely and what I thought would work. So, this is an experiment to find out, if user feedback will help me to write a better vignette. Let's see how it develops. I will make the data available once I have at least 10 submission.


No comments :

Post a Comment