R is the easiest language to speak badly

9 comments

I am amazed by the number of comments I received on my recent blog entry about "by", "apply" and friends. I had started my post by pointing out that R is a language. Well indeed, I have come to the conclusion, that it is a language with lots of irregular expressions and dialects. It feels a bit like German or French where you have to learn and memorise the different articles. The Germans have three singular definite articles: der (male), die (female) and das (neutral), the French have two: le (male) and la (female). Of course there is no mapping between them, and how do you explain that a girl in German is neutral (das Mädchen), while manhood is female (die Männlichkeit)?

Back to R. As I found out, there are lots of different ways to calculate the means on subsets of data. I begin to wonder, why so many different interfaces and functions have been developed over the years, and also why I didn't use the aggregate function more often in the past?

Can we blame internet search engines? Why should I learn a programming language properly, when I can find approximate answers to my problem online. I may not end up with the best answer, but with something which will work after all: Don't know why, but it works.

And sometimes the help files can be more difficult to understand than the code in the examples. Hence, I end up playing around with the example code until it works, and only then I try to figure out how it works. That was my experience with reshape.

Maybe this is a bit harsh. It is always up to the individual to improve his language skills, but you can get drunk in a pub as well, by only being able to order beer. I think it was George Bernard Shaw, who said: "R is the easiest language to speak badly." No, actually he said: "English is the easiest language to speak badly." Maybe that explains the success of English and R?

Reading helps. More and more books have been published on R over the last years, and not only in English. But which should you pick? Xi'an's review on the Art of R Programming suggests that it might be a good start.

Back to aggregate. Has anyone noticed, that the formula interface of aggregate is different to summaryBy?

aggregate(cbind(Sepal.Width, Petal.Width) ~ Species, data=iris, FUN=mean)
     Species Sepal.Width Petal.Width
1     setosa       3.428       0.246
2 versicolor       2.770       1.326
3  virginica       2.974       2.026

versus

library(doBy)
summaryBy(Sepal.Width + Petal.Width ~ Species, data=iris, FUN=mean)
     Species Sepal.Width.mean Petal.Width.mean
1     setosa            3.428            0.246
2 versicolor            2.770            1.326
3  virginica            2.974            2.026

And another slightly more complex example:
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, FUN=sum)
summaryBy(ncases + ncontrols ~ alcgp + tobgp, data = esoph, FUN=sum)


9 comments :

  1. It would be good if you pointed out that the summaryBy function is in the doBy package and not part of the normal stats library...

    ReplyDelete
  2. this is why plyr is so good - you can argue whether or not it's better, but it certainly gives you a consistent interface.

    ReplyDelete
  3. I don't know about R being the same as a spoken language - have you tried asking directions or ordering a meal in R? Regarding the explanation for what seem inconsistencies in the German language - 'Mädchen' is a diminuitive form of 'Maid' which has a feminine article as expected. All diminutive forms 'Häuschen', 'Hündchen' use the neutral article. You will also notice that all nouns ending in '-keit' use a femnine article irrespective of what adjective has been coupled with it. Otherwise, nice post. Thanks.

    ReplyDelete
  4. G. Grothendieck1 February 2012 12:38

    The history of this is that aggregate in the out-of-the-box R did not have a formula method.   During that time the doBy add-on package (not part of the out-of-the-box R) and its summaryBy command were created.  Later a formula method was added to aggregate in the out-of-the-box R.

    ReplyDelete
  5. Keep in mind that there are thousands of contributed R-packages submitted by different authors.   Since there's no team of editors poring over submissions (think Apple iPAD app approval process), these packages will not have perfectly consistent I/O formats.  This really isn't different from, say, the MatLab online contributors' directory.

    ReplyDelete
  6. Good point. I have added the library statement to the post.

    ReplyDelete
  7. Indeed, and sometimes it is both a blessing and a curse. 

    ReplyDelete
  8. Hello All.
    Software Developer's Journal published new issue fully dedicated to R language. You can read the teaser now: http://sdjournal.org/data-development-gems-software-developers-journal-teaser/

    ReplyDelete
  9. Andrej-Nikolai Spiess8 March 2014 15:32

    You're speaking out of my heart, to do some direct german translation ;-)
    Same with me: I deliberately used tapply(data, factor, mean) until somebody mentioned the 'ave' function which also seems to be living in oblivion...

    Cheers,
    Andrej

    ReplyDelete