The other day I got stuck working with a huge data set using data.table in R. It took me a little while to realise that I had to produce a minimal reproducible example to actually understand why I got stuck in the first place. I know, this is the mantra I should follow before I reach out to R-help, Stack Overflow or indeed the package authors. Of course, more often than not, by following this advise, the problem becomes clear and with that the solution obvious.
Last week Arthur Charpentier sketched out a Markov spatial process to generate hurricane trajectories. Here, I would like to take another look at the data Arthur used, but focus on its time component.
According to the Insurance Information Institute, a normal season, based on averages from 1980 to 2010, has 12 named storms, six hurricanes and three major hurricanes. The usual peak months of August and September passed without any major catastrophes this year, but the Atlantic hurricane season is not over yet.
The example I present here is a little silly, yet it illustrates how to join tables with data.table in R.
Mapping old data to new data Categories in general are never fixed, they always change at some point. And then the trouble starts with the data. For example not that long ago we didn’t distinguish between smartphones and dumbphones, or video on demand and video rental shops.
Last week’s Cologne R user group meeting was the best attended so far. Well, we had a great line up indeed. Matt Dowle came over from London to give an introduction to the data.table package. He was joined by his collaborator Arun Srinivasan, who is based in Cologne. Their talk was followed by Thomas Rahlf on Datendesign mit R (Data design with R).
data.table Download slides Matt’s goal with the data.
I really should make it a habit of using data.table. The speed and simplicity of this R package are astonishing.
Here is a simple example: I have a data frame showing incremental claims development by line of business and origin year. Now I would like add a column with the cumulative claims position for each line of business and each origin year along the development years.
It’s one line with data.
Transforming data sets with R is usually the starting point of my data analysis work. Here is a scenario which comes up from time to time: transform subsets of a data frame, based on context given in one or a combination of columns.
As an example I use a data set which shows sales figures by product for a number of years:
df <- data.frame(Product=gl(3,10,labels=c("A","B", "C")), Year=factor(rep(2002:2011,3)), Sales=1:30) head(df) ## Product Year Sales ## 1 A 2002 1 ## 2 A 2003 2 ## 3 A 2004 3 ## 4 A 2005 4 ## 5 A 2006 5 ## 6 A 2007 6 I am interested in absolute and relative sales developments by product over time.
R is a language, as Luis Apiolaza pointed out in his recent post. This is absolutely true, and learning a programming language is not much different from learning a foreign language. It takes time and a lot of practice to be proficient in it. I started using R when I moved to the UK and I wonder, if I have a better understanding of English or R by now.