mages' blog

Notes from the Kölner R meeting, 26 June 2015

Last Friday the Cologne R user group came together for the 14th time. For the first time we met at Startplatz, a start-up incubator venue. The venue was excellent, not only did they provide us with a much larger room, but also with table-football and drinks. Many thanks to Kirill for organising all of this!

Photo: Günter Faes
We had two excellent advanced talks. Both were very informative and well presented.

Data Science at the Command Line

Kirill Pomogajko showed us how he uses various command line tools to pre-process log-files for further analysis with R.
Photo: Günter Faes
Imagine you have several servers that generate large data sets with no standard delimiters, like the example below.

The columns appear to be separated by a blank at first glance, but the second column (Military) has strings such as Air Force that include a blank itself. Furthermore, other columns have missing data (Month) and another uses speech-marks (Car). Thus, it's messy and difficult to read into R.

To solve the problem Kirill developed a Makefile that uses tools such as scp, sed and awk to download and clean the server files.

Kirill's tutorial files are available via GitHub.

An Introduction to RStan and the Stan Modelling Language


Paul Viefers gave a great introduction to Stan and RStan, with a focus on explaining the differences to other MCMC packages such as JAGS.

Photo: Günter Faes

Stan is a probabilistic programming language for Bayesian inference. One of the major challenges in Bayesian analysis is that often there is no analytical solution for the posterior distribution. Hence, the posterior distribution is approximated via simulations, such as Gibbs sampling in JAGS. Stan, on the other hand, uses Hamiltonian Monte Carlo (HMC), an algorithm that is more subtle in proposing jumps, using more structure by translation into Hamiltonian mechanics framework.

Paul ended his talk by walking us through the various building blocks of a Stan script, using a hierarchical logistic regression example.

You can access Paul's slides on RPubs.

Drinks and Networking

No Cologne R user group meeting is complete without Kölsch and networking. In the end some of us ended up in a fancy burger place.

Next Kölner R meeting

The next meeting will be scheduled in September. Details will be published on our Meetup site. Thanks again to Revolution Analytics for their sponsorship.

Next Kölner R User Meeting: Friday, 26 June 2015

Koeln R
The next Cologne R user group meeting is scheduled for this Friday, 6 June 2015 and we have an exciting agenda with two talks followed by networking drinks.

  • Data Science at the Commandline (Kirill Pomogajko)
  • An Introduction to RStan and the Stan Modelling Language (Paul Viefers)
Please note: Our venue changed! We have outgrown the seminar room at the Institute of Sociology and move to Startplatz, a start-up incubator venue: Im Mediapark, 550670 Köln

Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.

For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

How to place titles in lattice plots

I like the Economist theme in the latticeExtra package. It produces nice looking charts that mimic the design of the weekly newspaper, such as in this example:


For some time I wondered how I could put the title of my lattice plots into the top left corner as well (by default titles are centred). Reviewing the code of the theEconomist.theme function by Felix Andrews reveals the trick. It is the setting of par.main.text:

library(lattice)
my.settings <- list(
  par.main.text = list(font = 2, # make it bold
                       just = "left", 
                       x = grid::unit(5, "mm")))

xyplot(sin(1:100) ~ cos(1:100), 
       par.settings=my.settings,
       main="Hello World", 
       type="l")


Furthermore, I can use the same approach to place a sub-title in the bottom left corner of my chart, e.g. to describe the source of my data:

my.settings <- list(
  par.main.text = list(font = 2, # make it bold
                       just = "left", 
                       x = grid::unit(5, "mm")),
  par.sub.text = list(font = 1, 
                      just = "left", 
                      x = grid::unit(5, "mm"))
  )

xyplot(sin(1:100) ~ cos(1:100), 
       par.settings=my.settings,
       main="Hello World", 
       sub="Source: Nobody knows",
       type="l")


For more information see also the lattice help pages or the lattice book by Deepayan Sarkar: Lattice: Multivariate Data Visualization with R.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] lattice_0.20-31

loaded via a namespace (and not attached):
[1] tools_3.2.0 grid_3.2.0

Using system and web fonts in R plots

The forthcoming R Journal has an interesting article on the showtext package by Yixuan Qiu. The package allows me to use system and web fonts directly in R plots, reminding me a little of the approach taken by XeLaTeX. But "unlike other methods to embed fonts into graphics, showtext converts text into raster images or polygons, and then adds them to the plot canvas. This method produces platform-independent image files that do not rely on the fonts that create them." [1]

Here is an example with fonts from my local system:


library(showtext)
png("System-Fonts.png", width=550, height=350);
par(mfrow=c(2,2))
plot(1 ~ 1, main="Lucida Bright", family = "Lucida Bright")
plot(1 ~ 1, main="Courier", family = "Courier")
plot(1 ~ 1, main="Helvetica Neue Light", family = "Helvetica Neue Light") 
plot(1 ~ 1, main="Lucida Handwriting Italic", family = "Lucida Handwriting Italic")
dev.off()
Additionally showtext allows me to use fonts hosted online, e.g. Google web fonts:

font.add.google("Alegreya Sans", "aleg");
font.add.google("Permanent Marker", "marker")
font.add.google("Gruppo", "gruppo")
font.add.google("Lobster", "lobster")
png("Google-Fonts.png", width=550, height=350)
showtext.begin()
par(mfrow=c(2,2))
plot(1 ~ 1, main="Alegreya Sans", family = "aleg")
plot(1 ~ 1, main="Permanent Marker", family = "marker")
plot(1 ~ 1, main="Gruppo", family = "gruppo") 
plot(1 ~ 1, main="Lobster", family = "lobster") 
showtext.end()
dev.off()

For more information read the article and/or visit the project site.

References

[1] Yixuan Qiu. showtext: Using System Fonts in R Graphics. The R Journal, 7(1), 2015.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] showtext_0.4-2 sysfonts_0.5  

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.6 showtextdb_1.0 jsonlite_0.9.16 bitops_1.0-6

Back from R/Finance in Chicago

I had a great time at the R/Finance conference in Chicago last Friday/Saturday. Some brief takeaways for me were:

From Emanuel Derman's talk: It is is important to distinguish between theories and models. Theories live in an abstract world and for a given set of axioms they can be proven right. However, models live in the real world, are build on simplifying assumptions and are only useful until experiments/data proves them wrong.

'Pornography is hard to define, but I know it when I see it.' Matt Dowle from h2o had the laughs on his side when he started his talk with this Justice Potter Stewart quote to illustrate the value of his data.table package to its users.

Bryan W. Lewis showed why inverting a matrix is tricky, particularly when it contains entries close to zero and what you can do about it.

Marius Hofert gave a stimulating talk on simsalapar a package for parallel simulations, which I need to study in more detail.

Following a brief conversation with Dirk on drat I finally got the punch line of the package, but not so much the joke on drat as a fairly mild expression of anger or annoyance. I had never heard the expression in the UK. Perhaps drat is better explained as Dirk's R Archive Template?

The audience seemed to have appreciated my talk on Communicating Risk. My chart of visualising profitability using a Whale Chart appeared to have resonated with a few.

Furthermore, I learned that the weather in Chicago is even more unstable than in London. After an amazing conference dinner at the Trump Tower, spending most of the time outside and admiring the sunset, we experienced a very cold and rainy Saturday. But then again, there is always time for a Jazz club and a drink. Talking about drinks, thanks to Q Ethan McCallum I had true American breakfast experience, including bottomless coffee.

Yet, the last word should go to ShabbyChef, who took a photo of a slide during Louis Marascio's keynote and tweeted:


Amen.

Communicating Risk at the Bay Area R User Group

I will be speaking at the Bay Area User Group meeting tonight about Communicating Risk. Anthony Goldbloom from Kaggle and Karim Chine from ElasticR will be there as well. The meeting will be at Microsoft in Mountain View.

Later this week I will give a similar presentation at the R in Finance conference in Chicago. Please get in touch if you are around and would like to share a coffee with me.

Posterior predictive output with Stan

I continue my Stan experiments with another insurance example. Here I am particular interested in the posterior predictive distribution from only three data points. Or, to put it differently I have a customer of three years and I'd like to predict the expected claims cost for the next year to set or adjust the premium.

The example is taken from section 16.17 in Loss Models: From Data to Decisions [1]. Some time ago I used the same example to get my head around a Bayesian credibility model.

Suppose the claims likelihood distribution is believed to follow an exponential distribution for a given parameter \(\Theta\). The prior parameter distribution on \(\Theta\) is assumed to be a gamma distribution with parameters \(\alpha=4, \beta=1000\):
\[\begin{aligned}\Theta & \sim \mbox{Gamma}(\alpha, \beta)\\
\ell_i & \sim \mbox{Exp}(\Theta) , \; \forall i \in N
\end{aligned}\]In this case the predictive distribution is a Pareto II distribution with density \(f(x) = \frac{\alpha \beta^\alpha}{(x+\beta)^{\alpha+1}}\) and a mean of \(\frac{\beta}{\alpha-1}=\,\)333.33.

I have three independent observations, namely losses of $100, $950 and $450. The posterior predictive expected loss is $416.67 and can be derived analytical, as shown in my earlier post. Now let me reproduce the answer with Stan as well.

Implementing the model in Stan is straightforward and I follow the same steps as in my simple example of last week. However, here I am also interested in the posterior predictive distribution, hence I add a generated quantities code block.



The output shows a simulated predictive mean of $416.86, close to the analytical answer. I can also read out that the 75%ile of the posterior predictive distribution is a loss of $542 vs. $414 from the prior predictive. That means every four years I shouldn't be surprised to observe a loss in excess of $500. Further I note that 90% of losses are expected to be less than $950, or in other words the observation in my data may reflect the outcome of an event with a 1 in 10 return period.

Comparing the sampling output from Stan with the analytical output gives me some confidence that I am doing the 'right thing'.

References

[1] Klugman, S. A., Panjer, H. H. & Willmot, G. E. (2004), Loss Models: From Data to Decisions, Wiley Series in Probability and Statistics.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] lattice_0.20-31 actuar_1.1-8 rstan_2.6.0 inline_0.3.14  
[5] Rcpp_0.11.6    

loaded via a namespace (and not attached):
[1] tools_3.2.0  codetools_0.2-11 grid_3.2.0 stats4_3.2.0