mages' blog

Next Kölner R User Meeting: Friday, 26 June 2015

Koeln R
The next Cologne R user group meeting is scheduled for this Friday, 6 June 2015 and we have an exciting agenda with two talks followed by networking drinks.

  • Data Science at the Commandline (Kirill Pomogajko)
  • An Introduction to RStan and the Stan Modelling Language (Paul Viefers)
Please note: Our venue changed! We have outgrown the seminar room at the Institute of Sociology and move to Startplatz, a start-up incubator venue: Im Mediapark, 550670 Köln

Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.

For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

How to place titles in lattice plots

I like the Economist theme in the latticeExtra package. It produces nice looking charts that mimic the design of the weekly newspaper, such as in this example:


For some time I wondered how I could put the title of my lattice plots into the top left corner as well (by default titles are centred). Reviewing the code of the theEconomist.theme function by Felix Andrews reveals the trick. It is the setting of par.main.text:

library(lattice)
my.settings <- list(
  par.main.text = list(font = 2, # make it bold
                       just = "left", 
                       x = grid::unit(5, "mm")))

xyplot(sin(1:100) ~ cos(1:100), 
       par.settings=my.settings,
       main="Hello World", 
       type="l")


Furthermore, I can use the same approach to place a sub-title in the bottom left corner of my chart, e.g. to describe the source of my data:

my.settings <- list(
  par.main.text = list(font = 2, # make it bold
                       just = "left", 
                       x = grid::unit(5, "mm")),
  par.sub.text = list(font = 1, 
                      just = "left", 
                      x = grid::unit(5, "mm"))
  )

xyplot(sin(1:100) ~ cos(1:100), 
       par.settings=my.settings,
       main="Hello World", 
       sub="Source: Nobody knows",
       type="l")


For more information see also the lattice help pages or the lattice book by Deepayan Sarkar: Lattice: Multivariate Data Visualization with R.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] lattice_0.20-31

loaded via a namespace (and not attached):
[1] tools_3.2.0 grid_3.2.0

Using system and web fonts in R plots

The forthcoming R Journal has an interesting article on the showtext package by Yixuan Qiu. The package allows me to use system and web fonts directly in R plots, reminding me a little of the approach taken by XeLaTeX. But "unlike other methods to embed fonts into graphics, showtext converts text into raster images or polygons, and then adds them to the plot canvas. This method produces platform-independent image files that do not rely on the fonts that create them." [1]

Here is an example with fonts from my local system:


library(showtext)
png("System-Fonts.png", width=550, height=350);
par(mfrow=c(2,2))
plot(1 ~ 1, main="Lucida Bright", family = "Lucida Bright")
plot(1 ~ 1, main="Courier", family = "Courier")
plot(1 ~ 1, main="Helvetica Neue Light", family = "Helvetica Neue Light") 
plot(1 ~ 1, main="Lucida Handwriting Italic", family = "Lucida Handwriting Italic")
dev.off()
Additionally showtext allows me to use fonts hosted online, e.g. Google web fonts:

font.add.google("Alegreya Sans", "aleg");
font.add.google("Permanent Marker", "marker")
font.add.google("Gruppo", "gruppo")
font.add.google("Lobster", "lobster")
png("Google-Fonts.png", width=550, height=350)
showtext.begin()
par(mfrow=c(2,2))
plot(1 ~ 1, main="Alegreya Sans", family = "aleg")
plot(1 ~ 1, main="Permanent Marker", family = "marker")
plot(1 ~ 1, main="Gruppo", family = "gruppo") 
plot(1 ~ 1, main="Lobster", family = "lobster") 
showtext.end()
dev.off()

For more information read the article and/or visit the project site.

References

[1] Yixuan Qiu. showtext: Using System Fonts in R Graphics. The R Journal, 7(1), 2015.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] showtext_0.4-2 sysfonts_0.5  

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.6 showtextdb_1.0 jsonlite_0.9.16 bitops_1.0-6

Back from R/Finance in Chicago

I had a great time at the R/Finance conference in Chicago last Friday/Saturday. Some brief takeaways for me were:

From Emanuel Derman's talk: It is is important to distinguish between theories and models. Theories live in an abstract world and for a given set of axioms they can be proven right. However, models live in the real world, are build on simplifying assumptions and are only useful until experiments/data proves them wrong.

'Pornography is hard to define, but I know it when I see it.' Matt Dowle from h2o had the laughs on his side when he started his talk with this Justice Potter Stewart quote to illustrate the value of his data.table package to its users.

Bryan W. Lewis showed why inverting a matrix is tricky, particularly when it contains entries close to zero and what you can do about it.

Marius Hofert gave a stimulating talk on simsalapar a package for parallel simulations, which I need to study in more detail.

Following a brief conversation with Dirk on drat I finally got the punch line of the package, but not so much the joke on drat as a fairly mild expression of anger or annoyance. I had never heard the expression in the UK. Perhaps drat is better explained as Dirk's R Archive Template?

The audience seemed to have appreciated my talk on Communicating Risk. My chart of visualising profitability using a Whale Chart appeared to have resonated with a few.

Furthermore, I learned that the weather in Chicago is even more unstable than in London. After an amazing conference dinner at the Trump Tower, spending most of the time outside and admiring the sunset, we experienced a very cold and rainy Saturday. But then again, there is always time for a Jazz club and a drink. Talking about drinks, thanks to Q Ethan McCallum I had true American breakfast experience, including bottomless coffee.

Yet, the last word should go to ShabbyChef, who took a photo of a slide during Louis Marascio's keynote and tweeted:


Amen.

Communicating Risk at the Bay Area R User Group

I will be speaking at the Bay Area User Group meeting tonight about Communicating Risk. Anthony Goldbloom from Kaggle and Karim Chine from ElasticR will be there as well. The meeting will be at Microsoft in Mountain View.

Later this week I will give a similar presentation at the R in Finance conference in Chicago. Please get in touch if you are around and would like to share a coffee with me.

Posterior predictive output with Stan

I continue my Stan experiments with another insurance example. Here I am particular interested in the posterior predictive distribution from only three data points. Or, to put it differently I have a customer of three years and I'd like to predict the expected claims cost for the next year to set or adjust the premium.

The example is taken from section 16.17 in Loss Models: From Data to Decisions [1]. Some time ago I used the same example to get my head around a Bayesian credibility model.

Suppose the claims likelihood distribution is believed to follow an exponential distribution for a given parameter \(\Theta\). The prior parameter distribution on \(\Theta\) is assumed to be a gamma distribution with parameters \(\alpha=4, \beta=1000\):
\[\begin{aligned}\Theta & \sim \mbox{Gamma}(\alpha, \beta)\\
\ell_i & \sim \mbox{Exp}(\Theta) , \; \forall i \in N
\end{aligned}\]In this case the predictive distribution is a Pareto II distribution with density \(f(x) = \frac{\alpha \beta^\alpha}{(x+\beta)^{\alpha+1}}\) and a mean of \(\frac{\beta}{\alpha-1}=\,\)333.33.

I have three independent observations, namely losses of $100, $950 and $450. The posterior predictive expected loss is $416.67 and can be derived analytical, as shown in my earlier post. Now let me reproduce the answer with Stan as well.

Implementing the model in Stan is straightforward and I follow the same steps as in my simple example of last week. However, here I am also interested in the posterior predictive distribution, hence I add a generated quantities code block.



The output shows a simulated predictive mean of $416.86, close to the analytical answer. I can also read out that the 75%ile of the posterior predictive distribution is a loss of $542 vs. $414 from the prior predictive. That means every four years I shouldn't be surprised to observe a loss in excess of $500. Further I note that 90% of losses are expected to be less than $950, or in other words the observation in my data may reflect the outcome of an event with a 1 in 10 return period.

Comparing the sampling output from Stan with the analytical output gives me some confidence that I am doing the 'right thing'.

References

[1] Klugman, S. A., Panjer, H. H. & Willmot, G. E. (2004), Loss Models: From Data to Decisions, Wiley Series in Probability and Statistics.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] lattice_0.20-31 actuar_1.1-8 rstan_2.6.0 inline_0.3.14  
[5] Rcpp_0.11.6    

loaded via a namespace (and not attached):
[1] tools_3.2.0  codetools_0.2-11 grid_3.2.0 stats4_3.2.0

Hello Stan!

In my previous post I discussed how Longley-Cook, an actuary at an insurance company in the 1950's, used Bayesian reasoning to estimate the probability for a mid-air collision of two planes.

Here I will use the same model to get started with Stan/RStan, a probabilistic programming language for Bayesian inference.

Last week my prior was given as a Beta distribution with parameters \(\alpha=1, \beta=1\) and the likelihood was assumed to be a Bernoulli distribution with parameter \(\theta\):
\[\begin{aligned}
\theta & \sim \mbox{Beta}(1, 1)\\
y_i & \sim \mbox{Bernoulli}(\theta), \;\forall i \in N
\end{aligned}\]For the previous five years no mid-air collision were observed, \(x=\{0, 0, 0, 0, 0\}\). That's my data.

In this case the posterior distributions can be derived analytically. The posterior hyper-parameters are \(\alpha'=\alpha + \sum_{i=1}^n x_i,\, \beta'=\beta + n - \sum_{i=1}^n x_i\) and with that I get the posterior parameter for the predictive distribution, which is a Bernoulli distribution again: \(\theta' = \alpha'/(\alpha'+\beta')=1/7\approx14.3\%\).

Still, I can use Stan and MCMC simulations to come to the same answers (of course I am using a sledgehammer here to crack a nut).

In the first code block the model is written in Stan's modelling language. The next section calls stan and finally the results can be analysed. The answers are very much the same as the analytical approach in my previous post.



Interested in the application of R in insurance? Join us at the 3rd R in Insurance conference in Amsterdam, 29 June 2015.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.6.0   inline_0.3.14 Rcpp_0.11.6  

loaded via a namespace (and not attached):
[1] tools_3.2.0      codetools_0.2-11 stats4_3.2.0