mages' blog

Notes from the Kölner R meeting, 14 October 2016

Last Friday the Cologne R user group came together for two talks and a quiz at Eye/o, the company behind Adblock Plus, in Köln-Ehrenfeld. Eye/o were a great host, offering nibbles and drinks to warm up the event and pizza at the end.

Cologne R user meeting at Eye/o
The first talk was given by Jiddu Alexander, a physicist turned freelance data scientist. Jiddu gave an introduction into the tidyverse. He presented the concept of tidy data, and how the tidyverse bundle can be used to manage multiple models. Furthermore, he explained the concept of learning curves for model selection. Jiddu's slides are available from his web site.

Jiddu Alexander explaining learning curves

Next up was Nils Glück to share his experience on performance profiling. R code often grows from a small idea for a specific task to a longer and longer script as more and more ideas and use cases are added. Occasionally, we end up with a long and poorly documented script that 'does the job' but has become slow. Finding the bottlenecks and addressing them is good short term remedy. Nils showed us how the Rprof function of the utils package can be used to understand the performance profile of R code. Furthermore, the microbenchmark package with a function of the same name can then be used to test new approaches for a code block.

Nils Glück quoting others who are not bothered about performance

To bridge the time for the pizzas to arrive our host Kirill had prepared a little R quiz: Could we guess the output of simple R statements? Well, it is more difficult than you might think. Kirill had a great selection of quirky one-liners, which he had collected over time and borrowed from the fabulous R Inferno book by Pat Burns.

Next Kölner R meeting

The next meeting will be scheduled in about three months time. Details will be published on our Meetup site. Thanks again to Eye/o for their support.

Please get in touch, if you would like to present at the next meeting.

Next Kölner R User Meeting: Friday 14 October

Koeln R
The 19th Cologne R user group meeting is scheduled for this Friday, 14 October 2016. We have three talks, followed by networking drinks.

  • Introduction to the tidyverse tools - Jiddu Alexander
  • Performance profiling and improvement in R - Nils Glück
  • Batch processing of R-Scripts with Excel - Klaus Jacobi
Venue: Eyeo GmbH, Lichtstraße 25, 50825 Köln

For further details visit our KölnRUG Meetup site.

Notes from past meetings are available here.

Notes from 4th Bayesian Mixer Meetup

Last Tuesday we got together for the 4th Bayesian Mixer Meetup. Product Madness kindly hosted us at their offices in Euston Square. About 50 Bayesians came along; the biggest turn up thus far, including developers of PyMC3 (Peadar Coyle) and Stan (Michael Betancourt).

The agenda had two feature talks by Dominic Steinitz and Volodymyr Kazantsev and a lightning talk by Jon Sedar.

Dominic Steinitz: Hamiltonian and Sequential MC samplers to model ecosystems
Dominic shared with us his experience of using Hamiltonian and Sequential Monte Carlo samplers to model ecosystems.

Volodymyr Kazantsev: Bayesian Model Averaging
Finding the 'best' model was Volodymyr's challenge. He tried various R packages (BMA, BMS and BAS) for Bayesian model averaging, with various degrees of success.

Jon Sedar: Easier Plate Notation in Python using Daft
Finally, Jon gave a brief overview on Daft, a nifty Python package for creating graphs, or plate notation.

Next meeting

The next Bayesian Mixer Meetup meeting is already scheduled for 21 October. We will be back at Cass Business School, with two talks:

  • Darren Wilkinson: Hierarchical Bayesian Modelling of Growth Curves inc Stochastic Processes
  • Peadar Coyle: Advanced PyMC3

Fitting a distribution in Stan from scratch

Last week the French National Institute of Health and Medical Research (Inserm) organised with the Stan Group a training programme on Bayesian Inference with Stan for Pharmacometrics in Paris.

Daniel Lee and Michael Betancourt, who run the course over three days, are not only members of Stan's development team, but also excellent teachers. Both were supported by Eric Novik, who gave an Introduction to Stan at the Paris Dataiku User Group last week as well.

Eric Kramer (Dataiku), Daniel Lee, Eric Novik & Michael Betancourt (Stan Group)

I have been playing around with Stan on and off for some time, but as Eric pointed out to me, Stan is not that kind of girl(boy?). Indeed, having spent three days working with Stan has revitalised my relationship. Getting down to the basics has been really helpful and I shall remember, Stan is not drawing samples from a distribution. Instead, it is calculating the joint distribution function (in log space), and evaluating the probability distribution function (in log space).

Thus, here is a little example of fitting a set of random numbers in R to a Normal distribution with Stan. Yet, instead of using the built-in functions for the Normal distribution, I define the log probability function by hand, which I will use in the model block as well, and even generate a random sample, starting with a uniform distribution. However, I do use pre-defined distributions for the priors.

Why do I want to do this? This will be a template for the day when I have to use a distribution, which is not predefined in Stan, e.g. the actuar package has some interesting candidates.


I start off by generating fake data, a sample of 100 random numbers drawn from a Normal distribution with a mean of 4 and a standard deviation of 2. Note, the sample mean of the 100 figures is 4.2 and not 4.
Histogram of 100 random numbers drawn from N(4,2).
I then use the Stan script to fit the data, i.e. to find the the parameters \(\mu\) and \(\sigma\), assuming that the data was generated by a Gaussian process.

Traceplot of 4 chains, including warm-up phase
Histograms of posterior parameter and predictive samples
Comparison of the emperical distributions
The posterior parameter distributions include both \(\mu\) and \(\sigma\) in the 95% credible interval. The distribution of posterior predictive check (y_ppc) is wider, taking into account the uncertainty of the parameters. The interquartile range and mean of my initial fake data and the sample of the posterior predictive distribution look very similar. That's good, my model generates data, which looks like the original data.

Bayesian Mixer Meetup

Btw, tonight we have the 4th Bayesian Mixer Meetup in London.

Session Info

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12 (Sierra)

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] MASS_7.3-45 rstan_2.12.1 StanHeaders_2.12.0 ggplot2_2.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7      codetools_0.2-14 digest_0.6.10    grid_3.3.1      
 [5] plyr_1.8.4       gtable_0.2.0     stats4_3.3.1     scales_0.4.0    
 [9] labeling_0.3     tools_3.3.1      munsell_0.4.3    inline_0.3.14   
[13] colorspace_1.2-6 gridExtra_2.2.1

googleVis 0.6.1 on CRAN

We released googleVis version 0.6.1 on CRAN last week. The update fixes issues with setting certain options, following the switch from RJSONIO to jsonlite.

Screen shot of some of the Google Charts
New to googleVis? The package provides an interface between R and the Google Charts Tools, allowing you to create interactive web charts from R without uploading your data to Google. The charts are displayed by default via the R internal help browser.

To lean more see the examples of googleVis charts on CRAN and read the introduction vignette.

Notes from the 4th R in Insurance Conference

The 4th R in Insurance conference took place at Cass Business School London on 11 July 2016. This one-day conference focused once more on the wide range of applications of R in insurance, actuarial science and beyond. The conference programme covered topics including reserving, pricing, loss modelling, the use of R in a production environment and much more.

The audience of the conference included both practitioners (c.80%) and academics (c.20%) who are active or interested in the applications of R in Insurance. It was a truly international event with speakers and delegates from Europe, Asia and the Americas. The coffee breaks and conference dinner offered great networking opportunities.

Mario Wüthrich, ETH Zürich

In the first plenary session Mario Wüthrich (RiskLab ETH Zurich) spoke about the (new) challenges in actuarial science. While fundamentals of analysing data have not changed over the years, the data and technology available has, and with that new challenges emerged. Yet, as Mario pointed out, insurance is still often concerned with analysing 'little' data, as losses occur rarely. Furthermore, the bigger data sets, often generated by sensors, require careful calibration, monitoring and cleansing. Those new challenges provide opportunities for new research (if data is being made available) and the industry. The R community can provide links between the two. Mario would like to see more and better documentation of R packages, more insurance examples and better handling of big data.

Thereafter, the programme consisted of a combination of contributed presentations and lightning talks, as well as a panel discussion on how analytics is transforming the insurance business. Adrian Cuc (Verisk), Simon Brickman (Beazley), Roland Schmid (Mirai Solutions) and Markus Gesmann (Vario Partners) discussed the efforts made in bridging between data vendors, consultants and insurers, as well as the challenges of developing collaborative business models that respond to market needs.

Dan Murphy, Trinostics

In the closing plenary, Dan Murphy (Trinostics, San Francisco) gave an insight into his experience as an actuary on how to provide persuasive advice for senior management. He uses the three-C's: context, confidence and clarity. Context is about articulating the problem in a language senior management can understand it. Why does the management need to worry about the problem? If you have a solution, then you have to deliver it with conviction, because, most importantly is has to be actionable. Clarity, of your actionable insight, ensures that those actions can be delegated to the relevant team/employee by the management without you in the room.

The slides of the conference are available on request.

Scientific committee and sponsors

The members of the scientific committee were: Katrien Antonio (KU Leuven, UvA), Christophe Dutang (Université du Maine), Markus Gesmann (Vario Partners), Giorgio Spedicato (UnipolSai ) and Andreas Tsanakas (Cass Business School).

Finally, we are grateful to our sponsors Verisk, Mirai Solutions, Applied AI, RStudio, CYBAEA and Oasis, without whom the event wouldn't be possible.

R in Insurance 2017

We are delighted to announce next year’s event already. The conference will travel across the Channel to ENSAE, Paris, 8 June 2017. Further details will be published on

Notes from the Kölner R meeting, 9 July 2016

Last Thursday the Cologne R user group came together again. This time, our two speakers arrived from Bavaria, to talk about Spark and R Server.

Introduction to Apache Spark

Download slides
Dubravko Dulic gave an introduction to Apache Spark and why Spark might be of interest to data scientists using R. Spark is designed for cluster computing, i.e. to distribute jobs across several computers. Not all tasks in R can be split easily across several nodes in a cluster, but if you use functions like by in R, then it is most likely doable. The by function in R splits a data set into several subsets and applies a specific function to each subgroup and collects the results in the end. In the world of Hadoop, this is called MapReduce. Spark has an advanced DAG (directed acyclic graph) execution engine that supports cyclic data flow and in-memory computing. Additionally, Spark has a direct API for R, which makes it relatively ease to write applications with Spark.

Microsoft R Server

Download slides
Since the acquisition of Revolution Analytics in 2015, Microsoft has been busy integrating R into its product offerings. Stefan Cronjaeger gave an overview of how R can be integrated into a production environment. Microsoft R server aims to solve the problem of doing 'big data' analytics with R, which allows to carrying out in-memory and disk-based data analysis. Additional new tools are called ScaleR for big data and parallelized analytics, ConnectR to connect to various other data sources, DistributedR for grid computing. Finally, Stefan showed us how Visual Studio can be used as an R development environment, similar to RStudio.

Next Kölner R meeting

The next meeting will be scheduled in about three months time. Details will be published on our Meetup site. Thanks again to Microsoft for their support.

Please get in touch, if you would like to present at the next meeting.