mages' blog

Fitting a distribution in Stan from scratch

Last week the French National Institute of Health and Medical Research (Inserm) organised with the Stan Group a training programme on Bayesian Inference with Stan for Pharmacometrics in Paris.

Daniel Lee and Michael Betancourt, who run the course over three days, are not only members of Stan's development team, but also excellent teachers. Both were supported by Eric Novik, who gave an Introduction to Stan at the Paris Dataiku User Group last week as well.

Eric Kramer (Dataiku), Daniel Lee, Eric Novik & Michael Betancourt (Stan Group)

I have been playing around with Stan on and off for some time, but as Eric pointed out to me, Stan is not that kind of girl(boy?). Indeed, having spent three days working with Stan has revitalised my relationship. Getting down to the basics has been really helpful and I shall remember, Stan is not drawing samples from a distribution. Instead, it is calculating the joint distribution function (in log space), and evaluating the probability distribution function (in log space).

Thus, here is a little example of fitting a set of random numbers in R to a Normal distribution with Stan. Yet, instead of using the built-in functions for the Normal distribution, I define the log probability function by hand, which I will use in the model block as well, and even generate a random sample, starting with a uniform distribution. However, I do use pre-defined distributions for the priors.

Why do I want to do this? This will be a template for the day when I have to use a distribution, which is not predefined in Stan, e.g. the actuar package has some interesting candidates.

Testing

I start off by generating fake data, a sample of 100 random numbers drawn from a Normal distribution with a mean of 4 and a standard deviation of 2. Note, the sample mean of the 100 figures is 4.2 and not 4.
Histogram of 100 random numbers drawn from N(4,2).
I then use the Stan script to fit the data, i.e. to find the the parameters \(\mu\) and \(\sigma\), assuming that the data was generated by a Gaussian process.

Traceplot of 4 chains, including warm-up phase
Histograms of posterior parameter and predictive samples
Comparison of the emperical distributions
The posterior parameter distributions include both \(\mu\) and \(\sigma\) in the 95% credible interval. The distribution of posterior predictive check (y_ppc) is wider, taking into account the uncertainty of the parameters. The interquartile range and mean of my initial fake data and the sample of the posterior predictive distribution look very similar. That's good, my model generates data, which looks like the original data.

Bayesian Mixer Meetup

Btw, tonight we have the 4th Bayesian Mixer Meetup in London.

Session Info

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12 (Sierra)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] MASS_7.3-45 rstan_2.12.1 StanHeaders_2.12.0 ggplot2_2.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7      codetools_0.2-14 digest_0.6.10    grid_3.3.1      
 [5] plyr_1.8.4       gtable_0.2.0     stats4_3.3.1     scales_0.4.0    
 [9] labeling_0.3     tools_3.3.1      munsell_0.4.3    inline_0.3.14   
[13] colorspace_1.2-6 gridExtra_2.2.1

googleVis 0.6.1 on CRAN

We released googleVis version 0.6.1 on CRAN last week. The update fixes issues with setting certain options, following the switch from RJSONIO to jsonlite.

Screen shot of some of the Google Charts
New to googleVis? The package provides an interface between R and the Google Charts Tools, allowing you to create interactive web charts from R without uploading your data to Google. The charts are displayed by default via the R internal help browser.

To lean more see the examples of googleVis charts on CRAN and read the introduction vignette.

Notes from the 4th R in Insurance Conference

The 4th R in Insurance conference took place at Cass Business School London on 11 July 2016. This one-day conference focused once more on the wide range of applications of R in insurance, actuarial science and beyond. The conference programme covered topics including reserving, pricing, loss modelling, the use of R in a production environment and much more.

The audience of the conference included both practitioners (c.80%) and academics (c.20%) who are active or interested in the applications of R in Insurance. It was a truly international event with speakers and delegates from Europe, Asia and the Americas. The coffee breaks and conference dinner offered great networking opportunities.

Mario Wüthrich, ETH Zürich

In the first plenary session Mario Wüthrich (RiskLab ETH Zurich) spoke about the (new) challenges in actuarial science. While fundamentals of analysing data have not changed over the years, the data and technology available has, and with that new challenges emerged. Yet, as Mario pointed out, insurance is still often concerned with analysing 'little' data, as losses occur rarely. Furthermore, the bigger data sets, often generated by sensors, require careful calibration, monitoring and cleansing. Those new challenges provide opportunities for new research (if data is being made available) and the industry. The R community can provide links between the two. Mario would like to see more and better documentation of R packages, more insurance examples and better handling of big data.

Thereafter, the programme consisted of a combination of contributed presentations and lightning talks, as well as a panel discussion on how analytics is transforming the insurance business. Adrian Cuc (Verisk), Simon Brickman (Beazley), Roland Schmid (Mirai Solutions) and Markus Gesmann (Vario Partners) discussed the efforts made in bridging between data vendors, consultants and insurers, as well as the challenges of developing collaborative business models that respond to market needs.

Dan Murphy, Trinostics

In the closing plenary, Dan Murphy (Trinostics, San Francisco) gave an insight into his experience as an actuary on how to provide persuasive advice for senior management. He uses the three-C's: context, confidence and clarity. Context is about articulating the problem in a language senior management can understand it. Why does the management need to worry about the problem? If you have a solution, then you have to deliver it with conviction, because, most importantly is has to be actionable. Clarity, of your actionable insight, ensures that those actions can be delegated to the relevant team/employee by the management without you in the room.

The slides of the conference are available on request.

Scientific committee and sponsors

The members of the scientific committee were: Katrien Antonio (KU Leuven, UvA), Christophe Dutang (Université du Maine), Markus Gesmann (Vario Partners), Giorgio Spedicato (UnipolSai ) and Andreas Tsanakas (Cass Business School).

Finally, we are grateful to our sponsors Verisk, Mirai Solutions, Applied AI, RStudio, CYBAEA and Oasis, without whom the event wouldn't be possible.

R in Insurance 2017

We are delighted to announce next year’s event already. The conference will travel across the Channel to ENSAE, Paris, 8 June 2017. Further details will be published on www.rininsurance.com.

Notes from the Kölner R meeting, 9 July 2016

Last Thursday the Cologne R user group came together again. This time, our two speakers arrived from Bavaria, to talk about Spark and R Server.

Introduction to Apache Spark

Download slides
Dubravko Dulic gave an introduction to Apache Spark and why Spark might be of interest to data scientists using R. Spark is designed for cluster computing, i.e. to distribute jobs across several computers. Not all tasks in R can be split easily across several nodes in a cluster, but if you use functions like by in R, then it is most likely doable. The by function in R splits a data set into several subsets and applies a specific function to each subgroup and collects the results in the end. In the world of Hadoop, this is called MapReduce. Spark has an advanced DAG (directed acyclic graph) execution engine that supports cyclic data flow and in-memory computing. Additionally, Spark has a direct API for R, which makes it relatively ease to write applications with Spark.

Microsoft R Server

Download slides
Since the acquisition of Revolution Analytics in 2015, Microsoft has been busy integrating R into its product offerings. Stefan Cronjaeger gave an overview of how R can be integrated into a production environment. Microsoft R server aims to solve the problem of doing 'big data' analytics with R, which allows to carrying out in-memory and disk-based data analysis. Additional new tools are called ScaleR for big data and parallelized analytics, ConnectR to connect to various other data sources, DistributedR for grid computing. Finally, Stefan showed us how Visual Studio can be used as an R development environment, similar to RStudio.

Next Kölner R meeting

The next meeting will be scheduled in about three months time. Details will be published on our Meetup site. Thanks again to Microsoft for their support.

Please get in touch, if you would like to present at the next meeting.

Notes from 3rd and 3.5th Bayesian Mixer Meetup

Two Bayesian Mixer meet-ups in a row. Can it get any better?

Our third 'regular' meeting took place at Cass Business School on 24 June. Big thanks to Pietro and Andreas, who supported us from Cass. The next day, Jon Sedar of Applied AI, managed to arrange a special summer PyMC3 event.

3rd Bayesian Mixer meet-up

First up was Luis Usier, who talked about cross validation. Luis is a former student of Andrew Gelman, so, of course, his talk touched on Stan and the 'loo' (leave one out) package in R. Luis started with a simple artificial example that aimed to predict the probability of goalkeepers to save a shot on target. Adding a hierarchical structure to the model and treating the variance as a random variable, resulted in a pathological posterior distribution, which makes sampling next to impossible. Instead, fitting different models, with different fixed parameters, allows the user then to compare the models via cross-validation using the 'loo' function. Clever! I need to learn more about this. Luis' slides are available here and the underlying source code on GitHub.

Luis Usier talking about cross-validation in R and Stan

We were lucky to have Robert Cowell talking to us, in what was his final week at Cass. Robert has been very much at the forefront of Bayesian development over the last 30 years. He is one of the co-authors of Probabilistic Networks and Expert Systems. Robert gave an insightful talk on probabilistic models for analysing mixed DNA traces. For illustration purpose, he used a crime case, where a man was killed in a pub, and where blood traces were used to support identifying the murder - turning statistics into a thriller.

Following those two stimulating talks, we had a few networking drinks at the Artillery Arms. But not too many, as the next day continued with another Bayesian event.

3.5th Meetup: PyMC3 summer special

We had a rare opportunity to gather together a few of the core contributors of the PyMC3 package for a talks & hack session. PyMC3 is a leading framework for probabilistic programming entirely based in Python with a 'theano' backend, with support for the NUTS sampler, Variational Inference and lots of useful functionality - an alternative to Stan.

We had two core contributors with us: Chris Fonnesbeck (usually in Nashville, USA) and Thomas Wiecki (online from Düsseldorf, Germany), plus other package contributors.

Chris Fonnesbeck talking about PyMC3

On Saturday morning Chris gave an overview of PyMC3, followed by a detailed talk of Thomas on Bayesian Deep Learning. The afternoon was spent hacking together away on different problems. I was new to PyMC3, so I went through the tutorial on Probabilistic Programming using PyMC3, which Chris had given at a workshop in Oslo.

Many thanks to all who helped to make these events such a success and especially to Chris, Thomas, Luis, Robert, Andreas, Pietro and Jon.

If you have ideas for a future event, then please get in touch and visit our Meetup page.

Early bird registration for R in Insurance closes 30 May


Hurry! The early bird registration offer for the 4th R in Insurance conference, 11 July 2016, at Cass Business School closes 30 May.

This one-day conference will focus once more on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered include reserving, pricing, loss modelling, the use of R in a production environment, and more.

We have a fantastic programme with international speakers and conference dinner at Ironmongers Hall. Keynotes will be given by Mario Wüthrich and Dan Murphy.

The organisers gratefully acknowledge the sponsorship of Verisk, Mirai Solutions, Applied AI, Studio, CYBAEA and Oasis, without whom the event wouldn't be possible.

R in Insurance 2016 Programme


We are delighted to announce that the programme for the 4th R in Insurance conference at Cass Business School in London, 11 July 2016, have been finalised.

Register by the end of May to get the early bird booking fee.

The organisers gratefully acknowledge the sponsorship of Verisk, Mirai Solutions, Applied AI, Studio, CYBAEA and Oasis, without whom the event wouldn't be possible.