mages' blog

First Bayesian Mixer Meeting in London

There is a nice pub between Bunhill Fields and the Royal Statistical Society in London: The Artillery Arms. Clearly, the perfect place to bring people together to talk about Bayesian Statistics. Well, that’s what Jon Sedar (@jonsedar, applied.ai) and I thought.

Source: http://www.artillery-arms.co.uk/
Hence, we’d like to organise a Bayesian Mixer Meetup on Friday, 12 February, 19:00. We booked the upstairs function room at the Artillery Arms and if you look outside the window, you can see Thomas Bayes’ grave.

We intend the group to be small (announcing only on the stan user group, pymc-devs gitter, and here for now) and geared to open discussion of Bayesian inference, tools, techniques and theory. Neither of us is a great expert, we're really just users of the tools, but we'd love to welcome academic discussion as well as real world examples etc.

Jon is more the Python/PyMC guy, while I come from the R/Rstan corner. We will prepare two talks to kick this off. Jon will talk about GLM Robust Regression with Outlier Detection using PyMC3, while I will talk about Experience vs Data with some stories from insurance and actuarial science, sprinkled with RStan examples.

If you would like to join us, please get in touch via the form below, so that we can keep tabs on numbers, and if this goes all well we shall set up a Meetup site.

Flowing triangles

I have admired the work of the artist Bridget Riley for a long time. She is now in her eighties, but as it seems still very creative and productive. Some of her recent work combines simple triangles in fascinating compositions. The longer I look at them, the more patterns I recognise.

Yet, the actual painting can be explained easily, in a sense of a specification document to reproduce the pattern precisely. However, seeing the real print, as I had the chance at the London Art Fair last week, and a reproduction on the screen is incommensurable.

Having said that, I could not resist programming a figure that resembles the artwork labelled Bagatelle 2. Well, at least I can say that I learned more about grid [1], grid.path [2] and gridSVG [3] in R.

Inspired by Bridget Riley Bagatelle 2

R Code


References

[1] P. Murrell. R Graphics, Second Edition. CRC Press. 2011
[2] P. Murrell. What's in a Name? . The R Journal, 4(2):5–12, dec 2012.
[3] P. Murrell and S. Potter. gridSVG: Export grid graphics as SVG. R package 1.5-0. 2015

Session Info

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils  datasets 
[7] methods   base     

other attached packages:
[1] gridSVG_1.5-0    data.table_1.9.6

loaded via a namespace (and not attached):
[1] tools_3.2.3   RJSONIO_1.3-0 chron_2.3-47  XML_3.98-1.3

Formatting table output in R

Formatting data for output in a table can be a bit of a pain in R. The package formattable by Kun Ren and Kenton Russell provides some intuitive functions to create good looking tables for the R console or HTML quickly. The package home page demonstrates the functions with illustrative examples nicely.

There are a few points I really like:
  • the functions accounting, currency, percent transform numbers into better human readable output
  • cells can be highlighted by adding color information
  • contextual icons can be added, e.g. from Glyphicons
  • output can be displayed in RStudio's viewer pane

The CRAN Task View: Reproducible Research lists other packages as well that help to create tables for web output, such as compareGroups, DT, htmlTable, HTMLUtils, hwriter, Kmisc, knitr, lazyWeave, SortableHTMLTables, texreg and ztable. Yet, if I am not mistaken, most of these packages focus more on generating complex tables with multi-columns rows, footnotes, math notation, etc, than the points I mentioned above.

Finally, here is a little formattable example from my side:



Session Info

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] formattable_0.1.5

loaded via a namespace (and not attached):
 [1] shiny_0.12.2.9006 htmlwidgets_0.5.1 R6_2.1.1         
 [4] rsconnect_0.3.79  markdown_0.7.7    htmltools_0.3    
 [7] tools_3.2.3       yaml_2.1.13       Rcpp_0.12.2      
[10] highr_0.5.1       knitr_1.12        jsonlite_0.9.19  
[13] digest_0.6.9      xtable_1.8-0      httpuv_1.3.3     
[16] mime_0.4  

R in Insurance: Registration and abstract submission opened

Following the successful 3rd R in Insurance conference in Amsterdam last year, we return to London this year.

The registration for the 4th conference on R in Insurance on Monday 11 July 2016 at Cass Business School has opened.

This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:
Details about the registration and abstract submission are given on the dedicated R in Insurance page at Cass Business School, London.

The submission deadline for abstracts is 28 March 2016. Please email your abstract of no more than 300 words to: rinsuranceconference@gmail.com.

Attendance of the whole conference is the equivalent of 6.5 hours of CPD for members of the Actuarial Profession.

For more information about the past events visit www.rininsurance.com.

Sponsors

The organisers gratefully acknowledge the sponsorship of Verisk/ISO, Mirai Solutions, RStudio, Applied AI, and CYBAEA.

Gold Sponsors

ISO2
Mirai Logo

Silver Sponsors

RStudio
AAI
Cybaea

Next Kölner R User Meeting: Friday, 4 December 2015

Koeln R
The 16th Cologne R user group meeting is scheduled for this Friday, 4 December 2015 and we have great line up with three talks followed by networking drinks.

Venue: Startplatz, Im Mediapark, 550670 Köln

Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.

For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Kirill Pomogajko and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

Notes from Warsaw R meetup

I had the great pleasure time to attend the Warsaw R meetup last Thursday. The organisers Olga Mierzwa and Przemyslaw Biecek had put together an event with a focus on R in Insurance (btw, there is a conference with the same name), discussing examples of pricing and reserving in general and life insurance.

Experience vs. Data

I kicked off with some observations of the challenges in insurance pricing. Accidents are thankfully rare events, that's why we buy insurance. Hence, there is often not a lot of claims data available for pricing. Combining the information from historical data and experts with domain knowledge can provide a rich basis for the assessment of risk. I presented some examples using Bayesian analysis to understand the probability of an event occurring. Regular readers of my blog will recognise the examples from earlier posts. You find my slides on GitHub.
Download slides

Non-life insurance in R

Emilia Kalarus from Triple A shared some of her experience of using R in non-life insurance companies. She focused on the challenges in working across teams, with different systems, data sets and mentalities.

As an example Emilia talked about the claims reserving process, which in her view should be embedded in the full life cycle of insurance, namely product development, claims, risk and performance management. Following this thought she presented an idea for claims reserving that models the live of a claim from not incurred and not reported (NINR), to incurred but not reported (IBNR), reported but not settled (RBNS) and finally paid.

Stochastic mortality modelling

The final talk was given by Adam Wróbel from the life insurer Nationale Nederlanden, discussing stochastic mortality modelling. Adam's talk on analysing mortality made me realise that life and non-life insurance companies may be much closer to each other than I thought.

Although life and non-life companies are usually separated for regulatory reasons, they both share the fundamental challenge of predicting future cash-flows. An example where the two industries meet is product liability.

Over the last century technology has changed our environment fundamentally, more so then ever before. Yet, we still don't know which long term impact some of the new technologies and products will have on our life expectancy. Some will prolong our lives, others may make us ill.

A classic example is asbestos, initial regraded as a miracle mineral, as it was impossible to set on fire, abundant, cheap to mine, and easy to manufacture. Not surprisingly it was widely used until it was linked to cause cancer. Over the last 35 years the non-life insurance industry has paid well in excess of hundred billion dollars in compensations.

Slides and Code

The slides and R code of the presentations are hosted on the Warsaw R GitHub page.

Hierarchical Loss Reserving with Stan

I continue with the growth curve model for loss reserving from last week's post. Today, following the ideas of James Guszcza [2] I will add an hierarchical component to the model, by treating the ultimate loss cost of an accident year as a random effect. Initially, I will use the nlme R package, just as James did in his paper, and then move on to Stan/RStan [6], which will allow me to estimate the full distribution of future claims payments.

Last week's model assumed that cumulative claims payment could be described by a growth curve. I used the Weibull curve and will do so here again, but others should be considered as well, e.g. the log-logistic cumulative distribution function for long tail business, see [1].
The growth curve describes the proportion of claims paid up to a given development period compared to the ultimate claims cost at the end of time, hence often called development pattern. Cumulative distribution functions are often considered, as they increase monotonously from 0 to 100%. Multiplying the development pattern with the expected ultimate loss cost gives me then the expected cumulative paid to date value.

However, what I'd like to do is the opposite, I know the cumulative claims position to date and wish to estimate the ultimate claims cost instead. If the claims process is fairly stable over the years and say, once a claim has been notified the payment process is quite similar from year to year and claim to claim, then a growth curve model is not unreasonable. Yet, the number and the size of the yearly claims will be random, e.g. if a windstorm, fire, etc occurs or not. Hence, a random effect for the ultimate loss cost across accident years sounds very convincing to me.

Here is James' model as described in [2]:
\[
\begin{align}
CL_{AY, dev} & \sim \mathcal{N}(\mu_{AY, dev}, \sigma^2_{dev}) \\
\mu_{AY,dev} & = Ult_{AY} \cdot G(dev|\omega, \theta)\\
\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}\\
Ult_{AY} & \sim \mathcal{N}(\mu_{ult}, \sigma^2_{ult})\\
G(dev|\omega, \theta) & = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)
\end{align}
\]The cumulative losses \(CL_{AY, dev}\) for a given accident year \(AY\) and development period \(dev\) follow a Normal distribution with parameters \(\mu_{AY, dev}\) and \(\sigma_{dev}\).

The mean itself is modelled as the product of an accident year specific ultimate loss cost \(Ult_{AY}\) and a development period specific parametric growth curve \(G(dev | \omega, \theta)\). The variance is believed to increase in proportion with the mean. Finally the ultimate loss cost is modelled with a Normal distribution as well.

Assuming a Gaussian distribution of losses doesn't sound quite intuitive to me, as loss are often skewed to the right, but I shall continue with this assumption here to make a comparison with [2] possible.

Using the example data set given in the paper I can reproduce the result in R with nlme:

The fit looks pretty good, with only 5 parameters. See James' paper for a more detailed discussion.

Let's move this model into Stan. Here is my attempt, which builds on last week's pooled model. With the generated quantities code block I go beyond the scope of the original paper, as I try to estimate the full posterior predictive distribution as well.
The 'trick' is the line mu[i] <- ult[origin[i]] * weibull_cdf(dev[i], omega, theta); where I have an accident year (here labelled origin) specific ultimate loss.

The notation ult[origin[i]] illustrates the hierarchical nature in Stan's language nicely.

Let's run the model:
The estimated parameters look very similar to the nlme output above.

Let's take a look at the parameter traceplot and the densities of the estimated ultimate loss costs by origin year.
This looks all not too bad. The trace plots don't show any particular patterns, apart from \(\sigma_{ult}\), which shows a little skewness.

The generated quantities code block in Stan allows me to get also the predictive distribution beyond the current data range. Here I forecast claims up to development year 12 and plot the predictions, including the 90% credibility interval of the posterior predictive distribution with the observations.

The model seems to work rather well, even with the Gaussian distribution assumptions. Yet, it has still only 5 parameters. Note, this model doesn't need an additional artificial tail factor either.

Conclusions

The Bayesian approach sounds to me a lot more natural than many classical techniques around the chain-ladder methods. Thanks to Stan, I can get the full posterior distributions on both, the parameters and predictive distribution. I find communicating credibility intervals much easier than talking about the parameter, process and mean squared error.

James Guszcza contributed to a follow up paper with Y. Zhank and V. Dukic [3] that extends the model described in [2]. It deals with skewness in loss data sets and the autoregressive nature of the errors in a cumulative time series.

Fank Schmid offers a more complex Bayesian analysis of claims reserving in [4], while Jake Morris highlights the similarities between a compartmental model used in drug research and loss reserving [5].

Finally, Glenn Meyers published a monograph on Stochastic Loss Reserving Using Bayesian MCMC Models earlier this year [7] that is worth taking a look at.

References

[1] David R. Clark. LDF Curve-Fitting and Stochastic Reserving: A Maximum Likelihood Approach. Casualty Actuarial Society, 2003. CAS Fall Forum.

[2] James Guszcza. Hierarchical Growth Curve Models for Loss Reserving, 2008, CAS Fall Forum, pp. 146–173.

[3] Y. Zhang, V. Dukic, and James Guszcza. A Bayesian non-linear model for forecasting insurance loss payments. 2012. Journal of the Royal Statistical Society: Series A (Statistics in Society), 175: 637–656. doi: 10.1111/j.1467-985X.2011.01002.x

[4] Frank A. Schmid. Robust Loss Development Using MCMC. Available at SSRN. See also http://lossdev.r-forge.r-project.org/

[5] Jake Morris. Compartmental reserving in R. 2015. R in Insurance Conference.

[6] Stan Development Team. Stan: A C++ Library for Probability and Sampling, Version 2.8.0. 2015. http://mc-stan.org/.

[7] Glenn Meyers. Stochastic Loss Reserving Using Bayesian MCMC Models. Issue 1 of CAS Monograph Series. 2015.

Session Info

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats  graphics  grDevices utils datasets  methods   base     

other attached packages:
[1] ChainLadder_0.2.3 rstan_2.8.0   ggplot2_1.0.1  Rcpp_0.12.1      
[5] lattice_0.20-33  

loaded via a namespace (and not attached):
 [1] nloptr_1.0.4       plyr_1.8.3         tools_3.2.2       
 [4] digest_0.6.8       lme4_1.1-10        statmod_1.4.21    
 [7] gtable_0.1.2       nlme_3.1-122       mgcv_1.8-8        
[10] Matrix_1.2-2       parallel_3.2.2     biglm_0.9-1       
[13] SparseM_1.7        proto_0.3-10       coda_0.18-1       
[16] gridExtra_2.0.0    stringr_1.0.0      MatrixModels_0.4-1
[19] lmtest_0.9-34      stats4_3.2.2       grid_3.2.2        
[22] nnet_7.3-11        tweedie_2.2.1      inline_0.3.14     
[25] cplm_0.7-4         minqa_1.2.4        actuar_1.1-10     
[28] reshape2_1.4.1     car_2.1-0          magrittr_1.5      
[31] scales_0.3.0       codetools_0.2-14   MASS_7.3-44       
[34] splines_3.2.2      rsconnect_0.3.79   systemfit_1.1-18  
[37] pbkrtest_0.4-2     colorspace_1.2-6   quantreg_5.19     
[40] labeling_0.3       sandwich_2.3-4     stringi_1.0-1     
[43] munsell_0.4.2      zoo_1.7-12