R in Insurance 2014: Conference Programme & Abstracts
I am delighted to announce that the programme and abstracts for the second R in Insurance conference at Cass Business School in London, 14 July 2014, have been finalised.
Register by the end of May to get the early bird booking fee.
9:00 - 10:00 Opening keynote:
Montserrat Guillen and Leo Guelman: New trends in predictive modelling - the uplift models success story
10:00 - 11:00 Session 1: Reserving (20 min. each)
- Munir Hiabu: RBNS preserving double chain ladder
- Els Godecharle: Reserving by conditioning on markers of individual claims: a case study using historical simulation
- Brian Fannin: Multivariate regression models for reserving
11:00 - 11:30 Coffee
11:30 - 12:30: Lightning talks (10 min each)
- Roel Verbelen: Loss modelling with mixtures of Erlang distributions
- Nicolas Baradel: R Program GUI Manager
- Karl-Kuno Kunze: Time Series Data Quality Analysis
- Ed Tredger: Integrating R with existing software in the London Market
- Wim Konings: Estimating the duration of non-maturing liabilities in a user friendly Shiny web application
- Markus Gesmann: End User Computing with R under Solvency II
12:30 - 13:30 Lunch
13:30 - 14:30 Session 2: Pricing (20 min. each)
- Giorgio Spedicato: R data mining for insurance retention modelling
- Bernhard Kübler: Assessing exploration risk for geothermal wells
- Xavier Marechal: Geographical ratemaking with R
14:30 - 15:00 Panel discussion: R at the Interface of Practitioner/Academic Communication
15:00 - 15:30: Coffee
15:30 - 16:30 Session 3: Life / Using R in a production environment (20 min. each)
- Ana Debon: Modelling trends and inequality of longevity in European Union countries
- Kate Hanley: Dynamic Report Generation in R: LaTeX vs. Markdown
- Matthew Dowle: Introduction to
16:30 - 17:30 Closing keynote:
Arthur Charpentier: Going Bayesian with R - a non-Bayesian perspective
18:00 - XY:00 The conference will be followed by a drinks and networking reception at Cass and the conference dinner (venue tbc).
Montserrat Guillen, Dept. Econometrics, Riskcenter, University of Barcelona and Leo Guelman, Royal Bank of Canada
The general setting and some classical contributions on customer
loyalty and lifetime value in insurance are presented. Methods to
identify attributes to increase customer loyalty are reviewed,
including the logistic regression model to predict the probability of
policy lapse and survival analysis techniques to predict customer
duration. Next, the concept of uplift modelling is introduced. This
method aims to measure the impact of a proactive intervention, such as
a marketing action, on a given response at the individual subject
level. Applications of conventional and uplift modelling techniques in
the context of insurance cross-selling, client retention and pricing
are described. Uplift models provide insurers a good orientation
regarding business risk management and these ideas are can be
generalised to financial services involving risk transfer. Features of
the recently created R-package
uplift are shown. Our conclusion is
that more interaction between retention strategies and pricing should
be encouraged and that integrated predictive modelling is a promising
Frees, E.W, Derrig, R., Meyer, G. (Eds) (2014) Predictive Modeling Handbook for Actuarial Science. Volume I. Regression with Categorical Dependent Variables. Chapter 3 by Guillén, M. Cambridge University Press. In press.
Guelman, L. (2014). uplift: Uplift Modeling. R package version 0.3.5.
Guelman, L. and Guillén, M. (2014) “A causal inference approach to measure price elasticity in Automobile Insurance” Expert Systems with Applications, 41(2), 387-396.
Guelman, L., Guillén, M. and Pérez-Marín, A.M. (2014) “Uplift Random Forests” Cybernetics & Systems, Special issue on “Intelligent Systems in Business and Economics”, accepted.
Guelman, L., Guillén, M. and Pérez-Marín, A.M. (2012) “Random forests for uplift modeling: An insurance customer retention case” Lecture Notes in Business Information Processing, 115 LNBIP, 123-133.
Munir Hiabu, Cass Business School
The single most important number in the accounts of a non-life company is often the so called reserve. The reserve is estimated based on a statistical analysis combining past paid claims and so called RBNS claims estimates. RBNS means “reported but not settled” and is a number set for any incurred claim in an insurance company. The statistical analysis in insurance companies is often done in practice via the classical chain ladder on so called incurred data.
This approach involves a statistical modelling and forecasting of RBNS estimates implying that expert opinion from the claims department is changed in the reserving department. Our motivation for developing this new RBNS preserving double chain ladder approach is to ensure that expert opinion of individual claims reserving is not changed in the modelling process. We take advantage of the flexibility of the double chain ladder method and develop a reserving technique that does not change RBNS estimates via modelling and forecasting. Full stochastic cash flows of the RBNS reserve, the IBNR reserve and the reserving tail are provided. Programmes are available via the double chain ladder (DCL) R-package.
Els Godecharle, KU Leuven
Our research explores the use of claim specific characteristics, so-called claim markers, for loss reserving with individual claims. Starting from the approach of Rosenlund (2012) we develop a stochastic Reserve by Detailed Conditioning (‘RDC’) method which is applicable to a micro-level data set with detailed information on individual claims.
We use historical simulation to construct the predictive distribution of the outstanding loss reserve by simulating payments of a claim, given its claim markers. We explore how to incorporate different types of claim specific information when simulating outstanding loss reserves, and evaluate the impact of the set of markers and their specification on the predictive distribution of the outstanding reserve.
The stochastic RDC method is implemented in R and the code is made available online. We demonstrate the performance of the method on a portfolio of general liability insurance policies for private individuals from a European insurance company.
Brian Fannin, Redwoods Group
MRMR - Multivariate Regression Models for Reserving is an R package
for loss reserving. The emphasis is on the treatment of loss reserving
as a multilevel linear regression problem. MRMR supports S4 classes
for storage of reserving data and reserving models. A
object is composed of smaller objects which represent
OriginPeriod, StaticMeasure and
StochasticMeasure. The stochastic feature of the
StochasticMeasure object refers to the fact that the measure may
change over time. In effect, a
StochasticMeasure is a traditional
loss triangle. However, the data may be augmented by use of a
StaticMeasure object to store data which is fixed in time, such as
written premium or other exposure elements. The OriginPeriod and
measure classes support implementation of basic functionality such as
c, comparison, assignment and extraction in a natural way
consistent with common R objects. This enables one to easily update
information as part of routine studies, join to other data sources or
examine particular subsets of reserving data.
The Triangle object features basic visualisation for exploratory
analysis to aid the actuary in selecting model variables. The
TriangleModel object stores information regarding a fit model. One may
have more than one model for the same Triangle object. When
constructing a model, the development lag generally acts as a
categorical parameter, though others may also be introduced. The
structure of the Triangle object permits multilevel linear models so
that, for example, one may view results for a particular line
segmented by territory or other dimensional variables such as customer
size, customer industry or urban vs. rural risks. This is effectively
a wrapper around calls to the
lme4 package. Finally, a
TriangleProjection object stores projection of a TriangleModel to any
arbitrary future time period. The
TriangleProjection classes are roughly analogous to a data frame,
the result from an
lm function and the result from a
Roel Verbelen, KU Leuven
Modelling data on claim sizes is crucial when pricing insurance products. Such loss models require on the one hand the flexibility of nonparametric density estimation techniques to describe the insurance losses and on the other hand the feasibility to analytically quantify the risk. Mixtures of Erlang distributions with a common scale are very versatile as they are dense in the space of positive continuous distributions (Tijms (1994, p. 163)). At the same time, it is possible to work analytically with this kind of distributions. Closed-form expressions of quantities of interest, such as the Value-at-Risk (VaR) and the Tail-Value-at-Risk (TVaR), can be derived as well as appealing closure properties (Lee and Lin (2010), Willmot and Lin (2011) and Klugman et al. (2012)). In particular, using these distributions in aggregate loss models leads to an analytical form of the corresponding aggregate loss distribution which avoids the need for simulations to evaluate the model.
In actuarial science, claim severity data is often censored and/or truncated due to policy modifications such as deductibles and policy limits. Lee and Lin (2010) formulate a calibration technique based on the EM algorithm for fitting mixtures of Erlangs with a common scale parameter to complete data. Here, we construct an adjusted EM algorithm which is able to deal with censored and truncated data, inspired by McLachlan and Peel (2001) and Lee and Scott (2012). Using the developed R program, we demonstrate the approximation strength of mixtures of Erlangs and model e.g. the left truncated Secura Re data from Beirlant et al. (2004), and use the mixtures of Erlangs approach to price an excess-of-loss reinsurance contract.
Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., De Waal, D., and Ferro, C. (2004). Statistics of Extremes: Theory and Applications. Wiley Series in Probability and Statistics. Wiley.
Klugman, S. A., Panjer, H. H., and Willmot, G. E. (2012). Loss models: from data to decisions, volume 715. Wiley.
Lee, G. and Scott, C. (2012). EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Computational Statistics & Data Analysis, 56(9):2816 - 2829.
Lee, S. C. and Lin, X. S. (2010). Modeling and evaluating insurance losses via mixtures of Erlang distributions. North American Actuarial Journal, 14(1):107.
McLachlan, G. and Peel, D. (2001). Finite mixture models. Wiley.
Tijms, H. C. (1994). Stochastic models: an algorithmic approach. Wiley.
Willmot, G. E. and Lin, X. S. (2011). Risk modelling with the mixed Erlang distribution. Applied Stochastic Models in Business and Industry, 27(1):2-16.
Nicolas Baradel and William Jouot, PGM Solutions
R is known to be a great programming language. However, we believe that it lacks two important features: creating graphical interfaces and generating reports. RPGM is a new brand software, standing for “R Program GUI Manager”. This tool enables any programmer to easily make R programs, to create user-friendly interfaces and powerful report generation abilities, that can export results as PDF files or as Excel spreadsheets.
Indeed, RPGM contains a powerful IDE, the Editor, with an R script editor with strong keywords coloration, function auto-completion, and a quick and easy access to the R help pages by just pressing F1 when the cursor is on a function. As TCL/TK, users can create graphical interfaces, but without a single line of code and without having to learn a new language. All common widgets types can be inserted such as text/number inputs, file/folder choosers, check boxes, dropdown lists, images…
Furthermore, RPGM also contains a “Client” software, aimed for end users, which executes programs made with the Editor. Most of the time, end user does not know and does not have to know about R and the given implemented feature. R is completely hidden, as the user will communicate with R through the user friendly generated client interface, yet the R console can be displayed for debugging purposes.
RPGM can also generate PDF reports using the powerful LaTeX language. No need to know how LaTeX works for using it thanks to the Report Editor, although raw LaTeX code can also be inserted. Excel spreadsheets can also be generated, based on an already existing Excel file with formulas and graphics. RPGM will add results from R in the spreadsheet in a specific cell name. For more information and screenshots, visit www.pgm-solutions.com.
Karl-Kuno Kunze, RStudio and Fractional View
Usually, time series analysis in finance, insurance, and other fields of interest starts from the premise that data quality is checked: all data is in place and in order. However, all too often data is either missing or wrong. Reports exist that time series have even been switched by market data providers at times. The package ‘Time Series Data Quality’ (TSDQ) provides an easily accessible tool to check time series data prior to further processing. Two main applications are in focus: analysis of historical series where the full history is checked for plausibility, as well as a ‘golden copy’ mode where newly incoming data is analysed and checked, for example in overnight risk calculations. In the second case, the checked data is the newest available information for the particular time series.
We present some features of the package as will be available for download from www.fractionalview.com. In addition, we will walk the audience through an online application provided by www.fractionalview.com fuelled by package TSDQ and the Shiny framework developed and provided by RStudio. The application consists of three major parts:
- Missing Data Analysis
- Outlier Analysis
- Corrected Data Imputation
The steps are accompanied by Conspicuous & Corrected Data reports.
Although high flexibility of methods is in focus, a standard mode with both robust and easy to use methods aims at a large non-specialist community as outlier detection and the like are concerned. The online tool shall integrate seamlessly into existing analyses and help avoid pitfalls that may result from mere data quality issues.
Ed Tredger and Dan Thompson, UMACS
Whilst the vast majority of us know spreadsheets are incredibly useful, there are things that they simply can not cope with. Our talk will focus on what R enables Actuaries to do beyond what existing software can handle, using example from the London Insurance Market. All of this talk is based on practical experience of using R in real-world examples and draws from the presenters’ personal experiences.
There are three distinct sections to the talk:
- The limitations of spreadsheets, now and in the future
- Why is R the solution?
- Practical examples of R integrating with Excel
The first section of the talk discusses why increasing volumes of data, demand for MI and reporting and the increasingly complex nature of reserving, pricing and capital modelling will push current software beyond their limits.
The second part of our talk discusses why we believe R is particularly well placed to go beyond the limitation of spreadsheets and manage the integration of other pieces of software.
The third part will discuss different implementations of R we have found have added significant end-user value. This part of the talk will use examples of how R has been successfully used in pricing, reporting and capital modelling.
Wim Konings, Reacfin
One of the challenges in the risk management of non-maturing liabilities (e.g. saving accounts) is the estimation of the interest rate sensitivity (or duration) of these instruments. The aim of our talk is double:
First a replicating portfolio approach will be presented for this problem. This approach consists of testing a large number of random investment strategies over a historical time horizon in order to select this strategy that replicates best the interest rate evolution of the saving account. In practice the replicating portfolio is chosen to be the strategy that minimises the variance of the margin between the saving account rate and replicating portfolio rate. By testing a large number of portfolios we are also able to construct a risk-return plane that can be used for optimising the investment strategy.
Second, we will present how such a model can be turned into a user friendly web application using the Shiny package.
Markus Gesmann, Lloyd’s
Under the Solvency II regime insurance companies have to demonstrate that applications built outside an IT controlled environments have an appropriate control framework in place.
Open source communities had to overcome those challenges in the past already: How to organise work across multiple teams? How to define interfaces? How to deal with security, incident management, documentation, testing, roll out, etc.?
The R documentation on writing R extensions answers those questions and offers a blue print and framework for end user computing. Over 5000 packages on CRAN demonstrate the success of this approach.
This talk will illustrate how the R package development standards map to the Solvency II requirements for end user computing.
Giorgio Spedicato, ACAS Data Scientist at UnipolSai
Retention analysis is a very important task that insurers do when perform tariff revisions. Retention and conversion represent the two blocks of price optimisation that is currently among the most sophisticated pricing analyses carried by actuaries working on personal lines.
However, academic literature has given few attention to this topic. The use of predictive modelling techniques, like classification trees, random forests, SVM, neural networks, knn, as long as time-to-event modelling has been deeply explored and actively used by data scientist and practitioners in other industries, like marketing, healthcare, banking etc.
The almost uniquely used model in actuarial practice and the only model implemented in standard pricing software is logistic regression. The aim of this paper is twofold:
First, some data mining techniques (among the most used) will be applied on a real data set and their predictive performance will compared with standard logistic regression results.
Second,the optimal renewal premium decision will be determined within all different methodology and results in term of underlying probability will be performed.
The analysis will be performed by the use of R statistical software (R
Core Team, 2013) and specialised predictive modelling package, in
caret package (Kuhn, 2008).
Bernhard Kübler, Fraunhofer Institute for Industrial Mathematics
In the course of Germany’s ‘energy transition’ alternative power sources are being extensively investigated. A particular form of these renewable energies is geothermics whose resources in Germany are estimated to 1200 EJ (Exajoules). As the power of a geothermal installation is proportional to the product of water temperature T and flow rate Q, a geothermal well is said to be successful if the crucial parameters T and Q exceed some critical thresholds fixed by the investor. The hazard that the drilling yields lower values than those required by the investor is referred to as exploration risk.
Typically, investors transfer exploration risk to a reinsurance company. Therefore, not only investors but also insurance and reinsurance companies seek for an accurate assessment of exploration risk. Also, low interest rates and regulatory standards (Solvency II) trigger insurers to consider new asset classes like alternative investments. Due to generating stable cash flows, investing in renewable energies and infrastructure is seen to be an eligible strategy for insurers. Within this context, risk management and reporting purposes call for a thorough assessment of the associated risk.
Besides the classical Kriging approach, we particularly employ Support Vector Machine Regression (SVR) to measure exploration risk. To our knowledge, up to now SVR has not been used in the context of geothermics. The major advantage of a Machine Learning based approach is its ability to model even quite complex nonlinear relationships appropriately. We also address estimation risk by deriving statements concerning the reliability of (point) estimates gained by SVR.
Our first results based on real data indicate that the Machine Learning tools yield forecasts being more precise than Kriging. This should enable investors and insurers to enhance their assessment of exploration risk, with corresponding potentials to realise cost savings.
Xavier Marechal, Reacfin
In motor insurance, most companies have adopted a risk classification according to the geographical zone where the policyholder lives (urban / non urban for instance, or a more accurate splitting of the country according to Zip codes). The aim of this talk is to introduce in the tariff a new explanatory variable based on the policyholder’s district taking into account the other explanatory variables already present in the technical tariff.
It is usually better to take into account both the frequency and the mean cost as behaviour can be quite different but a lot of companies concentrate on the frequency in order to establish their geographical ratemaking.
We will briefly present different solutions to define a categorical
geographical variable but we will focus on the use of GAM for the case
study in R. The
readShapeSpatial function will be introduced to plot
Ana Debon, Universitat Politècnica de València and Steve Haberman, Cass Business School
Comparisons of differential survival by country are useful in many domains. In the area of public policy, they help policymakers and analysts assess how much various groups benefit from public programs, such as Social Security and health care. In financial markets and especially for actuaries, they are important for designing annuities and life insurance.
In this study, we present a method for clustering information about differential survival by country and reviews mortality indicators to study inequalities and trends on mortality. Then we use this approach to group mortality surfaces for European Union countries. Additionally, the indicators allow us to characterise each group. All these statistical analyses were performed using the R environment for statistical computing.
Using R in a production environment
Kate Hanley, Mango Solutions
Dynamic document generation allows for the seamless integration of R code and document templates when generating reports. Changes in the R code are filtered through to the document, reducing the possibility of errors and discrepancies between the R code and final report, and minimising the burden of quality control procedures.
knitr package provides excellent tools to facilitate this
process, and allows documents to be generated directly from
R. However, there is often uncertainty concerning which mark-up
language to use to create the report templates, and the choice often
depends on a variety of factors.
This presentation focuses on two widely-used mark-up languages, LaTeX and markdown. Code examples are used to illustrate the differences between the two languages, allowing for a discussion of the pros and cons of each in terms of learning curve, flexibility and ease of use.
Matthew Dowle, data.table-project
data.table package provides an enhanced version of
including fast aggregation of large datasets, fast ordered joins, fast
add/modify/delete of columns by group using no copies at all, list
columns and a fast file reader: fread(). Its goal is to reduce both
programming time (fewer function calls, less variable name repetition)
and compute time on large data (64bit with 8GB+ RAM). It was first
released in 2008.
The presentation will cover the essential syntax (creating a
data.table, fast and friendly file reading with
setkey), update by reference (
:= and set*), ordered joins
forwards, backwards, limited and nearest) with a focus on quality
assurance (over 1,000 tests and release procedures) and support (a
review of 1,200 Q&A on
Arthur Charpentier, Université du Québec à Montréal, Professor
Bayesian philosophy has a long history in actuarial science. Liu et al. (1996) claim that “Statistical methods with a Bayesian flavour […] have long been used in the insurance industry.” If actuarial students discover Bayesian statistics with credibility, the Bayesian philosophy can be extremely powerful. Not only to quantify uncertainty when only a few observations have been observed (and asymptotic results cannot be invoked) but also to make computations possible when a lot of observations are available (and classical computations will be too complex). With the perspective of a muggle, I will try to explain the power of Bayesian techniques (usually seen as a magical black box by non-Bayesians).