Reserving based on logincremental payments in R, part I
1/08/2013 07:58:00 am
Actuarial
,
Barnett Zehnwirth
,
ChainLadder
,
IBNR
,
Insurance
,
linear model
,
logincremental
,
R
,
reserving
,
Risk
,
Tutorials
No comments
A recent post on the PirateGrunt blog on claims reserving inspired me to look into the paper Regression models based on logincremental payments by Stavros Christofides [1], published as part of the Claims Reserving Manual (Version 2) of the Institute of Actuaries.The paper is available together with a spread sheet model, illustrating the calculations. It is very much based on ideas by Barnett and Zehnwirth, see [2] for a reference. However, doing statistical analysis in a spread sheet programme is often cumbersome. I will go through the first 15 pages of Christofides' paper today and illustrate how the model can be implemented in R.
Let's start with the example data of an incremental claims triangle:
## Page D5.4
tri < t(matrix(
c(11073, 6427, 1839, 766,
14799, 9357, 2344, NA,
15636, 10523, NA, NA,
16913, NA, NA, NA),
nc=4, dimnames=list(origin=0:3, dev=0:3)))
The above triangle shows incremental claims payments for four origin (accident) years over time (development years). It is the aim to predict the bottom right triangle of future claims payments, assuming no further claims after four development years.Christofides model assumes the following structure for the incremental paid claims \(P_{ij}\):
\begin{align}
\ln(P_{ij}) & = Y_{ij} = a_i + b_j + \epsilon_{ij}
\end{align}where i and j go from 0 to 3, \(b_0=0\) and \(\epsilon_{ij} \sim N(0, \sigma^2)\). Unlike the basic chainladder method, this is a stochastic model that allows me to test my assumptions and calculate various statistics, e.g. standards errors of my predictions.
For the purpose of my analysis I will work with the data in form of a data frame:
m < dim(tri)[1]; n < dim(tri)[2]
dat < data.frame(
origin=rep(0:(m1), n),
dev=rep(0:(n1), each=m),
value=as.vector(tri))
rownames(dat) < with(dat, paste(origin, dev, sep=""))
dat < dat[order(dat$origin),]
dat
# origin dev value
# 00 0 0 11073
# 10 1 0 14799
# 20 2 0 15636
# 30 3 0 16913
# 01 0 1 6427
# 11 1 1 9357
# 21 2 1 10523
# 31 3 1 NA
# 02 0 2 1839
# 12 1 2 2344
# 22 2 2 NA
# 32 3 2 NA
# 03 0 3 766
# 13 1 3 NA
# 23 2 3 NA
# 33 3 3 NA
I add a few columns to my data, in particular factor variables of the origin and development years, a calendar year dimension and the log value of the paid claims.## Add dimensions as factors
dat < with(dat, data.frame(origin, dev, cal=origin+dev,
value, logvalue=log(value),
originf=factor(origin),
devf=as.factor(dev),
calf=as.factor(origin+dev)))
rownames(dat) < with(dat, paste(origin, dev, sep=""))
dat < dat[order(dat$origin),]
dat ## Page D5.7
# origin dev cal value logvalue originf devf calf
# 00 0 0 0 11073 9.312265 0 0 0
# 01 0 1 1 6427 8.768263 0 1 1
# 02 0 2 2 1839 7.516977 0 2 2
# 03 0 3 3 766 6.641182 0 3 3
# 10 1 0 1 14799 9.602315 1 0 1
# 11 1 1 2 9357 9.143880 1 1 2
# 12 1 2 3 2344 7.759614 1 2 3
# 13 1 3 4 NA NA 1 3 4
# 20 2 0 2 15636 9.657331 2 0 2
# 21 2 1 3 10523 9.261319 2 1 3
# 22 2 2 4 NA NA 2 2 4
# 23 2 3 5 NA NA 2 3 5
# 30 3 0 3 16913 9.735838 3 0 3
# 31 3 1 4 NA NA 3 1 4
# 32 3 2 5 NA NA 3 2 5
# 33 3 3 6 NA NA 3 3 6
I have done all the preparation and can carry out the linear regression with lm
:Fit < lm(logvalue ~ originf + devf + 0, data=dat)
summary(Fit) # Page D5.7/8
#
# Call:
# lm(formula = logvalue ~ originf + devf + 0, data = dat)
#
# Residuals:
# 00 01 02 03 10
# 2.389e02 5.396e02 3.007e02 6.939e18 1.118e02
# 11 12 20 21 30
# 1.889e02 3.007e02 3.507e02 3.507e02 0.000e+00
#
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# originf0 9.28837 0.04001 232.17 1.76e07 ***
# originf1 9.59114 0.04001 239.73 1.60e07 ***
# originf2 9.69240 0.04277 226.62 1.89e07 ***
# originf3 9.73584 0.05238 185.86 3.43e07 ***
# devf1 0.46615 0.04277 10.90 0.00165 **
# devf2 1.80146 0.05015 35.92 4.75e05 ***
# devf3 2.64719 0.06591 40.16 3.40e05 ***
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.05238 on 3 degrees of freedom
# Multiple Rsquared: 1, Adjusted Rsquared: 1
# Fstatistic: 4.03e+04 on 7 and 3 DF, pvalue: 1.884e07
The above output shows the same results as the paper. The estimators for the origin periods are clearly different from zero, yet the tests for originf1
to originf3
don't make much sense. It would be more appropriate to understand if I need different parameters for each origin period and hence a model such as lm(logvalue ~ originf + devf, data=dat)
can be more helpful. It would test the coefficients of the later origin periods against the base of the first one. Lets look at the residuals:
# Resdiual plots
op < par(mfrow=c(2,2))
attach(model.frame(Fit))
plot.default(rstandard(Fit) ~ originf,
main="Residuals vs. origin years")
abline(h=0, lty=2)
plot.default(rstandard(Fit) ~ devf,
main="Residuals vs. dev. years")
abline(h=0, lty=2)
with(na.omit(dat),
plot.default(rstandard(Fit) ~ calf,
main="Residuals vs. payments years"))
abline(h=0, lty=2)
plot.default(rstandard(Fit) ~ logvalue,
main="Residuals vs. fitted")
abline(h=0, lty=2)
detach(model.frame(Fit))
par(op)
The residual plots don't give any reason to dismiss the model, so I continue.In my next step I extract the design matrix from the model and build the future design matrix from the data. I will need both to calculate the prediction and standard errors on the original scale to follow the paper:
# Model design matrix, page D5.7
dm < model.matrix(formula(Fit), dat=model.frame(Fit))
dm
# originf0 originf1 originf2 originf3 devf1 devf2 devf3
# 00 1 0 0 0 0 0 0
# 10 0 1 0 0 0 0 0
# 20 0 0 1 0 0 0 0
# 30 0 0 0 1 0 0 0
# 01 1 0 0 0 1 0 0
# 11 0 1 0 0 1 0 0
# 21 0 0 1 0 1 0 0
# 02 1 0 0 0 0 1 0
# 12 0 1 0 0 0 1 0
# 03 1 0 0 0 0 0 1
## Future design matrix, page D5.11
fdm < model.matrix( ~ originf + devf + 0, data=dat[is.na(dat$value),])
fdm
# originf0 originf1 originf2 originf3 devf1 devf2 devf3
# 31 0 0 0 1 1 0 0
# 22 0 0 1 0 0 1 0
# 32 0 0 0 1 0 1 0
# 13 0 1 0 0 0 0 1
# 23 0 0 1 0 0 0 1
# 33 0 0 0 1 0 0 1
Following the paper I can calculate the variancecovariance matrix:# Page D5.12
# fdm %*% solve(t(dm)%*%dm) %*% t(fdm) *sigma^2, or shorter
varcovar < fdm %*% vcov(Fit) %*% t(fdm)
round(varcovar,4)
# 13 22 23 31 32 33
# 13 0.0046 0.0000 0.0037 0.0000 0.0000 0.0037
# 22 0.0000 0.0034 0.0021 0.0000 0.0021 0.0007
# 23 0.0037 0.0021 0.0053 0.0000 0.0007 0.0039
# 31 0.0000 0.0000 0.0000 0.0046 0.0037 0.0037
# 32 0.0000 0.0021 0.0007 0.0037 0.0053 0.0039
# 33 0.0037 0.0007 0.0039 0.0037 0.0039 0.0071
With the above matrix I can derive the variance of my future values, which are the diagonal elements of the above matrix plus the model variance \(\sigma^2\). Now I have everything to calculate the future claims amounts and standard errors on the original scale. Recall that for a lognormal distribution I have \(E(X) = \exp(\mu + 1/2 \sigma^2)\) and \(Var(X) = \exp(2\mu + \sigma^2)(\exp(\sigma^2)  1)\).# Page D5.12
sigma < summary(Fit)$sigma
sigma
# 0.05238207
Var < varcovar + sigma^2
VarY < Var[row(Var)==col(Var)]
Y < fdm %*% coef(Fit)
P < exp(Y + VarY/2)
VarP < exp(2*Y + VarY)*(exp(VarY)1)
seP < sqrt(VarP)
i < fdm %*% c((1:m)1, rep(0, (n1)))
j < fdm %*% c(rep(0, (m1)), (1:n)1)
Results < data.frame(i,j, Y, VarY, P, VarP, seP)
Results # Page D5.13
# i j Y VarY P VarP seP
# 13 1 3 6.943950 0.007317016 1040.658 7953.165 89.18052
# 22 2 2 7.890940 0.006173732 2681.219 44519.839 210.99725
# 23 2 3 7.045210 0.008002987 1151.950 10662.490 103.25933
# 31 3 1 9.269688 0.007317016 10650.334 833010.240 912.69395
# 32 3 2 7.934378 0.008002987 2802.814 63121.859 251.24064
# 33 3 3 7.088648 0.009832241 1204.192 14327.851 119.69900
Well, it is actually easier in R, as I could have used the function predict
, which does most of the above behind the scene:
newData < dat[is.na(dat$logvalue), c("originf", "devf")]
Pred < predict(Fit, newdata=newData, se.fit=TRUE)
Y < Pred$fit
VarY < Pred$se.fit^2 + Pred$residual.scale^2
fdm < model.matrix(~ originf + devf + 0, data=newData)
Nevermind, let's complete the triangle:lower.tri < xtabs(P ~ i+j, dat=Results)
lower.tri
# j
# i 1 2 3
# 1 0.000 0.000 1040.658
# 2 0.000 2681.219 1151.950
# 3 10650.334 2802.814 1204.192
Full.Incr.Triangle < tri
Full.Incr.Triangle[row(tri) > (nrow(tri) + 1  col(tri))] <
lower.tri[row(lower.tri) > (nrow(lower.tri)  col(lower.tri))]
Full.Cum.Triangle < apply(Full.Incr.Triangle, 1, cumsum)
Full.Cum.Triangle
# dev
# 0 1 2 3
# 0 11073 17500.00 19339.00 20105.00
# 1 14799 24156.00 26500.00 27540.66
# 2 15636 26159.00 28840.22 29992.17
# 3 16913 27563.33 30366.15 31570.34
Finally I want to estimate the overall error for my total reserve, the sum of all future incremental claims payments. I have all the values already and using the sweep
statement in R it is easy to calculate the overall variance for the total reserve, but I have to account for the covariances first:# Page D5.14
# Calculate the covariance between the predictions
CoVar < sweep(sweep((exp(varcovar)1), 1, P, "*"), 2, P, "*")
# I set the values on the diagonal to zero as I have to use
# the variances I calculated earlier (VarP),
# which includes the model variance sigma^2 as well.
CoVar[col(CoVar)==row(CoVar)] < 0
round(CoVar,0)
# 13 22 23 31 32 33
# 13 0 0 4394 0 0 4593
# 22 0 0 6363 0 15481 2216
# 23 4394 6363 0 0 2216 5403
# 31 0 0 0 0 109410 47006
# 32 0 15481 2216 109410 0 13145
# 33 4593 2216 5403 47006 13145 0
# Add the variances together to estimate the overall variance
OverallVar < sum(CoVar) + sum(VarP)
Total.SE < sqrt(OverallVar)
Total.SE
# [1] 1180.698
Overall.Reserve < sum(lower.tri)
Overall.Reserve
# [1] 19531.17
Total.SE / Overall.Reserve
# [1] 0.06045198
That's it, the projected reserve is $19,531 with an estimated overall standard error of $1,181 or just 6% of the overall reserve estimate.For comparison here is the output of a Mack chainladder model [3] run on the same triangle using the
ChainLadder
package [4]:library(ChainLadder)
M < MackChainLadder(incr2cum(tri), est.sigma="Mack")
M$FullTriangle
# dev
# origin 1 2 3 4
# 0 11073 17500.00 19339.00 20105.00
# 1 14799 24156.00 26500.00 27549.64
# 2 15636 26159.00 28785.83 29926.01
# 3 16913 27632.15 30406.90 31611.29
M
# MackChainLadder(Triangle = incr2cum(tri), est.sigma = "Mack")
#
# Latest Dev.To.Date Ultimate IBNR Mack.S.E CV(IBNR)
# 0 20,105 1.000 20,105 0 0.0 NaN
# 1 26,500 0.962 27,550 1,050 31.3 0.0298
# 2 26,159 0.874 29,926 3,767 177.1 0.0470
# 3 16,913 0.535 31,611 14,698 948.7 0.0645
#
# Totals
# Latest: 89,677.00
# Dev: 0.82
# Ultimate: 109,191.94
# IBNR: 19,514.94
# Mack S.E.: 980.34
# CV(IBNR): 0.05
The Mack chainladder results are actually quite similar to the loglinear model, but provide only point estimators without a distribution.Conclusions
It is actually very straightforward to implement the regression model based on logincremental payments in R. In particular I can run my code for any other triangle size without having to adjust any of my formulas, something which can sometimes be quite timeintensive and error prone to do in spread sheets.Here is the reduced model of the second example in Christofides' paper, which I cover in more detail in a second post.:
dat < data.frame( # Page D5.17
origin=rep(0:6, each=7),
dev=rep(0:6, 7),
value= c(3511, 3215, 2266, 1712, 1059, 587, 340,
4001, 3702, 2278, 1180, 956, 629, NA,
4355, 3932, 1946, 1522, 1238, NA, NA,
4295, 3455, 2023, 1320, NA, NA, NA,
4150, 3747, 2320, NA, NA, NA, NA,
5102, 4548, NA, NA, NA, NA, NA,
6283, NA, NA, NA, NA, NA, NA))
dat < with(dat,
data.frame(origin, dev, cal=origin+dev,
value, logvalue=log(value),
a6 = ifelse(origin == 6, 1, 0),
a5 = ifelse(origin == 5, 1, 0),
d = ifelse(dev < 1, 1, 0),
s = ifelse(dev < 1, 0, dat$dev)))
summary(Fit < lm(logvalue ~ a5 + a6 + d + s, data=na.omit(dat)))
# Call:
# lm(formula = logvalue ~ a5 + a6 + d + s, data = na.omit(dat))
#
# Residuals:
# Min 1Q Median 3Q Max
# 0.21567 0.04910 0.00654 0.05137 0.27198
#
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 8.60795 0.05150 167.142 < 2e16 ***
# a5 0.24353 0.08517 2.859 0.008870 **
# a6 0.44111 0.12170 3.625 0.001421 **
# d 0.30345 0.06779 4.476 0.000172 ***
# s 0.43967 0.01666 26.390 < 2e16 ***
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.1119 on 23 degrees of freedom
# Multiple Rsquared: 0.9804, Adjusted Rsquared: 0.977
# Fstatistic: 287.7 on 4 and 23 DF, pvalue: < 2.2e16
I am sure there are better ways to fit such models in R, in particular in dealing with the design matrix and changing them, e.g. to reduce the model and take out nonsignificant factor levels, although this seems to be a bit of a nono.
Jim Guszcza has a few more tips on reserving in his webinar on Actuarial Analytics in R and the ChainLadder package has further stochastic reserving methods already implemented.
You find the code of this post also as a gist on Github. Feedback, comments, tips and hints will be much appreciated.
References
[1] Stavros Christofides. Regression models based on logincremental payments. Claims Reserving Manual. Volume 2 D5. September 1997[2] Glen Barnett and Ben Zehnwirth. Best estimates for reserves. Proceedings of the CAS, LXXXVII(167), November 2000.
[3] Thomas Mack. The standard error of chain ladder reserve estimates: Recursive calculation and inclusion of a tail factor. Astin Bulletin, Vol. 29(2):361 – 266, 1999.
[4] Markus Gesmann, Dan Murphy, and Wayne Zhang. ChainLadder: Mack, Bootstrap and Munichchainladder methods for insurance claims reserving, 2012. R package version 0.1.54.
Session Info
R version 2.15.2 Patched (20130101 r61512)
Platform: x86_64appledarwin9.8.0/x86_64 (64bit)
locale:
[1] en_GB.UTF8/en_GB.UTF8/en_GB.UTF8/C/en_GB.UTF8/en_GB.UTF8
attached base packages:
[1] grid splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] ChainLadder_0.1.54 tweedie_2.1.5 statmod_1.4.16 cplm_0.64
lme4_0.9999990 ggplot2_0.9.3 coda_0.161 biglm_0.8 DBI_0.25
reshape2_1.2.2 actuar_1.15 RUnit_0.4.26 systemfit_1.114 lmtest_0.930
zoo_1.79 car_2.015 nnet_7.35 MASS_7.322 Matrix_1.010 lattice_0.2010
Hmisc_3.101 survival_2.372
loaded via a namespace (and not attached):
[1] cluster_1.14.3 colorspace_1.20 dichromat_1.24 digest_0.6.0
gtable_0.1.2 labeling_0.1 minqa_1.2.1 munsell_0.4 nlme_3.1106
plyr_1.8 proto_0.310 RColorBrewer_1.05 sandwich_2.29 scales_0.2.3
stats4_2.15.2 stringr_0.6.2
2013/01/08
Subscribe to:
Post Comments
(
Atom
)
Popular Posts

The guys at RStudio have done a fantastic job with shiny . It is really easy to build web apps with R using shiny. With the help of Joe Ch...

I really should make it a habit of using data.table . The speed and simplicity of this R package are astonishing. Here is a simple exampl...

Iris versicolor By Danielle Langlois License: CCBYSA R is a language , as Luis Apiolaza pointed out in his recent post . This is ab...

Today I feel very lucky, as I have been invited to the Royal Statistical Society conference to give a tutorial on interactive web graphs wi...

A friend of mine asked me the other day how she could use the function optim in R to fit data. Of course there are functions for fitting da...

Last Saturday I met the guys from RStudio at the R in Finance conference in Chicago. I was curious to find out what RStudio could offer. I...

Fitting distribution with R is something I have to do once in a while, but where do I start? A good starting point to learn more about dis...

Tonight I will give a talk at the Cambridge R user group about googleVis . Following my good experience with knitr and RStudio to create...

Transforming data sets with R is usually the starting point of my data analysis work. Here is a scenario which comes up from time to time: t...

Lattice plots are a great way of displaying multivariate data in R. Deepayan Sarkar, the author of lattice, has written a fantastic book ab...
No comments :
Post a Comment