## Hierarchical Loss Reserving with Stan

I continue with the growth curve model for loss reserving from last week's post. Today, following the ideas of James Guszcza [2] I will add an hierarchical component to the model, by treating the ultimate loss cost of an accident year as a random effect. Initially, I will use the

Last week's model assumed that cumulative claims payment could be described by a growth curve. I used the Weibull curve and will do so here again, but others should be considered as well, e.g. the log-logistic cumulative distribution function for long tail business, see [1].

The growth curve describes the proportion of claims paid up to a given development period compared to the ultimate claims cost at the end of time, hence often called development pattern. Cumulative distribution functions are often considered, as they increase monotonously from 0 to 100%. Multiplying the development pattern with the expected ultimate loss cost gives me then the expected cumulative paid to date value.

However, what I'd like to do is the opposite, I know the cumulative claims position to date and wish to estimate the ultimate claims cost instead. If the claims process is fairly stable over the years and say, once a claim has been notified the payment process is quite similar from year to year and claim to claim, then a growth curve model is not unreasonable. Yet, the number and the size of the yearly claims will be random, e.g. if a windstorm, fire, etc occurs or not. Hence, a random effect for the ultimate loss cost across accident years sounds very convincing to me.

Here is James' model as described in [2]:

\[

\begin{align}

CL_{AY, dev} & \sim \mathcal{N}(\mu_{AY, dev}, \sigma^2_{dev}) \\

\mu_{dev} & = Ult_{AY} \cdot G(dev|\omega, \theta)\\

\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}\\

Ult_{AY} & \sim \mathcal{N}(\mu_{ult}, \sigma^2_{ult})\\

G(dev|\omega, \theta) & = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)

\end{align}

\]The cumulative losses \(CL_{AY, dev}\) for a given accident year \(AY\) and development period \(dev\) follow a Normal distribution with parameters \(\mu_{AY, dev}\) and \(\sigma_{dev}\).

The mean itself is modelled as the product of an accident year specific ultimate loss cost \(Ult_{AY}\) and a development period specific parametric growth curve \(G(dev | \omega, \theta)\). The variance is believed to increase in proportion with the mean. Finally the ultimate loss cost is modelled with a Normal distribution as well.

Assuming a Gaussian distribution of losses doesn't sound quite intuitive to me, as loss are often skewed to the right, but I shall continue with this assumption here to make a comparison with [2] possible.

Using the example data set given in the paper I can reproduce the result in R with

The fit looks pretty good, with only 5 parameters. See James' paper for a more detailed discussion.

Let's move this model into Stan. Here is my attempt, which builds on last week's pooled model. With the generated quantities code block I go beyond the scope of the original paper, as I try to estimate the full posterior predictive distribution as well.

The 'trick' is the line

The notation

Let's run the model:

The estimated parameters look very similar to the

Let's take a look at the parameter traceplot and the densities of the estimated ultimate loss costs by origin year.

This looks all not too bad. The trace plots don't show any particular patterns, apart from \(\sigma_{ult}\), which shows a little skewness.

The

The model seems to work rather well, even with the Gaussian distribution assumptions. Yet, it has still only 5 parameters. Note, this model doesn't need an additional artificial tail factor either.

James Guszcza contributed to a follow up paper with Y. Zhank and V. Dukic [3] that extends the model described in [2]. It deals with skewness in loss data sets and the autoregressive nature of the errors in a cumulative time series.

Fank Schmid offers a more complex Bayesian analysis of claims reserving in [4], while Jake Morris highlights the similarities between a compartmental model used in drug research and loss reserving [5].

Finally, Glenn Meyers published a monograph on

[2] James Guszcza.

[3] Y. Zhang, V. Dukic, and James Guszcza.

[4] Frank A. Schmid.

[5] Jake Morris.

[6] Stan Development Team.

[7] Glenn Meyers.

`nlme`

R package, just as James did in his paper, and then move on to Stan/RStan [6], which will allow me to estimate the full distribution of future claims payments.Last week's model assumed that cumulative claims payment could be described by a growth curve. I used the Weibull curve and will do so here again, but others should be considered as well, e.g. the log-logistic cumulative distribution function for long tail business, see [1].

The growth curve describes the proportion of claims paid up to a given development period compared to the ultimate claims cost at the end of time, hence often called development pattern. Cumulative distribution functions are often considered, as they increase monotonously from 0 to 100%. Multiplying the development pattern with the expected ultimate loss cost gives me then the expected cumulative paid to date value.

However, what I'd like to do is the opposite, I know the cumulative claims position to date and wish to estimate the ultimate claims cost instead. If the claims process is fairly stable over the years and say, once a claim has been notified the payment process is quite similar from year to year and claim to claim, then a growth curve model is not unreasonable. Yet, the number and the size of the yearly claims will be random, e.g. if a windstorm, fire, etc occurs or not. Hence, a random effect for the ultimate loss cost across accident years sounds very convincing to me.

Here is James' model as described in [2]:

\[

\begin{align}

CL_{AY, dev} & \sim \mathcal{N}(\mu_{AY, dev}, \sigma^2_{dev}) \\

\mu_{dev} & = Ult_{AY} \cdot G(dev|\omega, \theta)\\

\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}\\

Ult_{AY} & \sim \mathcal{N}(\mu_{ult}, \sigma^2_{ult})\\

G(dev|\omega, \theta) & = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)

\end{align}

\]The cumulative losses \(CL_{AY, dev}\) for a given accident year \(AY\) and development period \(dev\) follow a Normal distribution with parameters \(\mu_{AY, dev}\) and \(\sigma_{dev}\).

The mean itself is modelled as the product of an accident year specific ultimate loss cost \(Ult_{AY}\) and a development period specific parametric growth curve \(G(dev | \omega, \theta)\). The variance is believed to increase in proportion with the mean. Finally the ultimate loss cost is modelled with a Normal distribution as well.

Assuming a Gaussian distribution of losses doesn't sound quite intuitive to me, as loss are often skewed to the right, but I shall continue with this assumption here to make a comparison with [2] possible.

Using the example data set given in the paper I can reproduce the result in R with

`nlme`

:The fit looks pretty good, with only 5 parameters. See James' paper for a more detailed discussion.

Let's move this model into Stan. Here is my attempt, which builds on last week's pooled model. With the generated quantities code block I go beyond the scope of the original paper, as I try to estimate the full posterior predictive distribution as well.

The 'trick' is the line

` mu[i] <- ult[origin[i]] * weibull_cdf(dev[i], omega, theta);`

where I have an accident year (here labelled origin) specific ultimate loss. The notation

`ult[origin[i]]`

illustrates the hierarchical nature in Stan's language nicely.Let's run the model:

The estimated parameters look very similar to the

`nlme`

output above.Let's take a look at the parameter traceplot and the densities of the estimated ultimate loss costs by origin year.

This looks all not too bad. The trace plots don't show any particular patterns, apart from \(\sigma_{ult}\), which shows a little skewness.

The

`generated quantities`

code block in Stan allows me to get also the predictive distribution beyond the current data range. Here I forecast claims up to development year 12 and plot the predictions, including the 90% credibility interval of the posterior predictive distribution with the observations. The model seems to work rather well, even with the Gaussian distribution assumptions. Yet, it has still only 5 parameters. Note, this model doesn't need an additional artificial tail factor either.

### Conclusions

The Bayesian approach sounds to me a lot more natural than many classical techniques around the chain-ladder methods. Thanks to Stan, I can get the full posterior distributions on both, the parameters and predictive distribution. I find communicating credibility intervals much easier than talking about the parameter, process and mean squared error.James Guszcza contributed to a follow up paper with Y. Zhank and V. Dukic [3] that extends the model described in [2]. It deals with skewness in loss data sets and the autoregressive nature of the errors in a cumulative time series.

Fank Schmid offers a more complex Bayesian analysis of claims reserving in [4], while Jake Morris highlights the similarities between a compartmental model used in drug research and loss reserving [5].

Finally, Glenn Meyers published a monograph on

*Stochastic Loss Reserving Using Bayesian MCMC Models*earlier this year [7] that is worth taking a look at.### References

[1] David R. Clark.*LDF Curve-Fitting and Stochastic Reserving: A Maximum Likelihood Approach.*Casualty Actuarial Society, 2003. CAS Fall Forum.[2] James Guszcza.

*Hierarchical Growth Curve Models for Loss Reserving*, 2008, CAS Fall Forum, pp. 146–173.[3] Y. Zhang, V. Dukic, and James Guszcza.

*A Bayesian non-linear model for forecasting insurance loss payments.*2012. Journal of the Royal Statistical Society: Series A (Statistics in Society), 175: 637–656. doi: 10.1111/j.1467-985X.2011.01002.x[4] Frank A. Schmid.

*Robust Loss Development Using MCMC*. Available at SSRN. See also http://lossdev.r-forge.r-project.org/[5] Jake Morris.

*Compartmental reserving in R*. 2015. R in Insurance Conference.[6] Stan Development Team.

*Stan: A C++ Library for Probability and Sampling, Version 2.8.0*. 2015. http://mc-stan.org/.[7] Glenn Meyers.

*Stochastic Loss Reserving Using Bayesian MCMC Models.*Issue 1 of CAS Monograph Series. 2015.### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ChainLadder_0.2.3 rstan_2.8.0 ggplot2_1.0.1 Rcpp_0.12.1
[5] lattice_0.20-33
loaded via a namespace (and not attached):
[1] nloptr_1.0.4 plyr_1.8.3 tools_3.2.2
[4] digest_0.6.8 lme4_1.1-10 statmod_1.4.21
[7] gtable_0.1.2 nlme_3.1-122 mgcv_1.8-8
[10] Matrix_1.2-2 parallel_3.2.2 biglm_0.9-1
[13] SparseM_1.7 proto_0.3-10 coda_0.18-1
[16] gridExtra_2.0.0 stringr_1.0.0 MatrixModels_0.4-1
[19] lmtest_0.9-34 stats4_3.2.2 grid_3.2.2
[22] nnet_7.3-11 tweedie_2.2.1 inline_0.3.14
[25] cplm_0.7-4 minqa_1.2.4 actuar_1.1-10
[28] reshape2_1.4.1 car_2.1-0 magrittr_1.5
[31] scales_0.3.0 codetools_0.2-14 MASS_7.3-44
[34] splines_3.2.2 rsconnect_0.3.79 systemfit_1.1-18
[37] pbkrtest_0.4-2 colorspace_1.2-6 quantreg_5.19
[40] labeling_0.3 sandwich_2.3-4 stringi_1.0-1
[43] munsell_0.4.2 zoo_1.7-12
```

10 Nov 2015
06:36
Insurance
,
R
,
reserving
,
Stan
,
stochastic reserving

## Loss Developments via Growth Curves and Stan

Last week I posted a biological example of fitting a non-linear growth curve with Stan/RStan. Today, I want to apply a similar approach to insurance data using ideas by David Clark [1] and James Guszcza [2].

Instead of predicting the growth of dugongs (sea cows), I would like to predict the growth of cumulative insurance loss payments over time, originated from different origin years. Loss payments of younger accident years are just like a new generation of dugongs, they will be small in size initially, grow as they get older, until the losses are fully settled.

Here is my example data set:

Following the articles cited above I will assume that the growth can be explained by a Weibull curve, with two parameters \(\theta\) (scale) and \(\omega\) (shape):

\[

G(dev|\omega, \theta) = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)

\]Inspired by the classical over-dispersed Poisson GLM in actuarial science, Guszcza [2] assumes a power variance structure for the process error as well. For the purpose of this post I will assume that claims cost follow a Normal distribution, to make a comparison with a the classical least square regression easier. With a prior estimate of the ultimate claims cost the cumulative loss can be modelled as:

\[

\begin{align}

CL_{AY, dev} & \sim \mathcal{N}(\mu_{dev}, \sigma^2_{dev}) \\

\mu_{dev} & = Ult \cdot G(dev|\omega, \theta)\\

\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}

\end{align}

\]Perhaps, the above formula suggests a hierarchical modelling approach, with different ultimate loss costs by accident year. Indeed, this is the focus of [2] and I will endeavour to reproduce the results with Stan in my next post, but today I will stick to a pooled model that assumes a constant ultimate loss across all accident years, i.e. \(Ult_{AY} = Ult \;\forall\; AY\).

To prepare my analysis I read in the data as a long data frame instead of the matrix structure above. Additionally, I compose another data frame that I will use later to predict payments two years beyond the first 10 years. Furthermore, to align the output with [2] I relabelled the development periods from years to months, so that year 1 becomes month 6, year 2 becomes month 18, etc. The accident years run from 1991 to 2000, while the variable

To get a reference point for Stan I start with a non-linear least square regression:

The output above doesn't look unreasonable, apart from the accident years 1991, 1992 and 1998. The output of

The Stan code below is very similar to last week. Again, I am interested here in the posterior distributions, hence I add a block to generate quantities from those. Note, Stan comes with a build-in function for the cumulative Weibull distribution

I store the Stan code in a separate text file, which I read into R to compile the model. The compilation takes a little time. The sampling itself is done in a few seconds.

Let's take a look at the output:

The parameters are a little different to the output of

Finally, I can plot the 90% credible intervals of the posterior distributions.

The 90% prediction credible interval captures most of the data and although this model might not be suitable for reserving individual accident years, it could provide an initial starting point for further investigations. Additionally, thanks to the Bayesian model I have access to the full distribution, not just point estimators and standard errors.

My next post will continue with this data set and the ideas of James Guszcza by allowing the ultimate loss cost to vary by accident year, treating it as a random effect. Here is a teaser of what the output will look like:

[2] Guszcza, J.

pp. 146–173.

Instead of predicting the growth of dugongs (sea cows), I would like to predict the growth of cumulative insurance loss payments over time, originated from different origin years. Loss payments of younger accident years are just like a new generation of dugongs, they will be small in size initially, grow as they get older, until the losses are fully settled.

Here is my example data set:

Following the articles cited above I will assume that the growth can be explained by a Weibull curve, with two parameters \(\theta\) (scale) and \(\omega\) (shape):

\[

G(dev|\omega, \theta) = 1 - \exp\left(-\left(\frac{dev}{\theta}\right)^\omega\right)

\]Inspired by the classical over-dispersed Poisson GLM in actuarial science, Guszcza [2] assumes a power variance structure for the process error as well. For the purpose of this post I will assume that claims cost follow a Normal distribution, to make a comparison with a the classical least square regression easier. With a prior estimate of the ultimate claims cost the cumulative loss can be modelled as:

\[

\begin{align}

CL_{AY, dev} & \sim \mathcal{N}(\mu_{dev}, \sigma^2_{dev}) \\

\mu_{dev} & = Ult \cdot G(dev|\omega, \theta)\\

\sigma_{dev} & = \sigma \sqrt{\mu_{dev}}

\end{align}

\]Perhaps, the above formula suggests a hierarchical modelling approach, with different ultimate loss costs by accident year. Indeed, this is the focus of [2] and I will endeavour to reproduce the results with Stan in my next post, but today I will stick to a pooled model that assumes a constant ultimate loss across all accident years, i.e. \(Ult_{AY} = Ult \;\forall\; AY\).

To prepare my analysis I read in the data as a long data frame instead of the matrix structure above. Additionally, I compose another data frame that I will use later to predict payments two years beyond the first 10 years. Furthermore, to align the output with [2] I relabelled the development periods from years to months, so that year 1 becomes month 6, year 2 becomes month 18, etc. The accident years run from 1991 to 2000, while the variable

`origin`

maps those years from 1 to 10.To get a reference point for Stan I start with a non-linear least square regression:

The output above doesn't look unreasonable, apart from the accident years 1991, 1992 and 1998. The output of

`gnls`

gives me an opportunity to provide my prior distributions with good starting points. I will use an inverse Gamma as a prior for \(\sigma\), constrained Normals for the parameters \(\theta, \omega\) and \(Ult\) as well. If you have better ideas, then please get in touch.The Stan code below is very similar to last week. Again, I am interested here in the posterior distributions, hence I add a block to generate quantities from those. Note, Stan comes with a build-in function for the cumulative Weibull distribution

`weibull_cdf`

.I store the Stan code in a separate text file, which I read into R to compile the model. The compilation takes a little time. The sampling itself is done in a few seconds.

Let's take a look at the output:

The parameters are a little different to the output of

`gnls`

, but well within the standard error. From the plots I notice that the samples for the ultimate loss as well as for \(\theta\) are a little skewed to the right. Well, assuming cumulative losses to follow a Normal distribution was a bit a of stretch to start with. Still, the samples seem to converge.Finally, I can plot the 90% credible intervals of the posterior distributions.

The 90% prediction credible interval captures most of the data and although this model might not be suitable for reserving individual accident years, it could provide an initial starting point for further investigations. Additionally, thanks to the Bayesian model I have access to the full distribution, not just point estimators and standard errors.

My next post will continue with this data set and the ideas of James Guszcza by allowing the ultimate loss cost to vary by accident year, treating it as a random effect. Here is a teaser of what the output will look like:

### References

[1] David R. Clark.*LDF Curve-Fitting and Stochastic Reserving: A Maximum Likelihood Approach.*Casualty Actuarial Society, 2003. CAS Fall Forum.[2] Guszcza, J.

*Hierarchical Growth Curve Models for Loss Reserving*, 2008, CAS Fall Forum,pp. 146–173.

### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] rstan_2.8.0 ggplot2_1.0.1 Rcpp_0.12.1
[4] nlme_3.1-122 lattice_0.20-33 ChainLadder_0.2.3
loaded via a namespace (and not attached):
[1] nloptr_1.0.4 plyr_1.8.3 tools_3.2.2
[4] digest_0.6.8 lme4_1.1-10 statmod_1.4.21
[7] gtable_0.1.2 mgcv_1.8-8 Matrix_1.2-2
[10] parallel_3.2.2 biglm_0.9-1 SparseM_1.7
[13] proto_0.3-10 gridExtra_2.0.0 coda_0.18-1
[16] stringr_1.0.0 MatrixModels_0.4-1 stats4_3.2.2
[19] lmtest_0.9-34 grid_3.2.2 nnet_7.3-11
[22] inline_0.3.14 tweedie_2.2.1 cplm_0.7-4
[25] minqa_1.2.4 reshape2_1.4.1 car_2.1-0
[28] actuar_1.1-10 magrittr_1.5 codetools_0.2-14
[31] scales_0.3.0 MASS_7.3-44 splines_3.2.2
[34] systemfit_1.1-18 rsconnect_0.3.79 pbkrtest_0.4-2
[37] colorspace_1.2-6 labeling_0.3 quantreg_5.19
[40] sandwich_2.3-4 stringi_1.0-1 munsell_0.4.2
[43] zoo_1.7-12
```

## Non-linear growth curves with Stan

I suppose the go to tool for fitting non-linear models in R is

The original example itself is taken from OpenBUGS. The data describes the length and age measurements for 27 captured dugongs (sea cows). Carlin and Gelfand (1991) model the data using a nonlinear growth curve with no inflection point and an asymptote as \(x_i\) tends to infinity:

\[

Y_i \sim \mathcal{N}(\mu_i, \sigma^2),\; i = 1,\dots,27\\

\mu_i = \alpha - \beta \lambda^{x_i},\; \alpha,\, \beta > 0;\, 0 < \lambda < 1 \] Fitting the curve with

Writing the model in Stan requires a few more lines, but gives me also the opportunity to generate output from the posterior distributions.

The predicted parameters and errors are very much the same as in the least square output of

`nls`

of the `stats`

package. In this post I will show an alternative approach with Stan/RStan, as illustrated in the example, *Dugongs: "nonlinear growth curve"*, that is part of Stan's documentation.The original example itself is taken from OpenBUGS. The data describes the length and age measurements for 27 captured dugongs (sea cows). Carlin and Gelfand (1991) model the data using a nonlinear growth curve with no inflection point and an asymptote as \(x_i\) tends to infinity:

\[

Y_i \sim \mathcal{N}(\mu_i, \sigma^2),\; i = 1,\dots,27\\

\mu_i = \alpha - \beta \lambda^{x_i},\; \alpha,\, \beta > 0;\, 0 < \lambda < 1 \] Fitting the curve with

`nls`

gives the following results:Writing the model in Stan requires a few more lines, but gives me also the opportunity to generate output from the posterior distributions.

The predicted parameters and errors are very much the same as in the least square output of

`nls`

, but with the Stan output I can also review the 90% credible intervals.### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] rstan_2.8.0 ggplot2_1.0.1.9003 Rcpp_0.12.1
loaded via a namespace (and not attached):
[1] colorspace_1.2-6 scales_0.3.0 plyr_1.8.3
[4] parallel_3.2.2 tools_3.2.2 inline_0.3.14
[7] gtable_0.1.2 gridExtra_2.0.0 codetools_0.2-14
[10] grid_3.2.2 stats4_3.2.2 munsell_0.4.2
```

27 Oct 2015
07:46
growth model
,
non-linear
,
R
,
Stan

## R in Insurance 2016

Following the successful 3rd R in Insurance conference in Amsterdam this year, we will return to London next year. We will be back at Cass Business School, 11 July 2016.

The event will focus again on the use of R in insurance, bringing together experts from industry and academia with a diverse background of disciplines, such as actuarial science, catastrophe modelling, finance, statistics and computer science.

We are delighted to announce or keynote speakers already: Dan Murphy and Mario V. Wüthrich.

More details about the event will be published in due course here. For now, save the date in your diary.

PS: The picture above is a cluster analysis of the Lloyd's building in R.

6 Oct 2015
07:32
Cass Business School
,
Conference
,
Insurance
,
R
,
R in Insurance

## ChainLadder 0.2.2 is out with improved glmReserve function

We released version 0.2.2 of

Ok, what does this all mean? I will run through a couple of examples and look behind the scene of

Like most other functions in ChainLadder,

The data at hand is often presented in form of a claims triangle, such as the following example data set from the ChainLadder package:

Suppose all claims will be reported within 7 years, then I'd like to know how much money should be set aside for the origin years 2008 to 2013 for claims that have incurred but not been reported (IBNR) yet. Or, to put it differently, I have to predict the

First, let's reformat the data as it would be stored in a database, that is in a long format of incremental claims over the years (I add the years also as factors, which I will use later):

One of the oldest methods to predict future claims development is called chain-ladder, which can be regarded as a weighted linear regression through the origin of cumulative claims over the development periods. Multiplying those development factors to the latest available cumulative position allows me to predict future claims in an iterative fashion.

It is well know in actuarial science that a Poisson GLM produces the same forecasts as the chain-ladder model.

Let's check:

There is another aspect to highlight with the Poisson model; its variance is equal to the mean. Yet, in real data I often observe that the variance increases in proportion to the mean. Well, this can be remedied by using an over-dispersed quasi-Poisson model.

I think a more natural approach would be to assume a compound distribution that models the frequency and severity of claims separately, e.g. Poisson frequency and Gamma severity.

Here the Tweedie distributions comes into play.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter \(\Phi\).

Furthermore, the variance is a power function of the mean, i.e. \(\mbox{Var}(X)=\Phi \, E[X]^p\).

The canonical link function for a Tweedie distribution in a GLM is the power link \(\mu^q\) with \(q=1-p\). Note, \(q=0\) is interpreted as \(\log(\mu)\).

Thus, let \(\mu_i = E(y_i)\) be the expectation of the ith response. Then I have the following model.

\[

y \sim \mbox{Tweedie}(q, p)\\

E(y) = \mu^q = Xb \\

\mbox{Var}(y) = \Phi \mu^p

\]The variance power \(p\) characterises the distribution of the responses \(y\). The following are some special cases:

Finally, I get back to the

With

In my first example I set use the parameters \(p=1, q=0\), which should return the results of the Poisson model.

Setting the argument

Not surprisingly the estimated reserves are similar to the Poisson model, but with a smaller predicted standard error.

Intuitively the modelling approach makes a lot more sense, but I end up with one parameter for each origin and development period, hence there is a danger of over-parametrisation.

Looking at the plots above again I note that many origin periods have a very similar development. Perhaps a hierarchical model would be more appropriate?

For more details on

`ChainLadder`

a few weeks ago. This version adds back the functionality to estimate the index parameter for the compound Poisson model in `glmReserve`

using the `cplm`

package by Wayne Zhang. Ok, what does this all mean? I will run through a couple of examples and look behind the scene of

`glmReserve`

. However, the clue is in the title, `glmReserve`

is a function that uses a generalised linear model to estimate future claims, assuming claims follow a Tweedie distribution. I should actually talk about a family of distributions that is known as Tweedie, named by Bent Jørgensen after Maurice Tweedie. Joe Rickert published a nice post about the Tweedie distribution last year.Like most other functions in ChainLadder,

`glmReserve`

purpose is to predict future insurance claims based on historical data.The data at hand is often presented in form of a claims triangle, such as the following example data set from the ChainLadder package:

```
library(ChainLadder)
cum2incr(UKMotor)
## dev
## origin 1 2 3 4 5 6 7
## 2007 3511 3215 2266 1712 1059 587 340
## 2008 4001 3702 2278 1180 956 629 NA
## 2009 4355 3932 1946 1522 1238 NA NA
## 2010 4295 3455 2023 1320 NA NA NA
## 2011 4150 3747 2320 NA NA NA NA
## 2012 5102 4548 NA NA NA NA NA
## 2013 6283 NA NA NA NA NA NA
```

The rows present different origin periods in which accidents occurred and the columns along each row show the incremental reported claims over the years (the data itself is stored in a cumulative form). Suppose all claims will be reported within 7 years, then I'd like to know how much money should be set aside for the origin years 2008 to 2013 for claims that have incurred but not been reported (IBNR) yet. Or, to put it differently, I have to predict the

`NA`

fields in the bottom right hand triangle.First, let's reformat the data as it would be stored in a database, that is in a long format of incremental claims over the years (I add the years also as factors, which I will use later):

```
claims <- as.data.frame(cum2incr(UKMotor)) # convert into long format
library(data.table)
claims <- data.table(claims)
claims <- claims[ , ':='(cal=origin+dev-1, # calendar period
originf=factor(origin),
devf=factor(dev))]
claims <- claims[order(dev), cum.value:=cumsum(value), by=origin]
setnames(claims, "value", "inc.value")
head(claims)
## origin dev inc.value cal originf devf cum.value
## 1: 2007 1 3511 2007 2007 1 3511
## 2: 2008 1 4001 2008 2008 1 4001
## 3: 2009 1 4355 2009 2009 1 4355
## 4: 2010 1 4295 2010 2010 1 4295
## 5: 2011 1 4150 2011 2011 1 4150
## 6: 2012 1 5102 2012 2012 1 5102
```

Let's visualise the data:```
library(lattice)
xyplot(cum.value/1000 + log(inc.value) ~ dev , groups=origin, data=claims,
t="b", par.settings = simpleTheme(pch = 16),
auto.key = list(space="right",
title="Origin\nperiod", cex.title=1,
points=FALSE, lines=TRUE, type="b"),
xlab="Development period", ylab="Amount",
main="Claims development by origin period",
scales="free")
```

The left plot of the chart above shows the cumulative claims payment over time, while the right plot shows the log-transformed incremental claims development for each origin/accident year.One of the oldest methods to predict future claims development is called chain-ladder, which can be regarded as a weighted linear regression through the origin of cumulative claims over the development periods. Multiplying those development factors to the latest available cumulative position allows me to predict future claims in an iterative fashion.

It is well know in actuarial science that a Poisson GLM produces the same forecasts as the chain-ladder model.

Let's check:

```
# Poisson model
mdl.pois <- glm(inc.value ~ originf + devf, data=na.omit(claims),
family=poisson(link = "log"))
# predict claims
claims <- claims[, ':='(
pred.inc.value=predict(mdl.pois,
.SD[, list(originf, devf, inc.value)],
type="response")), by=list(originf, devf)]
# sum of future payments
claims[cal>max(origin)][, sum(pred.inc.value)]
## [1] 28655.77
# Chain-ladder forecast
summary(MackChainLadder(UKMotor, est.sigma = "Mack"))$Totals[4,]
## [1] 28655.77
```

Ok, this worked. However, both of these models make actually fairly strong assumptions. The Poisson model by its very nature will only produce whole numbers, and although payments could be regarded as whole numbers, say in pence or cents, it does feel a little odd to me. Similarly, modelling the year on year developments via a weighted linear regression through the origin, as in the case of the chain-ladder model, sounds not intuitive either.There is another aspect to highlight with the Poisson model; its variance is equal to the mean. Yet, in real data I often observe that the variance increases in proportion to the mean. Well, this can be remedied by using an over-dispersed quasi-Poisson model.

I think a more natural approach would be to assume a compound distribution that models the frequency and severity of claims separately, e.g. Poisson frequency and Gamma severity.

Here the Tweedie distributions comes into play.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter \(\Phi\).

Furthermore, the variance is a power function of the mean, i.e. \(\mbox{Var}(X)=\Phi \, E[X]^p\).

The canonical link function for a Tweedie distribution in a GLM is the power link \(\mu^q\) with \(q=1-p\). Note, \(q=0\) is interpreted as \(\log(\mu)\).

Thus, let \(\mu_i = E(y_i)\) be the expectation of the ith response. Then I have the following model.

\[

y \sim \mbox{Tweedie}(q, p)\\

E(y) = \mu^q = Xb \\

\mbox{Var}(y) = \Phi \mu^p

\]The variance power \(p\) characterises the distribution of the responses \(y\). The following are some special cases:

- Normal distribution, p = 0
- Poisson distribution, p = 1
- Compound Poisson-Gamma distribution, 1 < p < 2
- Gamma distribution, p = 2
- Inverse-Gaussian, p = 3
- Stable, with support on the positive real numbers, p > 2

Finally, I get back to the

`glmReserve`

function, which Wayne Zhang, the other author of the `cplm`

package, contributed to the `ChainLadder`

package.With

`glmReserve`

I can model a claims triangle using the Tweedie distribution family.In my first example I set use the parameters \(p=1, q=0\), which should return the results of the Poisson model.

```
(m1 <- glmReserve(UKMotor, var.power = 1, link.power = 0))
## Latest Dev.To.Date Ultimate IBNR S.E CV
## 2008 12746 0.9732000 13097 351 125.8106 0.35843464
## 2009 12993 0.9260210 14031 1038 205.0826 0.19757473
## 2010 11093 0.8443446 13138 2045 278.8519 0.13635790
## 2011 10217 0.7360951 13880 3663 386.7919 0.10559429
## 2012 9650 0.5739948 16812 7162 605.2741 0.08451188
## 2013 6283 0.3038201 20680 14397 1158.1250 0.08044210
## total 62982 0.6872913 91638 28656 1708.1963 0.05961042
```

Perfect, I get the same results, plus further information about the model.Setting the argument

`var.power=NULL`

will estimate \(p\) in the interval \((1,2)\) using the `cplm`

package.```
(m2 <- glmReserve(UKMotor, var.power=NULL))
## Latest Dev.To.Date Ultimate IBNR S.E CV
## 2008 12746 0.9732000 13097 351 110.0539 0.31354394
## 2009 12993 0.9260870 14030 1037 176.9361 0.17062307
## 2010 11093 0.8444089 13137 2044 238.5318 0.11669851
## 2011 10217 0.7360951 13880 3663 335.6824 0.09164138
## 2012 9650 0.5739948 16812 7162 543.6472 0.07590718
## 2013 6283 0.3038201 20680 14397 1098.7988 0.07632138
## total 62982 0.6873063 91636 28654 1622.4616 0.05662252
m2$model
##
## Call:
## cpglm(formula = value ~ factor(origin) + factor(dev), link = link.power,
## data = ldaFit, offset = offset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.7901 -1.6969 0.0346 1.6087 8.4465
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.25763 0.04954 166.680 < 2e-16 ***
## factor(origin)2008 0.03098 0.05874 0.527 0.605588
## factor(origin)2009 0.09999 0.05886 1.699 0.110018
## factor(origin)2010 0.03413 0.06172 0.553 0.588369
## factor(origin)2011 0.08933 0.06365 1.403 0.180876
## factor(origin)2012 0.28091 0.06564 4.279 0.000659 ***
## factor(origin)2013 0.48797 0.07702 6.336 1.34e-05 ***
## factor(dev)2 -0.11740 0.04264 -2.753 0.014790 *
## factor(dev)3 -0.62829 0.05446 -11.538 7.38e-09 ***
## factor(dev)4 -1.03168 0.06957 -14.830 2.28e-10 ***
## factor(dev)5 -1.31346 0.08857 -14.829 2.28e-10 ***
## factor(dev)6 -1.86307 0.13826 -13.475 8.73e-10 ***
## factor(dev)7 -2.42868 0.25468 -9.536 9.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated dispersion parameter: 10.788
## Estimated index parameter: 1.01
##
## Residual deviance: 299.27 on 15 degrees of freedom
## AIC: 389.18
##
## Number of Fisher Scoring iterations: 4
```

From the model I note that the dispersion parameter \(\phi\) was estimated as 10.788 and the index parameter \(p\) as 1.01.Not surprisingly the estimated reserves are similar to the Poisson model, but with a smaller predicted standard error.

Intuitively the modelling approach makes a lot more sense, but I end up with one parameter for each origin and development period, hence there is a danger of over-parametrisation.

Looking at the plots above again I note that many origin periods have a very similar development. Perhaps a hierarchical model would be more appropriate?

For more details on

`glmReserve`

see the help file and package vignette.### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 ChainLadder_0.2.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 nloptr_1.0.4 plyr_1.8.3
[4] tools_3.2.2 digest_0.6.8 lme4_1.1-9
[7] statmod_1.4.21 gtable_0.1.2 nlme_3.1-121
[10] lattice_0.20-33 mgcv_1.8-7 Matrix_1.2-2
[13] parallel_3.2.2 biglm_0.9-1 SparseM_1.7
[16] proto_0.3-10 coda_0.17-1 stringr_1.0.0
[19] MatrixModels_0.4-1 stats4_3.2.2 lmtest_0.9-34
[22] grid_3.2.2 nnet_7.3-10 tweedie_2.2.1
[25] cplm_0.7-4 minqa_1.2.4 ggplot2_1.0.1
[28] reshape2_1.4.1 car_2.1-0 actuar_1.1-10
[31] magrittr_1.5 scales_0.3.0 MASS_7.3-44
[34] splines_3.2.2 systemfit_1.1-18 pbkrtest_0.4-2
[37] colorspace_1.2-6 quantreg_5.19 sandwich_2.3-4
[40] stringi_0.5-5 munsell_0.4.2 chron_2.3-47
[43] zoo_1.7-12
```

## Notes from the Kölner R meeting, 18 September 2015

Last Friday the Cologne R user group came together for the 15th time. Since its inception over three years ago the group evolved from a small gathering in a pub into an active data science community, covering wider topics than just R. Still, R is the link and clue between the different interests. Last Friday's agenda was a good example of this, with three talks touching on workflow management, web development and risk analysis.

Yuki Katoh had travelled all the way from Berlin to present on how to embed R with

Kicking off the luigi script starts the workflow, and

Shiny is a very popular R package that allows users to develop interactive browser applications. Paul Viefers introduced us to the extension

Paul showed us an example of a shinyapp that depending on the user plotted a different graph. Behind the scene his script would either hide or shows those plots, conditioned on the user. With only a few lines in R it allowed him to develop a user specific application. To achieve this he created a login screen that checks for user name and password. In his example he had hard coded the login credentials, instead of using a secure connection via a professional shiny server instance. However this was sufficient for his purpose, where he tests how students react to different economic scenarios in a lab environment at university.

The last talk of the meeting had a more statistical focus with examples from insurance. I repeated my talk from the LondonR user group meeting in June. One of the challenge in insurance is that despite of having many customers , insurance companies will have little claims data per customer to assess risks.

I presented some Bayesian ideas to analyse risks with little data. I used the wonderful "Hit and run accident" example from Daniel Kahneman's book

Please get in touch, if you would like to present at the next meeting.

### R in a big data pipeline

Download slides |

`luigi`

into a heterogeneous workflow of different applications. This is especially useful when R needs to be integrated with hadoop/hdfs based technologies, such as Spark and Hive. Luigi is not unlike Make, which Kirill presented at our last meeting in June. In a configuration file Yuki specified the various workflow steps and dependencies between the jobs.Kicking off the luigi script starts the workflow, and

`luigid`

server allows Yuki to monitor the various parts of the dependency graph visually. Thus, he can see the progress of his workflow in real time and identify quickly, when and where a sub process fails. As Yuki pointed out, this becomes critical in production systems, where failures need to be known and fixed quickly, unlike when ones carries out an explorative analysis in a development/research environment. See also Yuki's blog post for further details.### Shiny + Shinyjs

Download presentation files |

`shinyjs`

, a package written by Dean Attali. The name suggests already that the package provides additional JavaScript functionality. Indeed, it does, but without the need to learn JavaScript, as those functions are wrapped into R. Paul showed us an example of a shinyapp that depending on the user plotted a different graph. Behind the scene his script would either hide or shows those plots, conditioned on the user. With only a few lines in R it allowed him to develop a user specific application. To achieve this he created a login screen that checks for user name and password. In his example he had hard coded the login credentials, instead of using a secure connection via a professional shiny server instance. However this was sufficient for his purpose, where he tests how students react to different economic scenarios in a lab environment at university.

### Experience vs. Data

Download slides |

I presented some Bayesian ideas to analyse risks with little data. I used the wonderful "Hit and run accident" example from Daniel Kahneman's book

*Thinking, fast and slow*to explain Bayes' formula, introduced Bayesian belief networks for a claims analysis and discussed the challenge of predicting events when they haven't happened yet (also in Stan). Along the way I mentioned a few ideas on communicating risk, which I learned from David Spiegelhalter earlier this year.### Next Kölner R meeting

The next meeting will be scheduled in December. Details will be published on our Meetup site. Thanks again to Revolution Analytics/Microsoft for their sponsorship.Please get in touch, if you would like to present at the next meeting.

22 Sep 2015
07:14
Insurance
,
Koelner R User
,
Kölner R Users
,
luigi
,
R
,
shinyjs

## Next Kölner R User Meeting: Friday, 18 September 2015

The 15th Cologne R user group meeting is scheduled for this Friday, 18 September 2015 and we have a full agenda with three talks followed by networking drinks.

**R in big data pipeline with**(Yuki Katoh)*luigi*

*R in big data pipeline: Put your awesome R codes into production. Learn how to build solid big data pipeline around it.*

**shinyjs**(Paul Viefers)

*Using JavaScript in shiny, without knowing JavaScript*

**Experience vs Data**(Markus Gesmann)

*How to asses risks with small data sets. Bayesian ideas, belief networks and a small simulation example in Stan/rstan.*

### Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Kirill Pomogajko and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

15 Sep 2015
07:32
Kölner R Users
,
KölnR
,
News
,
R

Subscribe to:
Posts
(
Atom
)