## R in Insurance 2016

Following the successful 3rd R in Insurance conference in Amsterdam this year, we will return to London next year. We will be back at Cass Business School, 11 July 2016.

The event will focus again on the use of R in insurance, bringing together experts from industry and academia with a diverse background of disciplines, such as actuarial science, catastrophe modelling, finance, statistics and computer science.

We are delighted to announce or keynote speakers already: Dan Murphy and Mario V. Wüthrich.

More details about the event will be published in due course here. For now, save the date in your diary.

PS: The picture above is a cluster analysis of the Lloyd's building in R.

6 Oct 2015
07:32
Cass Business School
,
Conference
,
Insurance
,
R
,
R in Insurance

## ChainLadder 0.2.2 is out with improved glmReserve function

We released version 0.2.2 of

Ok, what does this all mean? I will run through a couple of examples and look behind the scene of

Like most other functions in ChainLadder,

The data at hand is often presented in form of a claims triangle, such as the following example data set from the ChainLadder package:

Suppose all claims will be reported within 7 years, then I'd like to know how much money should be set aside for the origin years 2008 to 2013 for claims that have incurred but not been reported (IBNR) yet. Or, to put it differently, I have to predict the

First, let's reformat the data as it would be stored in a database, that is in a long format of incremental claims over the years (I add the years also as factors, which I will use later):

One of the oldest methods to predict future claims development is called chain-ladder, which can be regarded as a weighted linear regression through the origin of cumulative claims over the development periods. Multiplying those development factors to the latest available cumulative position allows me to predict future claims in an iterative fashion.

It is well know in actuarial science that a Poisson GLM produces the same forecasts as the chain-ladder model.

Let's check:

There is another aspect to highlight with the Poisson model; its variance is equal to the mean. Yet, in real data I often observe that the variance increases in proportion to the mean. Well, this can be remedied by using an over-dispersed quasi-Poisson model.

I think a more natural approach would be to assume a compound distribution that models the frequency and severity of claims separately, e.g. Poisson frequency and Gamma severity.

Here the Tweedie distributions comes into play.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter \(\Phi\).

Furthermore, the variance is a power function of the mean, i.e. \(\mbox{Var}(X)=\Phi \, E[X]^p\).

The canonical link function for a Tweedie distribution in a GLM is the power link \(\mu^q\) with \(q=1-p\). Note, \(q=0\) is interpreted as \(\log(\mu)\).

Thus, let \(\mu_i = E(y_i)\) be the expectation of the ith response. Then I have the following model.

\[

y \sim \mbox{Tweedie}(q, p)\\

E(y) = \mu^q = Xb \\

\mbox{Var}(y) = \Phi \mu^p

\]The variance power \(p\) characterises the distribution of the responses \(y\). The following are some special cases:

Finally, I get back to the

With

In my first example I set use the parameters \(p=1, q=0\), which should return the results of the Poisson model.

Setting the argument

Not surprisingly the estimated reserves are similar to the Poisson model, but with a smaller predicted standard error.

Intuitively the modelling approach makes a lot more sense, but I end up with one parameter for each origin and development period, hence there is a danger of over-parametrisation.

Looking at the plots above again I note that many origin periods have a very similar development. Perhaps a hierarchical model would be more appropriate?

For more details on

`ChainLadder`

a few weeks ago. This version adds back the functionality to estimate the index parameter for the compound Poisson model in `glmReserve`

using the `cplm`

package by Wayne Zhang. Ok, what does this all mean? I will run through a couple of examples and look behind the scene of

`glmReserve`

. However, the clue is in the title, `glmReserve`

is a function that uses a generalised linear model to estimate future claims, assuming claims follow a Tweedie distribution. I should actually talk about a family of distributions that is known as Tweedie, named by Bent Jørgensen after Maurice Tweedie. Joe Rickert published a nice post about the Tweedie distribution last year.Like most other functions in ChainLadder,

`glmReserve`

purpose is to predict future insurance claims based on historical data.The data at hand is often presented in form of a claims triangle, such as the following example data set from the ChainLadder package:

```
library(ChainLadder)
cum2incr(UKMotor)
## dev
## origin 1 2 3 4 5 6 7
## 2007 3511 3215 2266 1712 1059 587 340
## 2008 4001 3702 2278 1180 956 629 NA
## 2009 4355 3932 1946 1522 1238 NA NA
## 2010 4295 3455 2023 1320 NA NA NA
## 2011 4150 3747 2320 NA NA NA NA
## 2012 5102 4548 NA NA NA NA NA
## 2013 6283 NA NA NA NA NA NA
```

The rows present different origin periods in which accidents occurred and the columns along each row show the incremental reported claims over the years (the data itself is stored in a cumulative form). Suppose all claims will be reported within 7 years, then I'd like to know how much money should be set aside for the origin years 2008 to 2013 for claims that have incurred but not been reported (IBNR) yet. Or, to put it differently, I have to predict the

`NA`

fields in the bottom right hand triangle.First, let's reformat the data as it would be stored in a database, that is in a long format of incremental claims over the years (I add the years also as factors, which I will use later):

```
claims <- as.data.frame(cum2incr(UKMotor)) # convert into long format
library(data.table)
claims <- data.table(claims)
claims <- claims[ , ':='(cal=origin+dev-1, # calendar period
originf=factor(origin),
devf=factor(dev))]
claims <- claims[order(dev), cum.value:=cumsum(value), by=origin]
setnames(claims, "value", "inc.value")
head(claims)
## origin dev inc.value cal originf devf cum.value
## 1: 2007 1 3511 2007 2007 1 3511
## 2: 2008 1 4001 2008 2008 1 4001
## 3: 2009 1 4355 2009 2009 1 4355
## 4: 2010 1 4295 2010 2010 1 4295
## 5: 2011 1 4150 2011 2011 1 4150
## 6: 2012 1 5102 2012 2012 1 5102
```

Let's visualise the data:```
library(lattice)
xyplot(cum.value/1000 + log(inc.value) ~ dev , groups=origin, data=claims,
t="b", par.settings = simpleTheme(pch = 16),
auto.key = list(space="right",
title="Origin\nperiod", cex.title=1,
points=FALSE, lines=TRUE, type="b"),
xlab="Development period", ylab="Amount",
main="Claims development by origin period",
scales="free")
```

The left plot of the chart above shows the cumulative claims payment over time, while the right plot shows the log-transformed incremental claims development for each origin/accident year.One of the oldest methods to predict future claims development is called chain-ladder, which can be regarded as a weighted linear regression through the origin of cumulative claims over the development periods. Multiplying those development factors to the latest available cumulative position allows me to predict future claims in an iterative fashion.

It is well know in actuarial science that a Poisson GLM produces the same forecasts as the chain-ladder model.

Let's check:

```
# Poisson model
mdl.pois <- glm(inc.value ~ originf + devf, data=na.omit(claims),
family=poisson(link = "log"))
# predict claims
claims <- claims[, ':='(
pred.inc.value=predict(mdl.pois,
.SD[, list(originf, devf, inc.value)],
type="response")), by=list(originf, devf)]
# sum of future payments
claims[cal>max(origin)][, sum(pred.inc.value)]
## [1] 28655.77
# Chain-ladder forecast
summary(MackChainLadder(UKMotor, est.sigma = "Mack"))$Totals[4,]
## [1] 28655.77
```

Ok, this worked. However, both of these models make actually fairly strong assumptions. The Poisson model by its very nature will only produce whole numbers, and although payments could be regarded as whole numbers, say in pence or cents, it does feel a little odd to me. Similarly, modelling the year on year developments via a weighted linear regression through the origin, as in the case of the chain-ladder model, sounds not intuitive either.There is another aspect to highlight with the Poisson model; its variance is equal to the mean. Yet, in real data I often observe that the variance increases in proportion to the mean. Well, this can be remedied by using an over-dispersed quasi-Poisson model.

I think a more natural approach would be to assume a compound distribution that models the frequency and severity of claims separately, e.g. Poisson frequency and Gamma severity.

Here the Tweedie distributions comes into play.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter \(\Phi\).

Furthermore, the variance is a power function of the mean, i.e. \(\mbox{Var}(X)=\Phi \, E[X]^p\).

The canonical link function for a Tweedie distribution in a GLM is the power link \(\mu^q\) with \(q=1-p\). Note, \(q=0\) is interpreted as \(\log(\mu)\).

Thus, let \(\mu_i = E(y_i)\) be the expectation of the ith response. Then I have the following model.

\[

y \sim \mbox{Tweedie}(q, p)\\

E(y) = \mu^q = Xb \\

\mbox{Var}(y) = \Phi \mu^p

\]The variance power \(p\) characterises the distribution of the responses \(y\). The following are some special cases:

- Normal distribution, p = 0
- Poisson distribution, p = 1
- Compound Poisson-Gamma distribution, 1 < p < 2
- Gamma distribution, p = 2
- Inverse-Gaussian, p = 3
- Stable, with support on the positive real numbers, p > 2

Finally, I get back to the

`glmReserve`

function, which Wayne Zhang, the other author of the `cplm`

package, contributed to the `ChainLadder`

package.With

`glmReserve`

I can model a claims triangle using the Tweedie distribution family.In my first example I set use the parameters \(p=1, q=0\), which should return the results of the Poisson model.

```
(m1 <- glmReserve(UKMotor, var.power = 1, link.power = 0))
## Latest Dev.To.Date Ultimate IBNR S.E CV
## 2008 12746 0.9732000 13097 351 125.8106 0.35843464
## 2009 12993 0.9260210 14031 1038 205.0826 0.19757473
## 2010 11093 0.8443446 13138 2045 278.8519 0.13635790
## 2011 10217 0.7360951 13880 3663 386.7919 0.10559429
## 2012 9650 0.5739948 16812 7162 605.2741 0.08451188
## 2013 6283 0.3038201 20680 14397 1158.1250 0.08044210
## total 62982 0.6872913 91638 28656 1708.1963 0.05961042
```

Perfect, I get the same results, plus further information about the model.Setting the argument

`var.power=NULL`

will estimate \(p\) in the interval \((1,2)\) using the `cplm`

package.```
(m2 <- glmReserve(UKMotor, var.power=NULL))
## Latest Dev.To.Date Ultimate IBNR S.E CV
## 2008 12746 0.9732000 13097 351 110.0539 0.31354394
## 2009 12993 0.9260870 14030 1037 176.9361 0.17062307
## 2010 11093 0.8444089 13137 2044 238.5318 0.11669851
## 2011 10217 0.7360951 13880 3663 335.6824 0.09164138
## 2012 9650 0.5739948 16812 7162 543.6472 0.07590718
## 2013 6283 0.3038201 20680 14397 1098.7988 0.07632138
## total 62982 0.6873063 91636 28654 1622.4616 0.05662252
m2$model
##
## Call:
## cpglm(formula = value ~ factor(origin) + factor(dev), link = link.power,
## data = ldaFit, offset = offset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.7901 -1.6969 0.0346 1.6087 8.4465
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.25763 0.04954 166.680 < 2e-16 ***
## factor(origin)2008 0.03098 0.05874 0.527 0.605588
## factor(origin)2009 0.09999 0.05886 1.699 0.110018
## factor(origin)2010 0.03413 0.06172 0.553 0.588369
## factor(origin)2011 0.08933 0.06365 1.403 0.180876
## factor(origin)2012 0.28091 0.06564 4.279 0.000659 ***
## factor(origin)2013 0.48797 0.07702 6.336 1.34e-05 ***
## factor(dev)2 -0.11740 0.04264 -2.753 0.014790 *
## factor(dev)3 -0.62829 0.05446 -11.538 7.38e-09 ***
## factor(dev)4 -1.03168 0.06957 -14.830 2.28e-10 ***
## factor(dev)5 -1.31346 0.08857 -14.829 2.28e-10 ***
## factor(dev)6 -1.86307 0.13826 -13.475 8.73e-10 ***
## factor(dev)7 -2.42868 0.25468 -9.536 9.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated dispersion parameter: 10.788
## Estimated index parameter: 1.01
##
## Residual deviance: 299.27 on 15 degrees of freedom
## AIC: 389.18
##
## Number of Fisher Scoring iterations: 4
```

From the model I note that the dispersion parameter \(\phi\) was estimated as 10.788 and the index parameter \(p\) as 1.01.Not surprisingly the estimated reserves are similar to the Poisson model, but with a smaller predicted standard error.

Intuitively the modelling approach makes a lot more sense, but I end up with one parameter for each origin and development period, hence there is a danger of over-parametrisation.

Looking at the plots above again I note that many origin periods have a very similar development. Perhaps a hierarchical model would be more appropriate?

For more details on

`glmReserve`

see the help file and package vignette.### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 ChainLadder_0.2.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 nloptr_1.0.4 plyr_1.8.3
[4] tools_3.2.2 digest_0.6.8 lme4_1.1-9
[7] statmod_1.4.21 gtable_0.1.2 nlme_3.1-121
[10] lattice_0.20-33 mgcv_1.8-7 Matrix_1.2-2
[13] parallel_3.2.2 biglm_0.9-1 SparseM_1.7
[16] proto_0.3-10 coda_0.17-1 stringr_1.0.0
[19] MatrixModels_0.4-1 stats4_3.2.2 lmtest_0.9-34
[22] grid_3.2.2 nnet_7.3-10 tweedie_2.2.1
[25] cplm_0.7-4 minqa_1.2.4 ggplot2_1.0.1
[28] reshape2_1.4.1 car_2.1-0 actuar_1.1-10
[31] magrittr_1.5 scales_0.3.0 MASS_7.3-44
[34] splines_3.2.2 systemfit_1.1-18 pbkrtest_0.4-2
[37] colorspace_1.2-6 quantreg_5.19 sandwich_2.3-4
[40] stringi_0.5-5 munsell_0.4.2 chron_2.3-47
[43] zoo_1.7-12
```

## Notes from the Kölner R meeting, 18 September 2015

Last Friday the Cologne R user group came together for the 15th time. Since its inception over three years ago the group evolved from a small gathering in a pub into an active data science community, covering wider topics than just R. Still, R is the link and clue between the different interests. Last Friday's agenda was a good example of this, with three talks touching on workflow management, web development and risk analysis.

Yuki Katoh had travelled all the way from Berlin to present on how to embed R with

Kicking off the luigi script starts the workflow, and

Shiny is a very popular R package that allows users to develop interactive browser applications. Paul Viefers introduced us to the extension

Paul showed us an example of a shinyapp that depending on the user plotted a different graph. Behind the scene his script would either hide or shows those plots, conditioned on the user. With only a few lines in R it allowed him to develop a user specific application. To achieve this he created a login screen that checks for user name and password. In his example he had hard coded the login credentials, instead of using a secure connection via a professional shiny server instance. However this was sufficient for his purpose, where he tests how students react to different economic scenarios in a lab environment at university.

The last talk of the meeting had a more statistical focus with examples from insurance. I repeated my talk from the LondonR user group meeting in June. One of the challenge in insurance is that despite of having many customers , insurance companies will have little claims data per customer to assess risks.

I presented some Bayesian ideas to analyse risks with little data. I used the wonderful "Hit and run accident" example from Daniel Kahneman's book

Please get in touch, if you would like to present at the next meeting.

### R in a big data pipeline

Download slides |

`luigi`

into a heterogeneous workflow of different applications. This is especially useful when R needs to be integrated with hadoop/hdfs based technologies, such as Spark and Hive. Luigi is not unlike Make, which Kirill presented at our last meeting in June. In a configuration file Yuki specified the various workflow steps and dependencies between the jobs.Kicking off the luigi script starts the workflow, and

`luigid`

server allows Yuki to monitor the various parts of the dependency graph visually. Thus, he can see the progress of his workflow in real time and identify quickly, when and where a sub process fails. As Yuki pointed out, this becomes critical in production systems, where failures need to be known and fixed quickly, unlike when ones carries out an explorative analysis in a development/research environment. See also Yuki's blog post for further details.### Shiny + Shinyjs

Download presentation files |

`shinyjs`

, a package written by Dean Attali. The name suggests already that the package provides additional JavaScript functionality. Indeed, it does, but without the need to learn JavaScript, as those functions are wrapped into R. Paul showed us an example of a shinyapp that depending on the user plotted a different graph. Behind the scene his script would either hide or shows those plots, conditioned on the user. With only a few lines in R it allowed him to develop a user specific application. To achieve this he created a login screen that checks for user name and password. In his example he had hard coded the login credentials, instead of using a secure connection via a professional shiny server instance. However this was sufficient for his purpose, where he tests how students react to different economic scenarios in a lab environment at university.

### Experience vs. Data

Download slides |

I presented some Bayesian ideas to analyse risks with little data. I used the wonderful "Hit and run accident" example from Daniel Kahneman's book

*Thinking, fast and slow*to explain Bayes' formula, introduced Bayesian belief networks for a claims analysis and discussed the challenge of predicting events when they haven't happened yet (also in Stan). Along the way I mentioned a few ideas on communicating risk, which I learned from David Spiegelhalter earlier this year.### Next Kölner R meeting

The next meeting will be scheduled in December. Details will be published on our Meetup site. Thanks again to Revolution Analytics/Microsoft for their sponsorship.Please get in touch, if you would like to present at the next meeting.

22 Sep 2015
07:14
Insurance
,
Koelner R User
,
Kölner R Users
,
luigi
,
R
,
shinyjs

## Next Kölner R User Meeting: Friday, 18 September 2015

The 15th Cologne R user group meeting is scheduled for this Friday, 18 September 2015 and we have a full agenda with three talks followed by networking drinks.

**R in big data pipeline with**(Yuki Katoh)*luigi*

*R in big data pipeline: Put your awesome R codes into production. Learn how to build solid big data pipeline around it.*

**shinyjs**(Paul Viefers)

*Using JavaScript in shiny, without knowing JavaScript*

**Experience vs Data**(Markus Gesmann)

*How to asses risks with small data sets. Bayesian ideas, belief networks and a small simulation example in Stan/rstan.*

### Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Kirill Pomogajko and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

15 Sep 2015
07:32
Kölner R Users
,
KölnR
,
News
,
R

## Bayesian regression models using Stan in R

It seems the summer is coming to end in London, so I shall take a final look at my ice cream data that I have been playing around with to predict sales statistics based on temperature for the last couple of weeks [1], [2], [3].

Here I will use the new

Thanks to

Let's start. Here is the data again:

My models are written down in very much the same way as with

Last week I wrote the Stan model for the log-transformed linear model myself. Here is the output of

I access the underlying Stan model via

To review the model I start by plotting the trace and density plots for the MCMC samples.

Combining the outputs of all four models into one data frame gives me then the opportunity to compare the prediction credible intervals of the four models in one chart.

There are no big news in respect of the four models, but for the fact that here I can look at the posterior prediction credible intervals, rather then the theoretical distributions two weeks ago. The over-prediction of the log-transformed linear model is apparent again.

Well, if I set the probability of selling out at 2.5%, then I will have enough ice cream to sell with 97.5% certainty. To estimate those statistics I have to calculate the 97.5% percentile of the posterior predictive samples.

Ok, I have four models and four answers ranging from 761 to 2494. The highest number is more than 3 times the lowest number!

I had set the market size at 800 in my binomial model, so I am not surprised by its answer of 761. Also, I noted earlier that the log-normal distribution is skewed to the right, so that explains the high prediction of 2494. The Poisson model, like the log-transform linear model, has the implicit exponential growth assumption. Its mean forecast is well over 1000 as I pointed out earlier and hence the 97.5% prediction of 1510 is to be expected.

Another aspect to consider would be the cost of holding unsold stock. Ice cream can usually be stored for some time, but the situation would be quite different for freshly baked bread or bunches of flowers that have to be sold quickly.

Here I will use the new

`brms`

(GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.### Stan models with brms

Like in my previous post about the log-transformed linear model with Stan, I will use Bayesian regression models to estimate the 95% prediction credible interval from the posterior predictive distribution.Thanks to

`brms`

this will take less than a minute of coding, because `brm`

allows me to specify my models in the usual formula syntax and I can leave it to the package functions to create and execute the Stan files.Let's start. Here is the data again:

My models are written down in very much the same way as with

`glm`

. Only the binomial model requires a slightly different syntax. Here I use the default priors and link functions:Last week I wrote the Stan model for the log-transformed linear model myself. Here is the output of

`brm`

. The estimated parameters are quite similar, apart from \(\sigma\):I access the underlying Stan model via

`log.lin.mod$model`

and note that the prior of \(\sigma\) is modelled via a Cauchy distribution, unlike the inverse Gamma I used last week. I believe that explains my small difference in \(\sigma\).To review the model I start by plotting the trace and density plots for the MCMC samples.

### Prediction credible interval

The`predict`

function gives me access to the posterior predictive statistics, including the 95% prediction credible interval. Combining the outputs of all four models into one data frame gives me then the opportunity to compare the prediction credible intervals of the four models in one chart.

There are no big news in respect of the four models, but for the fact that here I can look at the posterior prediction credible intervals, rather then the theoretical distributions two weeks ago. The over-prediction of the log-transformed linear model is apparent again.

### How much stock should I hold on a hot day?

Running out of stock on a hot summer's day would be unfortunate, because those are days when sells will be highest. But how much stock should I hold?Well, if I set the probability of selling out at 2.5%, then I will have enough ice cream to sell with 97.5% certainty. To estimate those statistics I have to calculate the 97.5% percentile of the posterior predictive samples.

Ok, I have four models and four answers ranging from 761 to 2494. The highest number is more than 3 times the lowest number!

I had set the market size at 800 in my binomial model, so I am not surprised by its answer of 761. Also, I noted earlier that the log-normal distribution is skewed to the right, so that explains the high prediction of 2494. The Poisson model, like the log-transform linear model, has the implicit exponential growth assumption. Its mean forecast is well over 1000 as I pointed out earlier and hence the 97.5% prediction of 1510 is to be expected.

### Conclusions

How much ice cream should I hold in stock? Well, if I believe strongly in my assumption of a market size of 800, then I should stick with the output of the binomial model, or perhaps hold 800 just in case.Another aspect to consider would be the cost of holding unsold stock. Ice cream can usually be stored for some time, but the situation would be quite different for freshly baked bread or bunches of flowers that have to be sold quickly.

### Session Info

Note, I used the developer version of`brms`

from GitHub.```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lattice_0.20-33 brms_0.4.1.9000 ggplot2_1.0.1
[4] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] codetools_0.2-14 digest_0.6.8 MASS_7.3-43
[4] grid_3.2.2 plyr_1.8.3 gtable_0.1.2
[7] stats4_3.2.2 magrittr_1.5 scales_0.2.5
[10] stringi_0.5-5 reshape2_1.4.1 proto_0.3-10
[13] tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[16] parallel_3.2.2 colorspace_1.2-6
```

## Visualising the predictive distribution of a log-transformed linear model

Last week I presented visualisations of theoretical distributions that predict ice cream sales statistics based on linear and generalised linear models, which I introduced in an earlier post.

Today I will take a closer look at the log-transformed linear model and use Stan/rstan, not only to model the sales statistics, but also to generate samples from the posterior predictive distribution.

The posterior predictive distribution is what I am most interested in. From the simulations I can get the 95% prediction interval, which will be slightly wider than the theoretical 95% interval, as it takes into account the parameter uncertainty as well.

Ok, first I take my log-transformed linear model of my earlier post and turn it into a Stan model, including a section to generate output from the posterior predictive distribution.

After I have complied and run the model, I can extract the simulations and calculate various summary statistics. Furthermore, I use my parameters also to predict the median and mean, so that I can compare them against the sample statistics. Note again, that for the mean calculation of the log-normal distribution I have to take into account the variance as well.

Ok, that looks pretty reasonable, and also quite similar to my earlier output with

Just as expected, I note a slightly wider 95% interval range in the posterior predictive distributions compared to the theoretical distributions at the top.

Theoretical distributions |

The posterior predictive distribution is what I am most interested in. From the simulations I can get the 95% prediction interval, which will be slightly wider than the theoretical 95% interval, as it takes into account the parameter uncertainty as well.

Ok, first I take my log-transformed linear model of my earlier post and turn it into a Stan model, including a section to generate output from the posterior predictive distribution.

After I have complied and run the model, I can extract the simulations and calculate various summary statistics. Furthermore, I use my parameters also to predict the median and mean, so that I can compare them against the sample statistics. Note again, that for the mean calculation of the log-normal distribution I have to take into account the variance as well.

Ok, that looks pretty reasonable, and also quite similar to my earlier output with

`glm`

. Using my plotting function of last week I can also create a nice 3D plot again.Posterior predictive distributions |

Just as expected, I note a slightly wider 95% interval range in the posterior predictive distributions compared to the theoretical distributions at the top.

### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] tools_3.2.2 codetools_0.2-14 stats4_3.2.2
```

## Visualising theoretical distributions of GLMs

Two weeks ago I discussed various linear and generalised linear models in R using ice cream sales statistics. The data showed not surprisingly that more ice cream was sold at higher temperatures.

I came to the conclusion that I preferred the binomial model for two reasons: the model simulates only whole numbers, just like the observational data and it has natural boundaries defined by zero sales and a market saturation level. Hence, it will neither predict negative sales, nor infinite sales.

However, I didn't review any residual plots or anything else but the mean predictions. So, let's do this in this post.

Arthur Charpentier reminded me via Twitter of an earlier post of his, where he looked at GLMs as well. Arthur used a nice 3D plot to visualise the models. I took his code and wrapped it into a function (it is not pretty, but it will do for now):

The 3D plot shows the observations as circles on the xy-plane, together with the mean predictions as a solid line. In yellow I show the theoretical residual interval between the 5th and 95th quantile of the Normal distributions parametrised by the model output. Additionally I present the density distributions of the model at each observation.

This is conceptually similar to a Q-Q, which compares the standardised residuals with the theoretical quantiles. In the case of a linear model the prediction interval would shrink to the theoretical interval as the number of data points increases.

The 3D plot looks pretty, but I think it is helpful as well. Let me put this in context to the traditional residual plots, see below.

The residual plots identify observations 2, 5 and 10 as extreme points. Indeed, those points are either on the edge or outside (point 10) of the yellow area in the plot above.

Again, the residual plots highlight points 2, 5 and 10 as extremes. I note also from the first residual plot that the log-transformed model over-predicts for colder temperatures.

The same points 2, 5 and 10 are highlighted again as extreme, but now they are well outside the yellow area.

The theoretical range between the 5th and 95th quantile has narrowed further. Data item 10 looks now really suspicious. Observations 2 and 5 are once again highlighted. Now I wonder, is someone eating my ice cream?

```
icecream <- data.frame(
temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
```

I used a linear model, a log-transformed linear model, a Poisson and Binomial generalised linear model to predict sales within and outside the range of data available.I came to the conclusion that I preferred the binomial model for two reasons: the model simulates only whole numbers, just like the observational data and it has natural boundaries defined by zero sales and a market saturation level. Hence, it will neither predict negative sales, nor infinite sales.

However, I didn't review any residual plots or anything else but the mean predictions. So, let's do this in this post.

Arthur Charpentier reminded me via Twitter of an earlier post of his, where he looked at GLMs as well. Arthur used a nice 3D plot to visualise the models. I took his code and wrapped it into a function (it is not pretty, but it will do for now):

```
glmModelPlot <- function(x, y, xlim,ylim, meanPred, LwPred, UpPred,
plotData, main=NULL){
## Based on code by Arthur Charpentier:
## http://freakonometrics.hypotheses.org/9593
par(mfrow=c(1,1))
n <- 2
N <- length(meanPred)
zMax <- max(unlist(sapply(plotData, "[[", "z")))*1.5
mat <- persp(xlim, ylim, matrix(0, n, n), main=main,
zlim=c(0, zMax), theta=-30,
ticktype="detailed",box=FALSE)
C <- trans3d(x, UpPred, rep(0, N),mat)
lines(C, lty=2)
C <- trans3d(x, LwPred, rep(0, N), mat)
lines(C, lty=2)
C <- trans3d(c(x, rev(x)), c(UpPred, rev(LwPred)),
rep(0, 2*N), mat)
polygon(C, border=NA, col=adjustcolor("yellow", alpha.f = 0.5))
C <- trans3d(x, meanPred, rep(0, N), mat)
lines(C, lwd=2, col="grey")
C <- trans3d(x, y, rep(0,N), mat)
points(C, lwd=2, col="#00526D")
for(j in N:1){
xp <- plotData[[j]]$x
yp <- plotData[[j]]$y
z0 <- plotData[[j]]$z0
zp <- plotData[[j]]$z
C <- trans3d(c(xp, xp), c(yp, rev(yp)), c(zp, z0), mat)
polygon(C, border=NA, col="light blue", density=40)
C <- trans3d(xp, yp, z0, mat)
lines(C, lty=2)
C <- trans3d(xp, yp, zp, mat)
lines(C, col=adjustcolor("blue", alpha.f = 0.5))
}
}
```

### Linear model

I start by plotting the output of the linear model, a GLM assuming a Normal distribution with an identity link function.The 3D plot shows the observations as circles on the xy-plane, together with the mean predictions as a solid line. In yellow I show the theoretical residual interval between the 5th and 95th quantile of the Normal distributions parametrised by the model output. Additionally I present the density distributions of the model at each observation.

This is conceptually similar to a Q-Q, which compares the standardised residuals with the theoretical quantiles. In the case of a linear model the prediction interval would shrink to the theoretical interval as the number of data points increases.

The 3D plot looks pretty, but I think it is helpful as well. Let me put this in context to the traditional residual plots, see below.

The residual plots identify observations 2, 5 and 10 as extreme points. Indeed, those points are either on the edge or outside (point 10) of the yellow area in the plot above.

```
xlim <- c(min(icecream$temp)*0.95, max(icecream$temp)*1.05)
ylim <- c(floor(min(icecream$units)*0.95),
ceiling(max(icecream$units)*1.05))
lin.mod <- glm(units ~ temp, data=icecream,
family=gaussian(link="identity"))
par(mfrow=c(2,2))
plot(lin.mod)
title(outer=TRUE, line = -1,
main = list("Linear regression",
cex=1.25,col="black", font=2))
meanPred <- predict(lin.mod, type="response")
sdgig <- sqrt(summary(lin.mod)$dispersion)
UpPred <- qnorm(.95, meanPred, sdgig)
LwPred <- qnorm(.05, meanPred, sdgig)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
stp <- 251
x = rep(icecream$temp[i], stp)
y = seq(ylim[1], ylim[2], length=stp)
z0 = rep(0, stp)
z = dnorm(y, meanPred[i], sdgig)
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Linear regression")
```

### Log-transformed linear model

The log-transformed model shows an increasing variance around the mean as the temperatures increase. Yet, this shouldn't come as a surprise, as this a property of the log-normal distribution. The variance is given as \(\text{Var}(X) = \exp(2\mu + \sigma^2)(\exp(\sigma^2) - 1) = \text{E}[X]^2 (\exp(\sigma^2) - 1)\) .Again, the residual plots highlight points 2, 5 and 10 as extremes. I note also from the first residual plot that the log-transformed model over-predicts for colder temperatures.

```
log.lin.mod <- glm(log(units) ~ temp, data=icecream,
family=gaussian(link="identity"))
par(mfrow=c(2,2))
plot(log.lin.mod)
title(outer=TRUE, line = -1,
main = list("Log-transformed LM",
cex=1.25,col="black", font=2))
meanLogPred <- predict(log.lin.mod, type="response")
sdgig <- sqrt(summary(log.lin.mod)$dispersion)
meanPred <- exp(meanLogPred + 0.5 * sdgig^2)
UpPred <- qlnorm(.95, meanlog = meanLogPred, sdlog = sdgig)
LwPred <- qlnorm(.05, meanlog = meanLogPred, sdlog = sdgig)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
stp <- 251
x = rep(icecream$temp[i], stp)
y = seq(ylim[1], ylim[2], length=stp)
z0 = rep(0, stp)
z = dlnorm(y, meanlog = meanLogPred[i], sdlog = sdgig)
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Log-transformed LM")
```

### Poisson (log) GLM

The Poisson model shows a narrower range between the 5th and 95th quantile then the previous models. My Poisson model didn't assume over-dispersion, and hence the yellow range increases in line with the mean.The same points 2, 5 and 10 are highlighted again as extreme, but now they are well outside the yellow area.

```
pois.mod <- glm(units ~ temp, data=icecream,
family=poisson(link="log"))
par(mfrow=c(2,2))
plot(pois.mod)
title(outer=TRUE, line = -1,
main = list("Poisson (log) GLM",
cex=1.25,col="black", font=2))
meanPred <- predict(pois.mod, type="response")
UpPred <- qpois(.95, meanPred)
LwPred <- qpois(.05, meanPred)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
y = seq(ylim[1], ylim[2])
x = rep(icecream$temp[i], length(y))
z0 = rep(0, length(y))
z = dpois(y, meanPred[i])
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Poisson (log) GLM")
```

### Binomial (logit) GLM

Here is my favourite model from my previous post again.The theoretical range between the 5th and 95th quantile has narrowed further. Data item 10 looks now really suspicious. Observations 2 and 5 are once again highlighted. Now I wonder, is someone eating my ice cream?

```
market.size <- 800
icecream$opportunity <- market.size - icecream$units
bin.glm <- glm(cbind(units, opportunity) ~ temp, data=icecream,
family=binomial(link = "logit"))
par(mfrow=c(2,2))
plot(bin.glm)
title(outer=TRUE, line = -1,
main = list("Binomial (logit) GLM",
cex=1.25,col="black", font=2))
meanProb <- predict(bin.glm, type="response")
meanPred <- meanProb*market.size
UpPred <- qbinom(.95, market.size, meanProb)
LwPred <- qbinom(.05, market.size, meanProb)
plotData <- lapply(
seq(along=icecream$temp),
function(i){
y = ylim[1]:ylim[2]
x = rep(icecream$temp[i], length(y))
z0 = rep(0, length(y))
z = dbinom(y, market.size, meanProb[i])
return(list(x=x, y=y, z0=z0, z=z))
}
)
glmModelPlot(x = icecream$temp, y=icecream$units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotData,
main = "Binomial (logit) GLM")
```

### Conclusions

I think Arthur's 3D plots help to visualise what GLMs are conceptually about. They illustrate the theoretical distribution around the predictions. Let me say this again, the yellow areas in the charts above show not the 90% prediction interval, but the theoretical residual distribution if the models where true.### Session Info

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.0
```

18 Aug 2015
07:06
glm
,
graphics
,
linear model
,
plot
,
R
,
regression

Subscribe to:
Posts
(
Atom
)