## Generalised Linear Models in R

Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points.

Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention.

I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well. That’s the purpose of this post.

Using data on ice cream sales statistics I will set out to illustrate different models, starting with traditional linear least square regression, moving on to a linear model, a log-transformed linear model and then on to generalised linear models, namely a Poisson (log) GLM and Binomial (logistic) GLM. Additionally, I will run a simulation with each model.

Along the way I aim to reveal the intuition behind a GLM using Ramus’ distribution centric description. I hope to clarify why one would want to use different distributions and link functions.

I am particularly interested in how my models will behave in the more extreme cases when it is freezing outside, say when the temperature drops to 0ºC and also what it would say for a very hot summer's day at 35ºC.

Let's start with a probability distribution centric description of the data.

I believe the observation \(y_i\) was drawn from a Normal (aka Gaussian) distribution with a mean \(\mu_i\), depending on the temperature \(x_i\) and a constant variance \(\sigma^2\) across all temperatures.

On another day with the same temperature I might have sold a different quantity of ice cream, but over many days with the same temperature the number of ice creams sold would start to fall into a range defined by \(\mu_i\pm\sigma\).

Thus, using a distribution centric notation my model looks like this:

\[

y_i \sim \mathcal{N}(\mu_i, \sigma^2), \\

\mathbb{E}[y_i]=\mu_i = \alpha + \beta x_i \; \text{for all}\; i

\]Note, going forward I will drop the \(\text{for all}\; i\) statement for brevity.

Or, alternatively I might argue that the residuals, the difference between the observations and predicted values, follow a Gaussian distribution with a mean of 0 and variance \(\sigma^2\):

\[

\varepsilon_i = y_i - \mu_i \sim \mathcal{N}(0, \sigma^2)

\]Furthermore, the equation

\[

\mathbb{E}[y_i]=\mu_i = \alpha + \beta x_i

\]states my belief that the expected value of \(y_i\) is identical to the parameter \(\mu_i\) of the underlying distribution, while the variance is constant.

The same model written in the classic error terms convention would be:

\[

y_i = \alpha + \beta x_i + \varepsilon_i, \text{ with }

\varepsilon_i \sim \mathcal{N}(0, \sigma^2)

\]I think the probability distribution centric convention makes it clearer that my observation is just one realisation from a distribution. It also emphasises that the parameter of the distribution is modelled linearly.

To model this in R explicitly I use the

That's what its says on the GLM tin and that's all there is to it!

(Actually, there is also the bit, which estimates the parameters from the data via maximum likelihood, but I will skip this here.)

\[

y_i \sim \mathcal{N}(\mu_i, \sigma^2) \text{ with }

\mu_i = -159.5 + 30.1 \; x_i \text{ and }

\sigma = 38.1

\]Although the linear model looks fine in the range of temperatures observed, it doesn't make much sense at 0ºC. The intercept is at -159, which would mean that customers return on average 159 units of ice cream on a freezing day. Well, I don't think so.

So, let's model the ice cream sales on a logarithmic scale. Thus my model changes to:

\[

\log(y_i) \sim \mathcal{N}(\mu_i, \sigma^2)\\

\mathbb{E}[\log(y_i)]=\mu_i = \alpha + \beta x_i

\]This model implies that I believe the sales follow a log-normal distribution:\[y_i \sim \log\mathcal{N}(\mu_i, \sigma^2)\]Note, the log-normal distribution is skewed to the right, which means that I regard higher sales figures more likely than lower sales figures.

Although the model is still linear on a log-scale, I have to remember to transform the predictions back to the original scale because \(\mathbb{E}[\log(y_i)] \neq \log(\mathbb{E}[y_i])\). This is shown below. For a discussion of the transformation of the lognormal distribution see for example the help page of the R function

\[

y_i \sim \log\mathcal{N}(\mu_i, \sigma^2)\\

\mathbb{E}[y_i] = \exp(\mu_i + \sigma^2/2)

\]

This plot looks a little better than the previous linear model and it predicts that I would sell, on average, 82 ice creams when the temperature is 0ºC:

The Poisson distribution has only one parameter, here \(\mu_i\), which is also its expected value. The canonical link function for \(\mu_i\) is the logarithm, i.e. the logarithm of the expected value is regarded a linear function of the predictors. This means I have to apply the exponential function to the linear model to get back to the original scale. This is distinctively different to the log-transformed linear model above, where the original data was transformed, not the expected value of the data. The log function here is the link function, as it links the transformed expected value to the linear model.

Here is my model:

\[

y_i \sim \text{Poisson}(\mu_i)\\

\mathbb{E}[y_i]=\mu_i=\exp(\alpha + \beta x_i) = \exp(\alpha) \exp(\beta x_i)\\

\log(\mu_i) = \alpha + \beta x_i\]Let me say this again, although the expected value of my observation is a real number, the Poisson distribution will generate only whole numbers, in line with the actual sales.

This looks pretty good. The interpretation of the coefficients should be clear now. I have to use the

From the coefficients I can read off that a 0ºC I expect to sell \(\exp(4.45)=94\) ice creams and that with each one degree increase in temperature the sales are predicted to increase by \(\exp(0.076) - 1 = 7.9\%\).

Note, the exponential function turns the additive scale into a multiplicative one.

So far, so good. My model is in line with my observations. Additionally, it will not predict negative sales and if I would simulate from a Poisson distribution with a mean given by the above model I will always only get whole numbers back.

However, my model will also predict that I should expect to sell over 1000 ice creams if the temperature reaches 32ºC:

This suggests a binomial distribution for the number of successful sales out of 800. The key parameter for the binomial distribution is the probability of success, the probability that someone will buy ice cream as a function of temperature.

Dividing my sales statistics by 800 would give me a first proxy for the probability of selling all ice cream.

Therefore I need an S-shape curve that maps the sales statistics into probabilities between 0 and 100%.

A canonical choice is the logistic function:

\[

\text{logit}(u) = \frac{e^u}{e^u + 1} = \frac{1}{1 + e^{-u}}

\]

With that my model can be described as:

\[

y_i \sim \text{Binom}(n, \mu_i)\\

\mathbb{E}[y_i]=\mu_i=\text{logit}^{-1}(\alpha + \beta x_i)\\

\text{logit}(\mu_i) = \alpha + \beta x_i\]

The chart doesn't look too bad at all!

As the temperature increases higher and higher this model will predict that sales will reach market saturation, while all the other models so far would predict higher and higher sales.

I can predict sales at 0ºC and 35ºC using the inverse of the logistic function, which is given in R as

The chart shows the predictions of my four models over a temperature range from 0 to 35ºC. Although the linear model looks OK between 10 and perhaps 30ºC, it shows clearly its limitations. The log-transformed linear and Poisson models appear to give similar predictions, but will predict an ever accelerating increase in sales as temperature increases. I don't believe this makes sense as even the most ice cream loving person can only eat so much ice cream on a really hot day. And that's why I would go with the Binomial model to predict my ice cream sales.

In all my models the linear structure was

\[

g(\mu_i) = \alpha + \beta x_i\]or in matrix notation

\[

g(\mu) = A v\] with \(A_{i,\cdot} = [1, x_i]\) and \(v=[\alpha, \beta]\), whereby \(A\) is the model matrix, \(v\) the vector of coefficients and \(g\) the link function.

With that being said, let's simulate data from each distribution for the temperatures measured in the original data and compare against the actual sales units.

The chart displays one simulation per observation from each model, but I believe it shows already some interesting aspects. Not only do I see that the Poisson and Binomial model generate whole numbers, while the Normal and log-transformed Normal predict real numbers, I notice the skewness of the log-normal distribution in the red point at 19.4ºC.

Furthermore, the linear model predicts equally likely figures above and below the mean and at 16.4ºC the prediction appears a little low - perhaps as a result.

Additionally the high sales prediction at 25.1ºC of the log-transformed Normal and Poisson model shouldn't come as a surprise either.

Again, the simulation of the binomial model seems the most realistic to me.

There are many aspects of GLMs which I haven't touched on here, such as:

Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention.

I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well. That’s the purpose of this post.

Using data on ice cream sales statistics I will set out to illustrate different models, starting with traditional linear least square regression, moving on to a linear model, a log-transformed linear model and then on to generalised linear models, namely a Poisson (log) GLM and Binomial (logistic) GLM. Additionally, I will run a simulation with each model.

Along the way I aim to reveal the intuition behind a GLM using Ramus’ distribution centric description. I hope to clarify why one would want to use different distributions and link functions.

### The data

Here is the example data set I will be using. It shows the number of ice creams sold at different temperatures.```
icecream <- data.frame(
temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
```

I will be comparing the results from this model with others. Hence, I write a helper function to provide consistent graphical outputs:```
basicPlot <- function(...){
plot(units ~ temp, data=icecream, bty="n", lwd=2,
main="Number of ice creams sold", col="#00526D",
xlab="Temperature (Celsius)",
ylab="Units sold", ...)
axis(side = 1, col="grey")
axis(side = 2, col="grey")
}
basicPlot()
```

As expected more ice cream is sold at higher temperature.### The challenge

I would like to create a model that predicts the number of ice creams sold for different temperatures, even outside the range of the available data.I am particularly interested in how my models will behave in the more extreme cases when it is freezing outside, say when the temperature drops to 0ºC and also what it would say for a very hot summer's day at 35ºC.

### Linear least square

My first approach is to take a ruler and draw a straight line through the points, such that it minimises the average distance between the points and the line. That's basically a linear least square regression line:```
lsq.mod <- lsfit(icecream$temp, icecream$units)
basicPlot()
abline(lsq.mod, col="orange", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "linear least square"),
col=c("#00526D","orange"), pch=c(1,NA))
```

That's easy and looks not unreasonable. ### Linear regression

For the least square method I didn't think about distributions at all. It is just common sense, or is it?Let's start with a probability distribution centric description of the data.

I believe the observation \(y_i\) was drawn from a Normal (aka Gaussian) distribution with a mean \(\mu_i\), depending on the temperature \(x_i\) and a constant variance \(\sigma^2\) across all temperatures.

On another day with the same temperature I might have sold a different quantity of ice cream, but over many days with the same temperature the number of ice creams sold would start to fall into a range defined by \(\mu_i\pm\sigma\).

Thus, using a distribution centric notation my model looks like this:

\[

y_i \sim \mathcal{N}(\mu_i, \sigma^2), \\

\mathbb{E}[y_i]=\mu_i = \alpha + \beta x_i \; \text{for all}\; i

\]Note, going forward I will drop the \(\text{for all}\; i\) statement for brevity.

Or, alternatively I might argue that the residuals, the difference between the observations and predicted values, follow a Gaussian distribution with a mean of 0 and variance \(\sigma^2\):

\[

\varepsilon_i = y_i - \mu_i \sim \mathcal{N}(0, \sigma^2)

\]Furthermore, the equation

\[

\mathbb{E}[y_i]=\mu_i = \alpha + \beta x_i

\]states my belief that the expected value of \(y_i\) is identical to the parameter \(\mu_i\) of the underlying distribution, while the variance is constant.

The same model written in the classic error terms convention would be:

\[

y_i = \alpha + \beta x_i + \varepsilon_i, \text{ with }

\varepsilon_i \sim \mathcal{N}(0, \sigma^2)

\]I think the probability distribution centric convention makes it clearer that my observation is just one realisation from a distribution. It also emphasises that the parameter of the distribution is modelled linearly.

To model this in R explicitly I use the

`glm`

function, in which I specify the "response distribution" (namely the number of ice creams) as Gaussian and the link function from the expected value of the distribution to its parameter (i.e. temperature) as identity. Indeed, this really is the trick with a GLM, it describes how the distribution of the observations and the expected value, often after a smooth transformation, relate to the actual measurements (predictors) in a linear way. The 'link' is the inverse function of the original transformation of the data. That's what its says on the GLM tin and that's all there is to it!

(Actually, there is also the bit, which estimates the parameters from the data via maximum likelihood, but I will skip this here.)

```
lin.mod <- glm(units ~ temp, data=icecream,
family=gaussian(link="identity"))
library(arm) # for 'display' function only
display(lin.mod)
## glm(formula = units ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) -159.47 54.64
## temp 30.09 2.87
## ---
## n = 12, k = 2
## residual deviance = 14536.3, null deviance = 174754.9 (difference = 160218.6)
## overdispersion parameter = 1453.6
## residual sd is sqrt(overdispersion) = 38.13
```

Thus, to mimic my data I could generate random numbers from the following Normal distribution:\[

y_i \sim \mathcal{N}(\mu_i, \sigma^2) \text{ with }

\mu_i = -159.5 + 30.1 \; x_i \text{ and }

\sigma = 38.1

\]Although the linear model looks fine in the range of temperatures observed, it doesn't make much sense at 0ºC. The intercept is at -159, which would mean that customers return on average 159 units of ice cream on a freezing day. Well, I don't think so.

### Log-transformed linear regression

Ok, perhaps I can transform the data first. Ideally I would like ensure that the transformed data has only positive values. The first transformation that comes to my mind in those cases is the logarithm.So, let's model the ice cream sales on a logarithmic scale. Thus my model changes to:

\[

\log(y_i) \sim \mathcal{N}(\mu_i, \sigma^2)\\

\mathbb{E}[\log(y_i)]=\mu_i = \alpha + \beta x_i

\]This model implies that I believe the sales follow a log-normal distribution:\[y_i \sim \log\mathcal{N}(\mu_i, \sigma^2)\]Note, the log-normal distribution is skewed to the right, which means that I regard higher sales figures more likely than lower sales figures.

Although the model is still linear on a log-scale, I have to remember to transform the predictions back to the original scale because \(\mathbb{E}[\log(y_i)] \neq \log(\mathbb{E}[y_i])\). This is shown below. For a discussion of the transformation of the lognormal distribution see for example the help page of the R function

`rlnorm`

.\[

y_i \sim \log\mathcal{N}(\mu_i, \sigma^2)\\

\mathbb{E}[y_i] = \exp(\mu_i + \sigma^2/2)

\]

```
log.lin.mod <- glm(log(units) ~ temp, data=icecream,
family=gaussian(link="identity"))
display(log.lin.mod)
## glm(formula = log(units) ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) 4.40 0.20
## temp 0.08 0.01
## ---
## n = 12, k = 2
## residual deviance = 0.2, null deviance = 1.4 (difference = 1.2)
## overdispersion parameter = 0.0
## residual sd is sqrt(overdispersion) = 0.14
log.lin.sig <- summary(log.lin.mod)$dispersion
log.lin.pred <- exp(predict(log.lin.mod) + log.lin.sig)
basicPlot()
lines(icecream$temp, log.lin.pred, col="red", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "log-transformed LM"),
col=c("#00526D","red"), pch=c(1,NA))
```

This plot looks a little better than the previous linear model and it predicts that I would sell, on average, 82 ice creams when the temperature is 0ºC:

```
exp(coef(log.lin.mod)[1])
## (Intercept)
## 81.62131
```

Although this model makes a little more sense, it appears that is predicts too many sales at the low and high-end of the observed temperature range. Furthermore, there is another problem with this model and the previous linear model as well. The assumed model distributions generate real numbers, but ice cream sales only occur in whole numbers. As a result, any draw from the model distribution should be a whole number.### Poisson regression

The classic approach for count data is the Poisson distribution.The Poisson distribution has only one parameter, here \(\mu_i\), which is also its expected value. The canonical link function for \(\mu_i\) is the logarithm, i.e. the logarithm of the expected value is regarded a linear function of the predictors. This means I have to apply the exponential function to the linear model to get back to the original scale. This is distinctively different to the log-transformed linear model above, where the original data was transformed, not the expected value of the data. The log function here is the link function, as it links the transformed expected value to the linear model.

Here is my model:

\[

y_i \sim \text{Poisson}(\mu_i)\\

\mathbb{E}[y_i]=\mu_i=\exp(\alpha + \beta x_i) = \exp(\alpha) \exp(\beta x_i)\\

\log(\mu_i) = \alpha + \beta x_i\]Let me say this again, although the expected value of my observation is a real number, the Poisson distribution will generate only whole numbers, in line with the actual sales.

```
pois.mod <- glm(units ~ temp, data=icecream,
family=poisson(link="log"))
display(pois.mod)
## glm(formula = units ~ temp, family = poisson(link = "log"), data = icecream)
## coef.est coef.se
## (Intercept) 4.54 0.08
## temp 0.08 0.00
## ---
## n = 12, k = 2
## residual deviance = 60.0, null deviance = 460.1 (difference = 400.1)
pois.pred <- predict(pois.mod, type="response")
basicPlot()
lines(icecream$temp, pois.pred, col="blue", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Poisson (log) GLM"),
col=c("#00526D","blue"), pch=c(1,NA))
```

This looks pretty good. The interpretation of the coefficients should be clear now. I have to use the

`exp`

function to make a prediction of sales for a given temperature. However, R will do this for me automatically, if I set in the `predict`

statement above `type="response"`

.From the coefficients I can read off that a 0ºC I expect to sell \(\exp(4.45)=94\) ice creams and that with each one degree increase in temperature the sales are predicted to increase by \(\exp(0.076) - 1 = 7.9\%\).

Note, the exponential function turns the additive scale into a multiplicative one.

So far, so good. My model is in line with my observations. Additionally, it will not predict negative sales and if I would simulate from a Poisson distribution with a mean given by the above model I will always only get whole numbers back.

However, my model will also predict that I should expect to sell over 1000 ice creams if the temperature reaches 32ºC:

```
predict(pois.mod, newdata=data.frame(temp=32), type="response")
## 1
## 1056.651
```

Perhaps the exponential growth in my model looks a little too good to be true. Indeed, I am pretty sure that my market will be saturated at around 800. **this is a modelling assumption from my side!***Warning:*### Binomial regression

Ok, let's me think about the problem this way: If I have 800 potential sales then I'd like to understand the proportion sold at a given temperature.This suggests a binomial distribution for the number of successful sales out of 800. The key parameter for the binomial distribution is the probability of success, the probability that someone will buy ice cream as a function of temperature.

Dividing my sales statistics by 800 would give me a first proxy for the probability of selling all ice cream.

Therefore I need an S-shape curve that maps the sales statistics into probabilities between 0 and 100%.

A canonical choice is the logistic function:

\[

\text{logit}(u) = \frac{e^u}{e^u + 1} = \frac{1}{1 + e^{-u}}

\]

With that my model can be described as:

\[

y_i \sim \text{Binom}(n, \mu_i)\\

\mathbb{E}[y_i]=\mu_i=\text{logit}^{-1}(\alpha + \beta x_i)\\

\text{logit}(\mu_i) = \alpha + \beta x_i\]

```
market.size <- 800
icecream$opportunity <- market.size - icecream$units
bin.glm <- glm(cbind(units, opportunity) ~ temp, data=icecream,
family=binomial(link = "logit"))
display(bin.glm)
## glm(formula = cbind(units, opportunity) ~ temp, family = binomial(link = "logit"),
## data = icecream)
## coef.est coef.se
## (Intercept) -2.97 0.11
## temp 0.16 0.01
## ---
## n = 12, k = 2
## residual deviance = 84.4, null deviance = 909.4 (difference = 825.0)
bin.pred <- predict(bin.glm, type="response")*market.size
basicPlot()
lines(icecream$temp, bin.pred, col="purple", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Binomial (logit) GLM"),
col=c("#00526D","purple"), pch=c(1,NA))
```

The chart doesn't look too bad at all!

As the temperature increases higher and higher this model will predict that sales will reach market saturation, while all the other models so far would predict higher and higher sales.

I can predict sales at 0ºC and 35ºC using the inverse of the logistic function, which is given in R as

`plogis`

:```
# Sales at 0 Celsius
plogis(coef(bin.glm)[1])*market.size
## (Intercept)
## 39.09618
# Sales at 35 Celsius
plogis(coef(bin.glm)[1] + coef(bin.glm)[2]*35)*market.size
## (Intercept)
## 745.7449
```

So, that is 39 ice creams at 0ºC and 746 ice creams at 35ºC. Yet, these results will of course change if I change my assumptions on the market size. A market size of 1,000 would suggest that I can sell 55 units at 0ºC and 846 at 35ºC. ### Summary

Let's bring all the models together into one graph, with temperatures ranging from 0 to 35ºC.```
temp <- 0:35
p.lm <- predict(lin.mod, data.frame(temp=temp), type="response")
p.log.lm <- exp(predict(log.lin.mod, data.frame(temp=0:35), type="response") +
0.5 * summary(log.lin.mod)$dispersion)
p.pois <- predict(pois.mod, data.frame(temp=temp), type="response")
p.bin <- predict(bin.glm, data.frame(temp=temp), type="response")*market.size
basicPlot(xlim=range(temp), ylim=c(-20,market.size))
lines(temp, p.lm, type="l", col="orange", lwd=2)
lines(temp, p.log.lm, type="l", col="red", lwd=2)
lines(temp, p.pois, type="l", col="blue", lwd=2)
lines(temp, p.bin, type="l", col="purple", lwd=2)
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D","orange", "red",
"blue", "purple"),
bty="n", lwd=rep(2,5),
lty=c(NA,rep(1,4)),
pch=c(1,rep(NA,4)))
```

The chart shows the predictions of my four models over a temperature range from 0 to 35ºC. Although the linear model looks OK between 10 and perhaps 30ºC, it shows clearly its limitations. The log-transformed linear and Poisson models appear to give similar predictions, but will predict an ever accelerating increase in sales as temperature increases. I don't believe this makes sense as even the most ice cream loving person can only eat so much ice cream on a really hot day. And that's why I would go with the Binomial model to predict my ice cream sales.

### Simulations

Having used the distribution centric view to describe my models leads naturally to simulations. If the models are good, then I shouldn't be able to identify the simulation from the real data.In all my models the linear structure was

\[

g(\mu_i) = \alpha + \beta x_i\]or in matrix notation

\[

g(\mu) = A v\] with \(A_{i,\cdot} = [1, x_i]\) and \(v=[\alpha, \beta]\), whereby \(A\) is the model matrix, \(v\) the vector of coefficients and \(g\) the link function.

With that being said, let's simulate data from each distribution for the temperatures measured in the original data and compare against the actual sales units.

```
n <- nrow(icecream)
A <- model.matrix(units ~ temp, data=icecream)
set.seed(1234)
(rand.normal <- rnorm(n,
mean=A %*% coef(lin.mod),
sd=sqrt(summary(lin.mod)$dispersion)))
## [1] 152.5502 278.3509 339.2073 244.5335 374.3981 404.4103 375.2385
## [8] 403.3892 483.9470 486.5775 526.3881 557.6662
(rand.logtrans <- rlnorm(n,
meanlog=A %*% coef(log.lin.mod),
sdlog=sqrt(summary(log.lin.mod)$dispersion)))
## [1] 196.0646 266.1002 326.8102 311.5621 315.0249 321.1920 335.3733
## [8] 564.6961 515.9348 493.4822 530.8356 691.2354
(rand.pois <- rpois(n,
lambda=exp(A %*% coef(pois.mod))))
## [1] 220 266 279 337 364 434 349 432 517 520 516 663
(rand.bin <- rbinom(n,
size=market.size,
prob=plogis(A %*% coef(bin.glm))))
## [1] 197 249 302 313 337 386 387 414 516 538 541 594
basicPlot(ylim=c(100,700))
cols <- adjustcolor(c("orange", "red", "blue", "purple"),
alpha.f = 0.75)
points(icecream$temp, rand.normal, pch=19, col=cols[1])
points(icecream$temp, rand.logtrans, pch=19, col=cols[2])
points(icecream$temp, rand.pois, pch=19, col=cols[3])
points(icecream$temp, rand.bin, pch=19, col=cols[4])
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D",cols), lty=NA,
bty="n", lwd=rep(2,5),
pch=c(1,rep(19,4)))
```

The chart displays one simulation per observation from each model, but I believe it shows already some interesting aspects. Not only do I see that the Poisson and Binomial model generate whole numbers, while the Normal and log-transformed Normal predict real numbers, I notice the skewness of the log-normal distribution in the red point at 19.4ºC.

Furthermore, the linear model predicts equally likely figures above and below the mean and at 16.4ºC the prediction appears a little low - perhaps as a result.

Additionally the high sales prediction at 25.1ºC of the log-transformed Normal and Poisson model shouldn't come as a surprise either.

Again, the simulation of the binomial model seems the most realistic to me.

### Conclusions

I hope this little article illustrated the intuition behind generalised linear models. Fitting a model to data requires more than just applying an algorithm. In particular it is worth to think about:- the range of expected values: are they bounded or range from \(-\infty\) to \(\infty\)?
- the type of observations: do I expect real numbers, whole numbers or proportions?
- how to link the distribution parameters to the observations

There are many aspects of GLMs which I haven't touched on here, such as:

- all the above models incorporate a fixed level of volatility. However, in practice, the variability of making a sale at low temperatures might be significantly different than at high temperatures. Check the residual plots and consider an over-dispersed model.
- I used so called canonical link functions in my models, which have nice theoretical properties, but other choices are possible as well.
- the distribution has to be chosen from the exponential family, e.g. Normal, Gamma, Poisson, binomial, Tweedie, etc.
- GLMs use maximum likelihood as the criteria for fitting the models. I sketched the difference between least square and maximum likelihood in an earlier post.

### R code

You find the code of this post also on GitHub.### Session Info

```
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arm_1.8-6 lme4_1.1-8 Matrix_1.2-1 MASS_7.3-42
loaded via a namespace (and not attached):
[1] minqa_1.2.4 tools_3.2.1 coda_0.17-1 abind_1.4-3
[5] Rcpp_0.11.6 splines_3.2.1 nlme_3.1-121 grid_3.2.1
[9] nloptr_1.0.4 lattice_0.20-31
```

4 Aug 2015
07:44
glm
,
linear model
,
R
,
regression

## Notes from the 3rd R in Insurance Conference

Photo: Arthur Charpentier |

Milliman, RStudio, CYBAEA, Deloitte, a.s.r., Triple A Risk Finance, AEGON, Delta Lloyd Amsterdam, QBE Re and APPLIED AI

This one-day conference focused once more on applications in insurance and actuarial science that use R. Topics covered included reserving, pricing, loss modelling, the use of R in a production environment and more.

Next summer we are back in London at Cass Business School.

The slides are now available from the links in the agenda below.

28 Jul 2015
07:52
Conference
,
Insurance
,
R
,
R in Insurance

## MacBook Air battery replacement

After four years of daily use our MacBook Air informed us that it needed a battery replacement. That's kind of nice to know, in particular as it still feels speedy and otherwise just works.

A new battery isn't that expensive and according to iFixit it appeared to be quite easy to replace it. I needn't to worry, it was actually super simple, given appropriate tools:

Here are a few pictures of the

A new battery isn't that expensive and according to iFixit it appeared to be quite easy to replace it. I needn't to worry, it was actually super simple, given appropriate tools:

- Remove 10 screws from bottom case
- Open case
- Disconnect battery
- Remove 5 screws from battery
- Swap battery
- Reassemble everything back together
- Job done.

Here are a few pictures of the

*surgery*:Lower case with screws removed |

Old battery pack |

New battery pack |

21 Jul 2015
07:55
Battery
,
MacBook Air
,
Soapbox

## ChainLadder 0.2.1 released

Over the weekend we released version 0.2.1 of the ChainLadder package for claims reserving on CRAN.

Binary versions of the package will appear on the various CRAN mirrors over the next couple of days. Alternatively you can install ChainLadder directly from GitHub using the following R commands:

Completely new to ChainLadder? Start with the package vignette.

### New Features

- New function
`PaidIncurredChain`

by Fabio Concina, based on the 2010 Merz & Wüthrich paper Paid-incurred chain claims reserving method - Functions
`plot.MackChainLadder`

and`plot.BootChainLadder`

gained new argument`which`

, allowing users to specify which sub-plot to display. Thanks to Christophe Dutang for this suggestion.

Output of `plot(MackChainLadder(MW2014, est.sigma="Mack"), which=3:6)` |

### Changes

- Updated
`NAMESPACE`

file to comply with new R CMD checks in R-3.3.0 - Removed package dependencies on
`grDevices`

and`Hmisc`

- Expanded package vignette with new paragraph on importing spreadsheet data, a new section "Paid-Incurred Chain Model" and an added example for a full claims development picture in the "One Year Claims Development Result" section, see also [1] .

Binary versions of the package will appear on the various CRAN mirrors over the next couple of days. Alternatively you can install ChainLadder directly from GitHub using the following R commands:

```
install.packages(c(“systemfit”, “actuar", "statmod", "tweedie", "devtools"))
library(devtools)
install_github("mages/ChainLadder")
library(ChainLadder)
```

Completely new to ChainLadder? Start with the package vignette.

### References

[1] Claims run-off uncertainty: the full picture. (with M. Merz) SSRN Manuscript, ID 2524352, 2014.
14 Jul 2015
07:05
ChainLadder
,
Insurance
,
R
,
reserving

## Adding mathematical notations to R plots

I have to admit that I find the

Apparently I am not the only one, but Stefano Meschiari did actually something about it. A few days ago his package

The package provides the wonderful function

Below is the first example from the

You find more information about

`plotmath`

expressions in R a little fiddly to annotate plots with mathematical notation. Apparently I am not the only one, but Stefano Meschiari did actually something about it. A few days ago his package

`latex2exp`

appeared on CRAN. The package provides the wonderful function

`latex2exp`

that translates LaTeX code into `plotmath`

expressions. Brillant! All I have to remember is to escape the `"\"`

character, that is write `"\\"`

instead of `"\"`

.Below is the first example from the

`plotmath`

help file and again using `latex2exp`

. I think this is much easier to read and write.You find more information about

`latex2exp`

on Stefano's web site and his GitHub repository.### Session Info

```
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] latex2exp_0.3.1
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.1 Rcpp_0.11.6 stringi_0.5-5 stringr_1.0.0
```

## Notes from the Kölner R meeting, 26 June 2015

Last Friday the Cologne R user group came together for the 14th time. For the first time we met at Startplatz, a start-up incubator venue. The venue was excellent, not only did they provide us with a much larger room, but also with table-football and drinks. Many thanks to Kirill for organising all of this!

We had two excellent advanced talks. Both were very informative and well presented.

Imagine you have several servers that generate large data sets with no standard delimiters, like the example below.

The columns appear to be separated by a blank at first glance, but the second column (Military) has strings such as Air Force that include a blank itself. Furthermore, other columns have missing data (Month) and another uses speech-marks (Car). Thus, it's messy and difficult to read into R.

To solve the problem Kirill developed a Makefile that uses tools such as

Kirill's tutorial files are available via GitHub.

Paul Viefers gave a great introduction to Stan and RStan, with a focus on explaining the differences to other MCMC packages such as JAGS.

Stan is a probabilistic programming language for Bayesian inference. One of the major challenges in Bayesian analysis is that often there is no analytical solution for the posterior distribution. Hence, the posterior distribution is approximated via simulations, such as Gibbs sampling in JAGS. Stan, on the other hand, uses Hamiltonian Monte Carlo (HMC), an algorithm that is more subtle in proposing jumps, using more structure by translation into Hamiltonian mechanics framework.

Paul ended his talk by walking us through the various building blocks of a Stan script, using a hierarchical logistic regression example.

You can access Paul's slides on RPubs.

Photo: Günter Faes |

### Data Science at the Command Line

Kirill Pomogajko showed us how he uses various command line tools to pre-process log-files for further analysis with R.Photo: Günter Faes |

The columns appear to be separated by a blank at first glance, but the second column (Military) has strings such as Air Force that include a blank itself. Furthermore, other columns have missing data (Month) and another uses speech-marks (Car). Thus, it's messy and difficult to read into R.

To solve the problem Kirill developed a Makefile that uses tools such as

`scp`

, `sed`

and `awk`

to download and clean the server files. Kirill's tutorial files are available via GitHub.

### An Introduction to RStan and the Stan Modelling Language

Paul Viefers gave a great introduction to Stan and RStan, with a focus on explaining the differences to other MCMC packages such as JAGS.

Photo: Günter Faes |

Stan is a probabilistic programming language for Bayesian inference. One of the major challenges in Bayesian analysis is that often there is no analytical solution for the posterior distribution. Hence, the posterior distribution is approximated via simulations, such as Gibbs sampling in JAGS. Stan, on the other hand, uses Hamiltonian Monte Carlo (HMC), an algorithm that is more subtle in proposing jumps, using more structure by translation into Hamiltonian mechanics framework.

Paul ended his talk by walking us through the various building blocks of a Stan script, using a hierarchical logistic regression example.

You can access Paul's slides on RPubs.

### Drinks and Networking

No Cologne R user group meeting is complete without Kölsch and networking. In the end some of us ended up in a fancy burger place.### Next Kölner R meeting

The next meeting will be scheduled in September. Details will be published on our Meetup site. Thanks again to Revolution Analytics for their sponsorship.
30 Jun 2015
20:41
Koelner R User
,
Kölner R Users
,
make
,
R
,
rstan

## Next Kölner R User Meeting: Friday, 26 June 2015

The next Cologne R user group meeting is scheduled for this Friday, 6 June 2015 and we have an exciting agenda with two talks followed by networking drinks.

- Data Science at the Commandline (Kirill Pomogajko)
- An Introduction to RStan and the Stan Modelling Language (Paul Viefers)

### Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

23 Jun 2015
06:47
Kölner R Users
,
KölnR
,
News
,
R

Subscribe to:
Posts
(
Atom
)