# mages' blog

## Notes from the Kölner R meeting, 12 December 2014

Last week's Cologne R user group meeting was the best attended so far, and it was a remarkable event - I believe not a single line of R code was shown. Still, it was an R user group meeting with two excellent talks, and you will understand shortly why not much R code needed to be displayed.

### Introduction to Julia for R Users

Hans Werner Borchers joined us from Mannheim to give an introduction to Julia for R users. Julia is a high-level, high-performance dynamic programming language for technical computing. The language has gained some considerable traction over the last two years and it was great to get an overview from a familiar perspective.

Interestingly, as Hans Werner pointed out, Julia is by far not the only new language around the block. Indeed, over the last decade nearly every year saw the announcement of a new language. Also big tech companies such as Microsoft, Google, Mozilla and Apple are trying to push their own programming languages: F# (2005), Go (2009), Rust (2010) and Swift (2014) respectively.

Over the more recent years we notice a movement towards the use of LLVM (Low Level Virtual Machine), on which Julia is based as well and which makes it fast. The just in time compilation demands a little mind shift if you come from R, where the mantra for speed is: vectorise - remove all for-loops. Well, the opposite is true for Julia, because your code will be compiled. For-loops are much easier to understand for the underlying compiler. Hans Werner's slides provide some good examples to get you started and pointers to further resources.

### Dynamic Linear Models and Kalman Filtering

Forecasting time series was the topic of Holger Zien's talk. Holger gained his first experience with time series during his PhD, when he worked with experimental sensor data. That meant he had lots of data, which could often be regarded stationary as well.

Nowadays, his challenges can be very different, sometimes only a few data points from a non-stationary process are available, and yet he is still expected to predict the future.

Dynamic linear models (dlm) can provide a remedy in those situations. In their simplest version a dlm links system and observational equations in the following way:
$y_t = F \theta_t + \nu_t\quad\mbox{observation eq. }\\ \theta_t = G \theta_{t-1} + \omega_t\quad\mbox{system eq.}$ with $$\nu_t, \omega_t$$ mutually independent random variables. A special case of dynamic linear models is the well known Kalman filter. In the more general case $$y_t$$ and $$\theta_t$$ are vectors and $$F_t, G_t$$ are time variant matrices.

Holger explained that a dlm can principally be used for three purposes:
• Filtering: Estimate of the current value of the state/system variable.
• Smoothing: Estimate of past values of the state/system variable, i.e., estimating at time $$t$$ given measurements up to time $$t' > t$$.
• Forecasting: Forecasting future observations or values of the state/system variable.
With this introduction outlined Holger showed us how various ARMA and linear regression models can be expressed as dlm. His talk concluded with some remarks about his personal experience and references to various R packages, articles and books.

### Next Kölner R meeting

The next meeting is scheduled for 6 March 2015.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

## Next Kölner R User Meeting: Friday, 12 December 2014

The next Cologne R user group meeting is scheduled for this Friday, 12 December 2014.

We have an exciting agenda with two talks on Julia and Dynamic Linear Models:

### Introduction to Julia for R Users

Hans Werner Borchers

Julia is a high-performance dynamic programming language for scientific computing, with a syntax that is familiar to users of other technical computing environments (Matlab, Python, R, etc.). It provides a sophisticated compiler, high performance with numerical accuracy, and extensive mathematical function libraries.

Some of the highlights of Julia are an implementation of automated differentiation, an optimisation modelling language, also integrating some of the optimisation solvers available from the COIN-OR project, and user-contributed packages for time series, statistics and machine learning, or operations research.

### Dynamic Linear Models and Kalman Filtering

Holger Zien

One of the problems most commonly presented to everybody working in statistics is forecasting time series. The textbook answer is to fit an ARMA model to the data and to use the model for prediction. This approach works well for long and stationary time series. However, often the actual time series one is given to analyse are more complicated. They are ridiculously short and clearly non-stationary. Dynamic Linear Models (DLM) and Kalman Filtering go one step beyond ARMA models and may be applied to these more complicated data. I will give a brief introduction into their mathematical background and will talk about my experience using them in practice.

### Drinks and Networking

The event will be followed by drinks and schnitzel at the Lux.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their vector programme.

View Larger Map

## Measuring temperature with my Arduino

It is really getting colder in London - it is now about 5°C outside. The heating is on and I have got better at measuring the temperature at home as well. Or, so I believe.

Last week's approach of me guessing/feeling the temperature combined with an old thermometer was perhaps too simplistic and too unreliable. This week's attempt to measure the temperature with my Arduino might be a little OTT (over the top), but at least I am using the micro-controller again.

Yet, I have to overcome similar challenges again as last week. The Arduino with the temperature sensor (TMP36) should be a better device to measure the temperature than my thermometer, but it isn't perfect either and the readings will have some uncertainty as well. But, it will be easy to make frequent measurements and with a little trick I should be able to filter out the noise and get a much better view of the 'true' temperature.

The trick is Bayesian filtering, or as it turns out, a simplified Kalman filter (univariate, no control actions, no transition matrices).

I can't observe the temperature directly, I can only try to measure it. I think of this scenario as a series of Bayesian networks (or Hidden Markov Model). The hidden or latent variable is the 'true' temperature and the observable variable is the reading of my Arduino sensor.

I start with a prior guess of the Temperature at time $$k=0$$, just like last week (note, here I will use $$x_k$$ instead of $$\mu_k$$). My initial prior is $$N(x_0, \sigma_0^2)$$. Now I predict the state (temperature) $$\hat{x}_1$$ at the next time step. I expect the temperature to be largely unchanged and therefore my prediction is just $$\hat{x}_1 = x_0 + \varepsilon_P$$ with $$\varepsilon_P \sim N(0, \sigma_P^2)$$ describing the additional process variance going from one time step to the next. The process variance is independent from the uncertainty of $$x_0$$, hence I can add them up and thus my prior for $$x_1$$ is $$N(x_0, \sigma_0^2+\sigma_P^2)$$.

The measurement of the temperature should reflect the 'true' temperature with some measurement uncertainty $$\sigma_M$$. My prediction of the measurement is $$\hat{z}_1 = \hat{x}_1 + \varepsilon_M$$ with $$\varepsilon_M \sim N(0, \sigma_M^2)$$. The best estimator for $$\hat{z}_1$$ is the actual measurement $$z_1$$ and therefore the likelihood distribution of $$z_1|x_1$$ is $$\,N(z_1, \sigma_M)$$. Applying the formulas for Bayesian conjugates again gives the hyper-parameters for the posterior normal distribution $$x_1|\,z_1, x_0$$, the temperature given the measurement and my prediction from the previous time step:

$x_1=\left.\left(\frac{x_0}{\sigma_0^2+\sigma_P^2} + \frac{z_1}{\sigma_M^2}\right)\right/\left(\frac{1}{\sigma_0^2+\sigma_P^2} + \frac{1}{\sigma_M^2}\right) \\ \sigma_1^2=\left(\frac{1}{\sigma_0^2+\sigma_P^2} + \frac{1}{\sigma_M^2}\right)^{-1}$With $$K_1 := \frac{\sigma_0^2+\sigma_P^2}{\sigma_M^2+\sigma_0^2+\sigma_P^2}$$ this simplifies to:
$x_1 = K_1\, z_1 + (1 - K_1)\, x_0\\ \sigma_1^2 = (1 - K_1) (\sigma_0^2 + \sigma_P^2)$
This gives the simple iterative updating procedure (or Kalman filter):
$K_{k+1} = \frac{\sigma_k^2+\sigma_P^2}{\sigma_M^2+\sigma_k^2+\sigma_P^2} \\ \sigma_{k+1}^2 = (1 - K_{k+1}) (\sigma_k^2 + \sigma_P^2)\\ x_{k+1} = K_{k+1}\, z_{k+1} + (1 - K_{k+1})\, x_k$
In my code below I set the process variance to $$\sigma_P^2=0.01$$ and the measurement variance to $$\sigma_M^2=0.5$$, which led to a fairly stable reading of 21.5°C.

Changing the parameters $$\sigma_P$$ and $$\sigma_M$$ influence how quickly the filter reacts to changes. Here is an example where I briefly touch the temperature sensor with the above parameters. The time axis in those screen shots is about 6 seconds.

I have to do a bit more reading on the Kalman filter. It appears to be an immensely powerful tool to extract the signal from the noise. Well, it helped to put a man on the moon.

### Code

This is the Processing and Arduino code I used in this post. You may have to change the port number in line 28 to your own settings.

## How cold is it? A Bayesian attempt to measure temperature

It is getting colder in London, yet it is still quite mild considering that it is late November. Well, indoors it still feels like 20°C (68°F) to me, but I have been told last week that I should switch on the heating.

Luckily I found an old thermometer to check. The thermometer showed 18°C. Is it really below 20°C?

The thermometer is quite old and I'm not sure that is works properly anymore. So, what shall I do now? Perhaps I should consider that both measurements are uncertain and try to combine them.

I believe that I can sense the temperature within ±3°C, while I think that the thermometer still works within ±2°C. Assuming that both measurements follow a Gaussian (Normal) distribution, with the uncertainties given as standard deviations, I can use Bayes' theorem to combine my hypothesis with the data. The posterior distribution will be Gaussian again with conjugated hyper-parameters:
$\mu=\left.\left(\frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^n x_i}{s^2}\right)\right/\left(\frac{1}{\sigma_0^2} + \frac{n}{s^2}\right) \\ \sigma^2=\left(\frac{1}{\sigma_0^2} + \frac{n}{s^2}\right)^{-1}$With $$K := \frac{n\sigma_0^2}{s^2+n\sigma_0^2}$$ this simplifies to:
$\mu = K\, \bar{x} + (1 - K)\, \mu_0 \mbox{, with } \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\\ \sigma = s \,\sqrt{K/n}$In my case I have: $$n=1,\; x_1=18^{\circ}C,\; s=2^{\circ}C,\; \mu_0=20^{\circ}C,\; \sigma_0=3^{\circ}C$$.

Hence, the posterior distribution has parameters $$\mu=18.6^{\circ}C$$ and $$\sigma=1.7^{\circ}C$$. Thus, my best guess would be that is actually a little colder than I thought. One could argue that the probability that is below 20° is 80%.

Over the last five days my perception of the temperature didn't change, neither did the weather forecast, but the measurements showed: 18°C, 19°C, 17.5°C, 18°C, 18.5°C.

With that information the parameters update to $$\mu=18.3^{\circ}C$$ and $$\sigma=0.9^{\circ}C$$. I can't deny it any longer it has got colder. The probability that is below 20°C is now at 97% and the heating is on.

Without any prior knowledge I may have used a t-test to check the measurements. But here I believe that I have information about the thermometer and my own temperature sensing abilities which I don't want to ignore.

### Session Info

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:
[1] BayesianFirstAid_0.1 rjags_3-14           coda_0.16-1
[4] lattice_0.20-29

loaded via a namespace (and not attached):
[1] grid_3.1.2    MASS_7.3-35   mnormt_1.5-1  stringr_0.6.2

## First steps with ChainLadder: Import triangle from Excel into R

Taking the first step is often the hardest: getting data from Excel into R.

Suppose you would like to use the ChainLadder package to forecast future claims payments for a run-off triangle that you have stored in Excel.

How do you get the triangle into R and execute a reserving function, such as MackChainLadder?

Well, there are many ways to do this and the ChainLadder package vignette, as well as the R manual on Data Import/Export has all of the details, but here is a quick and dirty solution using a CSV-file.

Open a new Excel workbook and copy your triangle into cell A1, with the first column being the accident or origin period and the first row describing the development period or age. You find an example CSV-file on GitHub.

Make sure that your triangle has no formatting, such a commas to separate thousands, as Excel will save those cell as characters. Now open R and go through the following commands:

# The first command will open a window and
myCSVfile <- file.choose()
# use read.csv2 if semicolons are used as a separator
# likely to be the case if you are in continental Europe

dat # to see your data

AY   X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
1  1981 5012  8269 10907 11805 13539 16181 18009 18608 18662 18834
2  1982  106  4285  5396 10666 13782 15599 15496 16169 16704    NA
3  1983 3410  8992 13873 16141 18735 22214 22863 23466    NA    NA
4  1984 5655 11555 15766 21266 23425 26083 27067    NA    NA    NA
5  1985 1092  9565 15836 22169 25955 26180    NA    NA    NA    NA
6  1986 1513  6445 11702 12935 15852    NA    NA    NA    NA    NA
7  1987  557  4020 10946 12314    NA    NA    NA    NA    NA    NA
8  1988 1351  6947 13112    NA    NA    NA    NA    NA    NA    NA
9  1989 3133  5395    NA    NA    NA    NA    NA    NA    NA    NA
10 1990 2063    NA    NA    NA    NA    NA    NA    NA    NA    NA

Ok, the data is in R, but now you have to convert it into a triangle. A triangle is basically a matrix with extra attributes. To do this execute the following steps.

# Load the ChainLadder package
# Ignore first column which holds accident year information
tri <- dat[,-1]
# Convert to matrix
tri <- as.matrix(tri)
dimnames(tri) <- list(origin=dat[,1], dev=1:ncol(tri))
# Convert into a triangle class
tri <- as.triangle(tri)
tri
dev
origin    1     2     3     4     5     6     7     8     9    10
1981 5012  8269 10907 11805 13539 16181 18009 18608 18662 18834
1982  106  4285  5396 10666 13782 15599 15496 16169 16704    NA
1983 3410  8992 13873 16141 18735 22214 22863 23466    NA    NA
1984 5655 11555 15766 21266 23425 26083 27067    NA    NA    NA
1985 1092  9565 15836 22169 25955 26180    NA    NA    NA    NA
1986 1513  6445 11702 12935 15852    NA    NA    NA    NA    NA
1987  557  4020 10946 12314    NA    NA    NA    NA    NA    NA
1988 1351  6947 13112    NA    NA    NA    NA    NA    NA    NA
1989 3133  5395    NA    NA    NA    NA    NA    NA    NA    NA
1990 2063    NA    NA    NA    NA    NA    NA    NA    NA    NA

With those preparations done you can execute the MackChainLadder function:

M <- MackChainLadder(tri, est.sigma = "Mack")
M
Latest Dev.To.Date Ultimate   IBNR Mack.S.E CV(IBNR)
1981 18,834       1.000   18,834      0        0      NaN
1982 16,704       0.991   16,858    154      206    1.339
1983 23,466       0.974   24,083    617      623    1.010
1984 27,067       0.943   28,703  1,636      747    0.457
1985 26,180       0.905   28,927  2,747    1,469    0.535
1986 15,852       0.813   19,501  3,649    2,002    0.549
1987 12,314       0.694   17,749  5,435    2,209    0.406
1988 13,112       0.546   24,019 10,907    5,358    0.491
1989  5,395       0.336   16,045 10,650    6,333    0.595
1990  2,063       0.112   18,402 16,339   24,566    1.503

Totals
Latest:    160,987.00
Dev:             0.76
Ultimate:  213,122.23
IBNR:       52,135.23
Mack S.E.:  26,909.01
CV(IBNR):        0.52


To copy the full triangle back into Excel you can use the clipboard:

### Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets
[7] methods   base

other attached packages:
[1] XML_3.98-1.1     data.table_1.9.2 ggplot2_1.0.0
[4] maps_2.3-7

loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4     gtable_0.1.2
[4] labeling_0.3     MASS_7.3-33      munsell_0.4.2
[7] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2
[10] reshape2_1.4     scales_0.2.4     stringr_0.6.2
[13] tools_3.1.1 

## Running RStudio via Docker in the Cloud

Deploying applications via Docker container is the current talk of town. I have heard about Docker and played around with it a little, but when Dirk Eddelbuettel posted his R and Docker talk last Friday I got really excited and had to have a go myself.

My aim was to rent some resources in the cloud, pull an RStudio Server container and run RStudio in a browser. It was actually surprisingly simple to get started.

I chose Digital Ocean as my cloud provider. They have many Linux systems to choose from and also a pre-built Docker system.

After about a minute I had kicked off the Docker droplet I could log into the system in a browser window and start pulling the Docker file, e.g. Dirk's container.

Once the downloads finished I could start the RStudio Server using the docker run command and log into a RStudio session. To my surprise even my googleVis package worked out of the box. The plot command opened just another browser window to display the chart; here the output of the WorldBank demo.

All of this was done within minutes in a browser window. I didn't even use a terminal window. So, that's how you run R on an iPad. Considering that the cost for the server was \$0.015 per hour, I wonder why I should buy my own server, or indeed buy a new computer.

## Managing R package dependencies

One of my take aways from last week's EARL conference was that R is more and more growing out of its academic roots into the enterprise. And with that come some challenges, e.g. how do I ensure consistent and systematic access to a set of R packages in an organisation, in particular when one team is providing packages to others?

Two packages can help here: roxyPackage and miniCRAN.

I wrote about roxyPackage earlier on this blog. It allows me to create a local repository to distribute my package, while at the same time execute and control the build process from within R. But what about my package's dependencies? Here miniCRAN helps. miniCRAN is a new package by Andrie de Vries that enables me to find and download all package dependencies and store them in a local repository, e.g. the one used by roxyPackage.

For more details about roxyPackage and miniCRAN read the respective package vignettes.

### Example

To create a local sub-CRAN repository for the two packages I maintain on CRAN and with all their dependencies I use:
library("miniCRAN")
pkgs <- pkgDep(my.pkgs, suggests = TRUE, enhances=FALSE)
makeRepo(pkgs = pkgs, path="/Users/Shared/myCRANRepos")
And to visualise the dependencies:
dg <- makeDepGraph(my.pkgs, includeBasePkgs=FALSE,
suggests=TRUE, enhances=TRUE)
set.seed(1)
plot(dg, legendPosEdge = c(-1, 1),
legendPosVertex = c(1, 1), vertex.size=20)

What a surprise! In total I end up with 42 packages from CRAN and I didn't expect any connection between the ChainLadder and googleVis package.

### Bonus tip

Don't miss out on Pat Burns's insightful talk about effective risk management from EARL. His thoughts reminded me of the great Karl Popper: Good tests kill flawed theories; we remain alive to guess again.

### Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics  grDevices utils  datasets  methods
[7] base

other attached packages:
[1] miniCRAN_0.1-0

loaded via a namespace (and not attached):
[1] httr_0.5 igraph_0.7.1  stringr_0.6.2 tools_3.1.1
[5] XML_3.98-1.1

## Notes from the Kölner R meeting, 12 September 2014

Last Friday we had guests from Belgium and the Netherlands joining us in Cologne. Maarten-Jan Kallen from BeDataDriven came from The Hague to introduce us to Renjin, and the guys from DataCamp in Leuven, namely Jonathan, Martijn and Dieter, gave an overview of their new online interactive training platform.

## Next Kölner R User Meeting: Friday, 12 September 2014

The next Cologne R user group meeting is scheduled for this Friday, 12 September 2014.

We have a great agenda with international speakers:
• Maarten-Jan Kallen: Introduction to Renjin, the R interpreter for the JVM
• Jonathan Cornelissen, Martijn Theuwissen: DataCamp - An online interactive learning platform for R
The event will be followed by drinks and schnitzel at the Lux.

The Google Charts API is quite powerful and via googleVis you can access it from R. Here is an example that demonstrates how you can zoom into your chart.

Over the weekend we released version 0.1.8 of the ChainLadder package for claims reserving on CRAN.

### What is claims reserving?

The insurance industry, unlike other industries, does not sell products as such but promises. An insurance policy is a promise by the insurer to the policyholder to pay for future claims for an upfront received premium.

As a result insurers don't know the upfront cost for their service, but rely on historical data analysis and judgement to predict a sustainable price for their offering. In General Insurance (or Non-Life Insurance, e.g. motor, property and casualty insurance) most policies run for a period of 12 months. However, the claims payment process can take years or even decades. Therefore often not even the delivery date of their product is known to insurers. The money set aside for those future claims payments are called reserves.

#### 1 comment :

Earlier this week we released googleVis 0.5.5 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R. This is mainly a maintenance release, updating documentation and minor issues.

 Screen shot of some of the Google Charts

New to googleVis? Review the examples of all googleVis charts on CRAN.

Perhaps the best known example of the Google Chart API is the motion chart, popularised by Hans Rosling in his 2006 TED talk.

## GrapheR: A GUI for base graphics in R

How did I miss the GrapheR package?

The author, Maxime Hervé, published an article about the package [1] in the same issue of the R Journal as we did on googleVis. Yet, it took me a package update notification on CRANbeeries to look into GrapheR in more detail - 3 years later! And what a wonderful gem GrapheR is.

The package provides a graphical user interface for creating base charts in R. It is ideal for beginners in R, as the user interface is very clear and the code is written along side into a text file, allowing users to recreate the charts directly in the console.

Adding and changing legends? Messing around with the plotting window settings? It is much easier/quicker with this GUI than reading the help file and trying to understand the various parameters.

Here is a little example using the iris data set.
library(GrapheR)
data(iris)
run.GrapheR()

This will bring up a window that helps me to create the chart and tweak the various parameters.

## Thanks to R Markdown: Perhaps Word is an option after all?

In many cases Word is still the preferred file format for collaboration in the office. Yet, it is often a challenge to work with it, not so much because of the software, but how it is used and abused. Thanks to Markdown it is no longer painful to include mathematical notations and R output into Word.

I have been using R Markdown for a while now and have grown very fond of it. Although I am quite happy with PDF and HTML output for basic reports and to switch to Sweave/LaTeX for more complex documents, I was pleasantly surprised to learn that the new version of RStudio can produce MS Word files directly from R Markdown as well; thanks to the power of pandoc. Perhaps Word is an option after all?

## Hit and run. Think Bayes!

At the R in Insurance conference Arthur Charpentier gave a great keynote talk on Bayesian modelling in R. Bayes' theorem on conditional probabilities is strikingly simple, yet incredibly thought provoking. Here is an example from Daniel Kahneman to test your intuition. But first I have to start with Bayes' theorem.

### Bayes' theorem

Bayes' theorem states that given two events $$D$$ and $$H$$, the probability of $$D$$ and $$H$$ happening at the same time is the same as the probability of $$D$$ occurring, given $$H$$, weighted by the probability that $$H$$ occurs; or the other way round. As a formula it can be written as:
$P(H \cap D) = P(H|D) \, P(D) = P(D|H) \, P(H)$
Or if I rearrange it:
$P(H|D) = \dfrac{P(D|H) \, P(H)}{P(D)}$
Imagine $$H$$ is short for hypothesis and $$D$$ is short for data, or evidence. Then Bayes' theorem states that the probability of a hypothesis given data is the same as the likelihood that we observe the data given the hypothesis, weighted by the prior belief of the hypothesis, normalised by the probability that we observe the data regardless of the hypothesis.

The tricky bit in real life is often to figure out what the hypothesis and data are.

### Hit and run accident

This example is taken from Daniel Kahneman's book Thinking, fast and slow [1].
A cab was involved in a hit and run accident at night. Two cab companies, the Green and the Blue, operate in the city. 85% of the cabs in the city are Green and 15% are Blue. A witness identified the cab as Blue. The court tested the reliability of the witness under the same circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colours 80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accident was Blue rather than Green knowing that this witness identified it as Blue?

What is here the data and what is here the hypothesis? Intuitively you may think that the proportion of Blue and Green cabs is the data at hand and the witness accusation that a Blue cab was involved in the accident is the hypothesis. However, after some thought I found the following assignment much more helpful, as then $$P(H|D)$$ matches the above question:

$$H =$$ Accident caused by Blue cab. $$D =$$ Witness said the cab was Blue.

With this it is straightforward to get the probabilities of $$P(H)=15\%$$ and $$P(D|H)=80\%$$. But what is $$P(D)$$? Well, when would the witness say that the cab was Blue? Either, when the cab was Blue and so the witness is right, or when the cab was actually Green and the witness is incorrect. Thus, following the law of total probability:
\begin{align} P(D) & = P(D|H) P(H) + P(D | \bar{H}) P(\bar{H})\\ & = 0.8 \cdot 0.15 + 0.2 \cdot 0.85 = 0.29 \end{align}Therefore I get $$P(H|D)=41\%$$. Thus, even if the witness states that the cab involved in the accident was Blue, the probability of this being true is only $$41\%$$.

An alternative way to think about this problem is via a Bayesian Network. The colour of the cab will influence the statement of the witness. In R I can specify such a network using the gRain package [2], which I discussed in an earlier post. Here I provide the distribution of the cabs and the conditional probabilities of the witness as an input. After I compile the network, I can again read off the probabilities that a Blue cab was involved, when the witness said so.

## Notes from the 2nd R in Insurance Conference

The 2nd R in Insurance conference took place last Monday, 14 July, at Cass Business School London.

This one-day conference focused once more on applications in insurance and actuarial science that use R. Topics covered included reserving, pricing, loss modelling, the use of R in a production environment and more.

In the first plenary session, Montserrat Guillen (Riskcenter, University of Barcelona) and Leo Guelman (Royal Bank of Canada, RBC Insurance) spoke about the rise of uplift models. These predictive models are used for improved targeting of policyholders by marketing campaigns, through the use of experimental data. The presenters illustrated the use of their uplift package (available on CRAN), which they have developed for such applications.

## Simple user interface in R to get login details

Occasionally I have to connect to services from R that ask for login details, such as databases. I don't like to store my login details in the R source code file, instead I would prefer to enter the my login details when I execute the code.

Fortunately, I found some old code in a post by Barry Rowlingson that does just that. It uses the tcltk package in R to create a little window in which the user can enter her details, without showing the password. The tcltk package is part of base R, which means the code will run on any operating system. Nice!

Recently we released googleVis 0.5.3 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R.

 Screen shot of some of the Google Charts

Although this is mainly a maintenance release, I'd like to point out two changes:
• Default chart width is set to 'automatic' instead of 500 pixels.
• Intervals for columns roles have to end with the suffix ".i", with "i" being an integer. Several interval columns are allowed, see the roles demo and vignette for more details.
Those changes were required to fix the following issues:
• The order of y-variables in core charts wasn't maintained. Thanks to John Taveras for reporting this bug.
• Width and height of googleVis charts were only accepted in pixels, although the Google Charts API uses standard HTML units (for example, '100px', '80em', '60', 'automatic'). If no units are specified the number is assumed to be pixels. This has been fixed. Thanks to Paul Murrell for reporting this issue.
New to googleVis? Review the demo on CRAN.

## Last chance to register for the R in Insurance conference

The registration for the 2nd R in Insurance conference at Cass Business School London will close this Friday, 4 July.

The programme includes talks from international practitioners and leading academics, see below. For more details and registration visit: http://www.rininsurance.com.

## Generating and visualising multivariate random numbers in R

This post will present the wonderful pairs.panels function of the psych package [1] that I discovered recently to visualise multivariate random numbers.

Here is a little example with a Gaussian copula and normal and log-normal marginal distributions. I use pairs.panels to illustrate the steps along the way.

library(psych)
library(MASS)
Sig <- matrix(c(1, -0.7, -.5,
-0.7, 1, 0.6,
-0.5, 0.6, 1),
nrow=3)
X <- mvrnorm(1000, mu=rep(0,3), Sigma = Sig,
empirical = TRUE)
pairs.panels(X)

## Who will win the World Cup and which prediction model?

The World Cup has finally kicked off last Thursday and I have seen some fantastic games already. Perhaps the Netherlands appears to be the strongest side so far, following their 5-1 victory over Spain.

To me the question is not only which country will win the World Cup, but also which prediction model will come closest to the actual results. Here I present three teams, FiveThirtyEight, a polling aggregation website, Groll & Schauberger, two academics from Munich and finally Lloyd's of London, the insurance market.

The guys around Nate Silver at FiveThirtyEight have used the ESPN’s Soccer Power Index to predict the outcomes of games and continue to update their model. Brazil is their clear favourite.

Andreas Groll & Gunther Schauberger from the LMU Munich developed a model, which like the approach from FiveThirtyEight aims to estimate the probability of a team winning the world cup. But unlike FiveThirtyEight, they see Germany to take the trophy home.

Lloyd's chose a different approach for predicting the World Cup final. The insurance market looked at the risk aspect of the teams and ranked the teams by their insured value. Arguably the better a team the higher their insured value. As a result Lloyd's predicts Germany to win the World Cup.

Quick reminder; what's the difference between insurance and gambling? Gambling introduces risk, where none exists. Insurance mitigates risk, where risk exists.

## The joy of joining data.tables

The example I present here is a little silly, yet it illustrates how to join tables with data.table in R.

### Mapping old data to new data

Categories in general are never fixed, they always change at some point. And then the trouble starts with the data. For example not that long ago we didn't distinguish between smartphones and dumbphones, or video on demand and video rental shops.

I would like to back track price change data for smartphones and online movie rental shops, assuming that their earlier development can be set to the categories they were formerly part of, namely mobile and video rental shops to create indices.

Here is my toy data:

I'd like to create price indices for all products and where data for the new product categories is missing, use the price changes of the old product category.

The data.table package helps here. I start with my original data, convert it into a data.table and create mapping tables. That allows me to add the old product with its price change to the new product. The trick here is to set certain columns as key. Two data tables with the same key can easily be joined. Once I have the price changes for products and years the price index can be added. The plot illustrates the result, the R code is below.

## Early bird registration for R in Insurance closes tomorrow

The early bird registration offer for the 2nd R in Insurance conference, 14 July 2014, at Cass Business School closes tomorrow.

This one-day conference will focus once more on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered include reserving, pricing, loss modelling, the use of R in a production environment, and more. All topics are to be discussed within the context of using R as a primary tool for insurance risk management, analysis and modelling.

The conference programme consists of invited talks and contributed presentations discussing the wide range of fields in which R is used in insurance:

Attendance of the whole conference is the equivalent of 6.5 hours of CPD for members of the Actuarial Profession.

You can register directly on the conference web site www.rininsurance.com.

The organisers gratefully acknowledge the sponsorship of Mango Solutions, CYBAEA, RStudio and PwC without whom the event wouldn't be possible.

## Notes from the Kölner R meeting, 23 May 2014

The 10th Kölner R user meeting took place last Friday at the Institute of Sociology and to celebrate the anniversary we invited Andrie de Vries to join us from Revolution Analytics. Andrie is well known in the R community; he is the co-author of the R for Dummies book and an active contributor on stackoverflow.

### Taking R to the Enterprise

 Andrie de Vries: Taking R to the Enterprise. Photo: Günter Faes

Andrie talked about how R is finding its way into the enterprise. He argued that R has become the go-to tool for data and statistical analysis in many companies. He observed that over the years the commercial user base evolved from a small expert community with a love for the command line and cutting edge technology into a much wider audience. Although R is still used by the technical experts, who demand more and more power, out-of-memory computations on bigger data, etc. the number of casual R users is growing rapidly. Often the casual R user relies on code written by others as part of a bigger workflow and hence prefers a more standardised user interface.

Andrie illustrated how Revolution Analytics aims to satisfy both user groups. He started with the expert R users by demonstrating Revolution Analytics' RevoScaleR package that allows them to carry out analysis on bigger data. His example followed closely the airline analysis of Joseph Rickert's white paper. Andrie then showed how more complex R code and functions can be integrated (and to some extend hidden) into Alteryx and Tableau via Rserve.

 Slides and code available on GitHub

I gave a brief introduction to googleVis and presented recent developments. The googleVis package provides an interface between R and the Google Chart Tools API. It allows users to create web pages with interactive charts based on R data frames and to display them either via the local R HTTP help server or within their own sites, without uploading the data to Google. The best known example is perhaps the motion chart of fertility and life expectancy data from the World Bank as first presented by Hans Rosling. Another popular example is the chart that illustrates the performance of the famous Lloyd's insurance market.

Since version 0.5.0 of googleVis many new chart types have been added to googleVis, including annotation, sankey, calendar and timeline charts. Additionally new markdown vignettes have been included to present the examples also on CRAN. The slides and rmarkdown code are available via GitHub.

### Kölsch & Schnitzel

No Kölner R meeting would be complete without a few Kölsch and Schnitzel at the Lux. We were lucky with the weather so that we could sit outside to enjoy drinks, food and networking opportunities.

 Kölsch and Schnitzel at the Lux. Photo: Günter Faes

### Next Kölner R meeting

The next meeting is scheduled for 12 September 2014.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

## Next Kölner R User Meeting: Friday, 23 May 2014

The next Cologne R user group meeting is scheduled for this Friday, 23 May 2014.

To celebrate our 10th meeting we welcome:
Followed by drinks and schnitzel at the Lux.

Further details available on our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their vector programme.

View Larger Map

## The Wiener takes it all? A review of the 2014 Eurovision results

Saturday's Eurovision Song Contest (ESC) from Copenhagen was hilarious as usual with acts from all over Europe and some more or less sensible gimmicks: a circular piano, a giant hamster wheel, a sea-saw, or indeed a beard and fancy dress.

The results of the ESC were only a little different to what the bookmakers in the UK had predicted before the event started. Sweden was seen as the favourite, followed by Austria, Netherlands, Armenia and the UK.

In the end Conchita Wurst from Austria won with 290 points in front of the acts from the Netherlands with 238, Sweden with 218 and Armenia with 174 points. The UK ended on the 17th rank with only 40 points. Perhaps the reason the English bookies had put the UK in fifth place reflected the bias of their clients towards their home country.

The points were given as a combination of a public tele vote and a jury in each country. But how big was the resemblance between the votes of the juries and the public? Is there much variance between countries?

The detailed voting data are available from the Eurovision site. Out of the 37 countries that participated in the voting, two countries, San Marino and Albania, had no tele voting and one country, Georgia, had no jury. In the remaining 34 countries Austria and the Netherlands made into the top 2 of the jury and public voting results if they were treated independently. Sweden made it into the top 5 of both as well, but the other countries differed.

So, how consistent was the voting of the jury and public for the top 5 in each country?

Well, in only 11 countries out of 34, the jury agreed with the public on more than 2 of the top 5 songs. In 25 countries the jury and public agreed on at least two candidates.

In some cases the differences between public and jury were so wide, that although a candidate was voted into the top 5 of the tele rankings the act wouldn't get any points. The top public winner in Belgium (Armenia), Ireland (Poland), Montenegro (Russia) and the UK (Poland) didn't get any points at all.

Still, getting the top favourites right seems much easier then any of the followers. Or in other words, it is really difficult to produce a hit, a song/act on which many can agree. But when you hear one, it is much easier to identify it as such, something you really like and believe others would like as well.

## Customising lines and points with googleVis

At the end of March Google released a new version of the Chart Tools API with new options for point shapes and line brushes. The arguments are called pointShape and lineDashStyle and can be set directly via googleVis.

We published googleVis 0.5.2 on CRAN yesterday with added examples for those new options in gvisLineChart and gvisScatterChart. Note, these options can be used with most chart types as well, also in combination. Visit the Google documentation for more details.

### Session Info

R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.1.0 

## R in Insurance 2014: Conference Programme & Abstracts

I am delighted to announce that the programme and abstracts for the second R in Insurance conference at Cass Business School in London, 14 July 2014, have been finalised.

Register by the end of May to get the early bird booking fee.

The organisers gratefully acknowledge the sponsorship of Mango Solutions, CYBAEA, RStudio and PwC without whom the event wouldn't be possible.

## Notes from the Tokyo R User Group meeting, 17 April 2014

Last Thursday I had the pleasure to attend the Tokyo R user group meeting. And what a fun meeting it was! Over 40 R users had come together in central Tokyo. Yohei Sato, who organises the meetings, allowed me to talk a little about the recent developments of the googleVis package.

Thankfully all talks were given in English:

Following the meeting the user group had booked a pub around the corner for a few drinks and some food. Brilliant!

 Delicious chicken steaks and rice porridge

The next morning, as I woke up in on the 23rd hotel floor in Shinjuku I felt that my bed was moving. I am sure it was the earthquake, but what a weird feeling it was with a little hangover.

## googleVis 0.5.1 released on CRAN

GoogleVis 0.5.1 was released on CRAN yesterday.

### New Features

• New functions gvisSankey, gvisAnnotationChart, gvisHistogram, gvisCalendar and gvisTimeline to support the new Google charts of the same names (without 'gvis').
• New demo Trendlines showing how trend-lines can be added to Scatter-, Bar-, Column-, and Line Charts.
• New demo Roles showing how different column roles can be used in core charts to highlight data.
• New vignettes written in R Markdown showcasing googleVis examples and how the package works with knitr.

### Changes

• The help files of gvis charts no longer show all their options, instead a link to the online Google API documentation is given.