mages' blog

Review: Kölner R Meeting 13 December 2013

Last week's Cologne R user group meeting was the best attended so far. Well, we had a great line up indeed. Matt Dowle came over from London to give an introduction to the data.table package. He was joined by his collaborator Arun Srinivasan, who is based in Cologne. Their talk was followed by Thomas Rahlf on Datendesign mit R (Data design with R).

data.table

Matt's goal with the data.table package is to reduce times; time to write code and to execute code. His talk illustrated how the syntax of data.table, not unlike SQL, can produce shorter and more readable code that at the same time provides an efficient and fast way to analyse big in memory data sets with R. Arun presented on new developments in data.table 1.8.11, which not only fixes bugs but adds many new features such as melt/cast and further speed gains.

I said early that data.table rocks. For more details see the data.table home page.

Data design with R

 Thomas Rahlf: Datendesign mit R

Thomas Rahlf talked about his forthcoming book Datendesign mit R (Data design with R). He shared with us his motivations and aims for the book. In his opinion there are many books that present beautiful charts and concepts (e.g. Tufte's books), but then don't show how they can be reproduced, as there are often done with software such as Adobe Illustrator. Or books explain the graphical functions of a software, yet fail to demonstrate how to create beautiful charts with them. Thus, Thomas' book will contain 100 examples demonstrating that desktop publishing quality charts can be produced with R and in some cases with the help of LaTeX. Indeed, all examples have about 40 lines of code and use the base R graphics system only and not grid or any add-ons such as lattice or ggplot2.

The book's accompanying web site gives you a taster already. The book itself will be published by Open Source Press next month.

The Schnitzel

Of course the evening ended with Schnitzel and Kölsch at the Lux.

 The Luxus Schnitzel. Photo by Günter Faes

Next Kölner R meeting

The next meeting is scheduled for 26 February 2013 (Wednesday before Altweiber), with two talks by Diego de Castillo (Connecting R with databases) and Kim Kuen Tang (R and kdb+).

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

Next Kölner R User Meeting: 13 December 2013

Quick reminder: The next Cologne R user group meeting is scheduled for this Friday, 13 December 2013. We are delighted to welcome:
Further details and the agenda are available on our KölnRUG Meetup site.

Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their vector programme.

View Larger Map

R in Insurance Conference, London, 14 July 2014

Following the very positive feedback that Andreas and I have received from delegates of the first R in Insurance conference in July of this year, we are planning to repeat the event next year. We have already reserved a bigger auditorium.

The second conference on R in Insurance will be held on Monday 14 July 2014 at Cass Business School in London, UK.

This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered may include actuarial statistics, capital modelling, pricing, reserving, reinsurance and extreme events, portfolio allocation, advanced risk tools, high-performance computing, econometrics and more. All topics will be discussed within the context of using R as a primary tool for insurance risk management, analysis and modelling.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:
• Arthur Charpentier, Département de mathématiques Université du Québec à Montréal
• Montserrat Guillen, Dept. Econometrics University of Barcelona together with Leo Guelman, Royal Bank of Canada (RBC Insurance division)
The members of the scientific committee are: Katrien Antonio (University of Amsterdam and KU Leuven), Christophe Dutang (Université du Maine, France), Jens Nielsen (Cass), Andreas Tsanakas (Cass) and Markus Gesmann (ChainLadder project).

Details about the registration and abstract submission process will be published soon on www.RinInsurance.com.

The organisers, Andreas Tsanakas and Markus Gesmann, gratefully acknowledge the sponsorship of Mango Solutions, RStudio, Cybaea and PwC.

Not only verbs but also believes can be conjugated

Following on from last week, where I presented a simple example of a Bayesian network with discrete probabilities to predict the number of claims for a motor insurance customer, I will look at continuous probability distributions today. Here I follow example 16.17 in Loss Models: From Data to Decisions [1].

Suppose there is a class of risks that incurs random losses following an exponential distribution (density $$f(x) = \Theta {e}^{- \Theta x}$$) with mean $$1/\Theta$$. Further, I believe that $$\Theta$$ varies according to a gamma distribution (density $$f(x)= \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha \,-\, 1} e^{- \beta x }$$) with shape $$\alpha=4$$ and rate $$\beta=1000$$.

In the same way as I had good and bad driver in my previous post, here I have clients with different characteristics, reflected by the gamma distribution.

The textbook tells me that the unconditional mixed distribution of an exponential distribution with parameter $$\Theta$$, whereby $$\Theta$$ has a gamma distribution, is a Pareto II distribution (density $$f(x) = \frac{\alpha \beta^\alpha}{(x+\beta)^{\alpha+1}}$$) with parameters $$\alpha,\, \beta$$. Its k-th moment is given in the general case by
$E[X^k] = \frac{\beta^k\Gamma(k+1)\Gamma(\alpha - k)}{\Gamma(\alpha)},\; -1 < k < \alpha.$ Thus, I can calculate the prior expected loss ($$k=1$$) as $$\frac{\beta}{\alpha-1}=\,$$333.33.
Now suppose I have three independent observations, namely losses of $100,$950 and $450 over the last 3 years. The mean loss is$500, which is higher than the \$333.33 of my model.

Question: How should I update my belief about the client's risk profile to predict the expected loss cost for year 4 given those 3 observations?

Visually I can regard this scenario as a graph, with evidence set for years 1 to 3 that I want to propagate through to year 4.

Predicting claims with a Bayesian network

Here is a little Bayesian Network to predict the claims for two different types of drivers over the next year, see also example 16.15 in [1].

Let's assume there are good and bad drivers. The probabilities that a good driver will have 0, 1 or 2 claims in any given year are set to 70%, 20% and 10%, while for bad drivers the probabilities are 50%, 30% and 20% respectively.

Further I assume that 75% of all drivers are good drivers and only 25% would be classified as bad drivers. Therefore the average number of claims per policyholder across the whole customer base would be:
0.75*(0*0.7 + 1*0.2 + 2*0.1) + 0.25*(0*0.5 + 1*0.3 + 2*0.2) = 0.475
Now a customer of two years asks for his renewal. Suppose he had no claims in the first year and one claim last year. How many claims should I predict for next year? Or in other words, how much credibility should I give him?

To answer the above question I present the data here as a Bayesian Network using the gRain package [2]. I start with the contingency probability tables for the driver type and the conditional probabilities for 0, 1 and 2 claims in year 1 and 2. As I assume independence between the years I set the same probabilities. I can now review my model as a mosaic plot (above) and as a graph (below) as well.

Next, I set the client's evidence (0 claims in year one and 1 claim in year two) and propagate these back through my network to estimate the probabilities that the customer is either a good (73.68%) or a bad (26.32%) driver. Knowing that a good driver has on overage 0.4 claims a year and a bad driver 0.7 claims I predict the number of claims for my customer with the given claims history as 0.4789.

Alternatively I could have added a third node for year 3 and queried the network for the probabilities of 0, 1 or 2 claims given that the customer had zero claims in year 1 and one claim in year 2. The sum product of the number of claims and probabilities gives me again an expected claims number of 0.4789.

References

[1] Klugman, S. A., Panjer, H. H. & Willmot, G. E. (2004), Loss Models: From Data to Decisions, Wiley Series in Proability and Statistics.

[2] Søren Højsgaard (2012). Graphical Independence Networks with the gRain Package for R. Journal of Statistical Software, 46(10), 1-26. URL http://www.jstatsoft.org/v46/i10/

Session Info

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Rgraphviz_2.6.0 gRain_1.2-2     gRbase_1.6-12   graph_1.40.0

loaded via a namespace (and not attached):
[1] BiocGenerics_0.8.0 igraph_0.6.6       lattice_0.20-24    Matrix_1.1-0
[5] parallel_3.0.2     RBGL_1.38.0        stats4_3.0.2       tools_3.0.2

googleVis 0.4.7 with RStudio integration on CRAN

In my previous post, I presented a preview version of googleVis that provided an integration with RStudio's Viewer pane (introduced with version 0.98.441).

Over 80% in my little survey favoured the new default output mechanism of googleVis within RStudio. Hence, I uploaded googleVis 0.4.7 on CRAN over the weekend.

However, there were also some thoughtful comments, which suggested that the RStudio Viewer pane is not always the best option. Indeed, Flash charts and gvisMerge output will still be displayed in your default browser, but also if you work on larger charts and with smaller screen, then the browser might still be the better option compared to the Viewer pane - of course you can launch the browser from the Viewer pane as well.

Hence, googleVis gained a new option 'googleVis.viewer' that controls the default output of the googleVis plot method. On package load it is set to getOption("viewer") and if you use RStudio, then its viewer pane will be used for displaying non-Flash and un-merged charts. You can set options("googleVis.viewer" = NULL) and the googleVis plot function will open all output in the default browser again. Thanks to J.J. from RStudio for the tip.

The screen shot below shows a geo chart within the RStudio Viewer pane of the
devastating typhoon track of Haiyan that hit Southeast Asia last week.

Session Info

RStudio v0.98.456 and R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.2


The preview version 0.98.441 of RStudio introduced a new viewer pane to render local web content and with that it allows me to display googleVis charts within RStudio rather than in a separate browser window.

I think this is a rather nice feature and hence I have updated the plot method in googleVis to use the RStudio viewer pane as the default output. If you use another editor, or if the plot is using one of the Flash based charts, then the browser is still the default display.

The behaviour can also be controlled via the option viewer. Set options("viewer"=NULL) and googleVis will plot all output in the browser again.

Of course shiny apps can also run in the viewer pane. Here is the example of the renderGvis help page of googleVis. For more information about the new viewer pane see the online RStudio documentation.

For the time being you can get the next version 0.4.6 of googleVis from our project site only. Please get in touch if you find any issues or bugs with this version, or add them to our issues list.

Is this a step in the right direction? Please use the voting buttons below.

Session Info

R Under development (unstable) (2013-10-25 r64109)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.1.0 

High resolution graphics with R

For most purposes PDF or other vector graphic formats such as windows metafile and SVG work just fine. However, if I plot lots of points, say 100k, then those files can get quite large and bitmap formats like PNG can be the better option. I just have to be mindful of the resolution.

As an example I create the following plot:
x <- rnorm(100000)

Saving the plot as a PDF creates a 5.2 MB big file on my computer, while the PNG output is only 62 KB instead. Of course, the PNG doesn't look as crisp as the PDF file.
png("100kPoints72dpi.png", units = "px", width=400, height=400) plot(x, main="100,000 points", col=adjustcolor("black", alpha=0.2)) dev.off()

Hence, I increase the resolution to 150 dots per pixel.
png("100kHighRes150dpi.png", units="px", width=400, height=400, res=150)
dev.off()

This looks a bit odd. The file size is only 29 KB but the annotations look too big. Well, the file has only 400 x 400 pixels and the size of a pixel is fixed. Thus, I have to provide more pixels, or in other words increase the plot size. Doubling the width and height as I double the resolution makes sense.
png("100kHighRes150dpi2.png", units="px", width=800, height=800, res=150)
dev.off()

Next I increase the resolution further to 300 dpi and the graphic size to 1600 x 1600 pixels. The file is still very crisp. Of course the file size increased. Now it is 654 KB in size, yet sill only about 1/8 of the PDF and I can embed it in LaTeX as well.
png("100kHighRes300dpi.png", units="px", width=1600, height=1600, res=300)
dev.off()

Note, you can click on the charts to access the original files of this post.

Review: Kölner R Meeting 18 October 2013

The Cologne R user group met last Friday for two talks on split apply combine in R and XLConnect by Bernd Weiß and Günter Faes respectively, before the usual Schnitzel and Kölsch at the Lux.

Split apply combine in R

The apply family of functions in R is incredible powerful, yet for newcomers often somewhat mysterious. Thus, Bernd gave an overview of the different apply functions and their cousins. The various functions differ in their object inputs, e.g. vectors, arrays, data frames or lists, and their outputs. Other related functions are by, aggregate and ave. While functions like aggregate reduce the output size, others like ave will return as many rows as the input object and repeat the results where necessary.

Alternatively to the base R function Bernd touched also on the **ply functions of the plyr package. The function names are certainly easier to remember, but their syntax can be a little awkward (.()). Bernd's slides, in German, are already available from our Meetup site.

XLConnect

When dealing with data stored in spreadsheets most member of the group rely on read.csv and write.csv in R. However, if you have a spreadsheet with multiple tabs and formatted numbers, read.csv becomes clumsy, as you would have to save each tab without any formatting in separate files.

Günter presented the XLConnect as an alternative to read.csv or indeed RODBC for reading spreadsheet data. It uses the Apache POI API as the underlying interface. XLConnect requires a Java runtime environment on your computer, but no installation of Excel. That makes it a true platform independent solution to exchange data with spreadsheets and R. Not only can you read defined rows and columns from Excel into R, or indeed named ranges, but in the same way data can be stored in Excel files again and to top it all - also graphic output from R.

Next Kölner R meeting

The next meeting is scheduled for 13 December 2013. A discussion of the data.table package is already on the agenda.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

Next Kölner R User Meeting: 18 Oktober 2013

Quick reminder: The next Cologne R user group meeting is scheduled for this Friday, 18 October 2013. We will discuss and hear about the apply family of functions and the XLConnect package. Further details and the agenda are available on our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

Thanks to Revolution Analytics, who sponsors the Cologne R user group as part of their vector programme.

Why models need a certain culture to flourish

About half a year ago Ian Branagan, Chief Risk Officer of Renaissance Re - a Bermudian reinsurance company with a focus on property catastrophe insurance, gave a talk about the usage of models in risk management and how they evolved over the last twenty years. Ian's presentation, titled with the famous quote of George E.P. Box: "All models are wrong, but some are useful", was part of the lunch time lecture series of talks at Lloyd's, organised by the Insurance Institute of London.

I re-discovered the talk online over the weekend and found it most enlightening again.

So, what makes models useful? And here I mean models that estimate extreme outcomes / percentiles. Three factors are critical, according to Ian, to embed models successfully in risk management and decision making processes.
1. Need - A clear defined need for the model.
2. Capabilities - The skills and resources to build and maintain the model.
3. Culture - An organisational culture that embraces, understands and challenges the model.
The need, if not driven internally, is often imposed by external requirements, such as regulation, e.g. banks and insurers have to use models to estimate the risk of insolvency in many countries. Building capabilities can largely be achieved by investing in people, technology and data. However, the last factor culture, so Ian, is often the most challenging one. Changing business processes, particularly in decision making at senior level requires people to change.

Where in the past senior management may have relied on advisors' expert judgement to guide them in their decision makings, they have to use models in a similar way now as well. I suppose, in the same way as it takes time and effort to build effective relationships with people, it is true for models as well. And equally, decisions should never rely purely on either other people's opinion or indeed model output. As Ian put it, outsourcing all modelling/thinking, and with that the decision making to vendors of models, such as catastrophe modelling companies or rating agencies, who both aim to provide probabilities for extreme events (catastrophes and companies failures) may be sufficient to tick a risk management box, but can ultimately put the company at risk, if model assumptions and limitations are not well understood.

Perhaps we are at the dawn of another enlightenment? Recall Kant's first sentence of his essay What is enlightenment?: "Enlightenment is man's emergence from his self-incurred immaturity." Indeed, it doesn't matter if we use experts' opinions or the output of models, relying blindly on them is dangerous and foolish. Don't stop thinking for yourself. Be critical! Remember, all models are wrong, but some are useful.

Creating a matrix from a long data.frame

There can never be too many examples for transforming data with R. So, here is another example of reshaping a data.frame into a matrix.

Here I have a data frame that shows incremental claim payments over time for different loss occurrence (origin) years.

The format of the data frame above is how this kind of data is usually stored in a data base. However, I would like to see the payments of the different origin years in rows of a matrix.

The first idea might be to use the reshape function, but that would return a data.frame. Yet, it is actually much easier with the matrix function itself. Most of the code below is about formatting the dimension names of the matrix. Note that I use the with function to save me a bit of typing.

An elegant alternative to matrix provides the acast function of the reshape2 package. It has a nice formula argument and allows me not only to specify the aggregation function, but also to add the margin totals.

Changing the width of bars and columns in googleVis

Changing the plotting width in bar-, column- and combo-charts of googleVis works identical and is defined by the bar.groupWidth argument. The dot in the argument means that it has to be split in R into bar="{groupWidth:'10%'}".

Example

library(googleVis)
xvar="Country", yvar="Population",
options=list(seriesType="bars", legend="top",
bar="{groupWidth:'10%'}",
width=500, height=450),
chartid="thincolumns")
plot(cc)

Session Info

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.1 

Using planel.groups in lattice

Last Tuesday I attended the LondonR user group meeting, where Rich and Andy from Mango argued about the better package for multivariate graphics with R: lattice vs. ggplot2.

As part of their talk they had a little competition in visualising London Underground performance data, see their slides. Both made heavy use of the respective panelling / faceting capabilities. Additionally Rich used the panel.groups argument of xyplot to fine control the content of each panel. Brilliant! I had never used this argument before. So, here is a silly example with the iris data set to remind myself of panel.groups in the future.

ave and the "[" function in R

The ave function in R is one of those little helper function I feel I should be using more. Investigating its source code showed me another twist about R and the "[" function. But first let's look at ave.

The top of ave's help page reads:

Group Averages Over Level Combinations of Factors

Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels.

As an example I look at revenue data by product and shop.
revenue <- c(30,20, 23, 17)
shop <- gl(2,2, labels=c("shop_1", "shop_2"))

To answer the question "Which shop sells proportionally more bread?" I need to divide the revenue vector by the sum of revenue per shop, which can be calculated easily by ave:
(shop_revenue <- ave(revenue, shop, FUN=sum))
# [1] 50 50 40 40
(revenue_split_in_shop <- revenue/shop_revenue)
# [1] 0.600 0.400 0.575 0.425 # Shop 1 sells more bread than cake

In other words, ave has to split the revenue vector by shop and apply the sum function to it. Well that's exactly what it does. Here is the source code of ave:
#  Copyright (C) 1995-2012 The R Core Team
ave <- function (x, ..., FUN = mean)
{
if(missing(...))
x[] <- FUN(x)
else {
g <- interaction(...)
split(x,g) <- lapply(split(x, g), FUN)
}
x
}
However, and this is what intrigued me, if I don't provide a grouping variable (missing(...)) it will apply the function FUN on x itself and write its output to x[]. That's actually what the help file to ave mentioned in its description. So what does it do? Here is an example again:
ave(revenue, FUN=sum)
# [1] 90 90 90 90
I get the sum of revenue repeated as many time as the vector has elements, not just once, as with sum(revenue). The trick is that the output of FUN(x) is written into x[], which of course is output of a function call itself "["(x).

I think it is the following sentence in the help file of "[" (see ?"["), which explains it: Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames.

So there we are. I feel less inclined to use ave more, as it is just short for the usual split, lapply routine, but I learned something new about the subtleties of R.

Doughnut chart in R with googleVis

The guys at Google continue to update and enhance the Chart Tools API. One new recent feature is a pie chart with a hole, or as some call them: donut charts.

Thankfully the new functionality is being achieved through new options for the existing pie chart, which means that those new features are available in R via googleVis as well, without the need of writing new code.

Doughnut chart example

With the German election coming up soon, here is the composition of the current parliament.

Session Info

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.1 

googleVis 0.4.4 released with new formatting options for tables

Over the weekend googleVis 0.4.4 found its way to CRAN. The function gvisTable gained a new argument formats that allow users to define the formats numbers displayed in tables. Thanks to J. Buros, who contributed the code.

Session Info

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.1 

Version 0.1.6 of the ChainLadder package has been released and is already available from CRAN.

The new version adds the function CLFMdelta. CLFMdelta finds consistent weighting parameters delta for a vector of selected age-to-age chain-ladder factors for a given run-off triangle.

The added functionality was implemented by Dan Murphy, who is the co-author of the paper A Family of Chain-Ladder Factor Models for Selected Link Ratios by Bardis, Majidi, Murphy. You find a more detailed explanation with R code examples on Dan's blog and see also his slides from the CAS spring meeting.

 Slides by Dan Murphy

Installing a SSD drive into a mid-2007 iMac

I have a mid-2007 iMac with a 2.4 GHz Core2Duo processor and despite the fact that it is already six years old, it still does a good job. However, compared to a friend's recent MacBook Air with a solid state disk (SSD) it felt sluggish when opening programmes and loading larger documents.

So, I thought it would be worthwhile to replace the old spinning hard disk drive with an SSD, instead of buying a new computer. I still like the display of the iMac. Hence, I got myself a Samsung 840 SATA drive, as it came with a USB cable and bracket hard drive holder.

It actually wasn't that difficult to replace the hard drive in my iMac. Of course I can give no guarantee that this works for you as well. Here are the steps I took:
So far I am really happy with the new SSD. Applications are opening much faster and overall the computer feels much snappier. It looks like it can serve me quite a little longer.

Here are a few pictures of the surgery:

 iMac with glass panel and bezel removed
 Display and hard drive removed
 New SSD and old HDD

 Screen shot of the new system preferences

I posted about the various googleVis axis options for base charts, such as line, bar and area charts earlier, but I somehow forgot to mention how to set the axes limits.

Unfortunately, there are no arguments such as ylim and xlim. Instead, the Google Charts axes options are set via hAxes and vAxes, with h and v indicating the horizontal and vertical axis. More precisely, I have to set viewWindowMode : 'explicit' and set the viewWindow to the desired min and max values. Additionally, I have to wrap all of this in [{}] brackets as those settings are sub-options of h/vAxes. There are also options minValue and maxValue, but they only allow you to extend the axes ranges.

Here is a minimal example, setting the y-axis limits from 0 to 10:

With more than one variable to plot I can use the series argument to decide which variables I want on the left and right axes and set the viewWindow accordingly. Again, here is a minimal example:

Session Info

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3

R in Insurance: Presentations are online

The programme and the presentation files of the first R in Insurance conference have been published on GitHub.

 Front slides of the conference presentations

Additionally to the slides many presenters have made their R code available as well:
• Alexander McNeil shared the examples of the CreditRisk+ model he presented.
• Lola Miranda made a Windows version of the double chain-ladder package DCL available via the Cass knowledge web site.
• Alessandro Carrato's 1-year re-reserving code is hosted on the ChainLadder project web site.
• Giorgio Spedicato's life contingencies package is on CRAN already.
• Simon Brickman and Adam Rich's HTML presentation and underlying R code for Automated Reporting is included in the GitHub repository.
• Stefan Eppert pointed out that KatRisk published an illustrative catastrophe model in R.
• Hugh Shanahan's code to ￼integrate R with Azure for high-throughput analysis is on GitHub.
Thanks again to all the presenters who have helped to make the event a success.

Hopefully, we see you again next year!

Review: Kölner R Meeting 19 July 2013

Despite the hot weather and the beginning of the school holiday season in North Rhine Westphalia the Cologne R user group met yet again for two fascinating talks and beer and schnitzel afterwards.

Analysing Twitter data to evaluate the US Dollar / Euro exchange rates

Dietmar Janetzko presented ideas to forecast US Dollar / Euro exchange rate movements for the following day.

To forecast exchange rate movements, Dietmar distinguishes two school of thoughts. The first one is based on the analysis of fundamental analysis, e.g. figures of GDP, debt, unemployment, etc. and the other one is based on news, e.g. announcements from central banks, e.g. from Ben Bernanke and other industry experts.

While the data for the fundamental analysis is usually updated slowly, e.g. annually or quarterly, news can be of higher frequency and less regular. As a result the forecasting horizon in a very liquid market, as the forex market, can vary from one minute or less to next year or decade.

Dietmar's aim was to forecast the exchange rate for the following day and to outperform the forecast of a random walk. For his experiment he used daily exchange rates from Quandl, which has a nice R interface, and Twitter data from topsy, which gives him access to Twitter's 'firehose'.

For his analysis Dietmar focused on the number of tweets of the terms Euro + Crisis + <concept>, whereby he used a dictionary of nearly 600 different concept words.

His training algorithms used functions of the following packages: forecast, caret and car, looking for predictors that have a smaller error than a random error.

It goes without saying that Dietmar hasn't made millions from his algorithms yet, but the discussion and the end of his presentation will hopefully have given him a few pointers to do just that.

Graphs in R

Afshin Sadeghi, who has a background in Steiner tree methods for Protein-Protein interaction networks, gave an overview of the various graph packages in R. He started his talk with a little overview of the graph terminology of nodes, edges, trees, directed and undirected graphs.

Afshin then gave a brief overview of the various graph packages in R and the different visualisation options. The most popular package seems to be igraph, maintained by Gabor Csardi. Although different packages use sometimes different graph objects, there are often conversation tools available, e.g. igraph.to.graphNEL, allowing users to use the best algorithms from all packages.

You can access Afshin's slides via our Meetup site.

Next Kölner R meeting

The next meeting has been scheduled for 18 October 2013.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

Quick review: R in Insurance Conference

Yesterday the first R in Insurance conference took place at Cass Business School in London.

I think the event went really well, but as a member of the organising committee my view is probably skewed. Still, we had a variety of talks, a full house, a great conference dinner and to top it all, the Tower Bridge opened while we had our drinks at the end of the evening.

I will post a more complete review in the future with links to the files of the presentations and R code, once we had a chance to collate all the information.

Many thanks again to all who helped to make this event a success, particularly Andreas Tsanakas at Cass and to our sponsors Mango Solutions and CYBAEA.

 Lecture room

 Breakout area

 Conference dinner at Cantina del Ponte

 Tower Bridge

Today Diego and I will give our googleVis tutorial at useR!2013 in Albacete, Spain.

We will cover:
• Introduction and motivation
• Case studies

There is definitely R in July

The useR!2013 conference in Albacete, Spain, will commence next Wednesday, 10 July, and on the day before Diego and I will give a googleVis tutorial.

The following Monday, 15 July, the first R in Insurance event will take place at Cass Business School and I am absolutely delighted with the programme and the fact that we are sold out.

On Tuesday, 16 July, the LondonR user group meets in the City, awaiting presentations by Andrie de Vries (Revolution Analytics), Rich Pugh (Mango Solutions) and Hadley Wickham (RStudio).

Finally on Friday, 19 July, the next Cologne R user group meeting is scheduled with two talks: Predicting the Euro/Dollar exchange rates with Twitter (Dietmar Janetzko) and Networks in R using igraph (Afshin Sadeghi).

Talking data: Building interactive relationships with data and colleagues

Last week I had the honour to give the opening keynote talk at the Talking Data South West
conference, organised by the Exeter Initiative for Statistics and its Applications. The event was chaired by Steve Brooks and brought together over 100 people to discuss all aspects of data: from collection and analysis through to visualisation and communication.

 Building interactive relationships with data and colleagues

The programme was very good with a variety of talks such as How data collection from smart phones can improve agronomic decision making in potato crops by Robert Allen or Spatial data and analysis in the improvement of aquatic ecosystem health and drinking water quality by Nick Palling. I also liked Richard Everson's presentation on Visualising and understanding multi-criterion league tables, which showed new ideas to create rankings.

However, my highlight was Alan Smith's talk on Information for the Masses: Using Visualisation to Engage the Public. Alan heads up the Data Visualisation Unit at the Office for National Statistics and they created some fantastic online visualisation tools. He presented some interactive examples of the Census 2011 data set. Alan mentioned the hilarious story of a young lady, who had moved from Leeds to Elmbridge and used the Census data to find out why her new home was so dull compared to her old.

 Screen shot of the 2011 Census comparator

The screen shot shows the age distribution of Leeds (left) and Elmbridge (right). Focus on the 20-24 year old age group and you'll get the joke.

googleVis 0.4.3 released with improved Geocharts

The Google Charts Tools provide two kinds of heat map charts for geographical data, the Flash based Geomap and the HTML5/SVG based Geochart.

I prefer the Geochart as it doesn't require Flash, but so far there have been two shortcomings with it: I couldn't add additional tooltip information and the default Mercator projection shows Greenland the size of Africa. Both of those issues seemed to have been resolved by Google. Although the features aren't officially documented and released yet, Mitchell Foley from the Google Chart Tools team presented the new developments at the Google I/O 2013 conference in May already.

With version 0.4.3 of googleVis, and thanks to John Muschelli, gvisGeoChart gained a new argument hovervar allowing users to add further information to the tooltip. Additionally, following the examples in Mitchell's presentation I can change the projection as well. The official release from Google shouldn't be too far away.

So, here are again the heat maps of countries' credit ratings from three American and one Chinese rating agency, sourced from Wikipedia. However, this time I use gvisGeoChart, setting the projection to Kavrayskiy VII and the tooltip to the actual rating letter(s), see the R code below.

R package development

Building R packages is not particular hard, but it can be a bit of a daunting endeavour at the beginning, particularly if you are more of a statistician than a computer scientist or programmer.

Some concepts may appear foreign or like red tape, yet many of them evolved over time for a reason. They help to stay organise, collaborate more effectively with others and write better code.

So, here are my slides of the R package development workshop at Lancaster University.

 R package development

For a detailed and authoritative reference on R package development see the Writing R Extensions manual on CRAN.

Interactive slides with googleVis on shiny

Following on from last week's post, here are my slides on using googleVis on shiny from the Advanced R workshop at Lancaster University, 21 May 2013.

Again, I wrote my slides in RMarkdown and I used slidify to create the HTML5 presentation. Unfortunately you may have to reload the slides that use googleVis on shiny as the JavaScript code in the background is potentially not ideal. Any pointers, which could help to improve the performance will be much appreciated.

Many of the examples in my slides are taken from my post First steps of using googleVis on shiny, however the presentation also demonstrates that it is possible to inject JavaScript code into a googleVis chart to trigger a shiny event, see also the example below.

Interactive presentation with slidify and googleVis

Last week I was invited to give an introduction to googleVis at Lancaster University. This time I decided to use the R package slidify for my talk. Slidify, like knitr, is built on Markdown and makes it very easy to create beautiful HTML5 presentations.

Separating content from layout is always a good idea. Markup languages such as TeX/LaTeX or HTML are built on this principle. Ramnath Vaidyanathan has done a fantastic job with slidify, as it is very straightforward to create presentations with R. There are a couple of advantages compared to traditional presentation software packages:
• RMarkdown helps me to focus on the content
• Integration of R code is build in
• HTML5 allows me to embed interactive content, such as
• Videos
• googleVis and other interactive charts
• shiny apps (more on this next week)
In the past I have used knitr in combination with pandoc to generate a slidy presentation. However, with slidfiy I can do all this in R directly. And better, Ramnath provides me with a choice of different layout frameworks and syntax highlighting options. Finally to top it all, publishing the slides on Github was only one more R statement: publish('mages', 'Introduction_to_googleVis').

I will give a half-day tutorial on googleVis with Diego de Castillo at useR2013! in Albacete on 9 July 2013. I hope to see some of you there.

Don't be misguided by the beauty of mathematics, if the data tells you otherwise

I was trained as a mathematician and it was only last year, when I attended the Royal Statistical Society conference and met many statisticians that I understood how different the two groups are.

In mathematics you often start with some axioms, things you assume to be true, and these axioms are then the basis from which new theory is derived. In statistics or more general in science you start with a theory, or better a hypothesis and try to disprove it. And if you can't disprove it, you accept it until you have other evidence. Or to phrase it like Karl R. Popper: you can only be proven wrong.

Now, why do I mention this? I have met many mathematicians who talk about the beauty of mathematics and I agree, a mathematical concept, theorem or proof can indeed be beautiful. However, when you work in applied mathematics and particular when you use mathematics to build models, there is a danger that you stick to the beautiful idea and ignore reality. Remember the financial crisis?

For example, it might be handy to assume that your data follow a normal distribution, e.g. to make the calculations easier. However, if the data tells you otherwise then be bold and ruthless and change your model. As strange as it might sound, it is has to be your aim to prove a model doesn't work in order to use it successfully.

Remember Pythagoras? He believed in beautiful integers and the realisation that the square root of two was not a fraction of two integers caused a big crisis.

I would argue that we need mathematics to do statistics and statistics to do science. The developments over the last 350 years really demonstrate the success the scientific method. Of course some ideas had to go: the earth can no longer be regarded as the centre our solar system - instead it appears more like a little pale blue dot.

Diggle and Chetwynd, from Lancaster University, published a nice little book that gives a good introduction into statistics and of the scientific method. Two quotes of the book stuck in my mind (pages 1&2):

A scientific theory cannot be proved in the rigours sense of a mathematical theorem. But it can be falsified, meaning that we can conceive of an experimental or observational study that would show the theory to be false.
...
The American physicist Richard Feynman memorable said that 'theory' was just a fancy name for a guess. If observation is inconsistent with theory then the theory, however elegant, has to go. Nature cannot be fooled.

Claims Inflation - a known unknown

Over the last year I worked with two colleagues of mine on the subject of inflation and claims inflation in particular. I didn't expect it to be such a challenging topic, but we ended up with more questions than answers. The key question and biggest challenge is to define what inflation, or indeed claims inflation actually is and how to measure it. We published a summary of our thoughts and findings in this month's issue of The Actuary.

Last year's discussion about the differences between the retail price index (RPI) and consumer price index (CPI) in the UK only exemplified the challenge. The economist Tim Harford illustrated the differences between the RPI and CPI with a simple example of price changes for a shirt and blouse in his Radio 4 programme More or Less. The radio podcast is still available from the BBC. Start listening after about 18 minutes into the show.

R in Insurance: Programme and Abstracts published

I am delighted to announce that the programme and abstracts for the first R in Insurance conference at Cass Business School in London, 15 July 2013, have been published.

The conference committee received strong abstracts from academia and the industry, covering:
• Pricing
• Reserving
• Data mining
• Capital modelling
• Automate reporting
• Catastrophe modelling
• High-performance computing
• Software development management
Register by the end of May to get the early bird booking fee.

We gratefully acknowledge the sponsorship of Mango Solutions and CYBAEA, without whom the event wouldn't be possible.

Programme and Abstracts

Register by the end of May to get the early bird booking fee.

How to change the alpha value of colours in R

Often I like to reduce the alpha value (level of transparency) of colours to identify patterns of over-plotting when displaying lots of data points with R. So, here is a tiny function that allows me to add an alpha value to a given vector of colours, e.g. a RColorBrewer palette, using col2rgb and rgb, which has an argument for alpha, in combination with the wonderful apply and sapply functions.

The example below illustrates how this function can be used with colours provided in different formats, thanks to the col2rgb function.

Review: Kölner R Meeting 12 April 2013

Our 5th Cologne R user group meeting was the best attended meeting so far, with 20 members finding their way to the Institute of Sociology for two talks by Diego de Castillo on shiny and Stephan Holtmeier on cluster analysis, followed by beer and schnitzel at the Lux, a gastropub nearby.

Shiny

Diego gave an overview of the design principles behind shiny, which provides a powerful API to build web apps in pure R. His explanation of the reactive programming model was particularly helpful to understand how shiny works under the hood and why it is so responsive. His live demonstrations of shiny even included shiny server, which he had running in a virtual machine. Diego's slides are available via our Meetup site.

 Diego de Castillo: Introduction to shiny

You can hear more from Diego and me at the UseR!2013 conference in Albacete, where we will give a googleVis tutorial. We will touch on googleVis on shiny as well. A dedicated shiny tutorial will be given in the afternoon by Josh and Winston from RStudio.

Cluster analysis

Stephan Holtmeier, who is a psychologist by background, presented an introduction to cluster analysis with R, motivated by his work in analysing survey data. As a toy example he used a 360° feedback survey of a group of managers within a big company. In his example he wanted to understand the profile of those managers better. Stephan illustrated how a cluster analysis can help to identify groups of managers with similar strengths, e.g. for communication, leadership and/or performance. Depending on how he measured the distance between managers he could look for people who have similar levels of competency or a similar profile (correlation). Stephan also touched on the differences between hierarchical and centroid based cluster analysis, such as k-means. You can find Stephan's slides (in German) also on our Meetup site.

 Stephan Holtmeier: Cluster Analysis with R

For more information on cluster analysis functions in R see also the cluster task view on CRAN. If you would like to get an overview of how psychologists look at data, then check out William Revelle's vignette of the psych package. Finally, if you are interested in how a k-means cluster analysis can be used for image manipulation, see an earlier post of mine.

Next Kölner R meeting, 19 July 2013

The next meeting has been scheduled for 19 July. Günter Faes will present his experiences using the XLConnect package as an interface between R and Excel. Dietmar Janetzko agreed to present how he used R and Twitter to predict exchange rate movements. Of course, the evening will close with a few Kölsch in a nearby beer-garden.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

Test Driven Analysis?

At the last LondonR meeting Francine Bennett from Mastodon C shared some of her experience and findings from an analysis of a large prescriptions data set of the UK's national health service (NHS). However, it was her last slide, which I found the most thought provoking. It asked for the definition of the following term:
Test-driven analysis?
Francine explained that test driven development (TDD) is a concept often used in software development for quality assurance and she wondered if a similar approach could be also used for data analysis. Unfortunately the audience couldn't provide her with the answer, but many expressed that they face similar challenges. So do I.

Indeed, how do I go about test driven analysis? How do I know that I haven't made a mistake, when I start an analysis of a new data set? Well, I don't. But I try to mitigate risks. Similar to TDD, I consider which outputs I should expect from my analysis. Those outputs form the test scenarios of my analysis. Basically I try to write down everything I know, before I start working with the data, e.g.
• any other data sets or reports I can use for cross referencing,
• any back-of-the-envelope analysis I can carry out to provide ballpark answers,
• any relativities and ratios which should hold true,
• any known boundaries and thresholds,
• test scenarios for my code with small well known data, for which I know the outcome,
• names of experts, who could sense check and peer review my output.
But most importantly: I try to think long and hard which questions I want to answer, following the advice of John Tukey: Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

How to set axis options in googleVis

Setting axis options in googleVis charts can be a bit tricky. Here I present two examples where I set several options to customise the layout of a line and combo chart with two axes.

The parameters have to be set in line with the Google Chart Tools API, which uses a JavaScript syntax. In googleVis chart options are set via a list in the options argument. Some of the list items can be a bit more complex, often wrapped in {} brackets, e.g. for various formatting options or in [] brackets, if there are multiple series to consider. Within those brackets sub-options are set via argument : value, using the : character for assignments.

There are many other options as part of the Google Chart Tools API, which are not supported by googleVis yet, such as columns roles, controls and dashboards, etc. Please get in touch if you have ideas in this regard and/or would like to collaborate.

In my first example I display two series of dummy data in a line chart with two axes. The left hand scale is in percentages and the right hand scale in amounts. Note in the code below how I set the various parameters and the placements of the different kinds of brackets.