# mages' blog

## How to change the alpha value of colours in R

Often I like to reduce the alpha value (level of transparency) of colours to identify patterns of over-plotting when displaying lots of data points with R. So, here is a tiny function that allows me to add an alpha value to a given vector of colours, e.g. a RColorBrewer palette, using col2rgb and rgb, which has an argument for alpha, in combination with the wonderful apply and sapply functions.

The example below illustrates how this function can be used with colours provided in different formats, thanks to the col2rgb function.

## Review: Kölner R Meeting 12 April 2013

Our 5th Cologne R user group meeting was the best attended meeting so far, with 20 members finding their way to the Institute of Sociology for two talks by Diego de Castillo on shiny and Stephan Holtmeier on cluster analysis, followed by beer and schnitzel at the Lux, a gastropub nearby.

### Shiny

Diego gave an overview of the design principles behind shiny, which provides a powerful API to build web apps in pure R. His explanation of the reactive programming model was particularly helpful to understand how shiny works under the hood and why it is so responsive. His live demonstrations of shiny even included shiny server, which he had running in a virtual machine. Diego's slides are available via our Meetup site.

 Diego de Castillo: Introduction to shiny

You can hear more from Diego and me at the UseR!2013 conference in Albacete, where we will give a googleVis tutorial. We will touch on googleVis on shiny as well. A dedicated shiny tutorial will be given in the afternoon by Josh and Winston from RStudio.

### Cluster analysis

Stephan Holtmeier, who is a psychologist by background, presented an introduction to cluster analysis with R, motivated by his work in analysing survey data. As a toy example he used a 360° feedback survey of a group of managers within a big company. In his example he wanted to understand the profile of those managers better. Stephan illustrated how a cluster analysis can help to identify groups of managers with similar strengths, e.g. for communication, leadership and/or performance. Depending on how he measured the distance between managers he could look for people who have similar levels of competency or a similar profile (correlation). Stephan also touched on the differences between hierarchical and centroid based cluster analysis, such as k-means. You can find Stephan's slides (in German) also on our Meetup site.

 Stephan Holtmeier: Cluster Analysis with R

For more information on cluster analysis functions in R see also the cluster task view on CRAN. If you would like to get an overview of how psychologists look at data, then check out William Revelle's vignette of the psych package. Finally, if you are interested in how a k-means cluster analysis can be used for image manipulation, see an earlier post of mine.

### Next Kölner R meeting, 19 July 2013

The next meeting has been scheduled for 19 July. Günter Faes will present his experiences using the XLConnect package as an interface between R and Excel. Dietmar Janetzko agreed to present how he used R and Twitter to predict exchange rate movements. Of course, the evening will close with a few Kölsch in a nearby beer-garden.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.

## Test Driven Analysis?

At the last LondonR meeting Francine Bennett from Mastodon C shared some of her experience and findings from an analysis of a large prescriptions data set of the UK's national health service (NHS). However, it was her last slide, which I found the most thought provoking. It asked for the definition of the following term:
Test-driven analysis?
Francine explained that test driven development (TDD) is a concept often used in software development for quality assurance and she wondered if a similar approach could be also used for data analysis. Unfortunately the audience couldn't provide her with the answer, but many expressed that they face similar challenges. So do I.

Indeed, how do I go about test driven analysis? How do I know that I haven't made a mistake, when I start an analysis of a new data set? Well, I don't. But I try to mitigate risks. Similar to TDD, I consider which outputs I should expect from my analysis. Those outputs form the test scenarios of my analysis. Basically I try to write down everything I know, before I start working with the data, e.g.
• any other data sets or reports I can use for cross referencing,
• any back-of-the-envelope analysis I can carry out to provide ballpark answers,
• any relativities and ratios which should hold true,
• any known boundaries and thresholds,
• test scenarios for my code with small well known data, for which I know the outcome,
• names of experts, who could sense check and peer review my output.
But most importantly: I try to think long and hard which questions I want to answer, following the advice of John Tukey: Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

## How to set axis options in googleVis

Setting axis options in googleVis charts can be a bit tricky. Here I present two examples where I set several options to customise the layout of a line and combo chart with two axes.

The parameters have to be set in line with the Google Chart Tools API, which uses a JavaScript syntax. In googleVis chart options are set via a list in the options argument. Some of the list items can be a bit more complex, often wrapped in {} brackets, e.g. for various formatting options or in [] brackets, if there are multiple series to consider. Within those brackets sub-options are set via argument : value, using the : character for assignments.

There are many other options as part of the Google Chart Tools API, which are not supported by googleVis yet, such as columns roles, controls and dashboards, etc. Please get in touch if you have ideas in this regard and/or would like to collaborate.

In my first example I display two series of dummy data in a line chart with two axes. The left hand scale is in percentages and the right hand scale in amounts. Note in the code below how I set the various parameters and the placements of the different kinds of brackets.

## Next Kölner R User Meeting: 12 April 2013

Quick reminder: The next Cologne R user group meeting is scheduled for this Friday, 12 April 2013. We will discuss cluster analysis and shiny. Further details and the agenda are available on our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from the last Cologne R user group meeting are available here.

Thanks also to Revolution Analytics, who sponsors the Cologne R user group as part of their vector programme.

## Top 10 tips to get started with R

• Be motivated. R has a steep learning curve. Find a problem you can't solve otherwise. E.g. plotting multivariate data, a statistical analysis for which an R function exists already.
• Download and install R. Get to know the R console. Learn how to install additional packages, how to access the history, how to use auto completion and open the help system. Review the R Installation and Administration manual and check out the free books section on CRAN.
• Get familiar with the R help files. They can appear cryptic at the start, but there is a structure to them. Read and re-read a couple of help files again and again. Look out for the input and output sections, execute the examples, run the demos, e.g. demo(graphics). Subscribe to R-help and read questions and answers, check out stackoverflow, follow blogs. Search with Rseek.org.
• Learn how to get your data into R. The easiest way is usually via a CSV-file (CSV=comma separated values), using read.csv. Look into XLConnect, if you have to deal with spreadsheet files. Move on to write queries against data bases, e.g. using RODBC. Skim through the R Data Import/Export manual.
• Try to understand the different data types in R and how to modify them. What are the differences between a matrix and a data frame? What is a factor? What is a list? Think about the different use cases. Review the Introduction to R manual.
• Do charts! Lots of charts. They are rewarding and keep you motivated. Be inspired by the R Graph Gallery. Check out the following packages: lattice, plotrix, ggplot2, deducer, googleVis.
• Learn how you can modify and reshape data in R and apply functions on subsets using by, apply, lapply, ave, reshape, sweep, with, within, etc. Set aside a weekend to think about these functions.
• Write your R code into files instead of typing it all into the R console. Use an integrated development environment (IDE), e.g. ESS Emacs, RStudio, StatET Eclipse.
• Understand the concept of functions. Write a function, which gives "Hello World" back. Modify it, so it has an input argument NAME and it prints "Hello NAME". Review the code of existing R functions. Copy from existing code.
• Document your code! Start your code by explaining what you want to achieve and only code that much, then write down the next step in plain English and code again. How will you know that your code does what you want it to do? Testing can help. Think about your code style and how you will be versioning your files.