mages' blog

It is the small data that matters the most

Everyone is talking about Big Data1, but it is the small data that is holding everything together. The small slowly changing reference tables are the linchpins. Unfortunately, too often politics gets in the way as those small tables, maintained by humans, don't get the attention they deserve; or in other words their owners, if they exists - many of these little tables are orphans, make changes without understanding the potential consequences on downstream systems.

Here is a little story as it happened over the weekend.

A friend of mine flew on Saturday night from Vienna to Yerevan. The plane was scheduled to leave Austria at 10:15 PM and to arrive at 4:35 AM in Armenia. However, last weekend was the weekend when the summer time or daylight saving time (DST) ended in many European countries, including Austria, but not in Armenia.

Armenia used DST in 1981-1995 and 1997-2011. This may look a little odd, but Armenia is bar far not the only region that has changed its policy over the years. Russia decided in 2011 not to switch to winter time and keep DST all year round, only to switch to permanent winter time last weekend again.

Back to my friend, she arrives early Sunday morning at the airport in Yerivan, but her friend who promised to pick her up is not there. After a little while she takes a cab to her friends house, making it just in time before he was about to leave for the airport. What happened? Well, her friend relied on his mobile phone for his wake-up call. However, his mobile phone wasn't aware of the fact that Armenia didn't have DST anymore and set his clock back by one hour, allowing him more sleep. His phone might either be a little old or getting its signal from Russia, I don't know. Yet, this little story illustrates nicely how reliant we are nowadays on our interconnected devices having access to the correct reference data.

Imagine being in an intensive care unit relying on medial devices that have to work correctly together on time signals, when they are sourced from different companies, countries and running on different systems. Try to stay healthy, particular in late March and October, would be my advise.

So, how do systems/computers actually know which timezone to use?

Most computers will use the tz reference code and database maintained by a group of volunteers until 2011 and now by IANA. Still, how do they get informed that a region or state decided to change the rules? By keeping their ears on the ground!

The following extract from the Asia file of the tz database 2014b release is quite telling:


# Armenia
# From Paul Eggert (2006-03-22):
# Shanks & Pottenger have Yerevan switching to 3:00 (with Russian DST)
# in spring 1991, then to 4:00 with no DST in fall 1995, then
# readopting Russian DST in 1997.  Go with Shanks & Pottenger, even
# when they disagree with others.  Edgar Der-Danieliantz
# reported (1996-05-04) that Yerevan probably wouldn't use DST
# in 1996, though it did use DST in 1995.  IATA SSIM (1991/1998) reports 
# that Armenia switched from 3:00 to 4:00 in 1998 and observed DST 
# after 1991, but started switching at 3:00s in 1998.

# From Arthur David Olson (2011-06-15):
# While Russia abandoned DST in 2011, Armenia may choose to
# follow Russia's "old" rules.

# From Alexander Krivenyshev (2012-02-10):
# According to News Armenia, on Feb 9, 2012,
# http://newsarmenia.ru/society/20120209/42609695.html
#
# The Armenia National Assembly adopted final reading of Amendments to the
# Law "On procedure of calculation time on the territory of the Republic of
# Armenia" according to which Armenia [is] abolishing Daylight Saving Time.
# or
# (brief)
# http://www.worldtimezone.com/dst_news/dst_news_armenia03.html
# Zone NAME  GMTOFF RULES FORMAT [UNTIL]
Zone Asia/Yerevan 2:58:00 - LMT 1924 May  2
   3:00 - YERT 1957 Mar    # Yerevan Time
   4:00 RussiaAsia YER%sT 1991 Mar 31 2:00s
   3:00 1:00 YERST 1991 Sep 23 # independence
   3:00 RussiaAsia AM%sT 1995 Sep 24 2:00s
   4:00 - AMT 1997
   4:00 RussiaAsia AM%sT 2012 Mar 25 2:00s
   4:00 - AMT

Does this all sounds familiar to you and the challenges in your own organisation? Well, I gave a talk on Small & Big Data recently with a colleague of mine, should you be interested to find out more about this topic.

1. or Tiny Data in Rasmus' case

Approximating the impact of inflation

The other day someone mentioned to me a rule of thumb he was using to estimate the number of years \(n\) it would take for inflation to destroy half of the purchasing power of today's money:
\[ n = \frac{70}{p}\]
Here \(p\) is the inflation in percent, e.g. if the inflation rate is \(2\%\) then today's money would buy only half of today's goods and services in 35 years. You can also think of a saving account with an interest rate of \(2\%\) that would double your money in 35 years.

It is not difficult to derive this formula, and I will do this below, I just wonder if the craft of approximating answers to questions is slowly eroding as we have ever more powerful computer and access to more and more data at our finger tips? Well, I better write down my derivation, before I forget it again.

The starting point is:
\[
2K = K (1 + \frac{p}{100})^n
\]
This is equivalent to:
\[
2 = (1 + \frac{p}{100})^n
\]
Taking the log gives:
\[
\log(2) = n \log(1 + \frac{p}{100})
\]
The first term of the Taylor series approximation of \(\log(1+x)\) for small \(x\) is \(x\). Hence for small \(p\) I can set:
\[
\log(2) \doteq n \, \frac{p}{100}
\]
Next I have to estimate the value for \(\log(2)\). Writing it as an integral leads to:
\[
\log(2) = \int_1^2 \frac{1}{x} \,dx
\]
Using Simpson's rule I can approximate the integral with:
\[
\int_1^2 \frac{1}{x} \,dx \doteq \frac{2-1}{6} (1+4\frac{2}{1+2}+\frac{1}{2} )
= \frac{25}{36} \doteq 0.7
\]
Thus,
\[
n \doteq \frac{70}{p}
\]
Plotting the two formulas against each other reveals that the approximation works pretty well, even for inflation rates up to 10%.

R Code

Here is the R code to reproduce the plot.
curve(70/x, from=1, to=10, 
      xlab="Inflation rate p%", 
      ylab="Number of years for purchaing power to half", 
      main="Impact of inflation on purchasing power",
      col="blue", 
      type="p", pch=16, cex=0.5)
curve(log(2)/(log(1+x/100)), 
      from=1, to=10, add=TRUE, 
      col="red")
legend("topright", 
       legend=c("70/p","log(2)/log(1+p/100)"), 
       bty="n",
       col=c("blue", "red"), 
       pch=c(16,16), pt.cex=c(1,1))

googleVis 0.5.6 released on CRAN

Version 0.5.6 of googleVis was released on CRAN over the weekend. This version fixes a bug in gvisMotionChart. Its arguments xvar, yvar, sizevar and colorvar were not always picked up correctly.

Thanks to Juuso Parkkinen for reporting this issue.

Example: Love, or to love

A few years ago Martin Hilpert posted an interesting case study for motion charts. Martin is a linguist and he researched how the usage of words in American English changed over time, e.g. some words were more often used as nouns in the past and then became more popular as a verb. Do you talk about love, or do you tell someone that you love her/him? Visit his motion chart web page for more information and details!


R code


Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] googleVis_0.5.6

loaded via a namespace (and not attached):
[1] RJSONIO_1.3-0 tools_3.1.1

Visualising the seasonality of Atlantic windstorms

Last week Arthur Charpentier sketched out a Markov spatial process to generate hurricane trajectories. Here, I would like to take another look at the data Arthur used, but focus on its time component.

According to the Insurance Information Institute, a normal season, based on averages from 1980 to 2010, has 12 named storms, six hurricanes and three major hurricanes. The usual peak months of August and September passed without any major catastrophes this year, but the Atlantic hurricane season is not over yet.

So, let's take a look at the data again, and here I will use code from Gaston Sanchez, who looked at windstorm data earlier.

I believe, I am using the same data as Arthur, but from a different source, which allows me to download it in one file (13.9MB). Using code from Gaston my chart of all named windstorms over the period of 1989 to 2013 looks like this.


Plotting the data by months illustrates the seasonality of windstorms during the year. Note, that there are no named windstorms for the months of February and March and although the season runs until the end of November, I can clearly see that the peak months are August and September.


Perhaps there are other time components that effect the seasonality across years, such as La Niña? Plotting the years 1989 to 2013 certainly shows that there are years with more windstorm activity than others. Critical to the impact and cost of windstorms is if they make landfall or not. Hence, I add information on deaths and economic damages for major hurricanes over that period from Wikipedia.


Although 1992 doesn't show much windstorm activity, hurricane Andrew was the most expensive one to that date and changed the insurance industry significantly. Following hurricane Andrew insurance companies started to embrace catastrophe models.

The economic loss is a very poor proxy to the loss of lives. In 1998 Hurricane Mitch cost the lives of over 19,000, with most of them living in Honduras. The economic cost of $6.2bn looks small compared to the cost of Andrew six years earlier ($26.5bn).


Although over the last few years we have seen a fairly benign impact of windstorm activity, the next big event can have a much more sever impact than what we have seen historically, as the population (and economy) on the Gulf of Mexico has and is growing rapidly, 150% from 1960 to 2008 alone. A hurricane season causing damages over $200bn doesn't look unreasonable anymore.

R Code


Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] XML_3.98-1.1     data.table_1.9.2 ggplot2_1.0.0   
[4] maps_2.3-7      

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     gtable_0.1.2    
 [4] labeling_0.3     MASS_7.3-33      munsell_0.4.2   
 [7] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2     
[10] reshape2_1.4     scales_0.2.4     stringr_0.6.2   
[13] tools_3.1.1 

Running RStudio via Docker in the Cloud

Deploying applications via Docker container is the current talk of town. I have heard about Docker and played around with it a little, but when Dirk Eddelbuettel posted his R and Docker talk last Friday I got really excited and had to have a go myself.

My aim was to rent some resources in the cloud, pull an RStudio Server container and run RStudio in a browser. It was actually surprisingly simple to get started.

I chose Digital Ocean as my cloud provider. They have many Linux systems to choose from and also a pre-built Docker system.


After about a minute I had kicked off the Docker droplet I could login into the system in a browser window and start pulling the Docker file, e.g. Dirk's container.


Once the downloads finished I could start the RStudio Server using the docker run command and login to a RStudio session. To my surprise even my googleVis package worked out of the box. The plot command opened just another browser window to display the chart; here the output of the WorldBank demo.


All of this was done within minutes in a browser window. I didn't even use a terminal window. So, that's how you run R on an iPad. Considering that the cost for the server was $0.015 per hour, I wonder why I should buy my own server, or indeed buy a new computer.

Managing R package dependencies

One of my take aways from last week's EARL conference was that R is more and more growing out of its academic roots into the enterprise. And with that come some challenges, e.g. how do I ensure consistent and systematic access to a set of R packages in an organisation, in particular when one team is providing packages to others?

Two packages can help here: roxyPackage and miniCRAN.

I wrote about roxyPackage earlier on this blog. It allows me to create a local repository to distribute my package, while at the same time execute and control the build process from within R. But what about my package's dependencies? Here miniCRAN helps. miniCRAN is a new package by Andrie de Vries that enables me to find and download all package dependencies and store them in a local repository, e.g. the one used by roxyPackage.

For more details about roxyPackage and miniCRAN read the respective package vignettes.

Example

To create a local sub-CRAN repository for the two packages I maintain on CRAN and with all their dependencies I use:
library("miniCRAN")
my.pkgs <- c("googleVis", "ChainLadder")
pkgs <- pkgDep(my.pkgs, suggests = TRUE, enhances=FALSE)
makeRepo(pkgs = pkgs, path="/Users/Shared/myCRANRepos")
And to visualise the dependencies:
dg <- makeDepGraph(my.pkgs, includeBasePkgs=FALSE, 
                   suggests=TRUE, enhances=TRUE)
set.seed(1)
plot(dg, legendPosEdge = c(-1, 1), 
     legendPosVertex = c(1, 1), vertex.size=20)

What a surprise! In total I end up with 42 packages from CRAN and I didn't expect any connection between the ChainLadder and googleVis package.

Bonus tip

Don't miss out on Pat Burns's insightful talk about effective risk management from EARL. His thoughts reminded me of the great Karl Popper: Good tests kill flawed theories; we remain alive to guess again.

Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics  grDevices utils  datasets  methods  
[7] base     

other attached packages:
[1] miniCRAN_0.1-0

loaded via a namespace (and not attached):
[1] httr_0.5 igraph_0.7.1  stringr_0.6.2 tools_3.1.1  
[5] XML_3.98-1.1

Notes from the Kölner R meeting, 12 September 2014

Last Friday we had guests from Belgium and the Netherlands joining us in Cologne. Maarten-Jan Kallen from BeDataDriven came from The Hague to introduce us to Renjin, and the guys from DataCamp in Leuven, namely Jonathan, Martijn and Dieter, gave an overview of their new online interactive training platform.