## Minimal examples help

The other day I got stuck working with a huge data set using

Ok, here is the problem. Well, easy to write down now, after I understood it.

Suppose, I have some data that describes my sales targets by product and quarter:

Further, I have some actual data, which is also broken down by region, but has no data for coconut:

What I would like to do is to join both data sets together, so that I can compare my sales figures with my targets. In particular, I would like to see also my targets for future quarters. However, I would like to filter out the target data for those products that are not available in a region, coconut in my example.

First I have to set keys for my data sets on which I would like to join them:

Because I want to see also future targets I am not using

Ok, that means I have to filter for the products in my actual data to match the relevant planning data:

That's it. Now I can get back to my original huge and complex data set and move on.

Please let me know if there is a better way of achieving the above.

`data.table`

in R. It took me a little while to realise that I had to produce a minimal reproducible example to actually understand why I got stuck in the first place. I know, this is the mantra I should follow before I reach out to R-help, Stack Overflow or indeed the package authors. Of course, more often than not, by following this advise, the problem becomes clear and with that the solution obvious. Ok, here is the problem. Well, easy to write down now, after I understood it.

Suppose, I have some data that describes my sales targets by product and quarter:

```
library(data.table)
Plan <- data.table(
Product=c(rep("Apple",3),rep("Kiwi",3),rep("Coconut",3)),
Quarter=rep(c(1,2,3), 3),
Target=1:9)
Plan
## Product Quarter Target
## 1: Apple 1 1
## 2: Apple 2 2
## 3: Apple 3 3
## 4: Kiwi 1 4
## 5: Kiwi 2 5
## 6: Kiwi 3 6
## 7: Coconut 1 7
## 8: Coconut 2 8
## 9: Coconut 3 9
```

Further, I have some actual data, which is also broken down by region, but has no data for coconut:

```
Actual <- data.table(
Region=rep(c("North", "South"), each=4),
Product=rep(c("Apple", "Kiwi"), times=4),
Quarter=rep(c(1,1,2,2), 2), Sales=1:8)
Actual
## Region Product Quarter Sales
## 1: North Apple 1 1
## 2: North Kiwi 1 2
## 3: North Apple 2 3
## 4: North Kiwi 2 4
## 5: South Apple 1 5
## 6: South Kiwi 1 6
## 7: South Apple 2 7
## 8: South Kiwi 2 8
```

What I would like to do is to join both data sets together, so that I can compare my sales figures with my targets. In particular, I would like to see also my targets for future quarters. However, I would like to filter out the target data for those products that are not available in a region, coconut in my example.

First I have to set keys for my data sets on which I would like to join them:

```
setkey(Actual, Product, Quarter)
setkey(Plan, Product, Quarter)
```

Because I want to see also future targets I am not using

`Plan[Actual]`

. Instead I join the Plan data for each region; but then I get also the target data for coconut:```
Actual[, .SD[Plan], by=list(Region)]
## Region Product Quarter Sales Target
## 1: North Apple 1 1 1
## 2: North Apple 2 3 2
## 3: North Apple 3 NA 3
## 4: North Coconut 1 NA 7
## 5: North Coconut 2 NA 8
## 6: North Coconut 3 NA 9
## 7: North Kiwi 1 2 4
## 8: North Kiwi 2 4 5
## 9: North Kiwi 3 NA 6
## 10: South Apple 1 5 1
## 11: South Apple 2 7 2
## 12: South Apple 3 NA 3
## 13: South Coconut 1 NA 7
## 14: South Coconut 2 NA 8
## 15: South Coconut 3 NA 9
## 16: South Kiwi 1 6 4
## 17: South Kiwi 2 8 5
## 18: South Kiwi 3 NA 6
```

Ok, that means I have to filter for the products in my actual data to match the relevant planning data:

```
Actual[, .SD[
Plan[
Product %in% unique(.SD[, Product])
]
], by=list(Region)]
## Region Product Quarter Sales Target
## 1: North Apple 1 1 1
## 2: North Apple 2 3 2
## 3: North Apple 3 NA 3
## 4: North Kiwi 1 2 4
## 5: North Kiwi 2 4 5
## 6: North Kiwi 3 NA 6
## 7: South Apple 1 5 1
## 8: South Apple 2 7 2
## 9: South Apple 3 NA 3
## 10: South Kiwi 1 6 4
## 11: South Kiwi 2 8 5
## 12: South Kiwi 3 NA 6
```

That's it. Now I can get back to my original huge and complex data set and move on.

Please let me know if there is a better way of achieving the above.

### Session Info

```
R version 3.1.2 Patched (2015-01-20 r67564)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.4
loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.4 reshape2_1.4.1 stringr_0.6.2
```

24 Feb 2015
07:24
data.table
,
join
,
R

## Reading Arduino data directly into R

I have experimented with reading an Arduino signal into R in the past, using Rserve and Processing. Actually, it is much easier. I can read the output of my Arduino directly into R with the

Here is my temperature sensor example again:

And all it needs to read the signal into the R console with my computer is:

Note: This worked for me on my Mac and I am sure it will work in a very similar way on a Linux box as well, but I am not so sure about Windows. Crucially, I had to learn the difference between the

With a little more R code I can create a 'live' data stream plot of my Arduino.

R code

Here is the original Arduino sketch as well:

`scan`

function. Here is my temperature sensor example again:

And all it needs to read the signal into the R console with my computer is:

```
> f <- file("/dev/cu.usbmodem3a21", open="r")
> scan(f, n=1)
```**Read 1 item**
[1] 20.8
> close(f)

Super simple: Open the file connection. Scan *n*lines of data. Close the file connection. Job done.Note: This worked for me on my Mac and I am sure it will work in a very similar way on a Linux box as well, but I am not so sure about Windows. Crucially, I had to learn the difference between the

*tty**and*cu**devices. I found the following statement in Mike's PBX Cookbook particular insightful:You might notice that each serial device shows up twice in /dev, once as a tty.* and once as a cu.*. So, what's the difference? Well, TTY devices are for calling into UNIX systems, whereas CU (Call-Up) devices are for calling out from them (eg, modems). We want to call-out from our Mac, so /dev/cu.* is the correct device to use.You find the file address of your Arduino by opening the Arduino software and looking it up under the menu

*Tools > Port*.With a little more R code I can create a 'live' data stream plot of my Arduino.

Reload this page to see the animated Gif again. |

R code

Here is the original Arduino sketch as well:

### Session Info

```
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_3.1.
```

17 Feb 2015
07:27
Arduino
,
R
,
read serial
,
scan

## What have a physicist, an entrepreneur and an actor in common?

They all try to do something new and take the risk to be seen as a fool.

Over the last few days I stumbled over three videos by a physicist, an entrepreneur and an actor, which at first have little in common, but they do. They all need to know when they are wrong in order to progress. If you are not wrong, then you are likely to be right, but that is often difficult to prove - often not at all.

Here I have Richard Feynman, Rob Fitzpatrick and Michael Caine.

Start with a guess for a new law. Predict the consequences and compare the prediction with the results of experiments. If the experiments disagree with your prediction, then your idea is wrong.

Ask your mum questions about the assumptions of your new business idea, without telling her anything about it. Do this in the same way with friends, without them knowing that you talk about a new business idea. This will require a great care in the way you phrase your questions. Don't fish for compliments. If the answers are different from your exceptions, then your assumptions are wrong and perhaps your business idea as well.

Rehearse your dialogue and observe how other people react to it. If they say something like "I am sorry, I see you are rehearsing, but I need to talk to you", then you are not doing it well. If on the other hand they join the conversation, so that you have to say: "I am sorry, but we are rehearsing" then you are getting there.

Willing/wanting to know when you are wrong is one the hardest things to accept, and yet the best way to progress quickly.

Over the last few days I stumbled over three videos by a physicist, an entrepreneur and an actor, which at first have little in common, but they do. They all need to know when they are wrong in order to progress. If you are not wrong, then you are likely to be right, but that is often difficult to prove - often not at all.

- The physicist has an idea for a new law. How does he/she know if it is
*wrong*? - The entrepreneur has an idea for a new business. How does he/she know if it
*won't make money*? - The actor is rehearsing a new scene. How does he/she know if the acting is
*not believable*?

Here I have Richard Feynman, Rob Fitzpatrick and Michael Caine.

### The physicist

Start with a guess for a new law. Predict the consequences and compare the prediction with the results of experiments. If the experiments disagree with your prediction, then your idea is wrong.

### The entrepreneur

Ask your mum questions about the assumptions of your new business idea, without telling her anything about it. Do this in the same way with friends, without them knowing that you talk about a new business idea. This will require a great care in the way you phrase your questions. Don't fish for compliments. If the answers are different from your exceptions, then your assumptions are wrong and perhaps your business idea as well.

### The actor

Rehearse your dialogue and observe how other people react to it. If they say something like "I am sorry, I see you are rehearsing, but I need to talk to you", then you are not doing it well. If on the other hand they join the conversation, so that you have to say: "I am sorry, but we are rehearsing" then you are getting there.

Willing/wanting to know when you are wrong is one the hardest things to accept, and yet the best way to progress quickly.

10 Feb 2015
07:04
acting
,
entrepreneur
,
physics
,
science

## R in Insurance 2015: Registration Opened

The registration for the third conference on R in Insurance on Monday

This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:

We invite you to submit a one-page abstract for consideration. Both academic and practitioner proposals related to R are encouraged. The submission deadline for abstracts is 28 March 2015.

Please email your abstract of no more than 300 words (in text or pdf format) to r-in-insurance@uva.nl.

Details about the registration and abstract submission are given on the dedicated R in Insurance page at the University of Amsterdam.

For more information about the past events visit www.rininsurance.com.

**29 June 2015**at the University of Amsterdam has opened.This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:

- Prof. Richard Gill, Leiden University
- Dr James Guszcza, FCAS, Chief Data Scientist, Deloitte - US

*Statistical Errors in Court*or Jim's talk on*Predictive Modelling and Behaviour Insight*or*Actuarial Analytics in R*. We are thrilled that they accepted our invitation.We invite you to submit a one-page abstract for consideration. Both academic and practitioner proposals related to R are encouraged. The submission deadline for abstracts is 28 March 2015.

Please email your abstract of no more than 300 words (in text or pdf format) to r-in-insurance@uva.nl.

Details about the registration and abstract submission are given on the dedicated R in Insurance page at the University of Amsterdam.

For more information about the past events visit www.rininsurance.com.

3 Feb 2015
07:22
Conference
,
Insurance
,
News
,
R
,
R in Insurance

## googleVis version 0.5.8 released

We released googleVis version 0.5.8 on CRAN last week. The update is a maintenance release for the forthcoming release of R 3.2.0.

New to googleVis? The package provides an interface between R and the Google Charts Tools, allowing you to create interactive web charts from R without uploading your data to Google. The charts are displayed by default via the R internal help browser.

Visit the examples of all googleVis charts on CRAN and review the vignettes.

Screen shot of some of the Google Charts |

Visit the examples of all googleVis charts on CRAN and review the vignettes.

## Communicating Risk and Uncertainty

David Spiegelhalter gave a fascinating talk on

In a very engaging way David gave many examples and anecdotes from his career in academia and advisory. I believe his talk will be published on the Grantham Institute's YouTuble channel, so I will only share a few highlights and thoughts that stuck in my mind here.

Well, this might be true, but fortunately pancreatic cancer is quite rare. About 5 people out of 400 will develop this cancer even without eating fried up bacon every day (absolute risk). Eating fry ups every morning will increase your chances by 20% (relative risk) to 6 out of 400 - still pretty low. Less of a scary headline.

Small increases in absolute risks can result in large increases in relative risk. Depending on how you choose to frame the story you can generate very different reactions. David mentioned also the work of Gerd Gigerenzer, who is promoting the use of

I think this is a wonderful example of trying to give a balanced view of the underlying risk, communicated in a very clear and understandable way.

I think this table will become handy in the future when I have to elicit expert opinions.

I wonder, what low agreement with robust evidence could mean. Perhaps, even if there is strong evidence that I should change my diet, e.g. drink less coffee, I might not agree with it and as a result wouldn't take any actions.

Finally, David announced that he has a new book in the pipeline. I am sure it will sell well, as it has the catchy title

*Communicating Risk and Uncertainty to the Public & Policymakers*at the Grantham Institute of the Imperial College in London last Tuesday.In a very engaging way David gave many examples and anecdotes from his career in academia and advisory. I believe his talk will be published on the Grantham Institute's YouTuble channel, so I will only share a few highlights and thoughts that stuck in my mind here.

### Framing: Opportunity vs. Risk

Do you prefer survival ratios or mortality rates? The way you present probabilities/frequencies will make a difference in the way the audience perceives them and therefore how they will make decisions. David suggested one should aim to give a balanced view to mitigate the framing effect; e.g. show the opportunity and risk, the survival ratio and mortality rate. Interestingly, that is exactly how Wolfram Alpha displays life expectancies.Life expectance: UK, male, 62 years old (Source: Wolfram Alpha) |

### Absolute and relative risk

What did you have for breakfast this morning? Fried up bacon? Perhaps you have fried up bacon every day? It could increase your chances of developing pancreatic cancer by almost a fifth, as an article by the Daily Express on 16th January 2012 suggested.Daily Express, 16th January 2012 |

Well, this might be true, but fortunately pancreatic cancer is quite rare. About 5 people out of 400 will develop this cancer even without eating fried up bacon every day (absolute risk). Eating fry ups every morning will increase your chances by 20% (relative risk) to 6 out of 400 - still pretty low. Less of a scary headline.

Small increases in absolute risks can result in large increases in relative risk. Depending on how you choose to frame the story you can generate very different reactions. David mentioned also the work of Gerd Gigerenzer, who is promoting the use of

*natural frequencies*in the context of medical conversations, e.g. say '1 out of 200 people will experience a certain effect', instead of 'the probability for this effect is 0.5%'.### Conditional probabilities

One of the big challenges in communicating risk are conditional probabilities. I believe, conditional probabilities are not intuitive for most of us. As an example David talked about breast cancer screening. The test itself is not a 100% reliable. No test is. There are always results which are*false positives*(test predicts cancer when there is none) and*false negatives*(test predicts no cancer when there is one). Tests are usually designed to have a higher false alarm rate than no alarm at all. Hence, even when the test result is positive the likelihood of cancer could be low, because cancer is rare and the test is not that good. David presented a great diagram of the*NHS breast screening - Helping you decide*leaflet to illustrate the conditional probabilities for breast screening.NHS breast screening - Helping you decide |

### Translating probabilities into words

The use of*natural frequencies*is not always appropriate and sometimes probabilities are more useful. However, instead of saying that something has a 50% chance of occurring, you might also say it is equally likely to happen than not. David mentioned the work of the Intergovernmental Panel on Climate Change (IPCC) that aims to use a consistent language in all their reports. To achieve that, they publish guidance on consistent treatments of uncertainty. This is their translation of probabilities into words:Source: IPCC Uncertainty guidance note |

### Speaking with confidence

The IPCC uncertainty guidance also talks about*confidence*and interestingly they correlate the level of agreement with the level of evidence. Note, even with a lot of evidence, without agreement the confidence is relatively low.Source: IPCC Uncertainty guidance note |

### Fan charts

The last example from David's talk I will mention here is the Bank of England's fan chart of GDP projection. The Bank of England has been using this chart now for a number of years and tweaked it only a little over that time. Until the financial crisis the Bank of England showed bands of confidence within the 5-95% interval. The financial crisis happened outside that interval, which of course doesn't mean that it couldn't happen, it was just very unlikely. Since then they illustrate the 0-100% confidence interval as a grey background. The other interesting aspect is, that they don't show the mean projection line initially to overcome the anchoring effect. Note also that the term projection is used and not prediction.November 2014 GDP Fan chart (Source: Bank of England) |

Finally, David announced that he has a new book in the pipeline. I am sure it will sell well, as it has the catchy title

*Sex by numbers*.
20 Jan 2015
07:25
Grantham
,
Imperial College
,
Risk
,
Spiegelhalter
,
Uncertainties

## Extended Kalman filter example in R

Last week's post about the Kalman filter focused on the derivation of the algorithm. Today I will continue with the extended Kalman filter (EKF) that can deal also with nonlinearities. According to Wikipedia the EKF has been considered the de facto standard in the theory of nonlinear state estimation, navigation systems and GPS.

\[\begin{align}

x_{t+1} & = A x_t + w_t,\quad w_t \sim N(0,Q)\\

y_t &=G x_t + \nu_t, \quad \nu_t \sim N(0,R)\\

x_1 & \sim N(\hat{x}_0, \Sigma_0)

\end{align}

\]With \(x_t\) describing the state space evolution, \(y_t\) the observations, \(A, Q, G, R, \Sigma_0\) matrices of appropriate dimensions, \(w_t\) the evolution error and \(\nu_t\) the observation error. The Kalman filter provides recursive estimators for \(x_t\) via:

\[\begin{align}

K_{t} &= A \Sigma_t G' (G \Sigma_t G' + R)^{-1}\\

\hat{x}_{t+1} &= A \hat{x_t} + K_{t} (y_t - G \hat{x}_t) \\

\Sigma_{t+1} &= A \Sigma_t A' - K_{t} G \Sigma_t A' + Q

\end{align}\]In the case of nonlinearities on the right hand side of either the state (\(x_t\)) or observation (\(y_t\)) equation the extended Kalman filter uses a simple and elegant trick: Taylor series of the first order, or in other words, I simply linearise the right hand side. The matrices \(A\) and \(G\) will be the Jacobian matrices of the respected vector functions.

The logistic growth model can be written as a time-invariant dynamical system with growth rate \(r\) and carrying capacity \(k\):

\[

\begin{aligned}

\dot{p} & = r p\Big(1 - \frac{p}{k}\Big)

\end{aligned}

\]The above ordinary differential equation has the well known analytical solution:

\[

p = \frac{kp_0\exp(r\,t)}{k + p_0(\exp(r\,t) - 1)}

\]Suppose I observe data of a population for which I know the carrying capacity \(k\), but where the growth rate \(r\) is noisy.

\[

\begin{aligned}

r_i &= r_{i-1} \\

p_i &= \frac{kp_{i-1}\exp(r_{i-1}\Delta T)}{k + p_{i-1}(\exp(r_{i-1}\Delta T) - 1)} \\

y_i &= \begin{bmatrix}0 & 1\end{bmatrix} \begin{bmatrix}r_i \\

p_i\end{bmatrix} + \nu

\end{aligned}

\]Or with \(x_i:=\begin{bmatrix}r_i & p_i\end{bmatrix}'\) as:

\[

\begin{aligned}

x_i &= a(x_i)\\

y_i &= G x_i + \nu_i, \quad \nu_i \sim N(0,R)

\end{aligned}

\]In my example the state space model is purely deterministic, so there isn't any evolution noise and hence \(Q=0\).

With an initial prior guess for \(x_0\) and \(\Sigma_0\) and I am ready to go. The R code below shows my implementation with the algorithm above. Note that I use the

After a few time steps the extended Kalman filter does a fantastic job in reducing the noise. Perhaps this shouldn't be too surprising as a local linearisation of the logistic growth function will give a good fit. The situation might be different for highly nonlinear functions. The Wikipedia article about the Kalman filter suggests the unscented version in those cases.

### Kalman filter

I had the following dynamic linear model for the Kalman filter last week:\[\begin{align}

x_{t+1} & = A x_t + w_t,\quad w_t \sim N(0,Q)\\

y_t &=G x_t + \nu_t, \quad \nu_t \sim N(0,R)\\

x_1 & \sim N(\hat{x}_0, \Sigma_0)

\end{align}

\]With \(x_t\) describing the state space evolution, \(y_t\) the observations, \(A, Q, G, R, \Sigma_0\) matrices of appropriate dimensions, \(w_t\) the evolution error and \(\nu_t\) the observation error. The Kalman filter provides recursive estimators for \(x_t\) via:

\[\begin{align}

K_{t} &= A \Sigma_t G' (G \Sigma_t G' + R)^{-1}\\

\hat{x}_{t+1} &= A \hat{x_t} + K_{t} (y_t - G \hat{x}_t) \\

\Sigma_{t+1} &= A \Sigma_t A' - K_{t} G \Sigma_t A' + Q

\end{align}\]In the case of nonlinearities on the right hand side of either the state (\(x_t\)) or observation (\(y_t\)) equation the extended Kalman filter uses a simple and elegant trick: Taylor series of the first order, or in other words, I simply linearise the right hand side. The matrices \(A\) and \(G\) will be the Jacobian matrices of the respected vector functions.

### Logistic growth

As an example I will use a logistic growth model, inspired by the Hakell example given by Dominic Steinitz.The logistic growth model can be written as a time-invariant dynamical system with growth rate \(r\) and carrying capacity \(k\):

\[

\begin{aligned}

\dot{p} & = r p\Big(1 - \frac{p}{k}\Big)

\end{aligned}

\]The above ordinary differential equation has the well known analytical solution:

\[

p = \frac{kp_0\exp(r\,t)}{k + p_0(\exp(r\,t) - 1)}

\]Suppose I observe data of a population for which I know the carrying capacity \(k\), but where the growth rate \(r\) is noisy.

### Extended Kalman filter

The state space and observation model can then be written as:\[

\begin{aligned}

r_i &= r_{i-1} \\

p_i &= \frac{kp_{i-1}\exp(r_{i-1}\Delta T)}{k + p_{i-1}(\exp(r_{i-1}\Delta T) - 1)} \\

y_i &= \begin{bmatrix}0 & 1\end{bmatrix} \begin{bmatrix}r_i \\

p_i\end{bmatrix} + \nu

\end{aligned}

\]Or with \(x_i:=\begin{bmatrix}r_i & p_i\end{bmatrix}'\) as:

\[

\begin{aligned}

x_i &= a(x_i)\\

y_i &= G x_i + \nu_i, \quad \nu_i \sim N(0,R)

\end{aligned}

\]In my example the state space model is purely deterministic, so there isn't any evolution noise and hence \(Q=0\).

With an initial prior guess for \(x_0\) and \(\Sigma_0\) and I am ready to go. The R code below shows my implementation with the algorithm above. Note that I use the

`jacobian`

function of the `numDeriv`

package.After a few time steps the extended Kalman filter does a fantastic job in reducing the noise. Perhaps this shouldn't be too surprising as a local linearisation of the logistic growth function will give a good fit. The situation might be different for highly nonlinear functions. The Wikipedia article about the Kalman filter suggests the unscented version in those cases.

### R code

### Session Info

```
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] numDeriv_2012.9-1
loaded via a namespace (and not attached):
[1] tools_3.1.2
```

Subscribe to:
Posts
(
Atom
)