mages' blog

Hit and run. Think Bayes!

At the R in Insurance conference Arthur Charpentier gave a great keynote talk on Bayesian modelling in R. Bayes' theorem on conditional probabilities is strikingly simple, yet incredibly thought provoking. Here is an example from Daniel Kahneman to test your intuition. But first I have to start with Bayes' theorem.

Bayes' theorem

Bayes' theorem states that given two events $$D$$ and $$H$$, the probability of $$D$$ and $$H$$ happening at the same time is the same as the probability of $$D$$ occurring, given $$H$$, weighted by the probability that $$H$$ occurs; or the other way round. As a formula it can be written as:
$P(H \cap D) = P(H|D) \, P(D) = P(D|H) \, P(H)$
Or if I rearrange it:
$P(H|D) = \dfrac{P(D|H) \, P(H)}{P(D)}$
Imagine $$H$$ is short for hypothesis and $$D$$ is short for data, or evidence. Then Bayes' theorem states that the probability of a hypothesis given data is the same as the likelihood that we observe the data given the hypothesis, weighted by the prior belief of the hypothesis, normalised by the probability that we observe the data regardless of the hypothesis.

The tricky bit in real life is often to figure out what the hypothesis and data are.

Hit and run accident

This example is taken from Daniel Kahneman's book Thinking, fast and slow [1].
A cab was involved in a hit and run accident at night. Two cab companies, the Green and the Blue, operate in the city. 85% of the cabs in the city are Green and 15% are Blue. A witness identified the cab as Blue. The court tested the reliability of the witness under the same circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colours 80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accident was Blue rather than Green knowing that this witness identified it as Blue?

What is here the data and what is here the hypothesis? Intuitively you may think that the proportion of Blue and Green cabs is the data at hand and the witness accusation that a Blue cab was involved in the accident is the hypothesis. However, after some thought I found the following assignment much more helpful, as then $$P(H|D)$$ matches the above question:

$$H =$$ Accident caused by Blue cab. $$D =$$ Witness said the cab was Blue.

With this it is straightforward to get the probabilities of $$P(H)=15\%$$ and $$P(D|H)=80\%$$. But what is $$P(D)$$? Well, when would the witness say that the cab was Blue? Either, when the cab was Blue and so the witness is right, or when the cab was actually Green and the witness is incorrect. Thus, following the law of total probability:
\begin{align} P(D) & = P(D|H) P(H) + P(D | \bar{H}) P(\bar{H})\\ & = 0.8 \cdot 0.15 + 0.2 \cdot 0.85 = 0.29 \end{align}Therefore I get $$P(H|D)=41\%$$. Thus, even if the witness states that the cab involved in the accident was Blue, the probability of this being true is only $$41\%$$.

An alternative way to think about this problem is via a Bayesian Network. The colour of the cab will influence the statement of the witness. In R I can specify such a network using the gRain package [2], which I discussed in an earlier post. Here I provide the distribution of the cabs and the conditional probabilities of the witness as an input. After I compile the network, I can again read off the probabilities that a Blue cab was involved, when the witness said so.

Notes from the 2nd R in Insurance Conference

The 2nd R in Insurance conference took place last Monday, 14 July, at Cass Business School London.

This one-day conference focused once more on applications in insurance and actuarial science that use R. Topics covered included reserving, pricing, loss modelling, the use of R in a production environment and more.

In the first plenary session, Montserrat Guillen (Riskcenter, University of Barcelona) and Leo Guelman (Royal Bank of Canada, RBC Insurance) spoke about the rise of uplift models. These predictive models are used for improved targeting of policyholders by marketing campaigns, through the use of experimental data. The presenters illustrated the use of their uplift package (available on CRAN), which they have developed for such applications.

Simple user interface in R to get login details

Occasionally I have to connect to services from R that ask for login details, such as databases. I don't like to store my login details in the R source code file, instead I would prefer to enter the my login details when I execute the code.

Fortunately, I found some old code in a post by Barry Rowlingson that does just that. It uses the tcltk package in R to create a little window in which the user can enter her details, without showing the password. The tcltk package is part of base R, which means the code will run on any operating system. Nice!

Recently we released googleVis 0.5.3 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R.

 Screen shot of some of the Google Charts

Although this is mainly a maintenance release, I'd like to point out two changes:
• Default chart width is set to 'automatic' instead of 500 pixels.
• Intervals for columns roles have to end with the suffix ".i", with "i" being an integer. Several interval columns are allowed, see the roles demo and vignette for more details.
Those changes were required to fix the following issues:
• The order of y-variables in core charts wasn't maintained. Thanks to John Taveras for reporting this bug.
• Width and height of googleVis charts were only accepted in pixels, although the Google Charts API uses standard HTML units (for example, '100px', '80em', '60', 'automatic'). If no units are specified the number is assumed to be pixels. This has been fixed. Thanks to Paul Murrell for reporting this issue.
New to googleVis? Review the demo on CRAN.