Statistics | mages' blog

Prediction for the 100m final at the Tokyo Olympics

On Sunday the Tokyo Olympics men sprint 100m final will take place. Francesc Montané reminded me in his analysis that 9 years ago I used a simple regression model to predict the winning time for the 100m men sprint final of the 2012 Olympics in London. My model predicted a winning time of 9.68s, yet Usain Bolt finished in 9.63s. For this Sunday my prediction is 9.72s, with a 50% credible interval of [9.

Use domain knowledge to review prior distributions

At the Insurance Data Science conference, both Eric Novik and Paul-Christian Bürkner emphasised in their talks the value of thinking about the data generating process when building Bayesian statistical models. It is also a key step in Michael Betancourt’s Principled Bayesian Workflow. In this post, I will discuss in more detail how to set priors, and review the prior and posterior parameter distributions, but also the prior predictive distributions with brms (Bürkner (2017)).

Models are about what changes, and what doesn't

How do you build a model from first principles? Here is a step by step guide. Following on from last week’s post on Principled Bayesian Workflow I want to reflect on how to motivate a model. The purpose of most models is to understand change, and yet, considering what doesn’t change and should be kept constant can be equally important. I will go through a couple of models in this post to illustrate this idea.

Visualising the predictive distribution of a log-transformed linear model

Last week I presented visualisations of theoretical distributions that predict ice cream sales statistics based on linear and generalised linear models, which I introduced in an earlier post. Theoretical distributions Today I will take a closer look at the log-transformed linear model and use Stan/rstan, not only to model the sales statistics, but also to generate samples from the posterior predictive distribution. The posterior predictive distribution is what I am most interested in.

Visualising theoretical distributions of GLMs

Two weeks ago I discussed various linear and generalised linear models in R using ice cream sales statistics. The data showed not surprisingly that more ice cream was sold at higher temperatures. icecream <- data.frame( temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1, 18.5, 19.4, 22.1, 22.6, 23.4, 25.1), units=c(185L, 215L, 332L, 325L, 408L, 421L, 406L, 412L, 522L, 445L, 544L, 614L) ) I used a linear model, a log-transformed linear model, a Poisson and Binomial generalised linear model to predict sales within and outside the range of data available.

Generalised Linear Models in R

Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points. Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention. I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well.

Extended Kalman filter example in R

Last week’s post about the Kalman filter focused on the derivation of the algorithm. Today I will continue with the extended Kalman filter (EKF) that can deal also with nonlinearities. According to Wikipedia the EKF has been considered the de facto standard in the theory of nonlinear state estimation, navigation systems and GPS. Kalman filter I had the following dynamic linear model for the Kalman filter last week: \[ \begin{aligned} x_{t+1} & = A x_t + w_t,\quad w_t \sim N(0,Q)\\ y_t &=G x_t + \nu_t, \quad \nu_t \sim N(0,R)\\ x_1 & \sim N(\hat{x}_0, \Sigma_0) \end{aligned} \]

Kalman filter example visualised with R

At the last Cologne R user meeting Holger Zien gave a great introduction to dynamic linear models (dlm). One special case of a dlm is the Kalman filter, which I will discuss in this post in more detail. I kind of used it earlier when I measured the temperature with my Arduino at home. Over the last week I came across the wonderful quantitative economic modelling site quant-econ.net, designed and written by Thomas J.

Binomial testing with buttered toast

Rasmus’ post of last week on binomial testing made me think about p-values and testing again. In my head I was tossing coins, thinking about gender diversity and toast. The toast and tossing a buttered toast in particular was the most helpful thought experiment, as I didn’t have a fixed opinion on the probabilities for a toast to land on either side. I have yet to carry out some real experiments.

How to use optim in R

A friend of mine asked me the other day how she could use the function optim in R to fit data. Of course, there are built-in functions for fitting data in R and I wrote about this earlier. However, she wanted to understand how to do this from scratch using optim. The function optim provides algorithms for general-purpose optimisations and the documentation is perfectly reasonable, but I remember that it took me a little while to get my head around how to pass data and parameters to optim.

Now I see it! K-means cluster analysis in R

Of course, a picture on a computer monitor is a coloured plot of x and y coordinates or pixels. Still, I was smitten by David Sparks’ posts on is.r(), where he shows how easy it is to read images into R to analyse them. In two posts [1], [2] he replicates functionality of image manipulation programmes like GIMP. I can’t resist to write about this here as well. David’s first post is about k-means cluster analysis.