This week in SEB113 we’ve started on regression with some simple linear models with one explanatory variable. As not everyone has a particularly strong statistics background (high school Maths B) there are definitely some challenges.

The big one seems to be moving from the Normal distribution, which everyone seems to get, for estimating the confidence interval of the mean, towards the t distribution for calculating confidence intervals for regression parameters. Putting the t distribution in the context of estimating quality between batches of Guiness helps a little with the question “Where did this even come from?” but doesn’t address the mathematics of it. Plotting a few different t distributions with varying degrees of freedom helps make the point that the t approaches the Normal when the degrees of freedom goes to infinity but does nothing to explain what the degrees of freedom actually are.

I’ve found that explaining the data as a resource for fitting the regression model can be handy. For a data set with *n* points you have a maximum of *n* degrees of freedom. Each time you add a parameter to your regression model you consume a degree of freedom because you’re imposing a constraint, such as “there is a straight line”. If we had one data point in our data set and wanted to know the mean of the data we would know it exactly, there would be no uncertainty left in our estimate (and therefore zero degrees of freedom). If we had two data points and wanted the mean there would be some amount of uncertainty left because there’s now some variation in our data (we would have one degree of freedom left). If we had two points and wanted a line of best fit we would be back to zero degrees of freedom because we have completely characterised the trend in the data set by joining the two points.

If we fit a regression with a total of *k* parameters on the right hand side of *Y*_{i} = *β*_{0} + *β*_{1} *X*_{1i} + *β*_{2} *X*_{2i} + … + *β*_{k-1} *X*_{k-1i } + ϵ_{i} (a mean and some effects of explanatory variables) we would have *n*–*k* degrees of freedom. As you have fewer degrees of freedom left your t distribution ends up with more mass further from the mean. This means that you’re more uncertain about the value of the parameter because you’re using your data to estimate other parameters.

Coming from a Bayesian perspective and looking at this sort of mathematics it’s easy to conclude that the model states that the parameters have a t distribution. This is, of course, completely incorrect because the parameters are fixed constants with unknown values. This idea totally throws me as I’ve been working with Bayesian analysis for the last few years (with the exception of perhaps one paper) and I’m used to thinking of all parameters as being random variables.

This is related to the hypothesis testing and confidence interval issues that I have with the way first year statistics is taught. Confidence intervals, as I’ve mentioned previously, are counter-intuitive. They are the things that are random in our estimates of the true values of parameters. I like the approach that we’re taking where we look at whether the 95% confidence interval covers zero in order to make statements about whether or not the parameter plays a role in explaining the variation that we are modelling. I don’t like that we then calculate p values for testing the hypothesis that that parameter is equal to zero. These tests are statements about the probability of seeing the test statistic or more extreme given the model that we’re working with. It’s all backwards and leads inexperienced students to make statements such as “We accept the null hypothesis” and “The variable is statistically insignificant”, both of which are nails on a chalkboard to my ears.

I know that we can’t teach statistics the way I would like to teach it, as these are science students who will be entering a field where ANOVA and t tests are still commonly used not as exploratory data analysis but as the basis for inference. I am very thankful for the fact that we are moving away from testing and towards modelling, and I’ve been trying to make the point in lectures that modelling allows us to do prediction, whereas testing allows us to only talk about what we’ve seen. If we can make sure the students can fit a model in R and use it to predict and/or make inferences I think we’ll have done our jobs because that is far more than I was able to do when I finished MAB101 ten years ago, when everything required we look through page after page of statistical table, hunting the right p value.

Edit: 150 posts!