This week we moved on to regression with a categorical covariate, which we’ve used in the context of estimating the mean dissolved oxygen across three spatial locations or the mean grain size of different types of rock.

I think one of the biggest stumbling blocks to our learning with R is the way that lm() and associated functions deal with factor terms. Personally, I am not a fan of the way it chooses the first alphabetic level as the baseline. I know you can relevel the factor levels to choose another one but I’d be much happier with a sum to zero constraint so that the effects are all departures from the mean. This is the way that R-INLA does it and, to me, it makes a lot more sense. Factor terms (categorical variables) are essentially a random effects mean, especially in a Bayesian setting where everything can be treated as a random effect. That lm() makes us choose a baseline and treats the other effects as differences to that baseline means we end up with a coefficients table which is more difficult to interpret. One option is to omit the intercept term from the model, with a function call like lm(data=my.data, y ∼ factor(x) – 1) but that still doesn’t give you a sense of the overall mean.

In any case, the mixture of new regression techniques and hypothesis testing for whether or not some parameter is equal to zero is proving difficult. The difference between the t and standard Normal distributions seems to not be particularly well understood and while I’ve tried to make the link between a 95% confidence interval and hypothesis testing at a 5% level of significance quite explicit, the fact remains that these are both new concepts which are being taught by a relatively inexperienced lecturer to students whose mathematical literacy is generally not at the level of those I’ve tutored in subjects where Maths B was explicitly a pre-requisite rather than assumed knowledge.

I managed to explain the heavy tails issue in class with a little bit of pantomime, showing how one might paint the tails of the t and Normal distributions and run out of paint at different values of t (or x) based on how much paint you had to use at values far away from the mean. I think about a quarter of the class was struggling with the diagram of overlapping triangles which was meant to be a “zoomed in” version of the point where the density functions of the Normal and t cross over as they get further away from the mean. The lectorials are recorded so I’ll be interested to see how it translates to a radio play setting.

The computer labs are apparently quite dense at the moment, with a lot of fairly new ideas being reinforced in a 50 minute block. We’re quite fortunate to have all the labs before all the workshops this semester, so the lab is basically “here’s the code to do what was shown in the lectorials” and then the workshops are designed to implement the code for some problem and generate a bit of discussion. I think this week was probably one of the hardest, conceptually, because it brings together regression for categorical explanatory variables (a straight line is easier to understand than mutually exclusive sets of points), hypothesis testing, the t distribution and confidence intervals. I have uploaded some of last semester’s slides on the central limit theorem for those who may need them, but I think it’s more a familiarity and practice thing than the material being inherently inaccessible.

Next week we’ll be moving on to different families for Generalised Linear Models and the use of the nls() function to fit non-linear models such as asymptotic, compartment and bi-exponential. I’m not such a fan of nls (or even nlme) but we can hardly teach them how to use something like WinBUGS to define their own custom mean functions (because if you struggle with the t distribution you’re going to have kittens trying to deal with using the Beta distribution as a conjugate prior for the Binomial model) or even throw them into using gam() from mgcv. I wouldn’t be averse to teaching Generalised Additive Models in an advanced follow-up unit for this subject. If we did that, we could remove the GLMs from SEB113 (which we’ve only introduced this semester) and spend some time on random effects models. I think such a subject would require a much stronger background in mathematics, so students may need to take MAB120/125 and MAB121/126 before attempting such a unit. Still, food for thought as QUT continues to develop the new Bachelor of Science course.