I’ve been spending a bit of time over the last few days making an R tutorial for the members of my air quality research group. Rather than being a very general introduction to the use of R, e.g. file input/output, loops, making objects, I’ve decided to show a very applied workflow that involves the actual data analysis and explaining ideas as we go along. Part of this philosophy is that I’m not going to write a statistics tutorial, opting instead to point readers to textbooks that deal with first year topics such as regression models and hypothesis tests.
It’s been a very interesting experience, and it’s meant having to deal with challenges along the way such as PDF graphs that take up so much file space for how (un-)important they are to the overall guide and, thinking about how to structure the tutorial so that I can assume zero experience with R but some experience with self-directed learning. The current version can be seen here.
One of the ideas that Sama Low Choy had for SEB113 when she was unit coordinator and lecturer and I was just a tutor, was to write a textbook for the unit because there wasn’t anything that really covered our approach. Since seeing computational stats classes in the USA being hosted as repositories on GitHub I think it might be possible to use R Markdown or GitBook to write an R Markdown project that could be compiled either as a textbook with exercises or as a set of slides.
I used to not be a very confident public speaker. I remember getting up at a community meeting in 2007 and stammering some words out to a group of residents; it was a disaster. Motivated for the desire for some money to augment my Youth Allowance payments I applied to be a tutor with the School of Mathematics (QUT) during my final years of undergrad and found that I became a bit better at talking to people. My Honours seminar was still a nervous affair but it was much less disastrous than the community meeting.
After Honours I had a job teaching mathematics to a group of video game programmers, developing the curriculum to suit their needs and interests and it’s here that I became far more comfortable with speaking. I was coming up with my own material and delivering it to people who I knew were interested in it. That’s a world away from teaching university students, where many may not see the point in learning what I’m teaching. This is especially the case in service mathematics and statistics units.
During my PhD studies I got interested in improvised theatre as a creative alternative to the mathematics, statistics and science that was my day. My reputation as someone not afraid to get up in front of 100 people and perform lead to my being asked by one of my PhD supervisors if I’d like to be a tutor in the brand new SEB113 course. Teaching students how to use R for their data analysis? Of course I’m interested! After the end of a very enjoyable, if somewhat disjointed, semester I was asked if I’d consider lecturing the smaller second semester re-run. I jumped at the chance. Continue reading →
My students are working on their 25% assessment pieces, the Quantitative Workbook. These are group assignments that require students do a quantitative analysis from start to finish on some ecology data we’ve given them. A few students are struggling with the p value concept, particularly what it means in the R summary.lm() output. I responded to the student with the following statement. It’s a bit more verbose than I might have liked but I think it’s important to try to step it through from start to finish. It took me ages to get this as an undergrad.
The hypothesis test that R does and gives you in the regression summary asks:
What is the probability of seeing a test statistic (third column in the output) at least as extreme as what we have if the true value of the parameter were actually zero (this is our null hypothesis)?
Our best estimates of the parameters given the data we are using with our model (first column in the output) are found by minimising the sum of squares of the errors between the observed values and the fitted values (see the Normal equations slides from the linear algebra week). Our uncertainty about those estimates is given to us with the standard error of the estimate (second column in the output) which is related to the size of the standard deviation of the residuals. More uncertainty in our fitted values reflects uncertainty in our parameter estimates. If the standard error is comparable in size to the estimate, then perhaps our uncertainty may mean we can’t reject the idea that the true value of the parameter is zero (i.e. we may not be able to detect that this variable has an effect).
The test statistic (third column) is assumed to come from a t distribution whose degrees of freedom is the number of data points we started with minus the number of parameters we’ve observed. The idea of the test statistic coming from a t distribution reflects the notion that our data is a finite sample of all the data that could have been collected if the experiment were repeated an infinite number of times under the same conditions. If the test statistic is really far away from zero, then it’s very improbable that we would observe sampled data like this if the true value of this parameter were zero (i.e. the relevant variable plays no role in explaining the variation in the response variable).
It’s traditional in science to use a cutoff for the p value of 0.05, corresponding to whether a 95% confidence interval covers zero. This is saying “we accept that in 1 out of every 20 identically conducted experiments we may see no observable effect, the rest of the time we see it”. If your p value, the probability of seeing a test statistic at least as extreme as this if the true value of the parameter is zero, is less than 0.05 then you’ve got evidence to reject the null hypothesis. Sometimes we want to be really confident and we choose a cutoff of 0.01, corresponding to whether a 99% CI covers zero. If the p value is less than 0.01 (where only at most 1 in 100 experiments show us a zero effect) then we have evidence to reject the null hypothesis at our 0.01 level. Sometimes we will accept a less confident cutoff of 0.1 (1 in 10 experiments). Whatever level we choose must be stated up front.
So in summary the hypothesis we are testing is “The true value of the parameter is zero”, the p value is a probabilistic statement that says “If I assume the true value is zero, what’s the probability of seeing a test statistic (that represents how uncertain I am about my estimate) at least as big as this?”
The statistics group that I’m part of is publishing a book that details how we do Bayesian statistics, what it’s used for and how people can use it. I wrote a chapter on Bayesian splines which is basically a few recipes for splines with some illustrative examples with small data sets. The work in the book chapter is currently being extended into something more useful as part of my PhD but the chapter itself does a decent job of introducing a few simple spline types and shows how you can generate the basis and use them in an adaptive Metropolis-Hastings framework to fit a univariate regression model.
The director of my stats group is one of the editors of the book and has asked me to give a small lecture to her Bayesian Data Analysis class, an honours level unit. I’ll be walking the students through the chapter and hoping that they’re familiar enough with the idea of priors to understand that a prior doesn’t just have to mean “I think this parameter has this value”. The idea of a prior being used to inform the values of the difference between spline coefficients  is a really nice Bayesian analogue of the frequentist approach (and does away with GCV, instead maximising the posterior density)  and I think makes a lot more sense than attempting to do a grid-based GCV approach, incorporating the amount of smoothing into the MCMC sampler (or INLA fitting).
It’ll be good fun.
 Lang, S. and Brezger, A. (2004): Bayesian P-Splines. Journal of Computational and Graphical Statistics, 13, 183-212. Download
 Eilers, P. H. C. and Marx, B. D. (1996): Flexible smoothing with B-splines and penalties. Statistical Science, 11 (2), 89-121. Download