I had a lot of fun this morning talking to a room full of career counsellors and others in similar roles about what it means to be a modern statistician/scientist working on large, multidisciplinary projects. I talked a little about my experiences as a student, how it took me a while to settle into the field that I did, and showed a few of the cool topics I get to work on. Everyone seemed to want to keep listening, and I even got some feedback later that it was the first time they’d heard a mathematician speak about maths and it be interesting.
If there are some key points that I hope people took away, it’s that maths isn’t just about doing maths but about solving problems. I mentioned my two favourite quotes to emphasise that studying maths, particularly in science, isn’t just about doing calculations by hand,
“Essentially, all models are wrong but some are useful” – George Box (1978)
“Machines can do the work so humans have time to think” – IBM – The Paperwork Explosion (1967)
One of the issues with working on a number of multidisciplinary projects at the same time is that stuff always ends up taking longer than expected and it interrupts progress on any given one. That said, the report for the Great Barrier Reef project I’ve been working on has been finalised and accepted, and the paper on modelling jaguar presences and abundances has been finished and is published.
Since I’ve been working on these larger projects I’ve started putting together a site that is an alternative to a CV, a sort of research portfolio that lists the projects I’ve worked on and the papers that have come out of them. I figured that I can’t list all the papers and a description of them in my CV as it’ll blow out to a huge number of pages and be more like a biography. It’s all done in R Markdown knitted to a Tufte-inspired HTML template with a little CSS thrown in to modify the fonts and table of contents. It wasn’t actually that difficult to do, and I learned a bit more about Markdown in the process. The next thing I’d like to be able to do is write a CSL file for styling the bibliography in such a way that some part of the reference itself is the URL, rather than it being tacked on the end, and abbreviate authors’ first names. That way the end half of the page isn’t so cluttered.
I’ve been working with the Teaching and Learning team at QUT’s Science and Engineering Faculty, and discussing with the physics and chemistry academics, on improving the maths in the Bachelor of Science degree. Nothing’s finalised yet in terms of long term planning but we’ve been gradually solving problems over the last few years regarding students’ background maths skills coming into the unit and recommending strategies that will help them get through their degrees. Feedback from the PULSE survey mid-semester indicates that we’re still doing a good job but probably need to rebalance a few topics and give a gentler introduction to R.
Since Nick Tierney came on board in SEB113 and redid the lab worksheets in R Markdown and created videos to show how to work through the exercises, I’ve been gradually introducing more and more R Markdown into the teaching workflow. The pie in the sky idea at the moment is to distribute lecture, lab and workshop material to students as a bookdown document that they can either clone or fork from a GitHub repository and work on. Any changes made to the book can be fetched so that students always have the most up to date version of the notes. The course could even be forked from one semester to another, or the book treated as releases. A number of the tutors in SEB113 are sold on R Markdown and the ability to include R analysis and LaTeX formatting in a set of slides, report or webpage, so there’d definitely be the staff to do it. There are certainly more pressing issues to solve around content and programming in general before we try to push first year science students into using code sharing platforms to download a textbook.
It’s been about a year, and a lot’s happened since then. The Diagnostic Quiz has gone from a tool for helping me understand my students better to a tool to help students choose the right pathway through their Science degrees. Now, if a student does poorly on certain sections of the diagnostic, particularly calculus and algebra, we recommend they hold off until SEB113 until second semester and take MZB101 – Introductory Modelling with Calculus – in its place. While I’ve not had a look yet at all the enrolment data, anecdotally a number of students have contacted me about switching out and have appreciated getting the feedback that they will need to cover a bit more mathematics so that they can understand what they need for their degree.
Unfortunately, when a student unenrols from my unit I lose all of their assessment items, which means I don’t have a record of the results for the students who move into MZB101. Perhaps something other than Blackboard (MZB125 – Introductory Engineering Mathematics – use WebWork for their diagnostic) which doesn’t link storage to enrolment as tightly would be a useful way to approach this. I’d love to do some analysis at the end of the year of the end of semester marks for those students who transferred out compared to the marks of those who remained in SEB113 but with low scores on the diagnostic.
With a cohort with better general mathematics skills than before, we’ll be able to spend less time catching up on simple algebra and calculus and more time extending what is covered in high school. I’ve found some nice physics examples for linear algebra (circuits) and differential equations (Torricelli’s law) and will be trying to grab a few more examples that we haven’t used before, particularly for assessment.
There’s a little more movement in our tutorials and workshops towards using packages from the tidyverse for our data munging and analysis. When we started four years ago we were using base graphics, reshape and then reshape2, tapply(), and writing loops with par(mfrow=c(2,2)) style stuff to do small multiples. Since introducing ggplot2 a semester or two later, we’ve been working on making the analysis as coherent as possible so students aren’t having to move between different conceptual models of what data are, how they’re stored and how we operate on them. The use of the %>% pipe is left as a bonus for those who feel comfortable programming, but the rest of the class will still be learning about gather, spread, group_by, summarise, summarise_each, and mutate.
Oh, and I’m giving two two-hour lectures this semester, repeating for different groups within the cohort. It’s weird.
I’ve just signed an acceptance of offer of employment which will take me fully back into maths at QUT, 50% teaching in the Mathematical Sciences School and 50% researching with Kerrie Mengersen under her ARC Laureate Fellowship. Over the last few years I’ve been supported variously by Professor Lidia Morawska in the International Laboratory for Air Quality and Health, the NHMRC Centre of Research Excellence for Air quality and health Research and evaluation, QUT’s Institute for Future Environments and Mathematical Sciences School to whom I’m very grateful.
The second piece of big news is that with Ruth Luscombe and Nick Tierney, SEB113 has been recognised with a Vice-Chancellor’s Performance Award for innovation in teaching. We’ve put a lot of work into the unit this year, along with Iwona Czaplinski, Brett Fyfield, Jocelyne Bouzaid and Amy Stringer and the guidance of Ian Turner and Steve Stern. Ruth, Iwona, Brett and I have a paper accepted as part of an education conference next year and it’s a nice confirmation of all that we’ve done over the last 3 years (from Sama Low Choy’s first delivery when I was just a tutor) to take the unit from a grab bag of topics that students didn’t feel was particularly well connected to a coherent series of lecture-lab-workshop sequences that introduce and reinforce six weeks of each of mathematics and statistics topics that students tell us have helped them come to understand the role of quantitative analysis in science.
One of the standard population dynamics models that I learned in my undergrad mathematical modelling units was the Lotka-Volterra equations. These represent a very simple set of assumptions about populations, and while they don’t necessarily give physically realistic population trajectories they’re an interesting introduction to the idea that differential equations systems don’t necessarily have an explicit solution.
The assumptions are essentially: prey grow exponentially in the absence of predators, predation happens at a rate proportional to the product of the predator and prey populations, birth of predators is dependent on the product of predator and prey populations, predators die off exponentially in the absence of prey. In SEB113 we cover non-linear regressions, the mathematical models that lead to them, and then show that mathematical models don’t always yield a nice function. We look at equilibrium solutions and then show that we orbit around it rather than tending towards (or away from) it. We also look at what happens to the trajectories as we change the relative size of the rate parameters.
Last time we did the topic, I posted about using the logistic growth model for our Problem Solving Task and it was pointed out to me that the model has a closed form solution, so we don’t explicitly need to use a numerical solution method. This time around I’ve been playing with using Euler’s method inside JAGS to fit the Lotka-Volterra system to some simulated data from sinusoidal functions (with the same period). I’ve put a bit more effort into the predictive side of the model, though. After obtaining posterior distributions for the parameters (and initial values) I generate simulations with lsode in R, where the parameter values are sampled from the posteriors. The figure below shows the median and 95% CI for the posterior predictive populations as well as points showing the simulated data.
The predictions get more variable as time goes on, as the uncertainty in the parameter values changes the period of the cycles that the Lotka-Volterra system exhibits. This reminds me of a chat I was having with a statistics PhD student earlier this week about sensitivity of models to data. The student’s context is clustering of data using overfitted mixtures, but I ended up digressing and talking about Edward Lorenz’s discovery of chaos theory through a meteorological model that was very sensitive to small changes in parameter values. The variability in the parameter values in the posterior give rise to the same behaviour, as both Lorenz’s work and my little example in JAGS involve variation in input values for deterministic modelling. Mine was deliberate, though, so isn’t as exciting or groundbreaking a discovery as Lorenz but we both come to the same conclusion: forecasting is of limited use when your model is sensitive to small variations in parameters. As time goes on, my credible intervals will likely end up being centred on the equilibrium solution and the uncertainty in the period of the solution (due to changing coefficient ratios) will result in very wide credible intervals.
It’s been a fun little experiment again, and I’m getting more and more interested in combining statistics and differential equations, as it’s a blend of pretty much all of my prior study. The next step would be to use something like MATLAB with a custom Gibbs/Metropolis-Hastings scheme to bring in more of the computational mathematics I took. It’d be interesting to see if there’s space for this sort of modelling in the Mathematical Sciences School’s teaching programs as it combines some topics that aren’t typically taught together. I’ve heard murmurings of further computational statistics classes but haven’t been involved with any planning.
My students are working on their 25% assessment pieces, the Quantitative Workbook. These are group assignments that require students do a quantitative analysis from start to finish on some ecology data we’ve given them. A few students are struggling with the p value concept, particularly what it means in the R summary.lm() output. I responded to the student with the following statement. It’s a bit more verbose than I might have liked but I think it’s important to try to step it through from start to finish. It took me ages to get this as an undergrad.
The hypothesis test that R does and gives you in the regression summary asks:
What is the probability of seeing a test statistic (third column in the output) at least as extreme as what we have if the true value of the parameter were actually zero (this is our null hypothesis)?
Our best estimates of the parameters given the data we are using with our model (first column in the output) are found by minimising the sum of squares of the errors between the observed values and the fitted values (see the Normal equations slides from the linear algebra week). Our uncertainty about those estimates is given to us with the standard error of the estimate (second column in the output) which is related to the size of the standard deviation of the residuals. More uncertainty in our fitted values reflects uncertainty in our parameter estimates. If the standard error is comparable in size to the estimate, then perhaps our uncertainty may mean we can’t reject the idea that the true value of the parameter is zero (i.e. we may not be able to detect that this variable has an effect).
The test statistic (third column) is assumed to come from a t distribution whose degrees of freedom is the number of data points we started with minus the number of parameters we’ve observed. The idea of the test statistic coming from a t distribution reflects the notion that our data is a finite sample of all the data that could have been collected if the experiment were repeated an infinite number of times under the same conditions. If the test statistic is really far away from zero, then it’s very improbable that we would observe sampled data like this if the true value of this parameter were zero (i.e. the relevant variable plays no role in explaining the variation in the response variable).
It’s traditional in science to use a cutoff for the p value of 0.05, corresponding to whether a 95% confidence interval covers zero. This is saying “we accept that in 1 out of every 20 identically conducted experiments we may see no observable effect, the rest of the time we see it”. If your p value, the probability of seeing a test statistic at least as extreme as this if the true value of the parameter is zero, is less than 0.05 then you’ve got evidence to reject the null hypothesis. Sometimes we want to be really confident and we choose a cutoff of 0.01, corresponding to whether a 99% CI covers zero. If the p value is less than 0.01 (where only at most 1 in 100 experiments show us a zero effect) then we have evidence to reject the null hypothesis at our 0.01 level. Sometimes we will accept a less confident cutoff of 0.1 (1 in 10 experiments). Whatever level we choose must be stated up front.
So in summary the hypothesis we are testing is “The true value of the parameter is zero”, the p value is a probabilistic statement that says “If I assume the true value is zero, what’s the probability of seeing a test statistic (that represents how uncertain I am about my estimate) at least as big as this?”
GitHub for Windows can be such a pain sometimes. I guess it’s partially my fault for attempting to use version control on the compiled PDF of a LaTeX document, but I spent a fair amount of time today attempting to fix up a colleague’s local repository. I’m now a bit more familiar with cherry-pick and rebase but it would have been nice to have it just work. For some reason, GitHub for Windows on this colleague’s computer simply will not sync properly; my colleague has had to become a bit more familiar with the common commands (push, pull, fetch, commit, merge). It works fine on their Mac, though. I run GitHub for OS X at home and it’s an absolute dream. At work (Windows XP) I have had no end of trouble with various programs like TortoiseGit. I think when I start my post-doc I’ll organise to have my computer converted to a Linux system.
After all that, though, we did make some pretty good progress on the modelling in this paper. I’m not quite sure which journal we’ll be sending it to but it’s a really nice piece of work with some personal monitoring data, simple but informative analysis and some very creative use of the base graphics system in R.