That feeling when former students contact you

Last year I had a student in SEB113 who came in to the subject with a distaste for mathematics and statistics; they struggled with both the statistical concepts and the use of R throughout the semester and looked as though they would rather be anywhere else during the collaborative workshops. This student made it to every lecture and workshop though and came to enjoy the work of using R for statistical analysis of data; and earned a 7 in the unit.

I just got an email from them asking for a reference for their VRES (Vacation Research Experience Scheme) project application. Not only am I proud of this student for working their butt off to get a 7 in a subject they disliked but came to find interesting, but I am over the moon to hear that they are interested in undertaking scientific field research. This student mentions how my “passion for teaching completely transformed my (their) view of statistics”, and their passion for the research topic is reflected in the email.

This sort of stuff is probably the most rewarding aspect of lecturing.

Lotka-Volterra and Bayesian statistics and teaching

One of the standard population dynamics models that I learned in my undergrad mathematical modelling units was the Lotka-Volterra equations. These represent a very simple set of assumptions about populations, and while they don’t necessarily give physically realistic population trajectories they’re an interesting introduction to the idea that differential equations systems don’t necessarily have an explicit solution.

The assumptions are essentially: prey grow exponentially in the absence of predators, predation happens at a rate proportional to the product of the predator and prey populations, birth of predators is dependent on the product of predator and prey populations, predators die off exponentially in the absence of prey. In SEB113 we cover non-linear regressions, the mathematical models that lead to them, and then show that mathematical models don’t always yield a nice function. We look at equilibrium solutions and then show that we orbit around it rather than tending towards (or away from) it. We also look at what happens to the trajectories as we change the relative size of the rate parameters.

Last time we did the topic, I posted about using the logistic growth model for our Problem Solving Task and it was pointed out to me that the model has a closed form solution, so we don’t explicitly need to use a numerical solution method. This time around I’ve been playing with using Euler’s method inside JAGS to fit the Lotka-Volterra system to some simulated data from sinusoidal functions (with the same period). I’ve put a bit more effort into the predictive side of the model, though. After obtaining posterior distributions for the parameters (and initial values) I generate simulations with lsode in R, where the parameter values are sampled from the posteriors. The figure below shows the median and 95% CI for the posterior predictive populations as well as points showing the simulated data.

lvThe predictions get more variable as time goes on, as the uncertainty in the parameter values changes the period of the cycles that the Lotka-Volterra system exhibits. This reminds me of a chat I was having with a statistics PhD student earlier this week about sensitivity of models to data. The student’s context is clustering of data using overfitted mixtures, but I ended up digressing and talking about Edward Lorenz’s discovery of chaos theory through a meteorological model that was very sensitive to small changes in parameter values. The variability in the parameter values in the posterior give rise to the same behaviour, as both Lorenz’s work and my little example in JAGS involve variation in input values for deterministic modelling. Mine was deliberate, though, so isn’t as exciting or groundbreaking a discovery as Lorenz but we both come to the same conclusion: forecasting is of limited use when your model is sensitive to small variations in parameters. As time goes on, my credible intervals will likely end up being centred on the equilibrium solution and the uncertainty in the period of the solution (due to changing coefficient ratios) will result in very wide credible intervals.

It’s been a fun little experiment again, and I’m getting more and more interested in combining statistics and differential equations, as it’s a blend of pretty much all of my prior study. The next step would be to use something like MATLAB with a custom Gibbs/Metropolis-Hastings scheme to bring in more of the computational mathematics I took. It’d be interesting to see if there’s space for this sort of modelling in the Mathematical Sciences School’s teaching programs as it combines some topics that aren’t typically taught together. I’ve heard murmurings of further computational statistics classes but haven’t been involved with any planning.

Paper helicopters

There is no textbook for SEB113 – Quantitative Methods in Science.

It’s not that we haven’t bothered to prescribe one, it’s that no one seemed to be taking the same approach we decided upon two years ago when the planning for the unit started. There are books on statistics for chemistry, statistics for ecology, statistics for physics, statistics for mathematics, etc. but trying to find a general “statistics for science” book that focuses on modelling rather than testing has been difficult.

That said, there are some amazing resources out there if you know where to look, not just for learning statistics but for teaching statistics. One of the most useful that we’ve come across is “Teaching Statistics“, by Andrew Gelman and Deborah Nolan. The book itself is full of advice for things like groupwork, topic order, structure of learning activities, etc. but my favourite thing so far is the paper helicopter experiment. Continue reading

Running Bayesian models

I came across a post via r/Bayes about different ways to run Bayesian hierarchical linear models in R, a topic I talked about recently at a two day workshop on using R for epidemiology. Rasmus Bååth‘s post details the use of JAGS with rjags, STAN with rstan and LaplacesDemon.

JAGS (well, rjags) has been the staple for most of my hierarchical linear modelling needs over the last few years. It runs within R easily, is written in C++ (so is relatively fast), spits out something that the coda package can work with quite easily, and, above all, makes it very easy to specify models and priors. Using JAGS means never having to derive a Gibbs sampler or write out a Metropolis-Hastings algorithm that requires to you to really think about jumping rules. It’s Bayesian statistics for those who don’t have the time/inclination to do it “properly”. It has a few drawbacks, though, such as not being able to specify improper priors (but this could be seen as a feature rather than a bug) with distributions like dflat() and defining a Conditional Autoregressive prior requires specifying it as a multivariate Gaussian. That said, it’s far quicker than using OpenBUGS and JAGS installs fine on any platform. Continue reading

Posterior Samples

Interested in collaborative use of R, MATLAB, etc. for analysis and visualisation within a webpage? Combining plotly and iPython can help you with that.

Cosmopolitan (yes, that Cosmopolitan) has a great article interviewing Emily Graslie, Chief Curiosity Officer at the Field Museum in Chicago. She discusses being an artist and making the transition into science, science education and YouTube stardom.

A few of the PhD students in my lab have asked if I could run an introduction to R session. I’d mentioned the CAR workshop that I’d be doing but the cost had put them off. Luckily, there are alternatives like Datacamp, Coursera and Lynda. Coursera’s next round of “Data Science”, delivered by Johns Hopkins University, starts next Monday (Course 1 – R Programming). So get in there and learn some R! I’m considering recommending some of these Coursera courses to my current SEB113 students who want to go a bit further with R, but the approach that they take in these online modules is quite different to what we do in SEB113 and I don’t want them to confuse themselves.


Learning different programming languages

One of the biggest changes I noticed moving from doing statistics in Minitab (in first year data analysis) to doing statistics in R (in third year statistical inference) was that R encourages you to write functions. Normally this is done by writing functions in R’s own language (that call other functions, also written in R, which eventually call functions written in C) but it’s also possible to make use of other languages to do the heavy lifting. This isn’t unique to R, of course; MATLAB encourages the use of MEX files to improve run-times when you need to call the same custom function over and over again.

I’ve really only used high level languages to do my statistics, making use of other peoples’ optimised code that do the things that I want. I’ve seen the development of pyMCMC by reseachers at QUT and someone from my NPBayes reading group made quite heavy use of RCpp in his thesis. Python and C++ are probably the two languages that would be the most useful to learn given their ubiquity and reputation. I have been putting off learning these for years as I know that there’s a large time investment required to become proficient in programming and no external pressure to learn (unlike learning R as part of my PhD thesis work).

There’s no doubt that writing optimised code is a desirable thing to do, and that knowing more than one programming language (and how to use them together) gives you a much richer toolbox for numerically solving problems. I’m now at a point, though, where it looks like I may need to bite the bullet and pick up C++. JAGS, which I use through rjags in R, is a stable, fast platform for MCMC-based inference. It’s written in C++ and notifies you every time you load it in R that it has loaded the basemod and bugs modules. There are additional modules available (check in \JAGS\JAGS-3.4.0\x64\modules\) and it’s possible to write your own, as long as you know C++.

I’m at a point with the work I’ve been doing on estimating personal dose of ultrafine particles that I’d like to make the modelling more Bayesian, which includes figuring out a way to include the deposition model in the MCMC scheme (as I’d like to put a prior on the shape parameter of the reconstructed size distribution). My options seem to be either writing a JAGS module that will allow me to call a C++ified version of the function or to abandon JAGS and write a Gibbs sampler (or Metropolis-Hastings, but Gibbs will likely be quicker given the simplicity of the model I’m interested in). Either solution will stretch me as a programmer and probably give me a better understanding of the problem. Eubank and Kupresanin’s “Statistical Computing in C++ and R” is staring at me from the shelf above my desk.

Posterior samples

ARC Discovery Projects have been returned to their authors, and we are putting our responses together for the rejoinders. Interesting to see that we got a comment suggesting that we use the less restrictive CC-by instead of CC-by-nc-sa as we’d suggested. We weren’t successful in our Linkage Project applications, which is disappointing as they were interesting projects (well, we thought so). Continuing to bring research funding in is an ongoing struggle for all research groups and I feel it’s only going to get harder as the new federal government’s research priorities appear to be more aligned to medical science that delivers treatments than to our group’s traditional strengths.

SEB113 is pretty much completely over for the semester, with marks having been entered for almost every student. Overall I think the students did fairly well. We had some issues with the timetable this semester. Ideally, we’d like the Lecture, then all of the computer labs, then all of the workshops, so that we can introduce a statistical idea, show the code and then apply the idea and code in a group setting. Next semester, we have the lecture followed immediately by the workshops with the computer labs dotted throughout the remainder of the week. This has provided us with an opportunity to try some semi-flipped classroom ideas, where students are able/expected to do the computer lab at home at their own pace rather than watch a tutor explain it one line at a time at the front of a computer lab.

I’m teaching part of a two day course on the use of R in air pollution epidemiology. My part will introduce Bayesian statistics with a brief overview, a discussion about prior distributions as a means of encoding a priori beliefs about model parameters, and discuss the use of Bayesian hierarchical modelling (as opposed to more traditional ANOVA techniques) as a way of making the most of the data that’s been collected. The other two presenters are Dr Peter Baker and Dr Yuming Guo. The course is being run by the CAR-CRE, who partially fund my postdoctoral fellowship.

I had meant to post this back when they were doing the rounds, but there’s a bunch of plots that attempt to show that correlation isn’t causation and that spurious correlations exist in large data sets. Tom Christie has responded to this by going over the fact that correlation in time series isn’t as simple as in the case of independent, identically distributed data. One should be careful that one’s criticism of bad statistics is itself founded on good statistics.