Learning different programming languages

One of the biggest changes I noticed moving from doing statistics in Minitab (in first year data analysis) to doing statistics in R (in third year statistical inference) was that R encourages you to write functions. Normally this is done by writing functions in R’s own language (that call other functions, also written in R, which eventually call functions written in C) but it’s also possible to make use of other languages to do the heavy lifting. This isn’t unique to R, of course; MATLAB encourages the use of MEX files to improve run-times when you need to call the same custom function over and over again.

I’ve really only used high level languages to do my statistics, making use of other peoples’ optimised code that do the things that I want. I’ve seen the development of pyMCMC by reseachers at QUT and someone from my NPBayes reading group made quite heavy use of RCpp in his thesis. Python and C++ are probably the two languages that would be the most useful to learn given their ubiquity and reputation. I have been putting off learning these for years as I know that there’s a large time investment required to become proficient in programming and no external pressure to learn (unlike learning R as part of my PhD thesis work).

There’s no doubt that writing optimised code is a desirable thing to do, and that knowing more than one programming language (and how to use them together) gives you a much richer toolbox for numerically solving problems. I’m now at a point, though, where it looks like I may need to bite the bullet and pick up C++. JAGS, which I use through rjags in R, is a stable, fast platform for MCMC-based inference. It’s written in C++ and notifies you every time you load it in R that it has loaded the basemod and bugs modules. There are additional modules available (check in \JAGS\JAGS-3.4.0\x64\modules\) and it’s possible to write your own, as long as you know C++.

I’m at a point with the work I’ve been doing on estimating personal dose of ultrafine particles that I’d like to make the modelling more Bayesian, which includes figuring out a way to include the deposition model in the MCMC scheme (as I’d like to put a prior on the shape parameter of the reconstructed size distribution). My options seem to be either writing a JAGS module that will allow me to call a C++ified version of the function or to abandon JAGS and write a Gibbs sampler (or Metropolis-Hastings, but Gibbs will likely be quicker given the simplicity of the model I’m interested in). Either solution will stretch me as a programmer and probably give me a better understanding of the problem. Eubank and Kupresanin’s “Statistical Computing in C++ and R” is staring at me from the shelf above my desk.

Posterior samples

ARC Discovery Projects have been returned to their authors, and we are putting our responses together for the rejoinders. Interesting to see that we got a comment suggesting that we use the less restrictive CC-by instead of CC-by-nc-sa as we’d suggested. We weren’t successful in our Linkage Project applications, which is disappointing as they were interesting projects (well, we thought so). Continuing to bring research funding in is an ongoing struggle for all research groups and I feel it’s only going to get harder as the new federal government’s research priorities appear to be more aligned to medical science that delivers treatments than to our group’s traditional strengths.

SEB113 is pretty much completely over for the semester, with marks having been entered for almost every student. Overall I think the students did fairly well. We had some issues with the timetable this semester. Ideally, we’d like the Lecture, then all of the computer labs, then all of the workshops, so that we can introduce a statistical idea, show the code and then apply the idea and code in a group setting. Next semester, we have the lecture followed immediately by the workshops with the computer labs dotted throughout the remainder of the week. This has provided us with an opportunity to try some semi-flipped classroom ideas, where students are able/expected to do the computer lab at home at their own pace rather than watch a tutor explain it one line at a time at the front of a computer lab.

I’m teaching part of a two day course on the use of R in air pollution epidemiology. My part will introduce Bayesian statistics with a brief overview, a discussion about prior distributions as a means of encoding a priori beliefs about model parameters, and discuss the use of Bayesian hierarchical modelling (as opposed to more traditional ANOVA techniques) as a way of making the most of the data that’s been collected. The other two presenters are Dr Peter Baker and Dr Yuming Guo. The course is being run by the CAR-CRE, who partially fund my postdoctoral fellowship.

I had meant to post this back when they were doing the rounds, but there’s a bunch of plots that attempt to show that correlation isn’t causation and that spurious correlations exist in large data sets. Tom Christie has responded to this by going over the fact that correlation in time series isn’t as simple as in the case of independent, identically distributed data. One should be careful that one’s criticism of bad statistics is itself founded on good statistics.

Posterior samples

A rough guide to spotting bad science.

Why big data is in trouble: they forgot about applied statistics. Big data analytics are all well and good but you have to keep in mind that there are statistical properties that govern which inferences are valid.

While I’m comfortable giving a lecture I really struggled to get through them in undergrad. It turns out they may not be the most effective way to get information to students.

My supervisors, Professor Lidia Morawska, is giving a public talk (free to register) at QUT soon, “Air Quality Reports On Our Mobiles – Do We Care?” June 6 2014

The ongoing crusade against Excel-based analysis

One of the things I catch myself saying quite often in SEB113 is “This is new. It’s hard. But remember, you weren’t born knowing how to walk. You learned it”, as my way of saying that it’s okay to not understand this straight away, it takes time, practice and determination. I often say this in response to students complaining about learning R to do their data analysis. It’s actually got to the point where the unit co-ordinator suggested I get a t-shirt printed with “You weren’t born knowing how to walk” on the front and “So learn R” on the back.

One of the reasons I’m so keen to push new students into learning R is that while Excel can do some of the simpler calculations required in the first year of a science degree it is often completely inadequate for doing data analysis as a professional scientist, or even in an advanced level university course. I actually saw a senior researcher in a 3 day Bayesian statistics course try to avoid using R to code a Gibbs sampler by getting it up and running in Excel. They managed it, but it took minutes to run what the rest of us could compute in a second (and it was for a trivially simple problem).

There are problems with Excel, such as its inability to deal with the standard deviation of a group of very large numbers due to its bizarre formulation. Apparently the secret to sane use of Excel is to only use it for data storage. This guiding principle has meant that I no longer manipulate my data in Excel. Even with time stamp information I’ll fire up the lubridate package to convert from one format to another. I’m slowly exploring the Hadleyverse and that sort of approach is filtering through into SEB113 where we’re teaching the use of ggplot2 and reshape2 within RStudio. These are all powerful tools that simplify data analysis and avoid the hackish feel that much Excel-based analysis has, where pivot tables are a thing and graphs are made by clicking and dragging a selection tool down the data (which can lead to some nasty errors).

The fact that these powerful tools that make data analysis simple are free is another reason to choose R over Excel. I’m not on the “Open Source Software and provision of all code is mandatory” bandwagon as others seem to be when it comes to analysis being replicable. I agree it’s a worthwhile goal but it’s not a priority for me. That said, though, I definitely support encouraging the use of free software (in both senses) in education on the grounds of equity of access.

I had a chat with some students in SEB113 yesterday about why we’re teaching everything in R given that the SEB114 staff use a combination of Excel, MATLAB (and maybe even other packages I don’t know about). If we were to teach analysis the way that the SEB114 lecturers do it themselves, we’d have to teach multiple packages to multiple disciplines. Even discounting the fact that everything we teach is implemented in R, that R is free (unlike Excel and MATLAB), cross-platform (Excel on Linux? Try OpenOffice/OfficeLibre) and extensible (MATLAB has toolboxes, Excel has add-ins, R has a nice package manager) was a big plus for students who said that being able to work on assignments at home was valuable and so paying for software would make study difficult.

Convincing students to use R can be difficult, especially if they have no programming background, but ultimately they seem to accept that R is powerful, can do more than Excel and that writing reusable code makes future analysis easier. Convincing SEB114 academics that teaching their students to use R is a good idea is probably a harder sell, given that they’ve got years of experience with other tools. It’s still only semester 3 of the new Bachelor of Science course so we’ll have to see how this plays out over the years to come.

Posterior samples

I’m teaching science students how to do statistics. It would be great if we could turn them into Bayesians, especially seeing as we’ve just covered the Agresti-Coull correction for estimating proportions from small experiments. Andrew Gelman has an interesting paper on teaching Bayesian statistics to non-statisticians that focuses on the delivery of skills rather than concepts. I would definitely agree with his approach, especially when you consider how he stresses that discussing the model is probably the most important part.

NASA have done some work simulating global aerosols and it’s been compiled into a neat video (via It’s Okay To Be Smart’s Joe Hanson). CSIRO have been doing some interesting stuff looking at the production of organic aerosols as well, so this is something I’m paying a bit more attention to at the moment.

Datacamp is a set of online labs for learning to use R, covering the basics of R, data analysis and statistical inference, and computational finance and econometrics.

Learn to be a better coder by improving your communication skills. The most practical (in terms of coding, at least) aspect of this includes using meaningful names and writing comments that describe what the code does when it’s not clear from the code itself.

Combining differential equations and regression

Last week I gave my first lecture for the semester to the SEB113 students. While they tend to not have a particularly strong mathematics background I got some very positive feedback on how much they enjoyed learning about mathematical modelling. We revised differentiation, what derivatives are and then jumped into a bit about formulating differential equations from words that represent the assumptions that the model makes.

The bulk of that week’s lecture is showing where the non-linear regression models we used in the previous week (first order compartment, asymptotic, biexponential) come from. To do this we have a chat about exponential growth and decay models as some of the easiest differential equation models to deal with. I show them how we solve the exponential model exactly and then make reference to the fact that I don’t expect them to solve these equations in this subject. We show the solutions to the DE systems and make it very clear that the non-linear regression models are the solutions to differential equations that represent different assumptions.

We finish the lecture off with a section on how we can’t always get a “pen and paper” solution to differential equations and so sometimes we either simplify the system to one we can solve (alluding to perturbation methods) or give it to a numerical solver (alluding to computational mathematics). Because it’s how I learned about numerical solutions to DEs I showed the students the Lotka-Volterra model and discussed why we can’t solve X(t) and Y(t) and so have to use numerical methods. For different parameter values we get variations on the same behaviour: cyclic patterns, prey population growth followed by predator population growth followed by overconsumption of prey leading to fewer predators being born to replace the dying. Many students seemed to enjoy investigating this model in the workshops, as it’s quite different to everything we’ve learned so far. Solution is via the deSolve package in R but we introduce the students to Euler’s method and discuss numerical instability and the accumulation of numerical error.

I finish off the lecture with a chat about how regression tends to make assumptions about the form of the mean relationship between variables so we can do parameter estimation and that differential equations give us a system we can solve to obtain that mean relationship. I state that while we can solve the DE numerically while simultaneously estimating the parameter it is way outside the scope of the course.

I had a bit of time this morning to spend on next week’s lecture material (linear algebra) so decided to have a go at numerical estimation for the logistic growth model and some data based on the Orange tree circumference data set in R with JAGS/rjags. It’s the first time I’ve had a go at combining regression and numerical solutions to DEs in the same code, so I’ve only used Euler’s method. That said, I was very happy with the solution and the code is provided below the cut.

Continue reading

A few R things

R: The Good Parts” is an attempt to showcase the best way to do things in R. I’m not yet at the stage of dealing with absolutely massive data sets but things will be heading that way for me if aerosol samplers continue to measure at higher frequencies. Left out of the article is a discussion of dplyr; I’m still using functions from the apply family! Maybe I should also get used to using data.table. (Update: I’m now using data.table and its syntax to apply functions across grouping levels that I’ve set as keys. This is amazing).

While we’ve been incorporating a few of the mathematical needs of SEB114 into SEB113 it looks like we may need to go a bit further with incorporating the R needs. I hadn’t really thought about plotting a specific function (other than a line y = ax + b) in the workshops but it looks like a few earth sciences students need to plot the function π x / (1+x)2. So we’ll have to take stock over the next six months of what the experimental science lecturers want to put in their units and how we can help support that (also how we can get the science lecturers to help reinforce statistical modelling over statistical testing).