Paper helicopters

There is no textbook for SEB113 – Quantitative Methods in Science.

It’s not that we haven’t bothered to prescribe one, it’s that no one seemed to be taking the same approach we decided upon two years ago when the planning for the unit started. There are books on statistics for chemistry, statistics for ecology, statistics for physics, statistics for mathematics, etc. but trying to find a general “statistics for science” book that focuses on modelling rather than testing has been difficult.

That said, there are some amazing resources out there if you know where to look, not just for learning statistics but for teaching statistics. One of the most useful that we’ve come across is “Teaching Statistics“, by Andrew Gelman and Deborah Nolan. The book itself is full of advice for things like groupwork, topic order, structure of learning activities, etc. but my favourite thing so far is the paper helicopter experiment. Continue reading

Running Bayesian models

I came across a post via r/Bayes about different ways to run Bayesian hierarchical linear models in R, a topic I talked about recently at a two day workshop on using R for epidemiology. Rasmus Bååth‘s post details the use of JAGS with rjags, STAN with rstan and LaplacesDemon.

JAGS (well, rjags) has been the staple for most of my hierarchical linear modelling needs over the last few years. It runs within R easily, is written in C++ (so is relatively fast), spits out something that the coda package can work with quite easily, and, above all, makes it very easy to specify models and priors. Using JAGS means never having to derive a Gibbs sampler or write out a Metropolis-Hastings algorithm that requires to you to really think about jumping rules. It’s Bayesian statistics for those who don’t have the time/inclination to do it “properly”. It has a few drawbacks, though, such as not being able to specify improper priors (but this could be seen as a feature rather than a bug) with distributions like dflat() and defining a Conditional Autoregressive prior requires specifying it as a multivariate Gaussian. That said, it’s far quicker than using OpenBUGS and JAGS installs fine on any platform. Continue reading

Posterior Samples

Interested in collaborative use of R, MATLAB, etc. for analysis and visualisation within a webpage? Combining plotly and iPython can help you with that.

Cosmopolitan (yes, that Cosmopolitan) has a great article interviewing Emily Graslie, Chief Curiosity Officer at the Field Museum in Chicago. She discusses being an artist and making the transition into science, science education and YouTube stardom.

A few of the PhD students in my lab have asked if I could run an introduction to R session. I’d mentioned the CAR workshop that I’d be doing but the cost had put them off. Luckily, there are alternatives like Datacamp, Coursera and Lynda. Coursera’s next round of “Data Science”, delivered by Johns Hopkins University, starts next Monday (Course 1 – R Programming). So get in there and learn some R! I’m considering recommending some of these Coursera courses to my current SEB113 students who want to go a bit further with R, but the approach that they take in these online modules is quite different to what we do in SEB113 and I don’t want them to confuse themselves.

 

Learning different programming languages

One of the biggest changes I noticed moving from doing statistics in Minitab (in first year data analysis) to doing statistics in R (in third year statistical inference) was that R encourages you to write functions. Normally this is done by writing functions in R’s own language (that call other functions, also written in R, which eventually call functions written in C) but it’s also possible to make use of other languages to do the heavy lifting. This isn’t unique to R, of course; MATLAB encourages the use of MEX files to improve run-times when you need to call the same custom function over and over again.

I’ve really only used high level languages to do my statistics, making use of other peoples’ optimised code that do the things that I want. I’ve seen the development of pyMCMC by reseachers at QUT and someone from my NPBayes reading group made quite heavy use of RCpp in his thesis. Python and C++ are probably the two languages that would be the most useful to learn given their ubiquity and reputation. I have been putting off learning these for years as I know that there’s a large time investment required to become proficient in programming and no external pressure to learn (unlike learning R as part of my PhD thesis work).

There’s no doubt that writing optimised code is a desirable thing to do, and that knowing more than one programming language (and how to use them together) gives you a much richer toolbox for numerically solving problems. I’m now at a point, though, where it looks like I may need to bite the bullet and pick up C++. JAGS, which I use through rjags in R, is a stable, fast platform for MCMC-based inference. It’s written in C++ and notifies you every time you load it in R that it has loaded the basemod and bugs modules. There are additional modules available (check in \JAGS\JAGS-3.4.0\x64\modules\) and it’s possible to write your own, as long as you know C++.

I’m at a point with the work I’ve been doing on estimating personal dose of ultrafine particles that I’d like to make the modelling more Bayesian, which includes figuring out a way to include the deposition model in the MCMC scheme (as I’d like to put a prior on the shape parameter of the reconstructed size distribution). My options seem to be either writing a JAGS module that will allow me to call a C++ified version of the function or to abandon JAGS and write a Gibbs sampler (or Metropolis-Hastings, but Gibbs will likely be quicker given the simplicity of the model I’m interested in). Either solution will stretch me as a programmer and probably give me a better understanding of the problem. Eubank and Kupresanin’s “Statistical Computing in C++ and R” is staring at me from the shelf above my desk.

Posterior samples

ARC Discovery Projects have been returned to their authors, and we are putting our responses together for the rejoinders. Interesting to see that we got a comment suggesting that we use the less restrictive CC-by instead of CC-by-nc-sa as we’d suggested. We weren’t successful in our Linkage Project applications, which is disappointing as they were interesting projects (well, we thought so). Continuing to bring research funding in is an ongoing struggle for all research groups and I feel it’s only going to get harder as the new federal government’s research priorities appear to be more aligned to medical science that delivers treatments than to our group’s traditional strengths.

SEB113 is pretty much completely over for the semester, with marks having been entered for almost every student. Overall I think the students did fairly well. We had some issues with the timetable this semester. Ideally, we’d like the Lecture, then all of the computer labs, then all of the workshops, so that we can introduce a statistical idea, show the code and then apply the idea and code in a group setting. Next semester, we have the lecture followed immediately by the workshops with the computer labs dotted throughout the remainder of the week. This has provided us with an opportunity to try some semi-flipped classroom ideas, where students are able/expected to do the computer lab at home at their own pace rather than watch a tutor explain it one line at a time at the front of a computer lab.

I’m teaching part of a two day course on the use of R in air pollution epidemiology. My part will introduce Bayesian statistics with a brief overview, a discussion about prior distributions as a means of encoding a priori beliefs about model parameters, and discuss the use of Bayesian hierarchical modelling (as opposed to more traditional ANOVA techniques) as a way of making the most of the data that’s been collected. The other two presenters are Dr Peter Baker and Dr Yuming Guo. The course is being run by the CAR-CRE, who partially fund my postdoctoral fellowship.

I had meant to post this back when they were doing the rounds, but there’s a bunch of plots that attempt to show that correlation isn’t causation and that spurious correlations exist in large data sets. Tom Christie has responded to this by going over the fact that correlation in time series isn’t as simple as in the case of independent, identically distributed data. One should be careful that one’s criticism of bad statistics is itself founded on good statistics.

Posterior samples

A rough guide to spotting bad science.

Why big data is in trouble: they forgot about applied statistics. Big data analytics are all well and good but you have to keep in mind that there are statistical properties that govern which inferences are valid.

While I’m comfortable giving a lecture I really struggled to get through them in undergrad. It turns out they may not be the most effective way to get information to students.

My supervisors, Professor Lidia Morawska, is giving a public talk (free to register) at QUT soon, “Air Quality Reports On Our Mobiles – Do We Care?” June 6 2014

The ongoing crusade against Excel-based analysis

One of the things I catch myself saying quite often in SEB113 is “This is new. It’s hard. But remember, you weren’t born knowing how to walk. You learned it”, as my way of saying that it’s okay to not understand this straight away, it takes time, practice and determination. I often say this in response to students complaining about learning R to do their data analysis. It’s actually got to the point where the unit co-ordinator suggested I get a t-shirt printed with “You weren’t born knowing how to walk” on the front and “So learn R” on the back.

One of the reasons I’m so keen to push new students into learning R is that while Excel can do some of the simpler calculations required in the first year of a science degree it is often completely inadequate for doing data analysis as a professional scientist, or even in an advanced level university course. I actually saw a senior researcher in a 3 day Bayesian statistics course try to avoid using R to code a Gibbs sampler by getting it up and running in Excel. They managed it, but it took minutes to run what the rest of us could compute in a second (and it was for a trivially simple problem).

There are problems with Excel, such as its inability to deal with the standard deviation of a group of very large numbers due to its bizarre formulation. Apparently the secret to sane use of Excel is to only use it for data storage. This guiding principle has meant that I no longer manipulate my data in Excel. Even with time stamp information I’ll fire up the lubridate package to convert from one format to another. I’m slowly exploring the Hadleyverse and that sort of approach is filtering through into SEB113 where we’re teaching the use of ggplot2 and reshape2 within RStudio. These are all powerful tools that simplify data analysis and avoid the hackish feel that much Excel-based analysis has, where pivot tables are a thing and graphs are made by clicking and dragging a selection tool down the data (which can lead to some nasty errors).

The fact that these powerful tools that make data analysis simple are free is another reason to choose R over Excel. I’m not on the “Open Source Software and provision of all code is mandatory” bandwagon as others seem to be when it comes to analysis being replicable. I agree it’s a worthwhile goal but it’s not a priority for me. That said, though, I definitely support encouraging the use of free software (in both senses) in education on the grounds of equity of access.

I had a chat with some students in SEB113 yesterday about why we’re teaching everything in R given that the SEB114 staff use a combination of Excel, MATLAB (and maybe even other packages I don’t know about). If we were to teach analysis the way that the SEB114 lecturers do it themselves, we’d have to teach multiple packages to multiple disciplines. Even discounting the fact that everything we teach is implemented in R, that R is free (unlike Excel and MATLAB), cross-platform (Excel on Linux? Try OpenOffice/OfficeLibre) and extensible (MATLAB has toolboxes, Excel has add-ins, R has a nice package manager) was a big plus for students who said that being able to work on assignments at home was valuable and so paying for software would make study difficult.

Convincing students to use R can be difficult, especially if they have no programming background, but ultimately they seem to accept that R is powerful, can do more than Excel and that writing reusable code makes future analysis easier. Convincing SEB114 academics that teaching their students to use R is a good idea is probably a harder sell, given that they’ve got years of experience with other tools. It’s still only semester 3 of the new Bachelor of Science course so we’ll have to see how this plays out over the years to come.