I am the only statistician in my aerosol science research group; everyone else is either a physicist, chemist, environmental scientist, engineer or some other physical scientist. Most of the people in the group do their data analysis and/or plotting in Excel or custom software like IGOR or Origin. One or two people in the group have used MATLAB in the past but there’s some sort of opportunity cost effect where it’s easier to use the Data Analysis plug-in in Excel than to load up MATLAB and remember how to fit a GLM or perform ANOVA.

My background in computational mathematics and a bit of statistics means I’ve been exposed to other mathematics and statistics software and I’ve grown accustomed, through my stats research group, to seeing novel statistical methods and data analysis done in appropriate software packages (R, MATLAB, WinBUGS, and a Python package called pyMCMC that some of my colleagues are working on). As such, it saddens me to see my fellow PhD students doing their stats and plotting in Excel (which has certainly come in leaps and bounds but is still not a good piece of statistical software).

I’ve really been pushing R as a way of doing better data analysis and making better looking graphs (which can easily be exported as PDF or EPS files) over the last year or so and have been heartened to see that a) it’s recognised as a good tool and b) some people are using the scripts I have written for them to do things like randomly sample from a list of strings. One of the other PhD students has found an R package that will make our lives much easier, openair, and has even started investigating how to use R to do repeated calculations of summary statistics from SMPS data rather than writing the formula in a spreadsheet cell in Excel and choosing “fill down”. The risk here, of course, is that we will be asked to write all the statistics code for everyone else.

In addition to this, one of the PhD students in the stats research group decided to run a short course in using R (with RStudio, which I’m now attempting to use instead of Notepad++ and NppToR) to manipulate data and do some simple plotting in ggplot2 and most of the (eight) students in my room went to the course. They all got various things out of going but all agreed that using R is the way to go. Some still find more advanced data analysis than t-tests, ANOVA and linear regression a bit daunting but will soldier on with the short courses that are being organised. There’s a bit of a feeling among the aerosol science PhD students that I should teach them data analysis in R, as I’m familiar with both what they’re working on (and thus have some idea of what they need to learn) and with them as people so I’ve got an idea of the sort of level of statistics and mathematics they’ve got.

Now’s probably not the time for me to start teaching (PhD to finish) but it looks like I might be around for a little while so if I can help build the capacity of the group then I’d be more than happy to show everyone how to do their favourite data analysis in R and perhaps extend them a little further as scientists. While everyone brings their own strengths to the group in terms of skill sets and bodies of knowledge, I believe that every scientist should be able to do their own data analysis and that they should be able to do it in statistical software rather than something which has been developed for generating accounting reports and has later had scripting and statistical routines shoe-horned on to it (and while widely available isn’t free and “free”).