An article in The Conversation made the rounds in our office yesterday and prompted a discussion with one of my colleagues about the role of statistics in scientific papers. The article itself talks about Australians who received an award from Thomson Reuters on the basis of publishing frequently cited work and discusses the different culture in science and the humanities when it comes to publication and citation.

The basic argument is that the awards are skewed to the sciences where publication is traditionally by journal article and citation in other journal articles is picked up more than in the humanities where monographs/books are the traditional means of disseminating information. One of the academics interviewed makes the point that science is much more collaborative, with researchers working in teams to achieve the goals of a project.

One of the email discussants in our group offered the following:

Graham Farquhar from the Australian National University is Australia’s citation laureate. Graham’s belief is this: start with a really good hypothesis driven question that no-one has answered and answer it. Deliver the answer with robust science and, voila, you have highly cited work.

Coming up with the questions is hard though. Thinking seven impossible questions before breakfast is probably a daily norm in Graham’s existence. Doing this, and maintaining a normal personality is difficult though, but he manages quite well.

A colleague talked to me about this idea of Farquhar’s in terms of a paper of his that he thought did well to answer an unanswered question. The aim of the paper was to offer an explanation for a physical phenomenon that is known in aerosols but it’s not well known why this phenomenon occurs.

We then got into an argument about the robustness of the science, not based on the science that had been done (it was a synthesis of previous work by other authors), but because the evidence hadn’t all been included in a model that drew all these reviews together in a statistical manner. Scientists love *t* tests. They are simple tests which tell you if there’s a difference between two groups and are appropriate in some, but not all, instances, and are a good first step for exploratory data analysis.

I was making the point that a Bayesian meta-analysis would have been far more appropriate as it’s a technique which is specifically designed to draw multiple sources of information together in a single model to provide a better estimate of an effect size. The Bayesian approach would have also helped here as a review of the literature could be used to determine priors such that even if no new data was collected for this paper, inferences could be drawn based on the current state of knowledge.

I think every research group should hire a statistician, have them retrain the researchers in how to use statistics, including GLMs/GAMs, spatial analysis, time series methods, and Bayesian inference in order to build capacity within the group. The statistician can then work with the scientists to ensure that new papers include the best analysis of results possible and to also review old papers to see if there’s any low hanging fruit in terms of interesting experiments/observations which could be re-analysed with something more than ANOVA or linear regression.

Statistics isn’t just for statisticians; we need to get away from the idea that doing more than the bare minimum for a paper is going too far. Apparently QUT is reviewing the way it teaches statistics to science students (something which is long overdue, my recollection is that MAB101 is an awful unit full of statistical techniques relevant to agricultural trials) so I’m hopeful that we can teach students statistical techniques that are relevant to them in an exciting way. I hope we don’t swing too far and just give them the exact tools they need and nothing else, because then we’re limiting graduates.

If you’ve got the brain power to deal with atmospheric chemistry and/or physics, you should be able to handle statistics. Not everyone goes through to do advanced level statistical units but it shouldn’t be too much of a jump from collecting experimental data to analysing it in R with a package appropriate to your field.

Edited to add: The point I’m trying to make here is that all researchers should have a basic understanding of exploratory/descriptive data analysis, simple GLMs (even if it’s just using glm() in R) and the ability to communicate the physical results in terms of the statistical modelling. I don’t expect that all scientists will go and become proficient in the use of Bayesian non-parametrics, but scientists should be able to start with some scatter plots, box plots and ANOVA to look for differences and then use regression to explain those differences.