This morning I gave a speech as part of the plenary session “Standing on the shoulders of giants”. I was joined on stage by Mengyan Gong from Tsinghua University and Joana Madureira from the University of Porto and the three of us were to talk about our position as students within the field of air quality and how we see ourselves contributing. After the three of us gave our ten minute talks, Professor Charles Weschler gave a talk about what we currently know about biochemistry and microbiology *vis a vis* the role of chemicals and microbes in indoor air and health. I think it was a great session and I’d like to thank Lidia Morawska for coming up with the idea and Martin Betts and William Nazaroff for chairing the session.

I spoke about the need for better statistics in science and how it was up to supervisors to encourage their students to look at novel approaches for data analysis and for students to be curious, creative and willing to learn in order to ensure that the field does not grow stagnant. I have uploaded the slides (which are quite spare) and a transcript of what I planned to say is available below (perhaps I should try to get this published in Indoor Air as a letter). I did vary from the script in a few parts and I will post the video online when it’s available. I will try to include links in the transcript to references so people can find out about these techniques that I get so excited about.

**“Quantitative questions, quality answers”**

A talk for the “Standing on the shoulders of giants” session

If I, as a student, am to further the field of indoor air and built environmental health it will be by standing on the shoulders of giants. The scientists, engineers and others who have gone before me in this field have developed a rich body of ideas and questions that have gone unanswered as a result of there being other work that they must do. But as one moves up the academic ladder, writing grant applications and supervising postgraduate students the chance to investigate some of these questions in depth arises. A team of PhD students with a diverse skill set can be assembled to assist the other academics in the team and to help drive the research output of the group. This has happened with my group’s UPTECH project, which is the topic of a session later today.

According to Graham Farquhar of the Australian National University, Canberra, one of Australia’s most cited academics, the key to producing high quality science is to start with a really good hypothesis driven question that no one has answered and answer it. That seems fairly straight forward? In the case of UPTECH, the unanswered question is “What is the effect of exposure to airborne nano and ultrafine particles emitted from motor vehicles on the health of children in schools?”. Data collected in this project include two weeks of indoor and out aerosol and meteorology measurements at each of 25 primary schools. Indoor microbiology measurements are taken in one or two classrooms at each school and health diagnostic tests are performed on the students in these classes, accompanied by a take home survey which includes questions about family health history, demographics and housing characteristics. This project generates a huge amount of data that has the potential to reveal some very interesting relationships. But to do so requires statistics beyond ANOVA and linear regression. The same can really be said of any modern scientific project.

Don’t get me wrong, ANOVA is a great tool for exploratory data analysis [1] and testing whether a term in a regression model is zero or not. But to stop the analysis at descriptive statistics and testing for equal means across groups is to cheat yourself out of the opportunity to examine why these differences arise and really get to know what your data is telling you. ANOVA can be replaced with a Generalised Linear Model with factor terms. Any measured covariates can then be included in this GLM rather than just calculating the correlation between the covariate and the response. If the effect is suspected to be non-linear there are a range of regression models which are commonly used in the statistical and computer sciences but have not found their way into the natural sciences and I will talk about these soon. The development of new statistical techniques and the ubiquity of computers in the workplace means that there is really no excuse for using statistical techniques that were developed for agricultural field trials are limited in their ability to explain variation in a data set.

Expecting senior academics and industry practitioners to maintain statistical education throughout their careers is a bit of a tall ask in some cases. There’s so much work to be done keeping up with the science that the statistics often falls by the wayside. In my mind, the role of the supervisor is to present a problem and then direct the creativity and curiosity of the student. The role of the student is to answer the research question in a paper which weaves together the experience of the supervisor with high quality research and statistical modelling appropriate to the data and hypothesis.

And there’s much more to data analysis than doing ANOVA. If you suspect that there might be homogenous groups within a set of observations, why not try a clustering algorithm like k-means [2] or a finite mixture model? [3] Don’t know how many groups there are? Try an infinite mixture model [4]. Think that certain covariates might have a different effect within those groups? A Dirichlet Process Mixture of GLMs is an option [5]. The Indian Buffet Process will help you identify common patterns across a bunch of correlated covariates and reduce the dimension of your data [6]. If you suspect the effect of humidity on particle number concentration is non-linear but aren’t sure about what it’s going to look like you could try a spline model [7,8]. For a smooth spatial relationship across a network of monitors you can use a Gaussian predictive process with a Matern class covariance function [9] or go all out and use a Gaussian process with non-parametric covariance [10]. Want to combine the effect sizes from some previous studies of the same thing to estimate an overall effect? Use Bayesian meta-analysis rather than a weighted average [11]. These are all common approaches in the statistical community and I have seen them applied to scientific problems, many related to air quality or health.

These techniques are much newer than ANOVA and their development in statistics and computer science means that professional scientists may not ever be exposed to them. So it’s up to students to be aware of new techniques which are applicable to their research.

But is it too much to expect that all students are well versed in such a range of statistical techniques? Probably. Especially when you consider that a lot of these things aren’t taught in undergraduate science degrees. Science graduates are the obvious choice when recruiting science PhD students or industry practitioners. But science and engineering is strengthened by solid statistics. And the statistics are solid when a group’s capacity for statistics is solid, whether by employing one directly, choosing candidates with a strong statistical education, building links with a statistics research group at a university or providing for the ongoing statistical training of early career researchers and/or students. Our science must not just demonstrate that something is happening but attempt to understand why that effect occurs. We must quantify it and how and why it varies. Highly influential work, such as the work presented by our keynote speakers, arises when appropriate statistics raises high quality experimental science to where it belongs. The reader has before them a clear picture of what is happening and why.

I see my role in my group, and through it, my role in our broad field, as fostering statistical creativity and curiosity. I am there to help provide tools which the people around me can use to solve these unanswered problems. Encouraging people to step outside the “ANOVA and linear regression in Excel” frame of mind has motivated them to ask questions about how best to fit some data which shows a non-linear effect, how to write code to process output from our instruments that will calculate summary statistics and generate plots, how to look at trends in time series data, and so on. In return, I’ve been given the opportunity to work on some really interesting air quality problems with some people who really know their science. So I pick up some more knowledge about aerosols and health, they get exposed to new ways of analysing data, and we get to present our interesting results to the world with robust and novel analysis.

[1] Andrew Gelman. Analysis of variance – why it is more important than ever. Annals of Statistics, 33:1–53, 2005.

[2] Lloyd., S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2): 129–137, 1982.

[3] Darren Wraith, Clair Alston, Kerrie Mengersen, Tareq Hussein. Bayesian mixture model estimation of aerosol particle size distributions. Environmetrics 22 (1)

[4] Brian Kulis, Michael I. Jordan. Revisiting k-means: New Algorithms via Bayesian Nonparametrics. ICML 2012.

[5] Lauren A. Hannah, David M. Blei, Warren B. Powell. Dirichlet Process Mixtures of Generalized Linear Models. Journal of Machine Learning Research 1: 1-33, 2011.

[6] Thomas L. Griffiths, Zoubin Ghahramani. Inﬁnite Latent Feature Models and the Indian Buffet Process. Advances in Neural Information Processing Systems 18, 2005.

[7] S. Clifford, S. Low Choy, T. Hussein, K. Mengersen, L. Morawska, Using the Generalised Additive Model to model the particle number count of ultrafine particles, Atmospheric Environment 45 (32): 5934-5945, 2011.

[8] S. Clifford, B. Mølgaard, S. Low Choy, J. Corander, K. Hämeri, K. Mengersen, L. Morawska, Bayesian semi-parametric forecasting of particle number concentration: penalised splines and autoregressive errors, in prep, 2012. arXiv

[9] Banerjee, S.; Gelfand, A. E.; Finley, A. & Sang, H. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society B 70: 825-848, 2008.

[10] Emily Fox and David Dunson. Bayesian Nonparametric Covariance Regression, 2011.

[11] Blangiardo, M.; Hansell, A. & Richardson, S. A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment, 45: 379 – 386, 2011.