Where to start if you’re going to revise statistics

I’d like to think it was my plenary speech that has spurred this, but given that they didn’t see it I’m not sure that it was, but one of the academics in my lab has decided it’s time to refresh their statistical knowledge. I think this is great, because they got their PhD a long time ago and have probably been using the same statistical methods for at least the last ten years.

The book they’ve decided to use is the Schaum’s Outline of Statistics. I’ve used books from this range before to revise linear algebra, differential equations, etc. and have even taught small courses (privately) based on the topics they cover and in the order they cover them. A flick through the book confirmed that it was full of frequentist testing and other similar statistical methods that I spent my talk saying were a good start but not the end of the analysis when writing a scientific paper.

The book’s online summary says it covers the use of MINITAB. I’m glad to see that it discusses the use of software other than Excel to perform analysis but I recommend people use books in Springer’s “Use R!” range because R is free (in terms of both speech and beer) and is much more flexible than MINITAB in terms of programming it and running different types of analysis. Where MINITAB is based on a point and click GUI, making it great for a first year statistics class where students may not be familiar with programming, R is driven by the command line and is more easily scripted. Learning to use R means giving yourself the opportunity to use the many packages that extend its functionality.

The books I’ve used in the Use R! range include A Beginner’s Guide to R and Bayesian Computation with R. While I’d definitely recommend these Use R! books, as they’re aimed at people wanting to use R to do better analysis, there are a few others that I’ve found incredibly useful. It’s important for me to point out that my background differs from my colleague’s so they may not find the books as relevant or accessible.

Gelman et. al – Bayesian Data Analysis. Around my stats group, this book is called “The Bible”. It’s probably the best textbook I’ve come across. It’s full of information, tutorials, detailed descriptions of the theory and methods and how they can be applied. This is certainly a graduate level statistics textbook, though, and it assumes calculus at what I’d say is probably a second year mathematics degree level. You may find this book difficult if you don’t feel comfortable with multiplying integrals together (which is really what Bayesian analysis is). I sent my colleague a link to Gelman’s Annals of Applied Statistics article on ANOVA. Might as well plug Gelman’s blog while I’m talking about him.

Woodworth – Biostatistics: a Bayesian introduction. I found this very useful in giving a more applied approach to introductory Bayesian statistics. There’s a good review of the book here  and I agree with the reviewers about the importance of the preface in that it talks about the philosophy of statistics in science and discusses the differences between frequentist and Bayesian statistics. The book walks the reader through a lot of the topics which frequentist statistics deals with but in a Bayesian setting. I find this sort of comparison very useful (and appreciate when Mike Jordan says that a lot of machine learning techniques are just Bayesian statistics with a different name) as most people who have taken a statistics class will have seen linear modelling, ANOVA and a little about statistical design. The book also introduces the use of WinBUGS as a tool for Bayesian modelling.

As an aside, I attended an introductory course run by my supervisor, Kerrie Mengersen, where she was teaching us how to use R to write a Gibbs sampler for a very simple problem and how to do it in WinBUGS as well. One of the other attendees, the leader of a medical science research group, had it in their head that they would use Excel to write the Gibbs sampler because it provides nice reports (summary stats, plots, etc.) through a plugin they had. Comparing the time it took WinBUGS and R to run the code against the Excel’s run time was probably what convinced me that Excel was one of the worst pieces of software that one could use for statistics. Great for spreadsheets, awful for statistics.

A non-technical book which does a good job extending the philosophical discussion to the history of Bayesian statistics and its use in solving some very complex problems is Sharon Bertsch McGrayne’s The Theory That Would Not Die (which I like to think of as the “A Brief History of Time” of Bayesian statistics). It’s very readable and really drives home the importance of Bayesian statistics and the profundity of Bayes and Laplace in developing this approach.

I really don’t think there’s much use revising the basics of frequentism as I disagree with its interpretation of probability and find the idea of confidence intervals problematic. Hypothesis testing is also another problem that I have with frequentism and I think we’re going to see a lot of scientific papers in the near future converting the p values for their ANOVA into a “sigma” level as a result of CERN’s announcement of the 5 sigma certainty of their search for a new boson. Tony O’Hagan has a good post about the “sigma” issue on the ISBA forums.

Edited to add: TL;DR? Got a maths degree and some familiarity with stats? Read Gelman. Don’t have a maths degree? Read Woodworth. Want to understand what Bayesian stats is but don’t want to read a textbook? Read McGrayne. Want to know how to use R? Read a Use R! book. Once you’ve got a decent understanding of what statistics is, read papers for specific topics because there is almost never a book about what you want.


2 thoughts on “Where to start if you’re going to revise statistics

  1. Mahdi

    Thanks for your very informative post. I have an interest on Bayesian inference from computational point of view. I know that the main challenge for Bayesian inference is
    computational barrier. That is many decision problems relevant to Bayesian computation are NP-hard (no efficient algorithms are known). I am looking for a reference (also introductory) that addresses this challenge from an statistical point of view.

    Many Thanks

    1. Sam Clifford Post author

      Hi Mahdi. I haven’t heard of Bayesian computation (in general) being NP hard, but a quick search reveals some papers on Bayesian Network inference being NP hard. In terms of computationally efficient Bayesian statistics, you might be interested in Approximate Bayesian Computation (ABC). ABC does away with computing likelihoods in favour of generating data from the model and seeing if the simulated data is “close enough” to the observed data that the parameter sample should be retained. Hamiltonian Monte Carlo (particularly Andrew Gelman’s No U-Turn Sampler) is also another attempt to improve computational efficiency that looks very promising.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s