A coworker sent me this article about alternatives to the default 0.05 p value in hypothesis testing as a way to improve the corpus of published articles so that we can actually expect reproducability and have a bit more faith that these results are meaningful. The article is based on a paper published in the Proceedings of the National Academy of Sciences which talks about mapping Bayes Factors to p values for hypothesis tests so that there’s a way to think about the strength of the evidence.

The more I do and teach statistics the more I detest frequentist hypothesis testing (including whether a regression coefficient is zero) as a means of describing whether or not something plays a “significant” role in explaining some physical phenomenon. In fact, the entire idea of statistical significance sits ill with me because the way we tend to view it is that 0.051 is not significant and 0.049 is significant, even though there’s only a very small difference between the two. I guess if you’re dealing with cutoffs you’ve got to put the cutoff somewhere, but turning something which by its very nature deals with uncertainty into a set of rigid rules about what’s significant and what’s not seems pretty stupid.

My distaste for frequentist methods means that even for simple linear regressions I’ll fire up JAGS in R and fit a Bayesian model because I fundamentally disagree with the idea of an unknown but fixed true parameter. Further to this, the nuances of p values being distributed uniformly under the Null hypothesis means that we can very quickly make incorrect statements.

I agree with the author of the article that shifting hypothesis testing p value goal posts won’t achieve what we want and I’ll have a bit closer a read of the paper. For the time being, I’ll continue to just mull this over and grumble when people say “statistically significant” without any reference to a significance level.

NB: this post has been in an unfinished state since last November, when the paper started getting media coverage.