Posterior samples

GPStuff 4.1 has been released recently. I’d like to work with Aki Vehtari’s group. I should really learn more about different inference approaches like EM, EP, VB.

Some advice for my SEB113 students who may struggle with the workload of first semester university comes from the sagest of equines, @horse_ebooks.horseebooksHugh Possingham‘s talking on Monday about the mathematics and economics of conservation as part of the BrisScience seminar series. I’ve been meaning to make it to a Possingham talk for a while.

Git used to write high school maths textbook

As a mathematician who uses git and LaTeX to collaboratively write papers with Finnish researchers, the following story is really neat. A group of Finnish mathematicians, students and teachers got together and hacked out a draft upper secondary school level mathematics textbook. There’s still some tidying up to be done but a snapshot’s available here.

The collaborators used git to weave together the various parts of the book, allowing them to work on separate sections of the book and feed them back to github. One of the benefits of something like git is that when two people work on the same section and both try to commit their changes it flags the conflict and forces you to resolve it. Using Google Docs to collaborate like this leads to people getting in each others’ way and overwriting each other. Using Dropbox results in a forked version of the conflicted file with no sensible way to resolve the conflict.

If you read the comments in their blog post you’ll see stories of other researchers doing similar work to create books. That the Finns are releasing their book under a CC-BY license means that others can take their work, fork the git repository and derive and compile their own versions of the book. If you speak Finnish, their Facebook page might also be of interest.

Unfinnished business

I’ve just heard from Kerrie Mengersen that the Finnish paper got rejected by BA for not being novel enough to publish there. So now I’m in a situation where I’ve got a paper which is too methodological for an applied stats journal (and far too methodological for an atmospheric science journal) and not a novel enough methodology for a journal as theoretical as BA. If BA don’t think it’s novel enough and it’s not the first time this data’s been published we’ll struggle to get it into something like JRSS B (IF: 3.645), JRS: Interface (IF: 4.402) or PLoS One (IF: 4.092). Our options, then, seem to be trying an Elsevier journal like EMS (IF: 3.114) or CSDA (IF: 1.028), which I’m not keen to do, somewhere like JSS (IF: 2.647) or a more applied journal like JRSS C (IF: 0.828, which is quite low).

I’d like to put this in a statistics journal because I want to have a career as a statistician rather than just someone who can only work in aerosols. That’s one of the reasons I’m not keen to publish this somewhere like Atmospheric Environment (where both my article and Bjarke’s, the bases of this work, were published).

I’m really kicking myself now for not submitting an abstract for this as a contributed talk at ISBA. I would’ve got a BA article out of it.

In happier news, I gave a presentation with two other PhD students to BRAG this morning where we talked about INLA. It went well and I think we’ve convinced a few of the others that INLA is pretty cool and worth using.

Edit: and I’m getting a lot of mileage out of the Finnish/finish pun.

Resubmission of Finnish paper

I’ve spent a lot of my time since Healthy Buildings finished revising the semi-parametric forecasting paper. We had submitted to Annals of Applied Statistics but it was rejected. We got some very useful comments back from the reviewer, though, and I think it’s a much stronger paper now. The reviewer encouraged us to rewrite the paper with a focus on the methodology rather than the application and submit it to a more theoretical journal. I have just submitted it to Bayesian Analysis (IF: 1.650), the official journal of ISBA, and uploaded the preprint to arXiv (where it will replace the current version in the next 24 hours).

As an Open Access journal with a well-written LaTeX document class, Bayesian Analysis is a journal I can get behind. Some very good papers have appeared there and as Bayesian statistics continues to grow as a field (and ISBA as a society) I think we can expect to see BA really take off as a journal. So much of modern statistics is algorithms rather than proofs and making these available to people, particularly people who aren’t academics, with freely available, peer reviewed papers will help improve the statistical capabilities of modern science.

My spoon is too big

Rejected.

I had submitted the Finnish paper to an applied statistics journal and received, within a working day or two, a response from the reviewer. They said the paper doesn’t focus as much on the application and is in fact more methodological. They go on to make a few suggestions as to how we could improve our method (mainly the forecasting and posterior summary stuff) and that we should submit it to a methods journal (I’m thinking Bayesian Analysis).

Having not studied much statistics in undergrad and learning Bayesian statistics to any degree after starting my PhD, I have felt like the work I’ve been doing was just applying methods that others had developed and that I wasn’t doing much statistics research. My first paper was more or less just that, fitting a GAM to some air quality data. It’s a nice paper, I’m proud of it, and it was a very valuable piece of work in terms of me understanding GAMs and semi-parametric regression; it took a lot of work to get there. At the same time, it felt a bit like I was using an R package to do some magic.

So while the Finnish paper has been rejected by a journal, I am buoyed by the reviewer’s comments about it being a well written paper that outlines a nice method with some solid statistics behind it. We have some changes to make (and I agree with the comments they make about our posterior summaries) but the thought of publishing a methods paper is very exciting.

ISBA 2012 – A few thoughts

Christian Robert asked for some guest bloggers for ISBA 2012 and today his ‘og features my thoughts as of this morning’s coffee break. There have been some really amazing talks in the sessions I’ve gone to, mostly in the NP Bayes talks.

My poster went well, I had a good discussion with Daniel Williamson about some of the shortfalls of P-spline models when smoothing temporal data. Hopefully I convinced him that my use of AR residuals means I’m not modelling noise with a highly oscillatory spline. I don’t think I can convince him of the validity of using an informative Gamma(1,b) prior for the smoothing parameter as he’s quite firmly in the subjective priors camp. Perhaps he and Sama should have a meeting.

I still haven’t been able to find Jukka Corander, he didn’t seem to be at the poster session where three of his students were presenting. Perhaps I just haven’t spotted him because we’ve only met once before and that was a year ago.

LaTeX and git

At the request of ihrhove I’ve decided to talk a little bit about using git and LaTeX together. I currently have two private git repositories; one for the Finnish paper and the other for all of my thesis work. I’ve talked previously about the Finnish paper so I’ll give a brief overview of how I use it with my thesis but you’ll need to keep in mind that I don’t have it shared with anyone because my supervisors don’t use git and nor do they edit the documents I work on directly (two print out draft papers and write on them, the third (who has used CVS/SVN in the past) uses Foxit to annotate PDFs directly and send them back to me.

To start (and possibly end, if you’re easily convinced) with, LaTeX is just code. So to me there’s no reason why you can’t use any service you’d normally use for code for LaTeX. Everything that is directly being used in a paper comes under my version control with git.

Each paper in my thesis repository has its own folder. Within that folder there is a LaTeX subfolder, where I keep everything needed for the writing of the paper, and an R or MATLAB folder depending on what program I’m using to do the modelling (and all the code goes into the repository). Within the LaTeX folder I have a whole bunch of .tex files and a folder where I store the images to be included in the paper.

One of my favourite commands in LaTeX is \input. Every section in a paper has its own LaTeX source file. I find that this helps me navigate my work when I’m writing, especially when making corrections. Each file gets worked on separately and I save frequently. If I’m finished dealing with a section or I’m heading off for a break I will save everything and commit the current changes with a note about which section I’ve been focussing on. I picked this \input based writing up in my Honours degree when I got sick of having screen after screen of text. If I want to omit a section in a draft I can just comment out the \input line. Reorganising sections and maybe even subsections, becomes an issue of swapping two or three lines of LaTeX rather than copying and pasting giant blocks of text.

I’m a sucker for vector graphics so I will use PDF graphs and pdflatex wherever I can. Occasionally I succumb to using PGF/TikZ for a while but usually have to generate so many different styles of plots that I don’t bother. So anyway, PDF graphics. These are really quite small and can be stored in git no trouble at all. I know git’s more or less useless for version control and revision of binary files (but PDF and EPS files are quite different) but I find it useful to be able to overwrite my graphs and still have the older versions available through reverting to a previous commit rather than making endless folders called “oldgraphics”.

The root of my thesis repository has a folder called “Bibliography” which is where a monolithic bibtex file called “allpapers.bib” is stored. Because I will cite the same references across multiple papers I find the idea of having separate bibliography databases a bit silly. I use JabRef to edit this, by the way. All my \bibliography commands point to ../../Bibliography/allpapers.bib. I’ve even got a template for papers with that line in it so that I don’t even have to think about how I do my referencing.

With regards to the Finnish paper, this compartmentalisation reduces, even further, the risk of conflicts. Committing changes to one section at a time means the commit messages are often quite descriptive without having to be quite long. The mixture of a few lines of changes and a brief summary means it’s easy to see what’s happened in the changelog.

I also use git to keep track of side projects that have popped up during my thesis. Coworkers will often come to me with a question about some data analysis or if I can write a script to make a certain repetitive task as automatic as possible. Each coworker gets a subfolder within a /Side Projects/ folder and within those there are folders for each little project. If I worked in a group where use of git was widespread I would consider making a separate project for each person and inviting them as a collaborator.

I kind of wish that QUT had a git server (the school of IT had a subversion server but I really dislike SVN after discovering git) and that scientists were encouraged to use R/MATLAB/SAS for their statistics and modelling instead of Excel. I think it’d a great way to foster collaboration and have people be able to work on a project and make changes, share their code with their coworkers, etc. without sending code and draft papers around via email. Actually a private git server without the account level limitations that github imposes would be an invaluable tool, especially if you could just open up your repositories to the QUT community to show what you’re doing and provide colleagues with usable code for statistical analysis, image manipulation tools, etc. And if someone within the university came across your work and liked it, you would potentially have another paper to work on within the uni.

Working on this Finnish paper

I figured I might as well describe how git made it possible to write the code and paper for the work I’ve been doing with Bjarke, Tareq, Kaarle and Jukka. Without git, we’d probably have been emailing code back and forth to each other or using something like Dropbox which would freak out over all the little changes we make, making it impossible to both be working on the same file at once.

Git is a distributed version control system that allows you to track revisions to your code and invite multiple collaborators to the project. I’ve talked about it previously but basically it’s this great system where you can work on a project with multiple people, making your changes, committing them on your local machine to save them. Once you are happy with the changes you’ve made and they don’t break anything, you can push the changes to the shared repository where all the other members of the project have access to them. If there’s a conflict, git lets you know and you can fix it up then re-commit and push. There are tools for reverting changes, making new branches, merging branches, etc.

June 13 2011. It’s still three weeks before I’m due to arrive in Finland. I upload the code from the book chapter on Bayesian Splines that I’ve been writing for BRAG. Bjarke and I spend a bit of time emailing back and forth about how splines work, as he hasn’t used them in a regression framework before. Bjarke has sent me a copy of the draft of his paper on a GLM with autoregressive residuals. I’ve still got the 8BNP workshop to attend before arriving in Finland.

July 5 2011. I arrive in Finland and meet Tareq and Bjarke for a meeting. We take a copious amount of notes during a long discussion where we set out what we want to achieve long term and what we want to have finished by the time I leave. The aim is to at least have some working code that combines my splines with Bjarke’s code that does autoregressive residuals.

July 6 2011. Bjarke’s code is added to the git repository and we get to work understanding what the other person has written. We’re both still getting to grips with how git works and end up accidentally making new branches. I spend most of my time annotating code so that I know where to look when things inevitably go wrong. Time is spent ensuring we have ways of visualising our results so we know if things are going totally wrong.

July 7-8 2011. We spend the next few days attempting to stitch the code together. Bjarke doesn’t use Google Chat or Facebook so there’s a little email correspondence at this time but it’s mostly office conversations.

July 9-10 2011. No work happens here as Bjarke and I are holidaying with his in-laws for the weekend at a summer cottage near Lappeenranta (near the Russian border).

July 11-16 2011. This is the most creative and chaotic period of working on the paper. Notes are made on A4 paper, transcribed as notes in a text file on git when they are worth following up and abandoned when they don’t lead anywhere. We start really getting to grips with multivariate splines, Metropolis-within-Gibbs, testing out new ideas, making new branches, merging them when they work, deleting them when they don’t, scribbling maths out on pieces of paper and running up and down the corridors whenever there’s a breakthrough.

July 19-31 2011. I return to Australia and we spend some time writing about what we managed to get done while I was overseas. We’re back to one branch and are largely discussing the methodology and making sure plotting works.

August, September 2011. I continue making changes to the way autoregressive residuals are handled, Bjarke codes up some diagnostics and begins examining a wide range of model specifications for the air quality data we’re working with in order to come up with a way of illustrating how what we’ve done is so cool.

October, November 2011. Some changes are made to the way the penalties are handled, the code becomes more functional and most of the focus is on plotting, diagnostics and model choice. Plots are saved as PDF files using export_fig.m within our script and are brought under the control of git so that we can replace one set of results with another in a single commit.

December 2011. Some radical changes are made to the way the autoregressive error structure is passed to the model, making it more flexible. These changes are contained in a separate branch so that Bjarke can continue working on his model comparison knowing that his code will continue to run. He checks it out and offers feedback.

January 2012. A lot of work is done on making sure the paper explains what’s going on. A few more features are introduced and the code is commented heavily.

February-April 2012. Bjarke spends a lot of time making sure the scripts to call the model fitting, forecasting and diagnostics work properly.

May 2012. A draft paper is sent around for feedback, some changes to the description of the method are recommended, as are a few different model specifications. Development on the code itself has stopped but the diagnostics, plotting and inference continues. Much of the work is now happening on QUT’s supercomputer as competing models are tested. Writing about the autoregressive errors is filled out a bit to ensure that the forecasting is highlighted.

June 2012. The paper is almost finished. We’re waiting on feedback from a co-author who has been quite sick. There have been some large rewrites based on Kerrie’s feedback, mostly to change the order so that it’s a punchier article which highlights the novelty of the method rather than me just talking about how cool splines are. Support is being canvassed among the authors for uploading the draft to arXiv and releasing the code once the paper is published.

And that’s where we stand at the moment. Hopefully I can make the git repository public and you can have a look at what’s happened and where we’ve come from with this. It might need a bit of pruning first to make sure that no data that shouldn’t be publicly available isn’t accidentally made public. There’s a minimal working example in the code where we simulate some data, so hopefully that’s enough to demonstrate what we’ve done. There are some really neat ways of visualising the work done on GitHub, including a network diagram of the committed changes and branches, contributions of each person over time, when commits occur most frequently, what (programming) languages the project uses and how frequent additions and deletions occur (and therefore the growth rate of the project).

I hope this sheds some light on the process that’s been used. GitHub was basically a way for the QUT and Helsinki groups to collaborate, with Bjarke and I acting as the conduits for reviews and comments. Git allowed us to write a whole bunch of code together, following up all sorts of crazy ideas without getting in each others’ way. The paper was written as we went and is subject to the same version control (after all, LaTeX is code too). I have found it a really great way of working. I’d like to see how it goes with a few more people programming and whether I can work with a few other people to try to make the changes to the paper directly via git rather me making the changes based on notes scribbled on a printed copy.

P.S. Wow, I can’t believe it’s been nearly a year since we started working on this. Well, I can, as we had a few delays where it turned out we needed to rewrite large chunks of code and the paper.

P.P.S. I just managed to merge the development branch with the modified way of dealing with the residuals back into the master branch without there being any conflicts. I didn’t expect conflicts but it’s nice to know that everything’s back in the master branch. Below is an image of the commit history. It doesn’t show the number of changes in each commit, but given that commits occur when an idea has been tested or a section written, it’s a good indication of a parcel of working being done.

Committed changes for the Finnish paper

For interest’s sake here’s a map of my time in Finland. I haven’t got the exact location of the summer cottage but it’s near Taipalsaari. Here’s my collection of photos from my time in Finland. I had originally uploaded them to Facebook and given detailed captions but the move to Google Plus ended up removing the captions. Leave a comment on them asking a question if you want to know more.

Bioaerosols

One of our researchers asked me to come along to a meeting next week to discuss my potential involvement on a paper as “the statistician”. It’s part of the UPTECH project but it’s not the spatial distribution of outdoor aerosols, which is what I’ve been working on for my PhD. Heidi Salonen, a visiting researcher from Finland, is working on UPTECH looking at the concentration of microbiological agents in indoor air. Luckily I don’t have to be an expert on spores, moulds and fungus but it’ll be interesting to pick up a bit more knowledge from the quite broad field of aerosols.

I’ve been asked a few times to be involved with some ILAQH projects but at the moment it’s been limited to helping write the statistical methodology for grant applications or discussing an approach for data from another part of the UPTECH project. As far as I recall, this is the first time that someone from ILAQH has come and said “Hey, I want you to be a co-author on this paper”.

This sort of thing is what I’ve been looking forward to as I finish off my PhD, the chance to get out of my specific topic and start looking at other peoples’ work. I’ll probably try to recommend a non-parametric regression technique but I have a feeling it’ll be a classically designed experiment and that all we’ll need is some t-tests.

A mostly Fin(n)ish[ed] paper

The paper I started with some collaborators in Finland (Bjarke Mølgaard, Jukka Corander, Kaarle Hämeri,  Tareq Hussein) almost a year ago is nearly done. It’s been nearly done a few times, but now all that remains is to do a little bit of model choice regarding the separability of the effects of meteorology on ultrafine particle number concentration. We’ve been using git to send the paper and code back and forth (well, Bjarke and I have) and I’ve found that to be a really simple way of collaboratively writing code and a paper. To see the changes made, one need only look at the commit details. Much nicer than using tracked changes in Word and emailing a bunch of versions of the same file back and forth and trying to do complicated merges of changes.

I am really looking forward to submitting this paper, as it’s probably the most methodological work I’ll get out of my PhD (the other papers are largely applications of some novel techniques to the UPTECH project’s data). It’s quite a nice blending of the work done by the Finnish authors previously  [1] as part of Bjarke’s PhD and some of the ideas in my first paper [2].

While I don’t know that it will totally revolutionise atmospheric modelling (in the way that I’m sure we all hope it will), it’s quite a nice technique that increases the flexibility of the Generalised Additive Model and hopefully encourage anyone interested in doing Bayesian modelling with the GAM to stop using Matt Wand‘s WinBUGS approach [3, 4]. To be clear, I find GAMs in WinBUGS particularly cumbersome to code given that WinBUGS doesn’t deal with matrix operations very well and the use of P-splines requires a lot of matrix operations. Having said that, though, Wand’s code is a nice intro to Bayesian splines where you don’t have to write your own MCMC sampler. I just think it has some limitations that are not easily overcome.

I’d like to present this to a statistics conference but it wasn’t anywhere near ready enough to demonstrate at ISBA 2012 when I was submitting an abstract.

[1] B. Mølgaard, T. Hussein, J. Corander, K. Hämeri, Forecasting size-fractionated particle number concentrations in the urban atmosphere, Atmospheric Environment, Volume 46, January 2012, Pages 155-163, ISSN 1352-2310, 10.1016/j.atmosenv.2011.10.004. ScienceDirect

[2] S. Clifford, S. Low Choy, T. Hussein, K. Mengersen, L. Morawska, Using the Generalised Additive Model to model the particle number count of ultrafine particles, Atmospheric Environment, Volume 45, Issue 32, October 2011, Pages 5934-5945, ISSN 1352-2310, 10.1016/j.atmosenv.2011.05.004. ScienceDirect

[3] C. M. Crainiceanu, D. Ruppert. M. P. Wand, Bayesian Analysis for Penalized Spline Regression Using WinBUGS, Journal of Statistical Software, Volume 14, Issue 14, September 2005.

[4] J. K. Marley, M. P. Wand, Non-Standard Semiparametric Regression via BRugs, Journal of Statistical Software, Volume 37, Issue 5, November 2010.

P.S. I apologise for the awful pun, but Shaun Micallef has been on my mind recently.