I figured I might as well describe how git made it possible to write the code and paper for the work I’ve been doing with Bjarke, Tareq, Kaarle and Jukka. Without git, we’d probably have been emailing code back and forth to each other or using something like Dropbox which would freak out over all the little changes we make, making it impossible to both be working on the same file at once.
Git is a distributed version control system that allows you to track revisions to your code and invite multiple collaborators to the project. I’ve talked about it previously but basically it’s this great system where you can work on a project with multiple people, making your changes, committing them on your local machine to save them. Once you are happy with the changes you’ve made and they don’t break anything, you can push the changes to the shared repository where all the other members of the project have access to them. If there’s a conflict, git lets you know and you can fix it up then re-commit and push. There are tools for reverting changes, making new branches, merging branches, etc.
June 13 2011. It’s still three weeks before I’m due to arrive in Finland. I upload the code from the book chapter on Bayesian Splines that I’ve been writing for BRAG. Bjarke and I spend a bit of time emailing back and forth about how splines work, as he hasn’t used them in a regression framework before. Bjarke has sent me a copy of the draft of his paper on a GLM with autoregressive residuals. I’ve still got the 8BNP workshop to attend before arriving in Finland.
July 5 2011. I arrive in Finland and meet Tareq and Bjarke for a meeting. We take a copious amount of notes during a long discussion where we set out what we want to achieve long term and what we want to have finished by the time I leave. The aim is to at least have some working code that combines my splines with Bjarke’s code that does autoregressive residuals.
July 6 2011. Bjarke’s code is added to the git repository and we get to work understanding what the other person has written. We’re both still getting to grips with how git works and end up accidentally making new branches. I spend most of my time annotating code so that I know where to look when things inevitably go wrong. Time is spent ensuring we have ways of visualising our results so we know if things are going totally wrong.
July 7-8 2011. We spend the next few days attempting to stitch the code together. Bjarke doesn’t use Google Chat or Facebook so there’s a little email correspondence at this time but it’s mostly office conversations.
July 9-10 2011. No work happens here as Bjarke and I are holidaying with his in-laws for the weekend at a summer cottage near Lappeenranta (near the Russian border).
July 11-16 2011. This is the most creative and chaotic period of working on the paper. Notes are made on A4 paper, transcribed as notes in a text file on git when they are worth following up and abandoned when they don’t lead anywhere. We start really getting to grips with multivariate splines, Metropolis-within-Gibbs, testing out new ideas, making new branches, merging them when they work, deleting them when they don’t, scribbling maths out on pieces of paper and running up and down the corridors whenever there’s a breakthrough.
July 19-31 2011. I return to Australia and we spend some time writing about what we managed to get done while I was overseas. We’re back to one branch and are largely discussing the methodology and making sure plotting works.
August, September 2011. I continue making changes to the way autoregressive residuals are handled, Bjarke codes up some diagnostics and begins examining a wide range of model specifications for the air quality data we’re working with in order to come up with a way of illustrating how what we’ve done is so cool.
October, November 2011. Some changes are made to the way the penalties are handled, the code becomes more functional and most of the focus is on plotting, diagnostics and model choice. Plots are saved as PDF files using export_fig.m within our script and are brought under the control of git so that we can replace one set of results with another in a single commit.
December 2011. Some radical changes are made to the way the autoregressive error structure is passed to the model, making it more flexible. These changes are contained in a separate branch so that Bjarke can continue working on his model comparison knowing that his code will continue to run. He checks it out and offers feedback.
January 2012. A lot of work is done on making sure the paper explains what’s going on. A few more features are introduced and the code is commented heavily.
February-April 2012. Bjarke spends a lot of time making sure the scripts to call the model fitting, forecasting and diagnostics work properly.
May 2012. A draft paper is sent around for feedback, some changes to the description of the method are recommended, as are a few different model specifications. Development on the code itself has stopped but the diagnostics, plotting and inference continues. Much of the work is now happening on QUT’s supercomputer as competing models are tested. Writing about the autoregressive errors is filled out a bit to ensure that the forecasting is highlighted.
June 2012. The paper is almost finished. We’re waiting on feedback from a co-author who has been quite sick. There have been some large rewrites based on Kerrie’s feedback, mostly to change the order so that it’s a punchier article which highlights the novelty of the method rather than me just talking about how cool splines are. Support is being canvassed among the authors for uploading the draft to arXiv and releasing the code once the paper is published.
And that’s where we stand at the moment. Hopefully I can make the git repository public and you can have a look at what’s happened and where we’ve come from with this. It might need a bit of pruning first to make sure that no data that shouldn’t be publicly available isn’t accidentally made public. There’s a minimal working example in the code where we simulate some data, so hopefully that’s enough to demonstrate what we’ve done. There are some really neat ways of visualising the work done on GitHub, including a network diagram of the committed changes and branches, contributions of each person over time, when commits occur most frequently, what (programming) languages the project uses and how frequent additions and deletions occur (and therefore the growth rate of the project).
I hope this sheds some light on the process that’s been used. GitHub was basically a way for the QUT and Helsinki groups to collaborate, with Bjarke and I acting as the conduits for reviews and comments. Git allowed us to write a whole bunch of code together, following up all sorts of crazy ideas without getting in each others’ way. The paper was written as we went and is subject to the same version control (after all, LaTeX is code too). I have found it a really great way of working. I’d like to see how it goes with a few more people programming and whether I can work with a few other people to try to make the changes to the paper directly via git rather me making the changes based on notes scribbled on a printed copy.
P.S. Wow, I can’t believe it’s been nearly a year since we started working on this. Well, I can, as we had a few delays where it turned out we needed to rewrite large chunks of code and the paper.
P.P.S. I just managed to merge the development branch with the modified way of dealing with the residuals back into the master branch without there being any conflicts. I didn’t expect conflicts but it’s nice to know that everything’s back in the master branch. Below is an image of the commit history. It doesn’t show the number of changes in each commit, but given that commits occur when an idea has been tested or a section written, it’s a good indication of a parcel of working being done.
For interest’s sake here’s a map of my time in Finland. I haven’t got the exact location of the summer cottage but it’s near Taipalsaari. Here’s my collection of photos from my time in Finland. I had originally uploaded them to Facebook and given detailed captions but the move to Google Plus ended up removing the captions. Leave a comment on them asking a question if you want to know more.