Some people make their visualisations in Excel, I make mine in R and others still use things like Processing or InDesign. Bret Victor shows us how the various ideas from each approach can be combined to make dynamic visualisations.
I’ve picked up a hobby over the last few months that is paying delicious dividends: homebrewing. It’s something I’d been wanting to try since about this time last year and I finally dropped the money (a cooking store voucher) on a cider homebrewing kit in February. My first batch was an apple cider that came with the kit and it’s been improving with age since the first bottle was opened in late February/early March. The second batch was a pear cider that a friend asked me to make for her; it was divided into two batches after primary fermentation so that I could try something different with the “excess”. The resulting pear and berry cider will make its debut quite soon, as it’s been patiently settling and aging over the last three weeks or so.
While I haven’t been keeping time series of the specific gravity, temperature and colour of the cider as it brews, there is certainly grounds to do so. Brewing and statistics have a history which goes back at least as far as William Sealy Gosset, who developed the t-distribution (and test) under the name “Student” while working at the Guinness brewery in 1908. Brewing involves balancing complex ecosystems of a whole lot of different things (depending on what you’re making) and is essentially a giant biochemical experiment. To get properly into brewing requires an understanding of botany, chemistry, microbiology, physics and statistics as you attempt to turn your basic ingredients into something which is tasty, non-toxic and perhaps even effervescent. I would like to start brewing beer at home soon, which will no doubt lead to me reading a lot more about hops, malt, wort, grains and yeasts and taking more fastidious notes.
So my exposure to microbiology has been twofold over the last year; working with a Finnish colleague on papers dealing with fungus and endotoxin counts in the UPTECH project and brewing my own alcoholic cider at home. The main fungus paper has been submitted and we’re checking the modelling on the endotoxin paper so that it can be submitted before this colleague leaves in the next few days. I can’t think of a more fitting thing to bring to her farewell party than a drinkable microbiology experiment.
Meta-analysis with a covariate feels really weird. I’m wanting to compare the relationship between the distributions of the mean concentration of endotoxin in the air and in dust samples across 50 locations. I wasn’t sure I did it the right way but the posterior estimates are consistent with my naïve approach of regressing the means of the air and dust samples. It’s important to account for the variability when doing this sort of post hoc analysis because a point estimate of the mean doesn’t reflect anywhere near the full set of knowledge you have about your parameters of interest.
On an unrelated note, another UPTECH paper has been published. This one’s looking at spatial variation of particle number concentration in the school environment. Congratulations to Farhad Salimi, the first author of this paper, on the publication of his first paper. Farhad’s one of the PhD students on the UPTECH project and is due to finish his thesis later this year. I’ve worked with him on two of his papers (this one and another which has been submitted) and he’s really thrown himself into learning how to use R. This has not only made it easier for me to collaborate with him but it’s also made his analysis possible.
In Australia, at least, the impact factor of the journals you publish in plays a large role in your advance in academia. Universities are always under pressure to publish their research in more prestigious journals, conflating the impact factor of the journal and the impact of the research published in it. There are many ways journals can game their impact factor, many ways researchers can game the indices that describe the impact of their work, etc. That said, it’s always good to aim to produce research that will be accepted in a high quality journal.
I’ve been excited about the PLoS journals since their launch and I believe QUT is a subscribing member, which means our publication fees are covered. It’s one of the best Open Access journal groups around and doesn’t appear to be a cash grab like some other publishers who are attempting to use Open Access as a business model to increase profits rather than because they believe in the free dissemination of research.
UPTECH collected fungi and endotoxin data at the 25 schools, and we’re about to submit the fungi paper (which means work must continue on the endotoxin paper). I was considering whether we should submit to PLoS One (IF 2011: 4.092) and then had a look at what other journals they have which may be an appropriate home. I really think once we get the clinical data from our Southern collaborators we should aim to do the best statistical modelling we can. I’m heartened by the fact that the head of the clinical group we’re working with has a strong background in stats and a desire to learn more Bayesian statistics. I don’t know if we can pull it off, but the prospect of having something investigating the role of fungi and endotoxin on child health published in PLoS Pathogens (IF 2011: 9.172) is exhilarating.
There are things I’ve heard of and never followed up like Expectation Maximisation (and Variational Bayes, for that matter), Expectation Propagation and Hamiltonian Monte Carlo. Things I once learned about and forgot because I didn’t have the background at the time such as Importance Sampling, Rejection Sampling, Slice Sampling. Then there’s things that are the cutting edge of statistical research that aren’t necessarily statistical methods but means of implementing them and are transforming the way we do statistics, such as CUDA.
I’ve managed to pick up a few little statistical novelties along the way such as nonparametric Bayes, hierarchical linear models and some of the theory behind GMRFs and Gaussian Processes but I feel like I’m lagging behind where I want to be. This could be a consequence of being based in a group with no other statisticians. Were I doing my postdoc in a statistics group I’d be more deeply immersed in a group with a culture of doing statistics research rather than doing scientific research which requires statistics I already know.
UPTECH is probably the biggest and most collaborative project my research group has ever embarked upon. Bringing together aerosol researchers at ILAQH, clinical medical scientists at the Woolcock Institute for Medical Research and international collaborators, it is likely to generate about fifty papers in the coming years. We currently have no system for keeping track of who is working on what and how developed each paper is. It’s probably too big for one or two investigators to keep in their head and I don’t think a Word document or Excel spreadsheet is going to cut it.
I spent some time this afternoon mucking around in Microsoft Access in an attempt to figure out how I could put a relational database together that’s built well enough to capture information about authorship, where the authors come from, which journal or conference the paper is being submitted to, etc. but still be simple enough that the people in my group, who are not database experts (and I’m certainly not one), can manage to operate it fairly painlessly such that they’ll actually use it.
I had a play with the “Projects” template but it was a bit too much, even though it did have some nice features such as the ability to import contact details from Outlook. I ended up making a table for papers and a table for people and set up a form for data entry. The “people” table feeds a drop down selection for lead author and a drop down multiple selection field for co-authors, which was quite simple to set up and is going to make things much easier. It works for the time being and I’ve got a few things I’d like to add, such as queries to return which papers a person is working on, how many lead author papers each person has, etc. it’s going to be a much more interesting way to learn about Access and databases than maintaining the utterly massive UPTECH database that was designed by someone else and then passed to me.
Although if there are purpose built systems for this sort of thing I am more than willing to listen to what others use.
Quora – What do statisticians do at Google? Some of the answer is deliberately vague, but the answer appears to be “Figure out whether the engineers have made the search algorithms better” and “Try to quantify whether changes to the way ads work have been positive”.
I work in an aerosol science group which has studied tobacco smoke. Something I can’t find in our publication history is a study on marijuana smoke. NORML have a page dedicated to disputing some of the myths surrounding marijuana that come from both the pro- and anti-marijuana camps. The review was done in 1994 so I’d be interested to see how the epidemiological evidence has changed as our equipment has improved. I don’t know that the Queensland government (or QUT) would let us purchase any marijuana to do the study.
One of my favourite video bloggers, Laci Green, talking about why women aren’t going into STEM fields. I’m quite fortunate to work with two research groups headed by women. There’s no sexism coming from the top of these groups and there are roughly even numbers of men and women in each group. Still, I know that not every university, nor every group within my university, has this sort of balance. Whether overt or covert there are still institutional barriers to equality of opportunity for women.
It feels like it’s been a very long time since I posted a substantial update here.
Last week was incredibly stressful and had me working on my thesis as well as on three other papers that I’m writing with colleagues here at ILAQH. My primary supervisor is back as of this week and it’s going to be a huge push to get this thesis finished, which means not much time for other projects. Since importing papers into my thesis document and expanding my literature review and introduction these past few days my thesis document has grown from about 50 to about 160 pages. Of course, the page and word count is unimportant in the end; what matters is that I write good work that is viewed favourably by the panel.
I realised that I hadn’t announced it here but the Centre for Air quality & health Research and evaluation (CAR) has granted me a post-doctoral fellowship worth 75% of a full time equivalent load. This is about twice what I get as an APA(I) scholarship holder, lasts for a year, and will keep me fed and clothed while I work on a variety of papers regarding statistical modelling of air quality.
GitHub for Windows can be such a pain sometimes. I guess it’s partially my fault for attempting to use version control on the compiled PDF of a LaTeX document, but I spent a fair amount of time today attempting to fix up a colleague’s local repository. I’m now a bit more familiar with cherry-pick and rebase but it would have been nice to have it just work. For some reason, GitHub for Windows on this colleague’s computer simply will not sync properly; my colleague has had to become a bit more familiar with the common commands (push, pull, fetch, commit, merge). It works fine on their Mac, though. I run GitHub for OS X at home and it’s an absolute dream. At work (Windows XP) I have had no end of trouble with various programs like TortoiseGit. I think when I start my post-doc I’ll organise to have my computer converted to a Linux system.
After all that, though, we did make some pretty good progress on the modelling in this paper. I’m not quite sure which journal we’ll be sending it to but it’s a really nice piece of work with some personal monitoring data, simple but informative analysis and some very creative use of the base graphics system in R.