Using ggplot2 in SEB113

One of the big pieces of feedback we got during last semester’s SEB113 class was that the programming was difficult to understand and reproduce. While the subject is not a programming subject, we do use R quite heavily for all of the data analysis. Maths B isn’t a pre-requisite for SEB113 and I’d wager that even fewer of the students entering the ST01 Bachelor of Science program have taken senior IPT/ICT subjects at their high schools than have taken Maths B.

This semester we introduced R in the very first lecture and gave it a bit of a context. This means that students are aware from the get-go that they will be learning statistics through data analysis on a computer. The lectorials introduce the concepts and provide code for the resulting plots and analysis, the computer labs show how to do that particular form of analysis in R and then the collaborative workshops reinforce the labs by getting groups to work through the analysis of some problem using the statistical concepts and code that they’ve learned that week.

One of the biggest stumbling blocks last semester was the inconsistency in the way visualisation was done in R. We used a combination of base graphics, trellis graphics in the lattice package, heatmaps and dendrograms from other packages and had to turn to yet another package to get colorbars for the heatmaps. Part of the fine-tuning this semester has been employing someone (who also does the labs) to rewrite the graphics in the labs in terms of Hadley Wickham’s ggplot2 library. This brings consistency to the graphical aspect of the unit and the plot geometries are named explicitly so that it’s clear what style of plot you’ll be generating.

I was quite sceptical of ggplot2 when I first saw it, as the only exposure I had to it was the default options for a scatterplot with points. Sure, that’s pretty boring, but the fact that you can make a faceted grid (or wrap it using facet_wrap instead of facet_grid) means that investigating the use of small multiples is so much easier. Small multiples is a visualisation technique developed by Edward Tufte to allow the reader to see how the relationship between two variables changes as you also vary one or two other (categorical) covariates. Doing this in lattice required specifying a formula, similar to the way you specify a model in lm, but lattice is so different from the base graphics that you lose consistency.

I’m touching up this week’s workshop at the moment and I’m really noticing where the graphics code has been greatly simplified by access to a grammar of graphics for a powerful set of plotting routines. The GGally pacakge provides things like ggpairs, which does what pairs does in the base graphics but gives you the correlation above the diagonal and the scatterplots below the diagonal. This makes for more informative graphs with the beauty of the ggplot style.

As far as I can tell we’re hearing fewer complaints about the programming and the visualisation is happening much quicker in the workshops this semester as ggplot2’s documentation is amazing and it’s often a choice of geometry (and changing one or two options) rather than a choice of library (and changing the entire approach to the coding).

The use of ggplot2 has made teaching visualisation much simpler and we’re now getting through the workshops quite quickly because the visualisation is no longer a huge stumbling block.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s