I was hanging out with some friends the other night and the topic of conversation turned to an infographic that one of the guys was working on for a friend of his, the number of times the word “smeg” (or a variant) appears in each episode of Red Dwarf (made in Processing). We got chatting about data visualisation and ended up talking about Edward Tufte and his approach to the various aspects of using data to show information. Apparently his brother have him a copy of Tufte’s book last year and he was kind enough to loan it to me.

One of my supervisors is a huge fan of Tufte’s approach and it’s worn off on me. Throughout my thesis I’ve moved away from giant scatter plots of the data to summary plots that don’t use more “ink” than is necessary to show the information. An example he gives in the book is the boxplot, which typically contains a lot of redundant information. The image below shows Tufte’s stripped down boxplot and the default R boxplot for the same data. In the more traditional boxplot, the maximum of the data (within 1.5 IQR) is represented by the end of the whisker as well as a hinge. The hinge isn’t necessary and neither is the box which marks the 25th and 75th percentiles as the other end of the whisker represents these. With no box, there’s no need for a horizontal stripe for the median, so it can be represented as a dot.

There are some plots in my thesis papers of which I’m quite proud and I will upload some of them here once the papers are finalised. My Tufte-mad supervisor has even commented on how my plots have become quite minimalist, something she attributes to the 2000s/10s despite Tufte’s work dating back to the 1970s. I’ve noticed that the R packages I’ve tended to use (R-INLA, mgcv) have quite simple graphics. While there are some fantastic plotting packages such as ggplot2 that make it quite easy to produce very pretty graphics, I feel far more familiar mucking around with the base graphics system to add points, lines and polygons to a blank plot. If you take the approach that Tufte does, that the ink on the page should represent data, and that there should be no extraneous elements to your plot (such as cross-hatching a barplot or colour where it doesn’t convey information) then it’s not hard to shy away from packages that do a lot of very nice, but ultimately data-poor, plotting.

But graphs on the printed page are not the only way to represent data or mathematical or statistical concepts. The Museum of Mathematics in New York looks to have a lot of really cool displays of a wide range of mathematical concepts, for example. My Advanced Calculus lecturer, Dr Jack Wrigley, had a background in education and often used props in class to illustrate ideas, such as holding pieces of paper against a balloon to give us a visual representation of a tangent and normal surfaces. I don’t often have props when doing improvised theatre but academic presentation is, at the end of the day, just another type of performance.

Part of the work I’ve been doing on modelling temporal trends from split panel design data involves modelling penalised random walks where the random walk is on a torus that represents a joint term for the hour of the day and the day of the week. This “hour of the week” term has 168 unique values, but we want to smooth both in the day to day and hour to hour direction, rather than just looking at the circle formed by gluing Saturday 11pm to Midnight and then Sunday 1am. Some people might be very good at visualising Markov random fields through their precision matrix but there will be many people in my PhD final seminar audience who are not postgraduate level statisticians. For the purpose of explaining one of the key ideas in my thesis, I am considering bringing an inflatable pool ring and a marker in order to draw the smoothing directions on the torus that represents the product of two circular spaces. If this doesn’t make up for the conceded pass I got in Jack Wrigley’s class I have no idea what will.

Sam CliffordPost authorCode for the plot:

`N <- 10`

means <- rnorm(N,5,3)

sds <- rgamma(N,3,2)

cats <- ceiling(runif(n=1000,min=0,max=N))

dat <- rnorm(n=1000, means[cats], sds[cats])

par(mfrow=c(2,1),mar=c(4,4,0,0)+0.1)

plot(0,0,type="n",xlab="Category",ylab="Response",xlim=c(1,N),ylim=range(dat),axes=F)

for (i in 1:N){

temp.dat <- dat[cats == i]

box.bit <- boxplot.stats(temp.dat)

points(i,box.bit$stats[3],pch=19)

lines(rep(i,2),y=box.bit$stats[c(1,2)])

lines(rep(i,2),y=box.bit$stats[c(4,5)])

# maybe you want to show outliers?

#outliers <- box.bit$out

#points(x=jitter(rep(i,length(outliers))),y=outliers,pch=19,cex=0.1)

}

axis(2)

axis(1,at=1:N)

`boxplot(dat ~ cats, xlab="Category",ylab="Response")`

MattI like the stripped-down boxplot, and thanks for posting the code – that would’ve been my first question :-)

What do you think of violin plots, which do convey meaningful information apart from the median and IQR? Or do you think that’s an instance where Andrew Gelman would recommend that multiple plots would do a better job, rather than overloading a single plot?

Sam CliffordPost authorWhat’s a violin plot? Is it like a bean plot? Density and some other meaningful summaries?

Sam CliffordPost authorAh okay, yeah. I’ve been playing around with violin plots but I think the boxplot’s IQR-based summaries are a bit rubbish. I think you only need half a violin plot (the symmetry is unnecessary) with the mean and median marked.