Now that I’ve handed a draft of my final thesis paper to my supervisors/co-authors, I’ve got a little head space to work on another paper that’s been sitting on my to do list for a while. One of the challenges with this paper is coming up with a way to represent data relating to a total of over 100 students at 24 schools. Summarising at the school level ignores a lot of the within-school variation but attempting to use standard plotting approaches can lead to some very complex and visually busy graphs. Add to this that we can’t really use colour and it’s getting a bit tricky.
I’ve had a few more looks at the Gelman, Pasarica and Dodhia paper that I’ve previously talked about. While it doesn’t have an example of what I actually want to plot, it does make me think more about what kind of data I’ve got, whether they’re continuous, categorical or count, what their ranges are and what sort of variation occurs. With 24 schools it’s possible to do a 4 x 6 or 6 x 4 grid of sub-plots, and within those subplots we can generally get across what sort of variation there is at the school level. Not everyone likes such a layout, though, so I’ve been looking into grouped/stacked barplots, changing the ordering of the grouping (variable by school vs school by variable) and combining time series and barplots in the same graph (which is actually quite a good way of visualising the data we have, but can’t be done for 100 students).
In the end, it’s going to come down to being creative enough to come up with a few alternatives and asking my co-authors which version they think sells the message best. I pretty much refuse to resort to pie graphs (because the scaling in area can be misleading) and feel really uncomfortable about using box-plots to summarise school-level variation. I have nothing against box plots, but for the size of the data we have at each school, summarising with a minimum, maximum and the 25th, 50th and 75th percentiles is going to be very difficult without shifting to a Bayesian ANOVA.
Still, it’s a really interesting piece of science with some quite unique challenges in terms of the analysis and representation of that analysis (and the raw data).
Edit: and it also makes me appreciate what designers (including my two housemates) have to go through, with people saying “No, I don’t like it” but not always having some constructive advice that’s actually possible to put into practice.
Edit 2: Relevant new Gelman blog post.