Tag Archives: programming

R Markdown

I’ve been spending a bit of time over the last few days making an R tutorial for the members of my air quality research group. Rather than being a very general introduction to the use of R, e.g. file input/output, loops, making objects, I’ve decided to show a very applied workflow that involves the actual data analysis and explaining ideas as we go along. Part of this philosophy is that I’m not going to write a statistics tutorial, opting instead to point readers to textbooks that deal with first year topics such as regression models and hypothesis tests.

It’s been a very interesting experience, and it’s meant having to deal with challenges along the way such as PDF graphs that take up so much file space for how (un-)important they are to the overall guide and, thinking about how to structure the tutorial so that I can assume zero experience with R but some experience with self-directed learning. The current version can be seen here.

One of the ideas that Sama Low Choy had for SEB113 when she was unit coordinator and lecturer and I was just a tutor, was to write a textbook for the unit because there wasn’t anything that really covered our approach. Since seeing computational stats classes in the USA being hosted as repositories on GitHub I think it might be possible to use R Markdown or GitBook to write an R Markdown project that could be compiled either as a textbook with exercises or as a set of slides.

Advertisements

ALP wants to teach kids how to program, and I agree

I checked in on one of my workshop classes this morning to see how everyone was going in the final week, to remind them of the remaining help sessions and to check that they’re on track to complete their group assignments.

There weren’t many students in the class, what with it being week 13, but of one of the students was very proud of the fact that she’d lifted her marks on the problem solving tasks from 1/10 to 8/10 over the course of the semester. She told me that going back over the last few workshops helped reinforce the coding that she needed to be able to do in order to complete the assessment.

She plans on transferring into medicine, which is typically not a career that requires programming. At the end of the semester, with only one piece of assessment remaining and the decision made that she will change out of science, she is still putting a lot of effort into understanding the statistics and learning how to program is reinforcing this and allowing her to engage deeper than if we were restricted to the stats education I had in first year ten years ago where we spent a lot of time looking up the tails of distributions in a book of tables.

Maths and statistics education (for students not studying maths/stats as a major) is no longer just about teaching students how to do long division in high school and calculus and point and click statistics methods at university. While some degree such as Electrical Engineering, Computer Science and IT have traditionally been associated with some amount of programming, it’s becoming more and more common for maths and stats service units to include MATLAB or R as a means of engaging deeper with the mathematical content and understanding solutions to linear systems and differential equations or performing data analysis and visualisation. Learning to program leads to better understanding of what you’re actually doing with the code.

Computers are everywhere in our students’ lives and in their educational experiences. Due to their ubiquity, the relationship students have with computers is very different to what it was 10 years ago. Computers are great at enabling access to knowledge through library databases, Wikipedia and a bunch of other online repositories. But it’s not enough to be able to look up the answer, one also has to be able to calculate an answer when it hasn’t been determined by someone else. There is not yet a mathematics or statistics package that does all of the data analysis and all of the mathematical analysis that we might want to do in a classroom with a point and click, drag and drop interface.

To this end, I teach my students how to use R to solve a problem. Computers can do nearly anything, but we have to be able to tell the computer how to do it. Learning simple coding skills in school prepares students to tackle more advanced coding in quantitative units in their university studies but it also teaches an understanding of how processes work based on inputs and outputs, and not just computational processes, it’s all about a literacy of processes and functions (inputs and outputs). Learning to code isn’t just about writing code as a profession no more than teaching students to read is done to prepare them in their profession of priest or newsreader. Coding provides another set of skills that are relevant to the future of learning and participation in society and the workforce, just as learning mathematics allows people to understand things like bank loans.

Tony Abbott does not sound like he’s on board with the idea of giving kids the skills to get along in a world in which computers are part of our classroom the way books were when he was going through school. While reading, writing and basic mathematics skills will continue to be important skills, literacy is more than just reading comprehension. Information literacy, being able to handle data, and being able to reason out a process are even more important thanks to the changing technologies we are experiencing. Not every student is going to be a professional programmer, an app developer or big data analyst, but coding will be a skill which becomes more and more necessary as computers become more and more a part of our workplace not just as fancy typewriters or an instantaneous postal system but as a problem solving tool.

Marrying differential equations and regression

Professor Fabrizio Ruggeri (Milan) visited the Institute for Future Environments for a little while in late 2013. He has been appointed as Adjunct Professor to the Institute and gave a public talk with a brief overview of a few of his research interests. Stochastic modelling of physical systems is something I was exposed to in undergrad when a good friend of mine, Matt Begun (who it turns out is doing a PhD under Professor Guy Marks, with whom ILAQH collaborates), suggested we do a joint Honours project where we each tackled the same problem but from different points of view, me as a mathematical modeller, him as a Bayesian statistician. It didn’t eventuate but it had stuck in my mind as an interesting topic.

In SEB113 we go through some non-linear regression models and the mathematical models that give rise to them. Regression typically features a fixed equation and variable parameters and the mathematical modelling I’ve been exposed to features fixed parameters (elicited from lab experiments, previous studies, etc.) and numerical simulation of a differential equation to solve the system (as analytic methods aren’t always easy to employ). I found myself thinking “I wonder if there’s a way of doing both at once” and then shelved the thought because there was no way I would have the time to go and thoroughly research it.

Having spent a bit of time thinking about it, I’ve had a crack at solving an ODE within a Bayesian regression model (Euler’s method in JAGS) for logistic growth and the Lotka-Volterra equations. I’ve started having some discussions with other mathematicians about how we marry these two ideas and it looks like I’ll be able to start redeveloping my mathematical modelling knowledge.

This is somewhere I think applied statistics has a huge role to play in applied mathematical modelling. Mathematicians shouldn’t be constraining themselves to iterating over a grid of point estimates of parameters, then choosing the one which minimises some Lp-norm (at least not without something like Approximate Bayesian Computation).

I mean, why explore regions of the parameter space that are unlikely to yield simulations that match up with the data? If you’re going to simulate a bunch of simulations, it should be done with the aim of not just finding the most probable values but characterising uncertainty in the parameters. A grid of values representing a very structured form of non-random prior won’t give you that. Finding the maximum with some sort of gradient-based method will give you the most probable values but, again, doesn’t characterise uncertainty. Sometimes we don’t care about that uncertainty, but when we do we’re far better off using statistics and using it properly.

Posterior samples

SEB113 students really seemed to enjoy looking at mathematical modelling last week. The Lotka-Volterra equations continue to be a good teaching tool. A student pointed out that when reviewing the limit idea for derivatives it’d be useful to show it with approximating the circumference of a circle using a polygon. So I knocked this up:

approximations

Are you interested in big data and/or air quality? Consider doing a PhD with me.

This week I showed in the workshop how Markov chains are a neat application of linear algebra for dealing with probability. We used this interactive visualisation to investigate what happens as the transition probabilities change.

Zoubin Ghahramani has written a really nice review paper of Bayesian non-parametrics that I really recommend checking out if you’re interested in the new modelling techniques that have been coming out in the last few years for complex data sets.

Exercism.io is a new service for learning how to master programming by getting feedback on exercises.

Running Bayesian models

I came across a post via r/Bayes about different ways to run Bayesian hierarchical linear models in R, a topic I talked about recently at a two day workshop on using R for epidemiology. Rasmus Bååth‘s post details the use of JAGS with rjags, STAN with rstan and LaplacesDemon.

JAGS (well, rjags) has been the staple for most of my hierarchical linear modelling needs over the last few years. It runs within R easily, is written in C++ (so is relatively fast), spits out something that the coda package can work with quite easily, and, above all, makes it very easy to specify models and priors. Using JAGS means never having to derive a Gibbs sampler or write out a Metropolis-Hastings algorithm that requires to you to really think about jumping rules. It’s Bayesian statistics for those who don’t have the time/inclination to do it “properly”. It has a few drawbacks, though, such as not being able to specify improper priors (but this could be seen as a feature rather than a bug) with distributions like dflat() and defining a Conditional Autoregressive prior requires specifying it as a multivariate Gaussian. That said, it’s far quicker than using OpenBUGS and JAGS installs fine on any platform. Continue reading

Posterior Samples

Interested in collaborative use of R, MATLAB, etc. for analysis and visualisation within a webpage? Combining plotly and iPython can help you with that.

Cosmopolitan (yes, that Cosmopolitan) has a great article interviewing Emily Graslie, Chief Curiosity Officer at the Field Museum in Chicago. She discusses being an artist and making the transition into science, science education and YouTube stardom.

A few of the PhD students in my lab have asked if I could run an introduction to R session. I’d mentioned the CAR workshop that I’d be doing but the cost had put them off. Luckily, there are alternatives like Datacamp, Coursera and Lynda. Coursera’s next round of “Data Science”, delivered by Johns Hopkins University, starts next Monday (Course 1 – R Programming). So get in there and learn some R! I’m considering recommending some of these Coursera courses to my current SEB113 students who want to go a bit further with R, but the approach that they take in these online modules is quite different to what we do in SEB113 and I don’t want them to confuse themselves.

 

Learning different programming languages

One of the biggest changes I noticed moving from doing statistics in Minitab (in first year data analysis) to doing statistics in R (in third year statistical inference) was that R encourages you to write functions. Normally this is done by writing functions in R’s own language (that call other functions, also written in R, which eventually call functions written in C) but it’s also possible to make use of other languages to do the heavy lifting. This isn’t unique to R, of course; MATLAB encourages the use of MEX files to improve run-times when you need to call the same custom function over and over again.

I’ve really only used high level languages to do my statistics, making use of other peoples’ optimised code that do the things that I want. I’ve seen the development of pyMCMC by reseachers at QUT and someone from my NPBayes reading group made quite heavy use of RCpp in his thesis. Python and C++ are probably the two languages that would be the most useful to learn given their ubiquity and reputation. I have been putting off learning these for years as I know that there’s a large time investment required to become proficient in programming and no external pressure to learn (unlike learning R as part of my PhD thesis work).

There’s no doubt that writing optimised code is a desirable thing to do, and that knowing more than one programming language (and how to use them together) gives you a much richer toolbox for numerically solving problems. I’m now at a point, though, where it looks like I may need to bite the bullet and pick up C++. JAGS, which I use through rjags in R, is a stable, fast platform for MCMC-based inference. It’s written in C++ and notifies you every time you load it in R that it has loaded the basemod and bugs modules. There are additional modules available (check in \JAGS\JAGS-3.4.0\x64\modules\) and it’s possible to write your own, as long as you know C++.

I’m at a point with the work I’ve been doing on estimating personal dose of ultrafine particles that I’d like to make the modelling more Bayesian, which includes figuring out a way to include the deposition model in the MCMC scheme (as I’d like to put a prior on the shape parameter of the reconstructed size distribution). My options seem to be either writing a JAGS module that will allow me to call a C++ified version of the function or to abandon JAGS and write a Gibbs sampler (or Metropolis-Hastings, but Gibbs will likely be quicker given the simplicity of the model I’m interested in). Either solution will stretch me as a programmer and probably give me a better understanding of the problem. Eubank and Kupresanin’s “Statistical Computing in C++ and R” is staring at me from the shelf above my desk.