Podcast on open data and reproducing analysis

Open access to published research is becoming more and more common, but what about the data behind the research? Why is it so important to be able to access researchers’ data and duplicate results, and what can be done to make this easier and more common practice?

I've transcribed a podcast from the National Centre for Research Methods called "Reproducing social science research: give up your code" where Professor Vernon Gayle of the University of Edinburgh is interviewed on this topic. It's a bit of a departure for us as it's more relevant for quant rather than qualitative research, but I thought it was interesting, so I decided to transcribe it and share!


Prof Vernon Gayle:

I buy that the sort of digital revolution has meant that there’s an unparalleled amount of data available to social science researchers. Computers have become quicker and more powerful. There have been advances in software capabilities and their storage is relatively cheap. So really, there’s an unprecedented access to data – especially from the huge advances that have been made with the National Archives.

At the same time, there’s a sort of quiet revolution, if you like, in open access, which has meant that it’s easier and easier to gain access to published research. Certainly in the UK, every university that intends to make a submission to the forthcoming REF has set up some form of repository providing open access to publications.

So the central argument of my talk earlier today was that although there’s more access to data and to publications, there’s a sort of black box in between. And this is because there’s a lack of access to the research code that’s produced the research output. I’m arguing for researchers to routinely share their research code.

Interviewer:

Can you just explain to us what you mean when you talk about research code?

Prof Vernon Gayle:

I mean providing sufficient information to enable other researchers to understand, evaluate and build upon work. There should be enough information for a third party to reproduce, i.e. duplicate the results without needing to get additional information from the authors. In practice, this will involve researchers sharing their command files – whether these are Stata .do files, R scripts, SPSS syntax files, or even in my case Jupyter Notebooks.

Interviewer:

You talk about the final published papers being a palimpsest of the research that’s undertaken. Tell us what a palimpsest is, first of all, and then what it is you mean by that?

Prof Vernon Gayle:

A palimpsest is a parchment or other writing surface on which the original text has been effaced or partially erased and then overwritten by another. There are some really interesting religious and legal palimpsests on the web that are worth taking a look at. So I sort of invert this idea – lots of work goes into the research, and there’s usually a lot of code required to get the data ready for analysis, to undertake exploratory analyses, to estimate models and to present results.

Some of this work, or the traces of it, are visible in the final publication. But the final publication really is sort of superimposed upon it. So for me, the final published work is a sort of palimpsest of all the work that’s gone on before, if you like.

Interviewer:

Let’s dig a little deeper then into this whole concept of reproducing or replicating results. What’s the thinking behind this and why do you argue that it’s so important?

Prof Vernon Gayle:

Well, I’ll take a step back and I’ll first explain what I think are two important concepts associated with reproducible research. Following Nicole Janz’s terminology, I would say that work can be duplicated if sufficient information is made available, which ensures consistent results can be produced using the same data and the same analytical techniques.

So duplication is stage 1, whereas stage 2 is replication. And a replication study can duplicate the original work, but can also further test the robustness of the original work by employing new data or additional data or additional measures and alternative techniques of data analysis.

So duplication is stage 1; replication follows from duplication. That’s stage 2.

I’d say that reproducible research is important for a number of reasons. It improves transparency and allows the accuracy of the work to be checked. But it also allows researchers to better understand, evaluate and build upon the research. The mission of my university, for example, the University of Edinburgh, is the creation, dissemination and curation of knowledge. And the idea of reproducible research chimes in unison, squarely, with this mission.

Interviewer:

So what’s the problem, then, as you see it? What’s getting in the way of achieving reproducible social science?

Prof Vernon Gayle:

Well, the Yale Law School roundtable on data and code sharing back in 2010 reported that today, it’s impossible to verify most of the computational results that scientists present at conferences and in papers. So this is really quite serious. Most of us have a fond, or maybe even a terrifying memory, of being in primary school and being told, “Show your working out!” Teachers were always standing behind me and saying, “Where’s your working out?”

But somewhere between primary school and post-graduate study, this requirement has evaporated in some way. My acquaintance at UC Berkeley, Philip Stark, says we must move from a “trust me” to a “show me” culture. And if we think of something like the Royal Society’s motto, I think it’s nullius in verba – take nobody’s word for it. Why should we take anyone’s word for it? Let’s see their working out.

Interviewer:  

Do you have any thoughts on what can be done to resolve the issue? And can you give us an example, say, of something you’ve done yourself?

Prof Vernon Gayle:

So for example, I would say historically, it wasn’t possible to share research code within the confines of a paper journal. So basically the journal set a limit of 6,000 words or 8,000 words. So it wasn’t possible to do that. But we’re members of the internet generation and it’s now possible – there are a number of ways in which we could share research code.

At the current time, if we do nothing else, we could make research code available via our university repositories. There are already important changes taking place – providing research code, the code that produced the publications is required by a number of high profile journals. And a large number of journals have signed up to the Transparency and Openness Promotion Guidelines. The climate is changing.

Many researchers have very clear and organised workflows which render their work systematic and allow them to duplicate their analysis and later replicate work. So a small step from having private reproducibility could lead to a giant step for public reproducibility, if we routinely make our workflows public.

So I would say first of all, depositing annotated scripts that a third party can use to completely duplicate all of the results included in the published work is necessary. And checking, genuinely checking that a third party can duplicate the work is important.

Interviewer:  

And you practise what you preach?

Prof Vernon Gayle:

Yes, I have practised what I’m preaching here today. For the talk I gave earlier today, the files to undertake the analysis that we’ve done… For example, my Stata .do file and the relevant data files in several formats are available on my website. But I’ve also checked whether or not a third party – in this case two early career researchers, who were unconnected with the work, I checked that they could duplicate the analysis and understand my workflow fully.

So the information provided, I would say, should clearly state things like the data that we used, the source of the data, and its release, so that someone is absolutely sure they’re using the same data. It should clearly state the software that’s been used, including versions and libraries and dependencies and certainly the R users will know about the importance of that.

Even something as detailed as what seed was set. Imagine something like Bootstrap, we need to know exactly the seed that was set at the start for someone to be able to duplicate the work. We should include all of the script needed for enabling or preparing the data. It’s what the Californians call ‘data wrangling’. All the effort that goes into getting the data ready, all that information should be provided.

Finally, things like well annotated codebooks detailing variables should also be made public. So those are the things I think are fundamental for moving this forward in the immediate sense.

Interviewer:  

So you’re clearly walking the walk as well as talking the talk. What are the things that are getting in the way of other researchers doing this?

Prof Vernon Gayle:

I’m not exactly sure. I think many people could move to this quite rapidly, but you have to be committed to moving from a sort of “trust me” culture to a “I have shown you” culture. So data, there’s now much more open data. Data is much more openly available from the National Archives as well. Publications are increasingly open. What I’m saying is let’s open up the black box in the middle – let’s share our research code and let’s be much more transparent about the social science that we’re undertaking.

Interviewer:

If you could just sort of summarise for you what the benefits both to researchers and wider users of research would be if they were to adopt this approach, what would you say?

Prof Vernon Gayle:

I would say that the obvious benefit to both academics and the wider research and policy community is greater transparency. So the ability to check the accuracy of results and allowing others to better understand, evaluate and ultimately to build incrementally on the research that we undertake. 

Full verbatim or intelligent verbatim transcription?

Full verbatim or intelligent verbatim transcription?

Review: Microsoft Natural Ergo Keyboard 4000

Review: Microsoft Natural Ergo Keyboard 4000