A visit from the Ghost of Research Past

A request for an old data set recently afforded me the opportunity, much like Ebenezer Scrooge, of revisiting my Past-Self when I was a brand-new post-graduate student, and allowing Past-Self and Future-Self to help me critique how my lab curates our data and materials in the present day. Both Past-Self and Future-Self are compelling agitators for a proactive approach to opening data, especially implementing a Data Partner scheme. 

Openness about our work is consistent with believing that the work is important and excellent. Being asked for access to your work is an acknowledgement that it is valuable, and sharing it is an expression of your confidence in its value. I’ve found openness to be rewarding, leading to additional citations, gracious acknowledgments, and sometimes new collaboration opportunities. 

However, requests for data or materials fluster us, arriving out-of-the-blue. It always seems necessary to perform fresh checks: Is the code understandable and functional? Data may need to be explained and possibly tidied: what do the column headings mean again? Could there be identifying details in any of the responses? I might spend hours performing these checks before complying with a request. 

Waiting until the request arrives to open up data and materials can be seen as a tacit judgment on the expected impact of the data. Why, if I believe the work I do is worthwhile, am I not preparing it for public consumption before I publish it? When did I start imagining that no one was likely to be interested in re-analyzing my data or using my experimental code?  

Recently I was asked for data from the first paper I ever published, part of my master’s research project, which were collected in autumn 2002 and published in 2004. Possibly, sharing data that has been untouched for more than 10 years is asking too much. It wouldn’t have been strange if I had lost it in institutional moves and computer crashes, or if it proved impossible to adequately document. But if found, going through these data would give me an opportunity to pay a visit to my Past-Self, recall what it was like to begin a research project for the first time, and maybe learn something from her.

One thing that struck me as I examined Past-Self’s data is that Past-Self organized it expecting that other people would be looking at it. Past-Self inserted comments explaining what numeric codes meant. Past-Self wrote summaries of the purpose of experiments, and Past-Self organized files into hierarchical directories with sub-folders for data files, analyses, and experimental stimuli. I think it would have surprised Past-Self that no one would ask to look at this information until 2015. Past-Self thought this work was important and documented it accordingly.

Though Past-Self began as a data-sharing idealist, she had minimal skills for curating data and materials. Some organization elements improved drastically in the later experiments in her project. Past-self learned it is better to make category codes self-explanatory (e.g., why assign “male” or “female” to arbitrary numeric codes instead of just entering the words?). Past-self developed sensible conventions for naming files. Past-self reduced redundancies in data recording. 

But though some practices improved, it also became clear that Past-Self abandoned the expectation that anyone apart from her and her supervisor would ever see these raw data and materials. As the project drew on, the helpful comments disappeared, and the summaries for subsequent experiments were unchanged from the earliest ones. The whole directory was organized around an 8-experiment master’s project, which eventually resulted in the publication of three experiments in two separate papers. Past-Self never re-organized these materials so that it would be immediately obvious how to locate the materials pertaining to each paper specifically.

Altogether I interrogated Past-Self for about 5 hours: we located the data sets requested, established through re-analysis that they did in fact include the same data that were published, saved them in an accessible non-proprietary format, documented what the data sets contained and how these variables were coded, and published the data and guidance on Open Science Framework. On the one hand, that isn’t terrible. My Future-Self, who checked in throughout this process, insists that 5 hours of work accomplished now is a sound investment. It enables a colleague on the other side of the planet to do a meaningful new analysis, from which we might all learn something novel. Furthermore, those data are now available to anyone else who might have other ideas for how our data can be useful. Future-Self insists that this will lead to glory. On the other hand, this 5 hours of work entirely replicated work that Past-Self did more than 10 years ago in her haphazard manner. If Past-Self had carried on carefully documenting her data, if she had considered that materials should be available in commonly accessible formats, and if she had updated her personal repository to reflect the published record, then these materials would have been ready for sharing upon request in minutes, not hours. Future-Self is anxious to know how I am going to prevent this waste of time. Past-Self wonders whether I can do more to help my trainees learn good habits.

What, if any, are the constraints to proactively curating lab work? Proactive curation is obviously desirable for Future-Self: it saves her time and effort and it increases the impact and utility of the work. It is arguably good for trainees and PIs alike. Because I work with many short-term trainees, I have handled most data curation myself, but this is a valuable skill that Past-Self needed to learn better, and that Future-Self wants delegated. The Data Partner scheme is ideal for this: my trainees can be paired with trainees from a colleague’s lab, and these two students will help each other curate data by seeing whether their partner’s work is clear, self-explanatory, and reproducible. They do this independently of me. When the data are shown to me, they have already been vetted by one other person, providing an additional chance to catch mistakes. My trainees get the practice that Past-Self lacked, and Future-Self will never wonder whether data and materials are ready to be shared.

Are you at Psychonomics 2015? Come to our talk, Open Science: Practical Guidance for Psychological Scientists, Friday at 10:40 am, in the Statistics and Methodology II session.

Update: Check out Lorne Campbell’s thoughts on this too.