Enhancing transparency when working with existing data: Examining reading comprehension difficulties in a large-scale birth cohort
Dr Emma James, Lecturer, Department of Psychology
0000-0002-5214-0035/ @emljames
Email: emma.james@york.ac.uk
Summary
An issue within psychological research is reduced acceptance of secondary data, when compared with primary data, within open research practices. A challenge with pre-registration is remaining naïve to the data when data access fees mean that all data must be accessed ahead of time, which the researchers worked around by being selective over what data they were viewing before analysis. An additional challenge with using secondary data comes from data being sub-optimal for statistical purposes planned in pre-registration. The researchers reflect on the ability to report what has not gone to plan from pre-registration, rather than pre-registration restricting what analyses can be completed. A final challenge comes from the secondary data not being open, and analysis requiring software that is not freely available. The researchers addressed these issues by finding a specialist R package for the analysis allowing an annotated version of the modelling process and output to be shared on OSF for future researchers using the same secondary dataset.
Case Study
This work examines children’s reading comprehension difficulties using data from the Avon Longitudinal Study of Parents and Children (ALSPAC), also known to participants as “Children of the 90s”, a birth cohort study charting the development of ~14,000 individuals born in the 1990s. Studies using existing datasets lag behind other aspects of psychology in open research practices: the secondary data preregistration template was only added to the Open Science Framework in 2021, and only 57% of journals accepting Registered Reports will do so for secondary data (at time of writing in May 2023) . This case study reflects on the barriers to transparency in this context, and how the researchers are addressing some of these challenges.
Transparency (or not) over prior access
One challenge in preregistering secondary analyses is remaining naïve to the data. ALSPAC charge a data access fee, meaning that the researchers had to request all variables at the start of their grant before they could plan each sub-project in detail. The researchers therefore had to act as their own gatekeepers, extracting only subsets of variables from the dataset once an analysis had been preregistered. The researchers were pleasantly surprised to learn this approach was sufficient for Registered Reports at two journals, despite their guidelines requesting “evidence” that the data had not been accessed.
Even so, there remains a conflict between transparent reporting and the journal publication process. It is important to report potential biases from the authors’ prior experience with the dataset, yet many developmental psychology journals adopt an anonymised review process that prevent referencing prior work. The researchers tried stating in cover letters that they were happy to forgo their anonymity, but to their knowledge the information was not shared.
A plan not a prison! Dealing with difficult data
Another challenge they encountered is that the data did not always conform to theoretical and statistical expectations for the planned analyses. Birth cohort studies prioritise having a broad range of measures, which are then often sub-optimal in capturing the variability researchers would aim for if designing the study themselves. Thus, preregistered statistical models are sometimes a poor fit and require several amendments. While amendments are entirely normal in this kind of analysis, the distinction between theoretically and statistically motivated decisions remains poorly reported in published papers. While the researchers previously thought of preregistration as a commitment to analyses before collecting the data, they came to appreciate its value in communicating what had not gone to plan.
Reporting open analyses in a closed context
Reporting these complex issues is even more important given that the data itself cannot be made openly available, and the researchers’ analyses used statistical software which was not open source. To facilitate transparency, they used a specialist R package, "MplusAutomation", as an interface between open-source R and the advanced tools of Mplus software. Using this package allowed them to share an annotated version of the modelling process and output on the OSF, so that others could reproduce the steps in future should they access the data from ALSPAC.
Links
Sub-project 1: Heterogeneity in reading comprehension difficulties: A latent class approach: https://osf.io/zvjw4/
Sub-project 2: Educational outcomes for children with comprehension difficulties - Registered Report (Stage 1 in principle acceptance): https://osf.io/yhu9b
Licensing information
Except where otherwise noted copyright in this work belongs to the author(s), licensed under a Creative Commons Attribution-NonCommercial 4.0 International Licence
Case study poster
Please consider sharing this in your department or school! A3 printed copies are available upon request from the Open Research team (lib-open-research@york.ac.uk)