Open data for science education

Published on PLOS Sci-Ed

Open data is the idea that scientific data should be freely available to all, without restrictions, in searchable online repositories. The open data movement is gaining momentum in the scientific community because of its promise to enable more frequent replication of studies and to accelerate the pace of research. But the advantages for science education are just as compelling.

Science students can benefit greatly from educational materials that expose them to real-world phenomena and data. Unlike learning from broad generalizations and pre-fabricated “cookbook” labs, examining and working with real data can increase interest and better prepare students for careers in science. As states begin to adopt the Next Generation Science Standards, which emphasize practices such as analyzing and interpreting data, and mathematical and computational thinking, developers of K-12 science curriculum materials are increasingly looking for ways to incorporate scientific data into their lessons and assessments.

However, barriers exist that prevent educators from effectively using much of the data that scientists produce. As a reader of PLOS blogs, you are likely familiar with the open access movement in scholarly publishing. But even access to journal articles, though valuable, is often not sufficient for educators’ purposes. Data in journal articles are usually in the form of a few graphs. These graphs are typically frozen in PDFs as part of a paper that conveys the authors’ interpretation of the results in the context of their particular study. And the data presentation choices were made with one audience in mind: experts in the field.

Open data graphic

Open Data. Image by Colleen Simon for opensource.com on Flickr (CC BY-SA).

Using data as it is presented in papers is almost never pedagogically sound at the middle or high school level; much must be changed about the presentation. Jargon and acronyms might have to be removed from axis titles, individual data sets might need to be separated if they are layered into a single figure, or perhaps a section of the graph that describes phenomena outside the scope of the lesson and would have to be removed. Making these kinds of educationally necessary modifications—while maintaining scientific accuracy—often requires access to full original datasets.

Unfortunately, most scientific data is not archived and readily available online. Educators have to contact the study authors and see if they are willing and able to pass it along. Just as with journal articles, this “write to the author” stop-gap is wildly inefficient. Study authors often can’t or won’t respond to requests for original data for a variety of reasons. Sometimes they are simply out of town and not checking email. Sometimes they want to publish more papers and are afraid of getting scooped. And sometimes, especially with older studies, they actually can’t find their data.

In a 2002 survey of geneticists, of those who admitted to denying at least one request from a colleague for published data, the most commonly given reason was the “effort required to actually produce the information” (80 percent of respondents). As Todd Vision, a biologist at UNC and contributor to the Data Dryad open data repository, explained in BioScience:

Unarchived data files are often misplaced, corrupted, or the software in which they were produced becomes obsolete. Memories fade.

Science education materials developers need full access to the data in order to determine its pedagogical strengths and weaknesses. This process often involves investigating many different data sets until settling on the ones that will best address the learning goals for their particular project. Following up on hundreds of individual papers—with a dismal rate of return—isn’t feasible for a small education nonprofit or a lone teacher trying to innovate at a struggling school. This leaves vast amounts of potentially more educationally useful data untapped.

I talked to Sandra Porter, who I met at the last Science Online conference, about her experience with obtaining data for curriculum materials development. Sandra is the president of Digital World Biology, and one of her collaborative projects, Bio-ITEST, involved the development of bioinformatics curriculum materials for secondary students. In genetics and bioinformatics, which are inherently data-focused, data archiving requirements are more common and Sandra and her colleagues were able to take advantage of open data resources such as the National Center for Biotechnology Information (NCBI) and the Barcode of Life Data (BOLD) Systems. Yet even in these fields, access to raw data—the kind that practicing scientists would encounter in their careers—can be tricky to obtain. Sandra commented:

The raw data was useful for us because we needed to know what raw data looks like so we could work out analysis problems in advance. These types of data files are not likely to be available from many places since these raw data are usually processed and analyzed through many pipeline steps before they get submitted to a database.

There are many worthy reasons to support the open science movement, but the argument for science education holds its own among them. It has never been easier to bring real scientific data into classrooms, and the benefits to young scientists-in-training are clear. It would be a shame for all of that educational potential to languish on old hard drives.