The National Science Board (NSB) held an Expert Panel Discussion on Data Policies at the National Science Foundation yesterday & today, exploring the opportunities and challenges of a future rooted in data-intensive science and engineering. Organized by the NSB’s Task Force on Data Policies, the meeting included leading figures in the scientific enterprise across the U.S., the U.K., and Germany. A key goal was to identify guiding principles for establishing policies on data and artifacts (such as codes).
The experts assembled by the NSB described the wealth of opportunities, including entirely new types of science, that stand to be enabled by data-intensive S&E — by virtue of opening up vast new sources of data/artifacts and combining/linking/merging these data/artifacts. However, they also relayed concerns, notably issues of data sharing and quality control — particularly in the context of grant proposals and journal publications. Participants agreed that data-intensive S&E promises to change how NSF does business, beginning with the way in which research proposals are evaluated. Indeed, we are already witnessing changes at the NSF: in late 2010, the Foundation instituted a policy requiring every proposal to include a Data Management Plan detailing how the proposal will conform to the NSF’s policy on the dissemination and sharing of research results (including data, software codes, etc.).
Fran Berman (RPI) helped set the stage for the meeting with a talk yesterday morning outlining the data-intensive movement. She noted that, while data-intensive science is about “big data,” it also creates multiple exciting environments for data-intensive applications throughout the data-compute-distribution spectrum. However, taking a page from the recent PCAST review of the Federal NITRD program (“Every agency will need a ‘Big Data’ strategy”), Berman argued we shouldn’t necessarily equate “big data” to more FLOPS. ”More scientists will depend on exabyte data than on exaFLOP machines,” she said.
Some themes that emerged during the first half of the meeting:
- The government has become a significant player in generating data for science — and NSF’s PI-based data must be interoperable with the government’s data. Consequently, the NSF has added responsibility for data collection, management, and infrastructure.
- The value added by data-intensive S&E is not necessarily clear in the academy. For example, generally-speaking, no discipline truly acknowledges data curation. Establishing a “peer group” through funding opportunities/mechanisms is not likely to be sufficient to trigger a shift in this view. On the other hand, in some places like the U.K., we are seeing an emergence of “hybrid professionals” spanning four key roles – computer scientists, other domain scientists and engineers, information technologists, and library scientists.
- There are a host of data federation problems in certain communities, such as in neuroscience where one sees a wide range of scales, data, and data types, etc.
- A key challenge is trust, both in terms of trusting others with one’s data sets (are there any intellectual property concerns?) and trusting the data that one comes across in a journal article (how accurate or believable are the data and results?).
- Journals are very concerned about data policies, and specifically how they should approach the data/artifacts behind the science that they publish. In addition, with respect to peer review of journal articles (or even grant proposals), it is important to remember the subtle but meaningful distinction between reproducibility and replicability.
And a sampling of suggestions the participants voiced before the NSB/NSF leadership in attendance — also during the first half of the meeting:
- Consider pursuing three integrative directions concurrently, namely data-enabled applications (e.g., “what are the ‘Grand Challenges’ here?”); data-focused research (e.g., data mining, machine learning, predictive modeling, visualization techniques, etc.); and sustainable and reliable data infrastructures.
- Consider pilot projects, e.g., to see what it takes for projects to be reproducible in different domains (as there are likely to be varying characteristics across disparate scientific fields).
(Contributed by Erwin Gianchandani, CCC Director)