Advertisement

SKIP ADVERTISEMENT

Activists Rush to Save Government Science Data — If They Can Find It

“Data rescuers” at New York University last month.Credit...Sam Hodgson for The New York Times

As the presidential inauguration drew near in January, something bordering on panic was taking hold among some scientists who rely on the vast oceans of data housed on government servers, which encompass information on everything from social demographics to satellite photographs of polar ice.

In a Trump administration that has made clear its disdain for the copious evidence that human activity is warming the planet, researchers feared a broad crusade against the scientific information provided to the public. Reports last week that the administration is proposing deep budget cuts for government agencies including the National Oceanic and Atmospheric Administration and the Environmental Protection Agency have fueled new fears of databases being axed, if only as a cost-saving measure.

“We’ll probably be saying goodbye to much of the invaluable data housed at the NCEI,” Anne Jefferson, a water hydrology professor at Kent State University, wrote on Twitter Saturday, referring to the National Centers for Environmental Information. “Hope it gets rescued in time.”

It is illegal to destroy government data, but agencies can make it more difficult to find by revising websites and creating other barriers to the underlying information.

Already there have been a handful of changes to the websites of federal science agencies, according to the Environmental Data and Governance Initiative, a new organization with researchers monitoring the content. On the E.P.A.’s website, for instance, the science and technology office had described as its mission the development of “scientific and technological foundations to achieve clean water.” Now the office says the goal is to develop “economically and technologically achievable performance standards.”

Pie charts on a Department of Energy website illustrating the link between coal and greenhouse gas emissions also have disappeared. So has the description on an Interior Department page of the potential environmental effects of hydraulic fracturing on federal land.

Changes like these appear only to reflect the publicly stated priorities of the new administration and there have been few signs as yet that federal databases are being systematically manipulated or restricted.

Image
Thousands of academics, librarians, coders and science-minded citizens have gathered at “data rescue” events like this one at New York University in February.Credit...Sam Hodgson for The New York Times

But concern about the vulnerability of scientific information has also focused attention on a nonpartisan problem of digital-age government: Much of the scientific information so painstakingly collected over the decades, at a cost of hundreds of billions of dollars, remains held only by the government, scattered on thousands of servers in hundreds of departments where it may not be backed up and could be impossible to find.

As thousands of academics, librarians, coders and science-minded citizens have gathered at what are called “data rescue” events in recent weeks — there were at least six this past weekend alone — the enormousness of extracting government data that is easily found has become apparent, as has the difficulty in tracking down the rest.

Some open-data activists refer to it as “dark data” — and they are not talking about classified information or data the government might release only if compelled by a Freedom of Information Act request.

“It’s like dark matter; we know it must be there but we don’t know where to find it to verify,” said Maxwell Ogden, the director of Code for Science and Society, a nonprofit that began a government-data archiving project in collaboration with the research libraries in the University of California system.

“If they’re going to delete something, how will we even know it’s deleted if we didn’t know it was there?” he asked.

The obstacles have spurred debate among open-data activists over how to build an archiving system for the government’s science data that ensures that the public does not lose access to it, regardless of who is in power.

“No one would advocate for a system where the government stores all scientific data and we just trust them to give it to us,” said Laurie Allen, a digital librarian at the University of Pennsylvania who helped found Data Refuge. “We didn’t used to have that system, yet that is the system we have landed with.”

At the moment, the closest thing to a central repository is Data.gov, which, under a 2013 Obama administration directive, is supposed to link to all of the public databases within the government. But it relies on agencies to self-report, and the total size of all the data linked to by the directory, Mr. Ogden recently found, comes to just 40 terabytes — about as much as would fit on $1,000 worth of hard drives.

Image
The transition to digital distribution that made government documents more accessible, librarians say, has also left them more at risk.Credit...Sam Hodgson for The New York Times

NASA alone provides access to more than 17.5 petabytes of archived data, according to its website (a petabyte is 1,000 times bigger than a terabyte), over dozens of different data portal systems.

And one-third of the links on Data.gov, Mr. Ogden found, take users to a website rather than the actual data, which makes it hard to devise software that can automatically copy it.

Even databases that are listed on Data.gov — and there are more than two million, according to Mr. Ogden’s published logs — often sit behind an interface designed for ease of use but built with proprietary code almost impossible to reproduce.

The need to write custom code to extract data from, say, the E.P.A.’s discharge monitoring reports is one reason that, despite having hosted more than two dozen “data rescue” events since January, the activist group Data Refuge lists only 158 data sets in its public directory.

Andrew Bergman, a graduate student in applied physics at Harvard, along with two physics department colleagues, suspended his studies to help found the Environmental Data and Governance Initiative, which has also helped to organize the events.

“We have things that are considered really important from NASA, E.P.A., NOAA,” Mr. Bergman said. “But in terms of finalized, completed data sets that are actually useful, it’s a very small number compared to the total.”

The transition to digital distribution that made government documents more accessible, librarians say, has also left them more at risk. Without physical copies in libraries, the internet’s promise of making government information more widely available has made it far more centralized.

Except when certain data is the subject of a lawsuit or multiple F.O.I.A. requests, it remains unclear what compels an agency to keep it online.

Image
Volunteers at New York University culling data from government websites in February. Because of the difficulty of extracting data, the activist group Data Refuge lists only 158 data sets in its public directory since it started hosting events in January.Credit...Sam Hodgson for The New York Times

“Destroying federal records is a crime,” said Patrice McDermott, who heads a public advocacy organization called Open the Government. “Taking them off of the internet does not have the same penalty.”

In a recent letter to the federal Office of Management and Budget, Ms. McDermott’s group cited a clause in the 1995 Paperwork Reduction Act that requires agencies to “provide adequate notice when initiating, substantially modifying, or terminating significant information dissemination products.”

But what that means for the age of big data has not been defined.

To make secure copies of government research that researchers can trust is no easy task, librarians say. But many of those who have been trying for years to find funding and a system to do it reliably hope to harness the current wave of interest.

“At the moment, more people than ever are aware of the risk of relying solely on the government to preserve its own information,’’ two government document librarians, James A. Jacobs, of the University of California, San Diego, and James R. Jacobs of Stanford University, wrote in an essay circulated online last week. “This was not true even six months ago.’’

At the archiving events, participants are typically divided into groups. One uses a web browser extension to flag government web addresses for the Internet Archive, an existing service that operates an automated “web crawler” that can make copies of federal websites but typically not the databases that store information in more exotic formats.

Another group is tasked with scrutinizing data sets that researchers have identified as particularly useful or vulnerable. Those are “tagged” with a description of where they came from and what they are.

At one of last month’s events, at New York University, many marveled at the breadth and depth of the research they were sorting through, even as they worried about its future.

“Look, you can get temperature and salinity readings from any one of these buoys,’’ said Barbara Thiers, the vice president for science at the New York Botanical Garden, another participant. “This is the raw data for tracking ocean warming.’’

Follow Amy Harmon on Twitter at @amy_harmon

Like the Science Times page on Facebook. | Sign up for the Science Times newsletter.

A version of this article appears in print on  , Section D, Page 1 of the New York edition with the headline: The Rush to Uncover, and Save, ‘Dark Data’. Order Reprints | Today’s Paper | Subscribe

Advertisement

SKIP ADVERTISEMENT