Data Management Policy
Updated: August 2024
EMSL’s Public Data Release policy applies to all non-proprietary user projects and is subject to change without notice under the Terms and Conditions for EMSL use.
The EMSL Data Management Policy details the resources available to EMSL users and facility staff for managing data associated with research using EMSL resources. This policy is provided to help users and staff understand the data resources available at EMSL, including storage and retrieval capabilities, data definitions, data release requirements, and the data management responsibilities of EMSL’s instrument and computational scientists. EMSL’s policy is based on the guiding principles summarized below from the Department of Energy (DOE), DOE Office of Science (DOE-SC) and the Office of Biological and Environmental Research (BER). It also provides information necessary to assist researchers in developing a data management plan to meet funding agency requirements.
To promote the efficient delivery of scientific discoveries and effective use of government resources, DOE and DOE-SC have mandated that data management planning be an integral element of research planning.1,2. Data preservation and sharing facilitates validation and reproducibility of scientific results and broadens the applicability of data products beyond the scope of individual research projects. Therefore, it is the intent of DOE-SC that scientific data generated at scientific user facilities such as EMSL be made available to the scientific community, industry, and the public to the greatest extent possible. DOE-SC policy requires that “all research data displayed in publications resulting from [SC-sponsored] research be open, machine-readable, and digitally accessible to the public at the time of publication.” Additional information on data and publication sharing requirements and guidelines is provided in DOE’s Public Access Plan3.
BER, EMSL’s sponsor, provides further guidance regarding digital data management for some of its research programs4. A compilation of data management policies at all SC user facilities is also available5.
Data Management Resources
EMSL currently provides the ability to store all data generated at EMSL (including numerical simulation outputs) in a hierarchal storage archive, which provides short-term disk storage of recently used data combined with long-term archival of infrequently used data on lower-cost tape resources. The EMSL data archive system, known as Aurora, currently has the capacity to store tens of petabytes (PB) of data and is readily expandable. This archive serves as the foundation for the metadata-based data repository that is currently accessible to EMSL staff members and is connected to all major EMSL instruments. Authorized EMSL users and facility staff can electronically access their nonproprietary data in the repository through the Get Data tab of the applicable project in the User Portal. To facilitate resource planning, researchers expecting to generate 250 terabytes of data or more on a single user project should include this in their proposal package request for resources. If projects generate more data than originally anticipated in the proposal process, researchers should discuss this with their project manager who will request additional resources.
EMSL Data Access Policy for User Project Team Members
Under this policy, data are released to EMSL user project team members as follows:
- For projects producing data at EMSL and one or more non-DOE user facilities, immediate access and release of data generated on an approved user project are granted to all team members (aka participants) listed on the project by the PI. All team members on the project will have full access to their data both during the project period and in perpetuity after the project ends. EMSL staff are granted access to the data but are not authorized to release the data in any form.
- For collaborative projects utilizing EMSL and one or more user facilities, the data generated at EMSL will be released to the team members as described above. Data generated at other user facilities will be released by those facilities in accordance with their respective data management policies. Project team members who are unable to find and download project data through the User Portal should contact their project manager or UPS. They will work with the team member to create a data package from the repository and transmit the data by appropriate means.
EMSL Open Access Data Release Policy
EMSL’s Open Access Data Release Policy applies only to non-proprietary data collected under any of the proposal types in the user program as described in Chapters 5, 7, 8, and 9 of EMSL's Operations Manual. The purpose of this policy is to balance the need to make data openly accessible to the scientific community and the public as soon as possible with the reasonable expectation that project teams are afforded time to analyze the data, evaluate the results and prepare publications on their conclusions, while easing fear of preemption. Data, for purposes of this policy, refers to the sample metadata, raw instrument data, associated experiment metadata, and processed data, and will be released to the public on EMSL’s open access data portal.
To support making data openly accessible, tracking data use for the Office of Science, and encouraging proper citation of the researchers who generated the data, EMSL’s data portal requires creation of a user account and provides Digital Object Identifier (DOI) minting services. DOIs may be minted in association with specific data packages, including 1) data packages associated with scientific publications; 2) unique data packages developed by EMSL users and/or staff; and 3) data packages requested through the data portal. An Award DOI is also generated by EMSL for every user project, for inclusion in publication acknowledgements. These data and award DOIs provide an avenue for data reuse with appropriate citation and attribution of EMSL and the generating PI or team members. Individuals who download data from EMSL’s data portal are counted as data users, according to the user definitions in Chapter 4 of EMSL's Operations Manual.
Under this policy, the data will be released to the public as follows:
-
Specific data will be released immediately upon upload to EMSL’s repository:
-
Data generated under EMSL’s Strategic Science Objectives, including but not limited to: the 1000 Soils Research Pilot and the Molecular Observation Network (MONet).
-
Field sensor data.
-
- All other resource data will be released as follows:
-
All non-proprietary data uploaded to the repository on an approved user project will become openly accessible at the time a data DOI is minted, at the time of publication of the associated scientific results, or within one year after data generation and upload to the EMSL repository, whichever comes first.
-
Prior to the open access release date, data can be released only by the user project PI or team member to other entities (people, publishers as supplementary materials in a manuscript submission, institutions, etc.).
-
For collaborative projects utilizing EMSL and additional user facilities, the data generated at EMSL will be released as described above. Data generated at other user facilities will be released by those facilities in accordance with their respective data management policies. EMSL’s Open Access Data Release Policy applies only to nonproprietary data collected under the user program. The purpose of this policy is to balance the need to make data openly accessible to the scientific community and the public as soon as possible with the reasonable expectation that project teams are afforded time to analyze the data, evaluate the results, and prepare publications on their conclusions, while easing fear of preemption. Data, for purposes of this policy, refer to the sample metadata, raw instrument data, associated experiment metadata and processed data and will be released to the public on EMSL’s open access data portal. To support making data openly accessible, tracking data use for SC, and encouraging proper citation of the researchers who generated the data, EMSL’s data portal requires the creation of a user account and provides Digital Object Identifier (DOI) minting services. DOIs may be minted in association with specific data packages, including (1) data packages associated with scientific publications, (2) unique data packages developed by EMSL users and/or staff, and (3) data packages requested through the data portal. An award DOI is also generated by EMSL for every user project for inclusion in publication acknowledgements. These data and award DOIs provide an avenue for data reuse with appropriate citation and attribution of EMSL and the generating PI or team members.
-
Under this policy, the data will be released to the public as follows:
- Specific data will be released immediately upon upload to EMSL’s repository:
- data generated under EMSL’s Strategic Science Objectives, including but not limited to the 1000 Soils Research Pilot and MONet o field sensor data.
- All other resource data will be released as follows:
- All nonproprietary data uploaded to the repository on an approved user project will become openly accessible at the time a data DOI is minted, at the time of publication of the associated scientific results, or within one year after data generation and upload to the EMSL repository, whichever comes first.
- Prior to the open access release date, data can be released only by the user project PI or team member to other entities (people, publishers as supplementary materials in a manuscript submission, institutions, etc.).
- For collaborative projects utilizing EMSL and additional user facilities, the data generated at EMSL will be released as described above. Data generated at other user facilities will be released by those facilities in accordance with their respective data management policies
Repository Management
Nonproprietary project data that are from EMSL resources must be uploaded to the repository; data stored outside EMSL’s repository does not meet the requirements of this policy. All data uploaded to the repository will be stored permanently to ensure long-term accessibility. Legacy data (data collected prior to the availability of the repository and stored elsewhere) are being evaluated by EMSL staff, and all legacy data that meet the requirements below and for which required metadata can be established will be uploaded to the repository and become available per the policies below.
Data uploads are regularly monitored using reporting tools that are linked to instrument usage records in EMSL’s management system to evaluate compliance with this policy. For purposes of this policy, EMSL’s instrument custodians and computational scientists are expected to use the following guidance to determine which data will be uploaded to the data repository.
Data Included in the Repository
Essentially all nonproprietary data should be uploaded, except data that fall under Section 12.4.2. Data that fall into the exception category can be uploaded in some cases but are not required. EMSL instrument custodians and computational scientists should direct any questions they have regarding the type of data collected to the EMSL IRPL responsible for the instrument or computing system being used. All sample metadata, raw instrument data, simulation outputs, processed data, and associated experiment metadata collected from experiments or computations that are expected to be delivered to EMSL users as part of an approved project must be uploaded to the repository. EMSL instrument custodians and computational scientists should upload the data as soon as practicable, but no later than the end of each month for raw data and associated metadata and no later than the end of the quarter for processed data. A command line uploader has been provided for computational resources and should be used to store computational outputs in the same manner as for experimental data.
Data definitions should conform to relevant community standards for data and metadata when they exist. All data uploaded to the repository conform to the Dublin Core bibliographic metadata standard (bibliographic metadata are automatically extracted from project text stored in EMSL’s management system), which facilitates linkage to the DOE Office of Scientific and Technical Information (OSTI) where all EMSL publications are archived. Other standards are domain-specific, such as the Human Proteome Organization (HUPO) proteomics standards initiative that guides metadata collected and stored by both EMSL and the PNNL proteomics data management systems. In cases where there are no clear community standards, data in a form that allows unbiased interpretation by the relevant scientific community should be uploaded. Note that a single experiment or simulation run could require more than one dataset to be uploaded; the original data may be uploaded initially and processed data subsequently. The time stamps for upload of each dataset determine its date of release to the public
Data Not Required to be Uploaded to the Repository
As an exception to the requirements in the section above (Date Included in the Repository), some data are not required to be uploaded to the repository. These are data that will not form the basis of publishable research findings or are not associated with an EMSL project. These include data from experiments known to be faulty in some regard, e.g., through mishap or due to a flawed experimental design, data from preliminary experiments that are not intended to be delivered to EMSL users, calibration runs for which results are not needed to interpret legitimate project data, and data generated to verify successful operation of the instrument or demonstrate capability to prospective users.
Blocking Released Data
In the rare case where users or staff have identified faulty data that have been released to the project team or made openly accessible to the scientific community, the instrument custodian should contact the appropriate IRPL and CDO in writing, providing sufficient detail and justification for requesting that the released data be blocked. If approved, the CDO will forward the email to UPS who will flag the appropriate dataset(s) and document both the request and approval in the applicable project records.
Data Repository Management
The archive size is maintained to ensure at least 48 months of headspace at any time (based on extrapolation of recent data upload rates). Disk storage comprises approximately 15% of the archive, with the remainder being tape storage. As data are uploaded, two permanent copies of the data are stored to tape (for data redundancy and integrity). The data are also maintained on disk to facilitate rapid access, but as data age they may be removed from disk storage. Selected data may also be “locked” to disk by system operations staff. The disk archive is actively managed by automated processes that purge data once the disk usage reaches 90% of its capacity; files are purged in order of longest time since last access until 80% capacity is achieved. Thereby the disk storage is continually maintained at 80% to 90% usage of its capacity, with only the most recently accessed files retained on disk. At any point, a file can be retrieved from tape to the disk if requested.
- https://www.energy.gov/datamanagement/doe-policy-digital-research-data-management
- https://science.osti.gov/Funding-Opportunities/Digital-Data-Management
- https://www.energy.gov/downloads/doe-public-access-plan
- https://science.osti.gov/ber/Funding-Opportunities/Digital-Data-Management
- https://www.energy.gov/datamanagement/doe-policy-digital-research-data-management-resources