Updated: August 2022
EMSL’s Public Data Release policy applies to all non-proprietary user projects and is subject to change without notice under the Terms and Conditions for EMSL use.
The EMSL Data Management Policy details the resources available to EMSL users and facility staff for managing data associated with research using EMSL resources. This policy is provided to help users and staff understand the data resources available at EMSL, including storage and retrieval capabilities, data definitions, data release requirements, and the data management responsibilities of EMSL’s instrument and computational scientists. EMSL’s policy is based on the guiding principles summarized below from the Department of Energy (DOE), DOE Office of Science (DOE-SC) and the Office of Biological and Environmental Research (BER). It also provides information necessary to assist researchers in developing a data management plan to meet funding agency requirements.
To promote the efficient delivery of scientific discoveries and effective use of government resources, DOE and DOE-SC have mandated that data management planning be an integral element of research planning.1,2. Data preservation and sharing facilitates validation and reproducibility of scientific results and broadens the applicability of data products beyond the scope of individual research projects. Therefore, it is the intent of DOE-SC that scientific data generated at scientific user facilities such as EMSL be made available to the scientific community, industry, and the public to the greatest extent possible. DOE-SC policy requires that “all research data displayed in publications resulting from [SC-sponsored] research be open, machine-readable, and digitally accessible to the public at the time of publication.” Additional information on data and publication sharing requirements and guidelines is provided in DOE’s Public Access Plan3.
BER, EMSL’s sponsor, provides further guidance regarding digital data management for some of its research programs4. A compilation of data management policies at all SC user facilities is also available5.
Data Management Resources
EMSL currently provides the ability to store all data generated at EMSL (including numerical simulation outputs) in a hierarchal storage archive, which provides short-term disk storage of recently used data combined with long-term archival of infrequently used data on lower-cost tape resources. The EMSL data archive system, known as Aurora, currently has the capacity to store tens of petabytes (PB) of data and is readily expandable. This archive serves as the foundation for the metadata-based data repository that is currently accessible to EMSL staff members and is connected to all major EMSL instruments. Authorized EMSL users and facility staff can electronically access their non-proprietary data in the repository through the Get Data tab of the applicable project in the User Portal.
To facilitate resource planning, researchers expecting to generate 250 terabytes of data or more on a single user project should include this in their proposal package request for resources. If projects generate more data than originally anticipated in the proposal process, researchers should discuss this with their project manager who will request additional resources.
EMSL Data Access Policy for User Project Team Members
Under this policy, data are released to EMSL user project team members as follows:
Immediate access and release of data generated on an approved user project is granted to all team members (aka participants) listed on the project by the principal investigator (PI). All team members on the project will have full access to their data both during the project period and in perpetuity after the project ends. EMSL staff are granted access to the data but are not authorized to release the data in any form.
For collaborative projects utilizing EMSL and additional user facilities, the data generated at EMSL will be released to the team members as described above. Data generated at other user facilities will be released by those facilities in accordance with their respective data management policies.
Project team members who are unable to find and download project data through the User Portal should contact their project manager or User Services. They will work with the team member to create a data package from the repository and transmit the data by appropriate means.
EMSL Open Access Data Release Policy
EMSL’s Open Access Data Release Policy applies only to non-proprietary data collected under any of the proposal types in the user program as described in Chapters 5, 7, 8, and 9 of EMSL's Operations Manual. The purpose of this policy is to balance the need to make data openly accessible to the scientific community and the public as soon as possible with the reasonable expectation that project teams are afforded time to analyze the data, evaluate the results and prepare publications on their conclusions, while easing fear of preemption. Data, for purposes of this policy, refers to the sample metadata, raw instrument data, associated experiment metadata, and processed data, and will be released to the public on EMSL’s open access data portal.
To support making data openly accessible, tracking data use for the Office of Science, and encouraging proper citation of the researchers who generated the data, EMSL’s data portal requires creation of a user account and provides Digital Object Identifier (DOI) minting services. DOIs may be minted in association with specific data packages, including 1) data packages associated with scientific publications; 2) unique data packages developed by EMSL users and/or staff; and 3) data packages requested through the data portal. An Award DOI is also generated by EMSL for every user project, for inclusion in publication acknowledgements. These data and award DOIs provide an avenue for data reuse with appropriate citation and attribution of EMSL and the generating PI or team members. Individuals who download data from EMSL’s data portal are counted as data users, according to the user definitions in Chapter 4 of EMSL's Operations Manual.
Under this policy, the data will be released to the public as follows:
Specific data will be released immediately upon upload to EMSL’s repository:
- All other resource data will be released as follows:
All non-proprietary data uploaded to the repository on an approved user project will become openly accessible at the time a data DOI is minted, at the time of publication of the associated scientific results, or within one year after data generation and upload to the EMSL repository, whichever comes first.
Prior to the open access release date, data can be released only by the user project PI or team member to other entities (people, publishers as supplementary materials in a manuscript submission, institutions, etc.).
For collaborative projects utilizing EMSL and additional user facilities, the data generated at EMSL will be released as described above. Data generated at other user facilities will be released by those facilities in accordance with their respective data management policies.
Non-proprietary project data that are from EMSL resources must be uploaded to the repository; data stored outside of EMSL’s repository does not meet the requirements of this policy. All data uploaded to the repository will be stored permanently to ensure long-term accessibility. Legacy data (data collected prior to availability of the repository and stored elsewhere) are being evaluated by EMSL staff, and all legacy data that meets the requirements below and for which required metadata can be established will be uploaded to the repository and become available per the policies below.
Data uploads are regularly monitored using reporting tools that are linked to instrument usage records in EMSL’s management system to evaluate compliance with this policy. For purposes of this policy, EMSL’s instrument and computational scientists are expected to use the following guidance to determine which data will be uploaded to the data repository.
Data Included in the Repository
Essentially, all non-proprietary data should be uploaded, except data that fall under the following section (Data Not Required to be Uploaded to the Repository) below. Data that fall into the exception category can be uploaded in some cases but are not required. EMSL instrument and computational scientists should direct any questions they have regarding the type of data collected to the EMSL IRP leader responsible for the instrument or computing system being used. All sample metadata, raw instrument data, simulation outputs, processed data and associated experiment metadata collected from experiments or computations that are expected to be delivered to EMSL users as part of an approved project must be uploaded to the repository. EMSL instrument and computational scientists should upload the data as soon as practicable, but no later than the end of each month for raw data and associated metadata and no later than the end of the quarter for processed data. A command line uploader has been provided for computational resources and should be used to store computational outputs in the same manner as for experimental data.
Data definitions should conform to relevant community standards for data and metadata when they exist. All data uploaded to the repository conforms to the Dublin Core bibliographic metadata standard (bibliographic metadata are automatically extracted from project text stored in EMSL’s management system), which facilitates linkage to the DOE Office of Scientific and Technical Information (OSTI) where all EMSL publications are archived. Other standards are domain-specific, such as the HUPO proteomics standards initiative that guides metadata collected and stored by both EMSL and the PNNL proteomics data management systems. In cases where there are no clear community standards, data in a form that allows unbiased interpretation by the relevant scientific community should be uploaded. Note that a single experiment or simulation run could require more than one data set to be uploaded; the original data may be uploaded initially and processed data subsequently. The time stamps associated with upload of each data set to the EMSL repository determines the one-year period after which those data are released to the public.
Data Not Required to be Uploaded to the Repository
As an exception to the requirements in the section above (Date Included in the Repository), some data are not required to be uploaded to the repository. These are data that will not form the basis of publishable research findings nor are associated with an EMSL project. These include data from experiments known to be faulty in some regard, e.g. through mishap or due to a flawed experimental design, data from preliminary experiments that are not intended to be delivered to EMSL users, calibration runs for which results are not needed to interpret legitimate project data, and data generated to verify successful operation of the instrument or demonstrate capability to prospective users.
Blocking Released Data
In the rare case where users or staff have identified faulty data that has been released to the project team or made openly accessible to the scientific community, the instrument scientist should contact the appropriate IRP Lead and Chief Data and Analytics Officer (CDAO) in writing, providing sufficient detail and justification for requesting that the released data be blocked. If approved, the CDAO will forward the email to User Services who will flag the appropriate data set(s) and document both the request and approval in the applicable project records.
Data Repository Management
The archive size is maintained to ensure at least 48 months of headspace at any time (based on extrapolation of recent data upload rates). Disk storage comprises approximately 15% of the archive, with the remainder being tape storage. As data are uploaded, two permanent copies of the data are stored to tape (for data redundancy and integrity). The data are also maintained on disk to facilitate rapid access, but as data age they may be removed from disk storage. The disk archive is actively managed by automated processes that purge data once the disk usage reaches 90% of its capacity; files are purged in order of longest time since last access until 80% capacity is achieved. Thereby the disk storage is continually maintained at 80% to 90% usage of its capacity, with only the most recently accessed files retained on disk. Tape storage retains archived data in perpetuity, and at any point a file can be retrieved from tape to the disk if requested.