Computing, Analytics, and Modeling
Data Transformations
Many scientific breakthroughs begin with the transformation of thousands of data points into usable sets of information that showcase a process, structure, or system. Accessing and processing those data points, however, can be challenging and complex. Through the Environmental Molecular Sciences Laboratory’s (EMSL’s) Data Transformations Integrated Research Platform (IRP), researchers work with leading experts in the biological and environmental data sciences to streamline experimental setup in the computational space, identify ideal computational workflows, and transform data into usable models and visualization tools. Computational scientists in the Data Transformations IRP closely collaborate with biological and environmental scientists in other IRPs to create dynamic synergies that facilitate scientific discoveries while advancing basic science in various fields. This collaborative approach enhances the efficiency and effectiveness of scientific investigations, pushing the boundaries of knowledge and innovation that characterize U.S. scientific leadership.
Computational expertise within the Data Transformations IRP includes:
- computational biology
- statistical design of experiments and analysis
- workflow development
- visualization
- implementation of new tools.
The IRP team also specializes in automating computational workflows and ensuring meticulous data curation. The end goal is to allow for the seamless creation of sophisticated models and visualization tools that communicate the wonders of American scientific advancements in a meaningful and relatable way.
The science
The Data Transformations IRP performs three key functions:
- Streamline and standardize data analysis workflows, including identification, exploratory data analysis, statistical analysis, and more. This is accomplished through
- provenance: tracing the origin of data from generation to storage
- reproducibility: the extent to which consistent results are obtained when an experiment is repeated
- automation: leveraging computational methods so that subject matter experts can focus on their domain expertise.
- Increase user collaboration and awareness of EMSL capabilities. This is accomplished through
- identifying areas that overlap in terms of processes/workflow steps to leverage existing capabilities into new areas
- gaining efficiencies through improved workflows and relationships that allow for the reduction of time and costs
- setting priorities for the development of new methods or tools for specific processing steps based on holistic views of existing workflows.
- Develop and improve the accessibility of tools for working with data produced at EMSL and generating data products. This is accomplished through
- providing open-source software—it can be seen, modified, and distributed by anyone
- hosting user interfaces where coding is not required
- producing static and interactive visualizations.
How we do the science
EMSL users have access to a range of existing and emerging computational capabilities, supported by world-class expertise, to grow American scientific leadership by addressing research questions in the biological and environmental sciences.
Users work directly with EMSL staff members to set up experiments, establish best practices and workflows for their project, and pull the data they need to showcase experimental results. The Data Transformations IRP works closely with the Systems Modeling IRP to synergize the development of novel platforms and modeling approaches.
Some of the tools used by the Data Transformations IRP include:
- Open OnDemand: a portal provides access to the Tahoma scientific computer so users can perform a range of computational tasks and workflows
- Multiomics Analysis Portal (MAP): a one-stop shop of applications to meet a variety of multiomics research needs
- CoreMS: a comprehensive mass spectrometry framework for software development and data analysis of small molecules
- PeakDecoder: a machine-learning-based algorithm that can identify a vast number of metabolites in each sample with a high level of confidence and can be combined with modern instrumentation, including mass spectrometry, ion mobility spectrometry, and liquid chromatography
- NMR analysis: a semiautomated capability for identifying and quantifying metabolites.
The Data Transformations IRP supports the Digital Phenome (DigiPhen) and Molecular Observation Network (MONet) strategic science objectives, including the adoption of new data transformation methods from the larger research community.
Research in action
Screening compound libraries for coronavirus therapeutics
When SARS-CoV-2 first emerged, there was an urgent need to speed up the development of new antiviral drugs to combat the virus. Researchers from the University of Washington School of Medicine, Pacific Northwest National Laboratory, and EMSL screened more than 13,000 compounds from existing drug libraries for the ability to inhibit a SARS-CoV-2 nonstructural protein called nsp15. This protein is commonly found in coronaviruses and has no counterpart in host cells, making it an interesting target for drug development. Through their screen, the scientists identified three hits against the nsp15 protein and confirmed that one bound to the protein using EMSL’s mass spectrometry capabilities. Though the hit—a compound called Exebryl-1—did not have sufficient antiviral activities in cell-based assays, the compound can be optimized using artificial intelligence and in silico molecular docking calculations.
Improving proteomics data analysis
Histone proteins can be modified to enhance or suppress gene expression. A histone can be simultaneously modified with multiple chemical groups. These modifications can be recognized by other proteins in the cell. Interpreting this “histone code” can be a challenge. Researchers at EMSL developed two different tools to help. IsoForma is robust and automated software that helps analyze modified proteins such as histones. PSpecteR is a proteomics-focused visualization application that helps scientists understand protein fragmentation patterns.