Depending on the experiment, scientific researchers may have to sort through thousands of data points to pull the information they need to communicate a project’s outcome. Taking that information and turning it into usable datasets and integrating with models can sometimes prove difficult. Finding that data later, and reproducing the analysis, can also pose a challenge.
At the Environmental Molecular Sciences Laboratory (EMSL), researchers study some of life’s tiniest components—chemical compounds, pathways, and processes down to the molecular and sub-molecular levels. Their experiments can contain hundreds of thousands of data points. To help both users of EMSL and staff scientists access and turn their data into usable streams, EMSL is launching a new integrated research platform (IRP) called Data Transformations.
Through the new IRP, a range of innovative and standardized workflows are under development to help researchers transform their scientific data into more manageable sets of information, make the data more accessible and analyses more reproducible, and facilitate the creation of models and visualization tools that help tell a larger story from the data.
In addition to rigorous statistical methods, the IRP is applying machine learning, artificial intelligence, and a broad array of techniques to streamline computational processes for data transformation and make them more accessible.
“We are creating the Data Transformations IRP as a way to accelerate delivering the scientific value of data we gather here,” said Jay Bardhan, leader of EMSL’s Computation, Analytics, and Modeling science area. “The goal is to help everyone approach and access data pertaining to their experiments more easily.”
A new computational leader at EMSL
Leading the new IRP is Kelly Stratton, an accomplished biostatistician and data scientist with more than 10 years of experience applying statistical methods to biological and environmental questions. Stratton said she joined the IRP because she was excited to help make data access and transformations more approachable for scientists at EMSL.
“Prior to this role, I had been getting more involved in [EMSL] user projects, helping with statistical analysis,” she said. “It has been fun to see the science that is going on for EMSL users. I was really excited about having the opportunity to form what Data Transformations looks like for EMSL. We plan to make it easier for people to do appropriate analyses with their data and reduce the friction that can exist in working with data.”
Stratton said while numerous algorithms and processes exist that help researchers process their data and turn it into usable sets, sometimes completing those processes can be daunting. This is especially true for advanced data analysis techniques that may have many parameter values to set or involve a series of choices. One algorithm or workflow created for a particular experiment or data may not be applicable to others, Stratton said.
“When I do a statistical analysis, I sometimes spend the bulk of my time reformatting data,” she said. “To make it more complicated, every research group does it differently because of differences in software used for molecular identifications and annotations, or just personal preference. If there is a way we can standardize and help streamline things at certain steps in the overall workflow going forward, then that is a big win."
Goals of Data Transformations IRP
Stratton said they have three overarching goals with the new Data Transformations IRP. The first is to create processes that help streamline and standardize data analysis workflows. The second is to increase collaboration and awareness between EMSL staff and users about EMSL capabilities in the computational and data processing space. And the third is to develop and improve accessibility of tools in working with data generated at EMSL and generating data products.
She said combining the three goals will help users and EMSL staff researchers better access, organize, and pull usable datasets and create models now and into the future. EMSL users will also be more in-tune with the computational capabilities and expertise that are available to them for various research projects.
“EMSL generates a lot of interesting and important data,” Stratton said. “The goal is to create a cohesive framework and help scientists and users work more efficiently with their data and get the most out of it that they can. We are not trying to change the science that someone is doing. We can make it easier for them to do the science by streamlining data handling and other processes.”
Streamlining data access and analysis
At EMSL, Stratton said they already have a variety of projects and programs that use new and improved automated workflows to analyze and process data.
One such program is CoreMS, which is a comprehensive mass spectrometry framework for software development and data analysis of small molecules. It was developed at EMSL. Data handling and software development for modern mass spectrometry often require an advanced background in computational science and a deep understanding of mass spectrometry, which include elements of physics and other areas of science. But CoreMS takes a chunk of the advanced technical know-how out, providing a fundamental, high-level basis for working with all mass spectrometry data types. It supports direct access for almost all vendor data formats, allowing for the centralization and automation of all data processing workflows from an instrument’s raw signal to data annotation and curation.
Another example is the Multiomics Analysis Portal, or MAP—also developed at EMSL. The portal provides a one-stop shop of applications to meet a variety of multiomics research needs.
“It’s like an app store for omics data,” Stratton said.
Researchers can access MAP to determine the best application for their desired data outcome. Users can filter by which data types they wish to analyze and their analysis goal, and MAP tells them which applications would be best for their desired data outcome.
Stratton said she is excited for what the future holds for Data Transformations at EMSL—they have lots of plans to make frameworks and processes more advanced in terms of capabilities, but also simpler and easier to use. In the IRP’s development of future capabilities and workflows for data transformations and visualization, one of their top focus areas is providing capabilities that are open-source and interfaces where coding is not required.
“I think there is a lot we can do to piece together all of these different components, including the use of AI and machine learning where appropriate,” she said. “We are trying to help map out the framework so people know what they can access, as well as what is needed for their projects.”
Data Transformations joins existing Systems Modeling IRP
The new Data Transformations IRP aligns seamlessly with EMSL’s existing Systems Modeling IRP that falls under the same science area—Computing, Analytics, and Modeling, Bardhan said. Previously, both modeling and data sciences were combined under the single Systems Modeling IRP.
“We separated them into two IRPs because we recognize there is a lot to be done in both areas,” he said.
The Systems Modeling IRP, led by Satish Karra, will now focus primarily on building models, simulation approaches, and visualization tools to inform data predictability and reproducibility.
“At the end of the day, we are understanding environmental and biological processes through experiments that will inform models so we can make decisions in the future,” Karra said. “Those models can be based on our understanding of physics, laws of mechanics, our understanding of chemistry, biology, and so on.”
Bardhan said the convergence of the “mountains” of data that exist at EMSL with its sophistication in modeling capabilities and advanced computing resources, such as EMSL’s Tahoma scientific computing resource, is going to open new ways for users and EMSL staff working with users to design experiments with more scientific understanding and fewer resources needed for collecting and analyzing samples.
“Due to the variety of instruments and the complexity of the biological and environmental systems, steps to follow experiment parameters can be incredibly complicated and there are a lot of systems-specific details that have to be done exactly right,” he said. “Users should be able to focus on their science, not be frustrated by the tools and other processes involved that take up unnecessary time.”
Bardhan said he and his team see the addition of the new Data Transformations IRP as being a huge resource for EMSL users in combination with established EMSL capabilities and expertise.