Skip to main content
Science Areas
Functional and Systems Biology
Environmental Transformations and Interactions
Computing, Analytics, and Modeling

Investing from Within: Software that Drives Scientific Advancements | Part 3 of a 3-Part Series

Experts in EMSL’s realm of Computing Analytics and Modeling develop software that eases the analysis of experimental data.

Corydon Ireland |
A gas chromatography mass spectrometer at EMSL
A detail shot from a gas chromatography mass spectrometer at EMSL, which is being set up to analyze samples for metabolomics research. The technique is complementary to liquid-chromatography mass spectrometry. Both are subject to computer-aided analytics.

Vials of river water, jars of farm soil, and milliliters of solution afloat with plant cells.

Samples of many kinds stream into the Environmental Molecular Sciences Laboratory (EMSL). From there, the U.S. Department of Energy (DOE) user facility performs its signature brand of deep analysis with EMSL’s vast suite of multi-modal analytical platforms, including mass spectrometry. A medically and scientifically critical technique, mass spectrometry is used to quantify known compounds, identify new ones, and to investigate the chemistry and structural properties of molecules.

EMSL’s state-of-the-art mass spectrometers generate so much data that it is challenging for a researcher to analyze properly. To overcome that challenge, scientists turn to software―a careful set of computer instructions used to execute a particular task that would be burdensome or impossible to do by human labor alone.

At EMSL, experts within the Computing Analytics and Modeling (CAM) science area specialize in such technical software. Sometimes their work is funded by internal investments, which provide support from six months up to three years.

 

The Three Ways of Software

EMSL Chief Information Officer Lee Ann McCue
EMSL Chief Information Officer Lee Ann McCue poses in front of Tahoma, the laboratory’s new supercomputer, which went online in October 2020. Photo is by Andrea Starr, PNNL.

Computational scientist Lee Ann McCue, Chief Information officer at EMSL, guides funding for internal investments in the CAM science area. (Separate paths of internal funding mark EMSL’s other main science areas: Environmental Transformations and Interactions and Functional and Systems Biology.)

“We’re asking people to test out an idea,” she said of internal-investment proposals, especially through Tier 1 internal funding for projects of six months or less. “You just want to see if it works or maybe you need the final finishing touches on something for the user community.”

McCue described “three flavors” of internal investing on EMSL’s computational side of the shop. Let’s call them reimagined, inward-facing, and outward-facing.

Reimagined internal investments begin with existing software and apply it in a way that has not been done before “to demonstrate a new type of science,” said McCue.

Inward-facing investments “are internally focused,” she said. “We develop software to more robustly analyze that data we generate. We make an EMSL process more robust and reliable.”

Outward-facing internal investments “are focused on the (EMSL) user community,” said McCue. “They’re about making the interpretation of experimental data easier for them.”

 

Game-Changers Welcome

There’s an advantage to internal funding, said McCue. Risk is contained but rewards come back to EMSL staff, improve processes, and make user experiences smoother.

“We encourage somebody to think about what might be a high-risk, high-reward project,” she said. “It might not work, but if it does it could be a game-changer.”

Looking ahead to future themes for internal investing, McCue would like to see some proposals related to machine learning. That’s an algorithm-driven application of artificial intelligence that helps computers learn by automating some of the most complex parts of an analysis.

EMSL’s new supercomputer, dubbed Tahoma and online since October 2020, “is perfect for projects like that,” she said. “It’s a heterogenous system that provides plenty of GPUs (graphics processing units) that are really good at things like machine learning.”

Of course, adds McCue, any EMSL investments, internal or otherwise, are intended to ultimately benefit users. In the end, every advance in software and every improved process at EMSL faces outward.

 

Reimagined: Problems in Aerosol Chemistry

EMSL computational scientist Edoardo Aprà
EMSL computational scientist Edoardo Aprà

For a sense of what McCue called the three flavors of internally funded projects at EMSL, she offered examples of each, which are all “making good progress and getting interesting results.”

EMSL computational scientist Edoardo “Edo” Aprà is leading a project that reimagines the open-source computational chemistry tools available in NWChem and applies them to the needs of scientists who investigate atmospheric aerosols. NWChem is high performance computational chemistry software developed and maintained by an EMSL-based consortium on the campus of Pacific Northwest National Laboratory (PNNL). It provides scalable tools to study the kinetics and dynamics of chemical transformations. Apra is a member of the NWChem development team.

Atmospheric aerosols are critically influential ultrafine liquid and solid particles suspended in the air. They help determine cloud formation, precipitation, and surface energy budgets.

The focus of the Aprà project has a kind of EMSL flair. It’s an effort to characterize properties of secondary organic aerosol (SOA) particles at the molecular level. SOAs, among the most abundant aerosols in the lower atmosphere, begin life in a gas phase and partition into a liquid or solid condensed phase.

The project specifically targets the reactivity of molecules that control the formation of SOA particles. The main computational tools are electron structure methods already available in NWChem software; so minimal software development is required.

Aprà sees the project as a way to build a bridge between quantum chemistry modelers like himself and atmospheric aerosol scientists.

 

A Photochemistry Focus

Atmospheric scientist John Shilling describes the aerosol chamber
During the aerosol summer school event at EMSL for young scientists in July 2019, atmospheric scientist John Shilling, in the grey shirt, describes the aerosol chamber where researchers can simulate the formation of organic aerosol particles. Photo is by Andrea Starr, PNNL.

To that end, Aprà’s coinvestigators are both atmospheric chemists: aerosol researcher John Shilling of PNNL and mass spectroscopy analyst Maria Zawadowicz of Brookhaven National Laboratory in New York.

In 2020, they agreed to focus first on the complicated photochemistry of SOA molecules, with the ultimate aim of improving the predictive power of atmospheric models. In part, the Aprà project is a theoretical exercise in understanding the organic molecules in SOA and how they interact with sunlight in the process of SOA formation.

There is very little detailed experimental data about the photochemistry of organic molecules in the atmosphere. This is especially true of the organic compounds that make up SOAs, which may decompose when exposed to sunlight. (This process of decomposition is known as photolysis.)

Zawadowicz and Shilling led a 2020 paper on one aspect of the data gap: “loss pathways” for SOAs in the atmosphere. A lot is known about SOA production but less is known about the loss of SOA mass through chemical processes like photolysis.

They found that decomposition and loss rates were highly dependent on the “recipe” used to make the SOA particles. They hypothesized that rates of SOA loss to photolysis depend in part on SOA molecular composition.

However, researchers need help interpreting the specific types of molecules susceptible to photolysis. Using NWChem chemistry calculations, Aprà and his team were able to investigate the types of molecules that were likely to absorb light and subsequently decompose. Ultimately, these calculations will help improve representations SOA loss processes, including photolysis, in earth system models.

EMSL routinely analyzes samples provided by another DOE entity, the Atmospheric Radiation Measurement (ARM) user facility. Also, EMSL staffers have access to PNNL’s Atmospheric Measurements Laboratory, including its aerosol chambers. (Shilling led the project to design and build it.) And in July 2019, EMSL and ARM teamed up to host an aerosol summer school for students and early-career scientists.

 

Inward-facing: Modernizing Software Interfaces

EMSL computational mass spectrometrist Yuri E. Corilo
EMSL computational mass spectrometrist Yuri E. Corilo

EMSL computational scientist Yuri E, Corilo arrived at the user facility in June 2019 after a decade at the High Magnetic Field Laboratory at Florida State University. Just a few months later, in October 2019, he was leading a year-long capability development project. The aim: develop a mass spectrometry (MS) software framework called CoreMS.

With a focus on EMSL staff scientists at first, CoreMS laid out a basis for working across data types as well as the fast-changing landscape of MS technology. Scientists using the software today for data processing can customize workflows for signal processing, curation, and annotation of mass spectra. CoreMS also makes access to processed data and calculations easier than before.

At EMSL, Corilo and his collaborators, particularly analytical chemist Will Kew, took on the problem of a data-processing bottleneck at EMSL occurring because of spiking demand and diversity in organic matter analysis. As at other research facilities, MS data volume has increased, along with the quality and sensitivity of data drawn from analyzing complex mixtures. Before CoreMS, the EMSL data-processing workflow was often labor-intensive and required using multiple software programs.

Metabolomics chemist Will Kew sets up a batch analysis of samples on EMSL’s 15 Tesla FTICR mass spectrometer.
In this July 2020 image, metabolomics chemist Will Kew, a Yuri Corilo collaborator, sets up a direct infusion cart for a batch analysis of samples on EMSL’s 15 Tesla FTICR mass spectrometer. Kew and Corilo are working of new data-assessment software, beginning with FTICR machines. Photo is by Andrea Starr, PNNL.

“CoreMS is a modernization of (EMSL’s) software infrastructure for mass spectrometry in general,” said Corilo, an organic chemist by training and MS specialist by passionate inclination. “It is easily connected by cloud services and web applications. CoreMS streamlines the whole pipeline by amalgamating data processing.”

To make CoreMS flexible, stable, and sustainable, the Corilo team developed a formula-assignment algorithm for molecules, accompanied by what they call a “confidence metric.”

Corilo praised Kew and other members of the EMSL project team: data scientist Allison Thompson and chemist Rosalie Chu.

 

First, Organic Matter

Corilo’s 2019-2020 internally funded project focused on organic matter and (to a lesser extent) metabolomics. At EMSL, “organic matter” typically refers to either dissolved organic matter or soil organic matter.

Today, Corilo is expanding the CoreMS framework applications to fully support metabolomics and other areas. He’s focusing on standardizing workflows and overcoming analysis bottlenecks created by a rush of new data generated by ever-improving instruments.

In a parallel effort, Corilo stayed in the internal-funding space at EMSL by taking on a two-year project that started in October 2020.

His team’s new “acquisition-time quality control workflow assessment” tool will automate data quality-control analysis in near-real-time, ensuring that data maintain the level of quality needed for pipelines that automate data analysis. A beta version, in the form of what Corilo calls a “dashboard” of statistical performance data, may be in place as early as fall 2021.

The initial focus is on tracking complex-mixture mass spectra from EMSL’s Fourier transform ion cyclotron resonance (FTICR) mass spectrometers. They include the Bruker 12 and 15 Tesla FTICR MS devices and the 21 Tesla FTICR MS.

Currently, data acquired in automation often require painstaking reanalysis, inspection, and reprocessing because of variable instrument performance or outright instrument failures. To counter that, Corilo and his project teammates are leveraging CoreMS to save operator and instrument time by doing what they call on-the-fly quality-control assessments.

The heart of the new software is a standalone uploader, an automated analysis pipeline, and a modern application deployed by using Kubernetes on the EMSL-RZR cluster. (Kubernetes is an open-source container orchestration platform.)

As with the 2019-2020 effort, Chu, Kew, and Thompson are repeat collaborators on the Corilo project. Joining them are EMSL’s Kurt Maier and Kenneth Auberry.

Best of all, Corilo added, the on-the-fly quality-control software under development could in the future be applied to other workflows and types of data and instruments. One example: metabolomics data from liquid chromatography-mass spectroscopy (LC-MS) systems.

By 2022, Corilo envisions a web portal that will give researchers and instrument operators real-time quality control assessments of how well analytical instruments are running.

“We’re really excited,” he said.

Outward-facing: Making Data Interpretation Easier

PNNL computational statistician Lisa Bramer
PNNL computational statistician Lisa Bramer visits collaborators in an EMSL lab in 2019. Photo is by Andrea Starr, PNNL.

Computational statistician Lisa Bramer, now team lead for Data Science and Biostatistics at PNNL, started a research partnership with McCue on an internally funded EMSL project in the fall of 2018.

The aim was to develop an informatics tool to speed and ease the analysis of data streaming from Fourier transform mass spectrometers (FT-MS) at EMSL and elsewhere―like the FTICR-MS devices Corilo is working with..

Bramer and McCue succeeded, turning out an open-source software product called ftmsRanalysis, which is in wide use today among both EMSL staff and EMSL users. It’s an R package, a kind of software comprised of functions, data sets, and other add-ons that increase the power of R, a programming language used for statistical computing.

In the case of ftmsRanalysis, the software provides a suite of ways to format, process, filter, sample, and rapidly compare datasets. A March 2020 paper, with Bramer as the lead author, laid out the software’s structure and capabilities. The authors, McCue included, used a soil microbiology dataset to demonstrate some ftmsRanalysis results and visualizations.

 

FT-MS, in all its variations, is more than ever a popular technique for investigating the chemical composition of soil, plant, aquatic and other samples containing complex mixtures of proteins, lipids, and other compounds. In turn, software tools that analyze FT-MS data are in high demand.

Said McCue of ftmsRanalysis, “it’s gotten the most traction with our users.”

Among those users is ecosystem scientist Robert Danczak, who handles FTICR-MS data for a PNNL-based river-corridor research consortium called the Worldwide Hydrobiogeochemical Observation Network for Dynamic River Systems (WHONDRS). He said ftmsRanalysis and sometimes its newer online application, FREDA, “have significantly improved our workflow and allowed us to analyze samples in a much higher throughput. It provides us with a standardized tool that enables us to share our analyses (and results) with no ambiguity―that is, we can explain each step to our collaborators.”

Bramer called FREDA “a front end for users who don’t do statistical programming but who want to visualize and explore their data.”

Bramer and others are developing software tools to help researchers process, analyze, and visualize data derived aquatic and other samples.
Bramer and others are developing software tools to help researchers process, analyze, and visualize data derived aquatic and other samples.

A Second and a Third

Bramer did not stop with ftmsRanalysis but used the idea to expand the platform to proteomics, “just one example of the type of software we’re creating,” she said, and the second in a series of three successive and related projects.

Work on the proteomics platform project began in 2018. Bramer teamed up with a PNNL Biological Sciences Division colleague, Bobbie-Jo M. Webb-Robertson, chief scientist of computational biology, who had the conceptual vision. Along with a team of researchers, the two created an open-source software product called PMART, which is their abbreviation of “panomics marketplace.”

The idea, outlined in a Bramer-led 2019 paper, was to create an interactive web-based software environment for analyzing proteomics, lipidomics, and metabolomics data by way of reproducible techniques.

To date, she said, the ability to analyze omics data in a reproducible way has been limited. But with PMART, scientists can do analyses of complex datasets without having a background in statistics. They can also do analyses module by module, create interactive visualizations, and document the process in a way that makes it easy for analyses to be reproduced.

Even PMART was not the end. It went on to inspire a third project, this time funded for two years beginning in October 2020.

“We started out trying to develop something that would automate what scientists were doing by hand more or less,” said Bramer, describing the inspiration for ftmsRanalysis.

Then came PMART, which was derived from a Webb-Robertson project that was specific to making proteomics cancer data accessible to users. From there, developing statistical methods for other problems “is not much of a leap,” said Bramer.

The third project, barely a few months old, is part of what Bramer describes as “a long-term grand vision,” a kind app store or software portal for data analysts. A researcher would enter with data either on their own computer or in the EMSL NEXUS data archive. They would then search among the available apps―FREDA would be one possibility, for instance, in the case of FTICR-MS data.

“The idea is that we would have this very connected place for reproducible research and data processing and visualization (that delivers) multiple views of the data at different levels,” said Bramer. “It’s really about employing these processes and manipulating the data to get a more in-depth story of what’s happening.”

Joining Bramer and Webb-Robertson in the 2020-2022 portal project are researchers Kristin Burnum-Johnson, Daniel Claborne, David Degnan, Hugh Mitchell, Rachel Richardson, Kelly Stratton, and Amanda White.