Skip to main content

Scaling up for large metagenomic computations with ScalaBLAST


EMSL Project ID
20905

Abstract

The recent emphasis on understanding genomes at the level of whole organisms from the U.S. Department of Energy, National Institutes of Health, and other major agencies has driven a worldwide effort at improving the rate of sequencing new organisms and toward improving the quality of new sequences coming online. Understanding the molecular composition and interactions of proteins in microbes is a key advancement that holds great promise for cleaner transportation fuels, processing legacy waste products associated with weapons production, developing counter-bioterrorism strategies like early detection of biological agents or field-deployable intervention technologies, or for providing a vehicle for processing spent fuel expected to result from the renewed nuclear reactor program. But extracting meaningful information from the vast collection of sequence data requires comparison of many entire genomes against one another using approaches such as comparative genomics.

Comparative genomics is a powerful approach to understanding how single organisms evolve and function. The burgeoning number of sequenced genomes presents computational challenges that are just being met with existing methods and infrastructure. However, the recent application of high throughput sequencing to environmental samples (metagenomics) represents a grand challenge in biology because it has the potential to reveal how entire ecosystems function and interact. The approach was initially demonstrated on a simple acid mine drainage biofilm community comprising <10 species resulting in significant recovery of the genomes of the dominant populations (Tyson, 2004). Key insights into the biofilm community function and interaction could be inferred from a metabolic reconstruction of the genomic data, such as the partitioning of community essential functions, such as nitrogen fixation (Tyson, 2004).
Other ecosystems currently under investigation include termite hindgut communities for bioenergy production (a key DOE mission), activated sludges, soils, and terephthalate-degrading communities.
The computational task of analyzing this massive volume of complex data is quickly outpacing the capacity for existing software and hardware capabilities. Current standalone sequence analysis implementations generally take a single dedicated machine days or weeks to complete the analysis of a single microbial genome against the nonredundant protein database (a growing curated collection of proteins from all sequenced organisms). The grand challenge in this "genomes-to-life" sequence analysis effort is much larger, putting it beyond the capacity of most biologists to complete in a reasonable time.

ScalaBLAST is a high-performance extenstion to the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST). ScalaBLAST has been shown to accelerate the throughput of BLAST sequence analysis in proportion to the number of processors available on machines like the EMSL Molecular Science Computing Facility supercomputer, MPP2. Using a previous pilot project, we have demonstrated nearly perfect scaling to machine capacity for sequence analysis tasks of the size we propose in this computational grand challenge. ScalaBLAST will be the main sequence analysis driver for the proposed project, enabling a body of sequence of analysis to be performed which will provide a critical information resource to the general science community for addressing the driving science problems identified above.

Project Details

Project type
Capability Research
Start Date
2006-10-01
End Date
2009-09-30
Status
Closed

Team

Principal Investigator

Christopher Oehmen
Institution
Pacific Northwest National Laboratory

Team Members

Arzu Gosney
Institution
Pacific Northwest National Laboratory

Anuj Shah
Institution
Pacific Northwest National Laboratory

Abigail Corrigan
Institution
Pacific Northwest National Laboratory

Peter Zuber
Institution
Oregon Health & Science University

Nikos Kyrpides
Institution
Joint Genome Institute

Victor Markowitz
Institution
Lawrence Berkeley National Laboratory

Ernest Szeto
Institution
Lawrence Berkeley National Laboratory

Heidi Sofia
Institution
Whitman College

Philip Hugenholtz
Institution
Joint Genome Institute

Jarek Nieplocha
Institution
Pacific Northwest National Laboratory

Douglas Baxter
Institution
Environmental Molecular Sciences Laboratory

T. Straatsma
Institution
Oak Ridge National Laboratory

Related Publications

Markowitz VM, E Szeto, K Palaniappan, Y Grechkin, K Chu, IA Chen, I Dubchak, I Anderson, A Lykidis, K Mavromatis, NN Ivanova, and NC Kyrpides. 2008. "The integrated Microbial Genomes (IMG) System in 2007: Data Content and Analysis Tool Extensions." Nucleic Acids Research 36(1):D528-D533. doi:10.1093/nar/gkm846
Markowitz VM, NN Ivanova, E Szeto, K Palaniappan, K Chu, D Dalevi, IA Chen, Y Grechkin, I Dubchak, I Anderson, A Lykidis, K Mavromatis, P Hugenholtz, and NC Kyrpides. 2008. "IMG/M: A Data Management and Analysis System for Metagenomes." Nucleic Acids Research 36(1):D534-D538. doi:10.1093/nar/gkm869
Shah, AR, CS Oehmen, BM Webb-Robertson, (2008) SVM-Hustle - An iterative semi-supervised machine learning approach for pairwise protein remote homology detection. Bioinformatics (accepted)
Shah, A. VM Markowitz, CS Oehmen, (2007) High-throughput computation of pairwise sequence similarities for multiple genome comparisons using ScalaBLAST, Proceedings of IEEE-NIH Life Science Systems and Applications (LISSA 2007).
Shotgun Proteomics Identifies Proteins Specific for Acute Renal Transplant Rejection. Sigdel TK, Kaushal A, Gritsenko M, Norbeck AD, Qian WJ, Xiao W, Camp DG, Smith RD, Sarwal MM. Proteomics Clin Appl. 2010 Jan 1;4(1):32-47.