ScalaBLAST to reduce protein sequence redundancy, determine similarity among species, and enable proteomics on unsequenced microbial communities
EMSL Project ID
29396
Abstract
The ultimate goal of high-throughput proteomics centers such as that located at PNNL is the challenge of correctly identifying peptides (protein fragments) from biological samples based on their measured mass spectra. We currently approach this by scoring methods which compare hypothetical spectra generated from protein sequence data of the organism being studied. As opposed to conventional studies on isolates, proteomics for metagenomics and multicellular organisms present a unique challenge for this approach because hypothetical spectra must be generated using multiple proteomes with many overlapping or redundant peptides. Redundancy in the peptides reduces (dramatically, in some cases) the correct identification rate of peptides from these samples. We have devised an approach to solve this redundancy problem by compacting multi-organism proteomes by exploiting sequence similarity. This solution requires sequence analysis at the multiple-whole genome level—a task which has been shown to scale to machine capacity on the MSCF supercomputer using the software ScalaBLAST. We propose to create reduced representations of multiple whole-genome datasets using information provided by ScalaBLAST for several real-world applications. The impact of solving this challenge is that removing this redundancy provides a path for enabling widespread use of proteomics on metagenome datasets. Even more important, perhaps, is that this same solution may also lead to more reliable use of proteomics for multicellular organisms that is currently not possible because of system complexity.
Project Details
Project type
Limited Scope
Start Date
2008-02-12
End Date
2008-04-14
Status
Closed
Released Data Link
Team
Principal Investigator
Team Members