Optimization of peptide identification from tandem mass spectral data
EMSL Project ID
3550
Abstract
Determining the correct sequence of amino acids for a peptide starting with MS/MS spectral data can be stated as an optimization problem where the objective is to match an experimental spectrum with the amino acid sequence most likely to produce it.In general, two approaches have been proposed for the solution of this problem. In the first, the MS/MS spectrum of an unknown peptide is compared to idealized spectra derived from genomic databases (Eng, McCormack et al. 1994; Eriksson, Chait et al. 2000). The best match, or matches, are reported as answers. This method will fail to identify a correct peptide if the peptide sequence under investigation is unavailable in the search database. This can happen for a number of reasons, including differences in the genomes of the organism studied in the field and the one which was sequenced, frameshifts that occur during translation, alternative splicing, and post-translational modifications.
The second approach attempts to find an amino acid sequence that would produce the spectrum at hand without recurring to an archive of previously available peptide sequences. This de novo methodology uses only the peaks in the spectrum to deduce the sequence of amino acids that gave rise to it and is usually stated in a graph-theoretical framework (Bartels 1990; Hines, Falick et al. 1992; Taylor and Johnson 1997; Dancik, Addona et al. 1999; Chen, Kao et al. 2001). The objective in this problem is to create a sequence of amino acids that helps explain the most important spectral features observed.
Consider a peptide formed by the amino acid sequence LFSQVGK (Kinter and Sherman 2000). A complete and perfect fragmentation of this peptide into singly charged b- and y- ions would produce peaks at the positions shown in Figure 1. For simplification purposes, we assume that all ions are detected and all have the same relative abundance. The information contained in Figure 1 can be used to reconstruct the original peptide because the difference (in mass/charge, or m/z, units) between adjacent peaks of a given ion type corresponds to the mass of an amino acid residue in the original sequence. If a fragmentation occurs at every amino acid and every resulting fragment is detected as a singly-charged ion, the problem of reconstructing the peptide using spectral information is greatly simplified and can be solved efficiently using dynamic programming methods (Dancik, Addona et al. 1999; Chen, Kao et al. 2001).
Figure 1 here.
Unfortunately, experimental results are seldom this perfect and the researcher is confronted with spectra that contain missing or unclearly defined peaks. Real spectra may also show peaks from a variety of other peptide fragments as well as considerable background noise. Departures from perfect behavior make the computationally efficient dynamic programming algorithms lose their edge when dealing with real spectral data.
Even if perfect information is available, the graph-theoretical approaches require unambiguous identification of all spectral features (all peaks must be assigned to a certain type of ion) to produce the correct answer. This assignment is clearly not an easy task. In the absence of clear identification, there is no guarantee that the graph-theoretical methods will produce the correct answer.
Our interest is to propose a solution technique that uses a Genetic Algorithm (GA) to solve the de novo sequencing problem. Genetic Algorithms have become an increasingly popular methodology to solve difficult combinatorial optimization problems in many different areas of science and engineering. Pioneered in the 1970s by Holland (Holland 1992), the term Genetic Algorithm (or GA) is used whenever a small group of potential solutions is evolved until some criterion of convergence has been reached. The main idea behind GA is that, by combining small blocks of relevant information, good solutions can be created. The solutions generated in a run have, potentially, the ability to explore any portion of the entire problem space. Since GA only require the assignment of a goodness value to any given solution, they are not deterred by discontinuities in the search space, noisy objective functions or non-linearly constrained spaces. Presently, the genetic algorithm takes hours to match a sequence with a spectrum. The goal of this work will be to implement the GA in parallel, resulting in an analysis that takes a few seconds to a minute to complete.
Project Details
Project type
Exploratory Research
Start Date
2003-08-27
End Date
2004-09-15
Status
Closed
Released Data Link
Team
Principal Investigator
Team Members
Related Publications
Heredia-Langner A, WR Cannon, KD Jarman, and KH Jarman. 2004. "Sequence optimization as an alternative to de novo analysis of tandem mass spectrometry data." Bioinformatics 20(14):2296-2304.