Proteomic Characterization of Alternative Splicing and Coding Polymorphism using Tandem Mass Spectrometry

Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples.Traditional search engines, which match peptide sequences with tandem mass spectra to identify the samples' proteins, use protein sequence databases to suggest peptide candidates for consideration. While the acquisition of tandem mass spectra is not biased towards well understood protein isoforms, this computational strategy is, failing to identify peptides from alternative splicing and coding SNP protein isoforms despite the acquisition of good quality tandem mass spectra.

We propose, instead, that expressed sequence tags (ESTs) be searched. Ordinarily, such a strategy would be computationally infeasible due to the size of EST sequence databases, however we show that a sophisticated sequence database compression strategy, applied to human ESTs, reduces the sequence database size approximately thirty-five fold. Once compressed, our human EST sequence database is comparable in size to other commonly used protein sequence databases, making routine EST searching feasible.

We demonstrate that our EST sequence database enables the discovery of novel peptides in a variety of public proteomics datasets, representing a tantalizing untapped source of potential disease biomarkers. Furthermore, proteomic alternative splicing evidence helps distinguish the alternative splicing events.