0


基于概率语言模型的串联质谱鉴定蛋白质鉴定

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling
课程网址: http://videolectures.net/ecmlpkdd09_yang_pitmsplm/  
主讲教师: Yiming Yang
开课单位: 卡内基梅隆大学
开课时间: 2009-10-20
课程语种: 英语
中文简介:
本文对从串联质谱中提取蛋白质的统计信息检索 (ir) 技术进行了跨学科的研究, 这是蛋白质组数据分析中的一个具有挑战性的问题。我们将任务表述为一个 ir 问题, 通过构造一个 "查询向量", 其元素是基于输入样本频谱分析的系统预测肽和置信度得分, 并定义了 "文档" 的向量空间与蛋白质配置文件, 其中每一个是基于蛋白质的理论光谱构建。该公式建立了一个新的连接, 从蛋白质识别问题到概率语言建模方法以及红外中的向量空间模型, 并使我们能够比较红外模型和常用方法中的根本差异。蛋白质鉴定。我们在基准光谱分析查询集和大型蛋白质数据库上的实验表明, 红外模型在蛋白质识别方面的性能明显优于成熟的方法, 特别是通过提高高召回区域的精度。传统的方法是薄弱的。
课程简介: This paper presents an interdisciplinary investigation of statistical information retrieval (IR) techniques for protein identification from tandem mass spectra, a challenging problem in proteomic data analysis. We formulate the task as an IR problem, by constructing a “query vector” whose elements are system-predicted peptides with confidence scores based on spectrum analysis of the input sample, and by defining the vector space of “documents” with protein profiles, each of which is constructed based on the theoretical spectrum of a protein. This formulation establishes a new connection from the protein identification problem to a probabilistic language modeling approach as well as the vector space models in IR, and enables us to compare fundamental differences in the IR models and common approaches in protein identification. Our experiments on benchmark spectrometry query sets and large protein databases demonstrate that the IR models significantly outperform well-established methods in protein identification, by enhancing precision in high-recall regions in particular, where the conventional approaches are weak.
关 键 词: 计算语言学; 统计信息检索技术; 概率语言建模
课程来源: 视频讲座网
最后编审: 2020-05-22:王淑红(课程编辑志愿者)
阅读次数: 113