首页自然科学
   首页生物学
0


概率多类多核学习:蛋白质折叠识别和远程同源性检测

Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection
课程网址: https://videolectures.net/videos/licsb08_damoulas_pmm  
主讲教师: Theodoros Damoulas
开课单位: 信息不详。欢迎您在右侧留言补充。
开课时间: 2007-11-20
课程语种: 英语
中文简介:
蛋白质折叠识别和远程同源性检测的问题最近引起了人们的极大兴趣,因为它们代表了具有挑战性的多特征多类问题,而现代模式识别方法只能达到适度的性能水平。与许多模式识别问题一样,有多个特征空间或属性组可用,例如氨基酸组成(C)、预测的二级结构(S)、疏水性(H)、范德华体积(V)、极性(P)、极化率(Z)等全局特征,以及Smith Waterman评分等从局部序列比对中得出的属性。这就需要一种分类方法,能够评估这些潜在异构对象描述符的贡献,同时利用这些信息来提高预测性能。为此,我们提供了一个单一的多类内核机器,该机器信息性地组合了可用的特征组,并且如本文所示,能够在折叠识别问题上提供最先进的性能精度。此外,所提出的方法通过评估最近引入的蛋白质特征和字符串核的重要性提供了一些见解。所提出的方法在贝叶斯层次框架内是有充分依据的,并且推导出了一个变分贝叶斯近似值,该近似值允许高效的CPU处理时间。结果:我们在SCOP PDB-40D基准数据集上报告的最佳性能是通过组合来自全局蛋白质特征的所有可用特征组以及包括序列比对特征,准确率为70%。我们在结合二元SVM分类器的最佳报告性能上提高了8%,同时降低了计算成本并评估了各种可用特征的预测能力。此外,我们检查了我们的方法在模拟远程同源性检测的SCOP 1.53基准数据集上的性能,并检查了最近提出的各种最先进的字符串内核的组合。
课程简介: The problems of protein fold recognition and remote homology detection have recently attracted a great deal of interest as they represent challenging multi-feature multi-class problems for which modern pattern recognition methods achieve only modest levels of performance. As with many pattern recognition problems, there are multiple feature spaces or groups of attributes available, such as global characteristics like the amino-acid composition (C), predicted secondary structure (S), hydrophobicity (H), van der Waals volume (V), polarity (P), polarizability (Z), as well as attributes derived from local sequence alignment such as the Smith-Waterman scores. This raises the need for a classification method that is able to assess the contribution of these potentially heterogeneous object descriptors while utilizing such information to improve predictive performance. To that end, we offer a single multi-class kernel machine that informatively combines the available feature groups and, as is demonstrated in this paper, is able to provide the state-of-the-art in performance accuracy on the fold recognition problem. Furthermore, the proposed approach provides some insight by assessing the significance of recently introduced protein features and string kernels. The proposed method is well-founded within a Bayesian hierarchical framework and a variational Bayes approximation is derived which allows for efficient CPU processing times. Results: The best performance which we report on the SCOP PDB-40D benchmark data-set is a 70% accuracy by combining all the available feature groups from global protein characteristics but also including sequence-alignment features. We offer an 8% improvement on the best reported performance that combines binary SVM classifiers while at the same time reducing computational costs and assessing the predictive power of the various available features. Furthermore, we examine the performance of our methodology on the SCOP 1.53 benchmark data-set that simulates remote homology detection and examine the combination of various state-of-the-art string kernels that have recently been proposed.
关 键 词: 蛋白质折叠识别; 远程同源性检测; 氨基酸组成
课程来源: vidiolectures
数据采集: 2025-02-24:yuhongrui
最后编审: 2025-02-24:yuhongrui
阅读次数: 1