0


概率多类多核学习:蛋白质折叠识别与远程同源性检测

Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection
课程网址: http://videolectures.net/licsb08_damoulas_pmm/  
主讲教师: Theodoros Damoulas
开课单位: 格拉斯哥大学
开课时间: 2008-04-17
课程语种: 英语
中文简介:
蛋白质折叠识别和远程同源性检测的问题最近引起了极大的兴趣,因为它们代表具有挑战性的多特征多类问题,现代模式识别方法仅实现适度的性能水平。与许多模式识别问题一样,有多个特征空间或属性组可用,例如全局特征,如氨基酸组成(C),预测的二级结构(S),疏水性(H),范德华体积(V) ,极性(P),极化率(Z),以及从局部序列比对得到的属性,例如Smith Waterman得分。这提出了对分类方法的需求,该分类方法能够在利用这些信息来提高预测性能的同时评​​估这些潜在异构对象描述符的贡献。为此,我们提供了一个单一的多类内核机器,它可以信息性地组合可用的功能组,并且如本文所示,它能够在折叠识别问题上提供最高性能的技术。此外,所提出的方法通过评估最近引入的蛋白质特征和字符串内核的重要性提供了一些见解。所提出的方法在贝叶斯分层框架内很好地建立,并且导出变分贝叶斯近似,其允许有效的CPU处理时间。结果:我们报告的SCOP PDB 40D基准数据集的最佳性能是通过结合全球蛋白质特征的所有可用特征组,但也包括序列比对特征,准确度为70%。我们将报告的最佳性能提高了8%,该性能结合了二进制SVM分类器,同时降低了计算成本并评估了各种可用功能的预测能力。此外,我们检查了我们的方法在SCOP 1.53基准数据集上的性能,该数据集模拟了远程同源检测,并检查了最近提出的各种最新的字符串内核的组合。
课程简介: The problems of protein fold recognition and remote homology detection have recently attracted a great deal of interest as they represent challenging multi-feature multi-class problems for which modern pattern recognition methods achieve only modest levels of performance. As with many pattern recognition problems, there are multiple feature spaces or groups of attributes available, such as global characteristics like the amino-acid composition (C), predicted secondary structure (S), hydrophobicity (H), van der Waals volume (V), polarity (P), polarizability (Z), as well as attributes derived from local sequence alignment such as the Smith-Waterman scores. This raises the need for a classification method that is able to assess the contribution of these potentially heterogeneous object descriptors while utilizing such information to improve predictive performance. To that end, we offer a single multi-class kernel machine that informatively combines the available feature groups and, as is demonstrated in this paper, is able to provide the state-of-the-art in performance accuracy on the fold recognition problem. Furthermore, the proposed approach provides some insight by assessing the significance of recently introduced protein features and string kernels. The proposed method is well-founded within a Bayesian hierarchical framework and a variational Bayes approximation is derived which allows for efficient CPU processing times. Results: The best performance which we report on the SCOP PDB-40D benchmark data-set is a 70% accuracy by combining all the available feature groups from global protein characteristics but also including sequence-alignment features. We offer an 8% improvement on the best reported performance that combines binary SVM classifiers while at the same time reducing computational costs and assessing the predictive power of the various available features. Furthermore, we examine the performance of our methodology on the SCOP 1.53 benchmark data-set that simulates remote homology detection and examine the combination of various state-of-the-art string kernels that have recently been proposed.
关 键 词: 蛋白质折叠; 同源性检测; 多类内核机器
课程来源: 视频讲座网
最后编审: 2019-05-12:lxf
阅读次数: 68