0


使用字符串内核进行大规模学习

Large Scale Learning with String Kernels
课程网址: http://videolectures.net/eml07_sonnenburg_lsl/  
主讲教师: Sören Sonnenburg
开课单位: 弗劳恩霍夫智能分析与信息系统研究所
开课时间: 2007-12-29
课程语种: 英语
中文简介:
在生物信息学和文本处理的应用中,例如拼接站点识别和垃圾邮件检测,可获得并且需要大量的训练序列以在分类或回归任务上实现足够高的预测性能。尽管基于内核的方法(例如SVM)经常实现最先进的结果,但是训练和评估时间可能非常大。当单核计算时间已经是线性的(例如输入序列)时,似乎很难实现进一步的加速。在这项工作中,我们描述了一种使用稀疏数据结构计算字符串内核的线性组合的有效技术,例如显式映射,排序数组和后缀尝试,树或数组[5]。由于计算内核的线性组合构成了SVM训练和评估的主要部分,因此加快计算速度至关重要。考虑到最近提出并成功使用的线性时间字符串内核,如Spectrum内核[2]和加权频谱内核[3],我们发现可以分别以7倍和60倍的速度加速SVM训练,同时需要相当少的内存。我们的方法允许我们在大达1000万个序列的集合上训练字符串内核SVM [4]。此外,使用这些技术,对新序列的评估通常要快几千倍,这使我们能够将分类器应用于具有70亿个测试实例的基因组大小的数据集[6]。所提出的算法在我们的机器学习工具箱SHOGUN中实现。 \\参考文献:[1] D. Gusfield。字符串,树和序列的算法。剑桥大学出版社,1997。[2] C. Leslie,E。Eskin和W. S. Noble。谱内核:SVM蛋白分类的字符串内核。在RB Altman,AK Dunker,L。Hunter,K。Lauderdale和TE Klein,编辑,太平洋生物计算太平洋研讨会论文集,第564-575页,Kaua'i,夏威夷,2002年。[3]G.R¨atsch和S. Sonnenburg。准确的剪接位点预测秀丽隐杆线虫,第277-298页。麻省理工学院关于计算分子生物学的新闻系列。麻省理工学院出版社,2004年。[4] S. Sonnenburg,P。Philips,G。Schweikert和G.R¨atsch。使用支持向量机准确拼接点预测。 BMC Bioinformatics,8,2007。[5] S. Sonnenburg,G.R¨atsch和K. Rieck。使用字符串内核进行大规模学习。在L. Bottou,O。Chapelle,D.DeCoste和J.Weston,编辑,大规模核机器,神经信息处理系列,第73-104页。 MIT Press,Cambridge,MA,2007。[6] S. Sonnenburg,A。Zien和G.R¨atsch。 ARTS:准确识别转录起始于人类。生物信息学,22(14):e472-480,2006。
课程简介: In applications of bioinformatics and text processing, such as splice site recognition and spam detection, large amounts of training sequences are available and needed to achieve sufficiently high prediction performance on classification or regression tasks. Although kernel-based methods such as SVMs often achieve state-of-the-art results, training and evaluation times may be prohibitively large. When single kernel computation time is already linear (w.r.t. the input sequences) it seems difficult to achieve further speed ups. In this work we describe an efficient technique for computing linear combinations of string kernels using sparse data structures such as explicit maps, sorted arrays and suffix tries, trees or arrays [5]. As computing linear combinations of kernels make up the dominant part of SVM training and evaluation, speeding up their computation is essential. Considering the recently proposed and successfully used linear time string kernels, like the Spectrum kernel [2] and the Weighted Spectrum kernel [3] we show that one can accelerate SVM training by factors of 7 and 60 times, respectively, while requiring considerably less memory. Our method allows us to train string kernel SVMs on sets as large as 10 million sequences [4]. Moreover, using these techniques the evaluation on new sequences is often several thousand times faster, allowing us to apply the classifiers on genome-sized data sets with seven billion test examples [6]. The presented algorithms are implemented in our Machine Learning toolbox SHOGUN fo References: [1] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press, 1997. [2] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564–575, Kaua’i, Hawaii, 2002. [3] G. R¨atsch and S. Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277–298. MIT Press series on Computational Molecular Biology. MIT Press, 2004. [4] S. Sonnenburg, P. Philips, G. Schweikert, and G. R¨atsch. Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8, 2007. [5] S. Sonnenburg, G. R¨atsch, and K. Rieck. Large-scale learning with string kernels. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, Neural Information Processing Series, pages 73–104. MIT Press, Cambridge, MA, 2007. [6] S. Sonnenburg, A. Zien, and G. R¨atsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.
关 键 词: 生物信息学; 文本处理; 稀疏数据结构
课程来源: 视频讲座网
最后编审: 2021-12-22:liyy
阅读次数: 48