0


simcomp:元基因组读取的混合软聚类

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads
课程网址: http://videolectures.net/prib2010_prabhakara_simc/  
主讲教师: Shruthi Prabhakara
开课单位: 宾夕法尼亚州立大学
开课时间: 2010-10-14
课程语种: 英语
中文简介:
宏基因组学面临的一个主要挑战是开发用于表征大量短宏基因组读数的功能和分类学内容的工具。在本文中,我们提出了一种双通道半监督算法SimComp,用于短宏基因组读数的软聚类,这是一种基于比较和基于组合的方法的混合。在第一轮中,针对BLASTx的宏基因组读数的比较分析从宏基因组内提取参考序列以形成初始的一组接种簇。那些与数据库有重大匹配的读取由它们的系统发育起源聚类。在第二次通过中,剩余的读数部分的特征在于它们基于物种特定组成的特征。 SimComp将读取分组为重叠的簇,每个簇都有读取的引导。我们不对数据集的分类学分布做出任何假设。集群之间的重叠优雅地处理了宏基因组数据的本质所带来的挑战。由此产生的聚类领导者可以用作宏基因组数据集的系统发育组成的准确估计。我们的方法将数据集丰富到少量的簇中,同时准确地分配小到100个碱基对的片段。
课程简介: A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.
关 键 词: 宏基因组学; 半监督算法; 软聚类
课程来源: 视频讲座网
最后编审: 2019-09-14:lxf
阅读次数: 70