0


动态混合聚类的生物信息学结合文本挖掘和引文分析

Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis
课程网址: http://videolectures.net/kdd07_janssens_dhc/  
主讲教师: Frizo Janssens
开课单位: 鲁汶大学
开课时间: 2007-09-14
课程语种: 英语
中文简介:
为了揭示生物信息学领域的概念结构和动态,我们分析了一系列来自Web of Science和MEDLINE数据库的7401种出版物,出版年份为1981年和2004年。为了描绘这个复杂的跨学科领域,使用了一种新颖的文献计量检索策略。鉴于通过将文本内容与引用图的结构深度融合,显着提高了无监督聚类和科学出版物分类的性能,我们进行了基于Fisher的逆卡方的混合聚类方法。群集的最佳数量由复合半自动策略决定,该策略包括 基于istance的和基于稳定性的方法。我们还研究了潜在语义索引因子的数量,聚类数量和聚类性能之间的关系。 HITS和PageRank算法用于确定每个群集中的代表性出版物。接下来,我们开发了一种用于不断发展的书目数据集的动态混合聚类的方法。将相同的聚类方法应用于由该组上的时间窗口定义的连续时段,并且在随后的相链中通过随时间匹配和跟踪聚类来形成。 11个结果簇链的术语网络呈现了该领域的认知结构。最后,我们提供了一个观点,即生物信息学界通过时间对不同子域投入了多少关注。
课程简介: To unravel the concept structure and dynamics of the bioinformatics field, we analyze a set of 7401 publications from the Web of Science and MEDLINE databases, publication years 1981–2004. For delineating this complex, interdisciplinary field, a novel bibliometric retrieval strategy is used. Given that the performance of unsupervised clustering and classification of scientific publications is significantly improved by deeply merging textual contents with the structure of the citation graph, we proceed with a hybrid clustering method based on Fisher’s inverse chi-square. The optimal number of clusters is determined by a compound semiautomatic strategy comprising a combination of  istancebased and stability-based methods. We also investigate the relationship between number of Latent Semantic Indexing factors, number of clusters, and clustering performance. The HITS and PageRank algorithms are used to determine representative publications in each cluster. Next, we develop a methodology for dynamic hybrid clustering of evolving bibliographic data sets. The same clustering methodology is applied to consecutive periods defined by time windows on the set, and in a subsequent phase chains are formed by matching and tracking clusters through time. Term networks for the eleven resulting cluster chains present the cognitive structure of the field. Finally, we provide a view on how much attention the bioinformatics community has devoted to the different subfields through time.
关 键 词: 最佳聚类数; 潜在语义索引; 混合聚类方法
课程来源: 视频讲座网
最后编审: 2020-05-31:吴雨秋(课程编辑志愿者)
阅读次数: 46