0


利用背景文本从稀疏文本数据中挖掘频繁字符串的特殊成分

Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts
课程网址: http://videolectures.net/ecmlpkdd09_ikeda_mpcfsstdubt/  
主讲教师: Daisuke Ikeda
开课单位: 东京大学
开课时间: 2009-10-20
课程语种: 日语
中文简介:
我们考虑从文本T中挖掘不寻常的模式。与假定概率模型和使用简单估计方法的现有方法不同,除了T和w和xy的x和y的组合作为模式之外,我们使用背景文本的集合B.如果存在x和y使得w = xy,则字符串w是特殊的,x和y中的每一个在B中比在T中更频繁,并且相反地,w = xy在T中更频繁.T中的xy的频率非常高由于x和y在T中很少见,但是与B中的xy相比,x在T中相对丰富。尽管这些特殊组成的复杂条件,我们开发了一种使用后缀树找到特殊组合物的快速算法。由于我们的修剪技术和特殊组成概念的优越性,使用DNA序列的实验显示了我们算法的可扩展性。
课程简介: We consider mining unusual patterns from text T. Unlike existing methods which assume probabilistic models and use simple estimation methods, we employ a set B of background text in addition to T and compositions w=xy of x and y as patterns. A string w is peculiar if there exist x and y such that w=xy, each of x and y is more frequent in B than in T, and conversely w=xy is more frequent in T. The frequency of xy in T is very small since x and y are infrequent in T, but xy is relatively abundant in T compared to xy in B. Despite these complex conditions for peculiar compositions, we develop a fast algorithm to find peculiar compositions using the suffix tree. Experiments using DNA sequences show scalability of our algorithm due to our pruning techniques and the superiority of the concept of the peculiar composition.
关 键 词: 假定概率模型; 简单估计; 背景文本
课程来源: 视频讲座网
最后编审: 2019-03-27:lxf
阅读次数: 37