0


SigniTrend:基于哈希显著性阈值的文本流中新兴主题的可伸缩检测

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
课程网址: http://videolectures.net/kdd2014_weiler_signi_trend/  
主讲教师: Michael Weiler
开课单位: 路德维希马克西米利安大学
开课时间: 2014-10-07
课程语种: 英语
中文简介:

Twitter或Weblog等社交媒体是实时文本数据的流行来源。之所以如此受欢迎,很大程度上是由于这些数据到达的速度很快,并且有许多全球性事件,例如阿拉伯之春,据报道Twitter在其中产生了重大影响。但是,用于新兴主题检测的现有方法通常只能检测全球范围的事件,例如自然灾害或名人死亡,并且只能监视用户选择的关键字或仅对精选的主题标签进行操作。但是,有趣的新兴主题的规模可能要小得多,并且可能涉及两个或多个单词的组合,而这些单词本身在当时并没有异常活跃。首先,我们对新兴趋势检测的贡献是三方面的,我们提出了一项重要措施,可以利用异常检测中的经验,将其用于在新兴话题成为“热门标签”之前就及早发现。其次,通过在笨重的击球手类型算法中使用哈希表来建立噪声基线,我们展示了如何仅使用固定数量的内存来跟踪所有关键字对。最后,我们使用聚类方法将检测到的共同趋势汇总到更大的主题中,因为单个事件经常会导致同时出现多个单词组合。

课程简介: Social media such as Twitter or weblogs are a popular source for live textual data. Much of this popularity is due to the fast rate at which this data arrives, and there are a number of global events - such as the Arab Spring - where Twitter is reported to have had a major influence. However, existing methods for emerging topic detection are often only able to detect events of a global magnitude such as natural disasters or celebrity deaths, and can monitor user-selected keywords or operate on a curated set of hashtags only. Interesting emerging topics may, however, be of much smaller magnitude and may involve the combination of two or more words that themselves are not unusually hot at that time. Our contributions to the detection of emerging trends are three-fold first of all, we propose a significance measure that can be used to detect emerging topics early, long before they become "hot tags", by drawing upon experience from outlier detection. Secondly, by using hash tables in a heavy-hitters type algorithm for establishing a noise baseline, we show how to track even all keyword pairs using only a fixed amount of memory. Finally, we aggregate the detected co-trends into larger topics using clustering approaches, as often as a single event will cause multiple word combinations to trend at the same time.
关 键 词: 趋势检测; 显著性阈值
课程来源: 视频讲座网
数据采集: 2020-12-03:zyk
最后编审: 2020-12-03:zyk
阅读次数: 37