短消息到通用域的高效聚类Efficient Clustering of Short Messages into General Domains |
|
课程网址: | https://videolectures.net/videos/icwsm2013_tsur_general_domains |
主讲教师: | Oren Tsur |
开课单位: | 信息不详。欢迎您在右侧留言补充。 |
开课时间: | 2014-04-03 |
课程语种: | 英语 |
中文简介: | 社交网络中日益增长的活动主要表现为不断增长的状态更新或微博。大量的更新流强调了大规模准确有效地对短消息进行聚类的必要性。由于稀疏性,应用传统的聚类技术既不准确又低效。本文提出了一种准确有效的推特推文聚类算法。我们将聚类任务分为两个不同的任务/阶段:(1)用户注释数据的批量聚类,以及(2)推文流的在线聚类。在第一阶段,我们依赖于社交媒体流中常见的“标记”习惯(例如标签),因此该算法可以在标签上引导对大量标签推文进行聚类。在第一阶段实现的稳定集群适合(主要是)无标签消息流的在线集群。我们根据金标准分类评估我们的结果,并通过采用多种聚类评估方法(信息论、配对、F和贪婪)来验证结果。我们将我们的算法与许多其他聚类算法和各种类型的特征集进行了比较。结果表明,所提出的算法既准确又高效,并且可以很容易地用于稀疏消息的大规模聚类,因为在亚线性数量的文档上实现了繁重的任务。 |
课程简介: | The ever increasing activity in social networks is mainly manifested by a growing stream of status updating or microblogging. The massive stream of updates emphasizes the need for accurate and efficient clustering of short messages on a large scale. Applying traditional clustering techniques is both inaccurate and inefficient due to sparseness. This paper presents an accurate and efficient algorithm for clustering Twitter tweets. We break the clustering task into two distinctive tasks/stages: (1) batch clustering of user annotated data, and (2) online clustering of a stream of tweets. In the first stage we rely on the habit of `tagging', common in social media streams (e.g. hashtags), thus the algorithm can bootstrap on the tags for clustering of a large pool of hashtagged tweets. The stable clusters achieved in the first stage lend themselves for online clustering of a stream of (mostly) tagless messages. We evaluate our results against gold-standard classification and validate the results by employing multiple clusteringevaluation measures (information theoretic, paired, F and greedy). We compare our algorithm to a number of otherclustering algorithms and various types of feature sets. Results show that the algorithm presented is both accurate andefficient and can be easily used for large scale clustering of sparse messages as the heavy lifting is achieved ona sublinear number of documents. |
关 键 词: | 推特推文聚类算法; 社交网络; 亚线性数量 |
课程来源: | 视频讲座网 |
数据采集: | 2025-05-08:yuhongrui |
最后编审: | 2025-05-08:yuhongrui |
阅读次数: | 4 |