0


信息网络中的社区异常及其有效检测

On Community Outliers and their Efficient Detection in Information Networks
课程网址: http://videolectures.net/kdd2010_gao_ocoe/  
主讲教师: Jing Gao
开课单位: 布法罗大学
开课时间: 2010-10-01
课程语种: 英语
中文简介:
链接或网络数据在许多应用程序中无处不在。示例包括通过超链接连接的网络数据或超文本文档,通过朋友链接连接的社交网络或用户配置文件,共同作者和引用信息,博客数据,电影评论等。在这些数据集(称为“信息网络”)中,共享相同属性或兴趣的密切相关的对象形成社区。例如,blogsphere中的社区可能是最感兴趣的手机评论和新闻的用户。信息网络中的异常检测可以揭示重要的异常和有趣的行为,如果忽略社区信息则这些行为并不明显。一个例子可能是一个低收入人群,他是许多富人的朋友,尽管他在整个人口中的收入并不是异常低。本文首先介绍了社区异常值的概念(更有积极意义的有趣点或新星),然后表明,不考虑链接或社区信息的众所周知的基线方法无法找到这些社区异常值。我们通过将网络数据建模为由多个正常社区和一组随机生成的异常值组成的混合模型来提出有效的解决方案。概率模型通过基于隐马尔可夫随机场(HMRF)定义它们的联合分布来同时表征数据和链路。最大化数据可能性和模型的后验可以解决异常值推断问题。我们将该模型应用于合成数据和DBLP数据集,结果证明了该概念的重要性,以及所提出方法的有效性和效率。
课程简介: Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.
关 键 词: 网络数据; 社区异常值; 概率模型
课程来源: 视频讲座网
最后编审: 2019-05-11:lxf
阅读次数: 111