基于混合主题链接模型的可伸缩文本和链接分析Scalable Text and Link Analysis with Mixed-Topic Link Models |
|
课程网址: | http://videolectures.net/kdd2013_zhu_link_models/ |
主讲教师: | Yaojia Zhu |
开课单位: | 新墨西哥大学 |
开课时间: | 2013-09-27 |
课程语种: | 英语 |
中文简介: | 许多数据集包含有关对象的丰富信息,以及它们之间的成对关系。例如,在网站,科学论文和其他文档的网络中,每个节点都具有包含单词集合以及指向其他节点的超链接或引用的内容。为了对此类数据集进行推断,并做出预测和建议,拥有能够捕获在每个节点处生成文本的过程以及它们之间的链接的模型非常有用。在本文中,我们将主题建模中的经典思想与统计物理学界最近开发的混合成员块模型的变体相结合。生成的模型的优势在于,可以使用简单且可扩展的期望最大化算法来推断其参数,包括每个文档的主题混合以及生成的重叠社区。我们在三个数据集上测试我们的模型,执行无监督的主题分类和链接预测。对于这两项任务,我们的模型均优于几种现有的现有方法,可通过更少的计算获得更高的准确性,并在几分钟内分析出具有130万个单词和4.4万个链接的数据集。 p> |
课程简介: | Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node and the links between them. In this paper, we combine classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation-maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes. |
关 键 词: | 数据集; 统计物理学 |
课程来源: | 视频讲座网 |
数据采集: | 2021-01-06:zyk |
最后编审: | 2021-01-06:zyk |
阅读次数: | 49 |