0


命名实体挖掘—通过使用弱监督潜在的狄利克雷分配数据

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation
课程网址: http://videolectures.net/kdd09_yang_nemctduwslda/  
主讲教师: Shuang-Hong Yang; Gu Xu
开课单位: 推特公司
开课时间: 2009-09-14
课程语种: 英语
中文简介:
本文讨论命名实体挖掘 (nem), 在其中我们从大量的数据中挖掘有关命名实体 (如电影、游戏和书籍) 的知识。nem 在许多应用中都有潜在的用途, 包括网络搜索、在线广告和推荐系统。这项任务面临三项挑战: 寻找合适的数据源, 处理命名实体类的模糊性, 以及将必要的人员监督纳入挖掘过程。本文提出了利用网络搜索引擎收集的点击率数据进行 nem, 利用生成点击率数据的主题模型, 通过对人类的弱监督学习主题模型。具体来说, 它通过其在点击式数据中的关联查询和 url 来描述每个命名实体。它使用主题模型通过将命名实体类表示为主题来解决命名实体类的歧义。它采用了一种方法, 称为弱监督的潜在 dirichlet 分配 (ws-lda), 以准确地学习具有部分标记的命名实体的主题模型。在包含超过15亿查询 url 对的大规模点击数据上进行的实验表明, 该方法可以执行非常准确的 nem, 并且显著优于基线。
课程简介: This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.
关 键 词: 命名实体挖掘; 数据源; 狄利克雷分配数据
课程来源: 视频讲座网
最后编审: 2020-06-20:zyk
阅读次数: 63