TextTruth:一种从多源文本数据中发现可信信息的无监督方法TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi‑Sourced Text Data |
|
课程网址: | http://videolectures.net/kdd2018_ma_texttruth_data/ |
主讲教师: | Fenglong Ma |
开课单位: | 布法罗大学计算机科学与工程系 |
开课时间: | 2018-11-23 |
课程语种: | 英语 |
中文简介: | 真相发现由于能够在没有任何监督的情况下从嘈杂的多源数据中提取可靠的信息而受到越来越多的关注。然而,大多数现有的真相发现方法都是为结构化数据设计的,不能满足从原始文本数据中提取可信信息的强烈需求,因为文本数据具有其独特的特征。推断文本数据真实信息的主要挑战来自文本答案的多因素属性(即,答案可能包含多个关键因素)和单词用法的多样性(即,不同的单词可能具有相同的语义)。为了应对这些挑战,在本文中,我们提出了一种名为“TextTruth”的新颖的真相发现方法,该方法将从特定问题的答案中提取的关键词组合成多个可解释的因素,并推断出答案因素和答案提供方的可信度。之后,每个问题的答案可以根据估计的因素可信度进行排序。所提出的方法以无监督的方式工作,因此可以应用于涉及文本数据的各种应用场景。在三个真实世界数据集上的实验表明,所提出的TextTruth模型可以准确地选择可信的答案,即使这些答案是由多个因素形成的。 |
课程简介: | Truth discovery has attracted increasingly more attention due to its ability to distill trustworthy information from noisy multi-sourced data without any supervision. However, most existing truth discovery methods are designed for structured data, and cannot meet the strong need to extract trustworthy information from raw text data as text data has its unique characteristics. The major challenges of inferring true information on text data stem from the multifactorial property of text answers (i.e., an answer may contain multiple key factors) and the diversity of word usages (i.e., different words may have the same semantic meaning). To tackle these challenges, in this paper, we propose a novel truth discovery method, named “TextTruth”, which jointly groups the keywords extracted from the answers of a specific question into multiple interpretable factors, and infers the trustworthiness of both answer factors and answer providers. After that, the answers to each question can be ranked based on the estimated trustworthiness of factors. The proposed method works in an unsupervised manner, and thus can be applied to various application scenarios that involve text data. Experiments on three real-world datasets show that the proposed TextTruth model can accurately select trustworthy answers, even when these answers are formed by multiple factors. |
关 键 词: | 多源数据中提取; 结构化数据; 单词用法的多样性; 真实世界数据集 |
课程来源: | 视频讲座网 |
数据采集: | 2023-01-27:cyh |
最后编审: | 2023-01-27:cyh |
阅读次数: | 32 |