0


无监督学习数据链接配置

Unsupervised Learning of Data Linking Configuration
课程网址: http://videolectures.net/eswc2012_nikolov_data_linking/  
主讲教师: Andriy Nikolov
开课单位: 英国开放大学
开课时间: 2012-07-04
课程语种: 英语
中文简介:
由于通常不能获得语义数据集(例如ISBN代码或DOI标识符)中的数据实例的通常接受的标识符,因此通常通过使用模糊相似性度量来实现在Web上重叠数据集之间发现链接。配置这样的度量,即决定哪个相似性函数应用于哪些数据属性与哪些参数,通常是一个非常重要的任务,取决于数据中的域,本体模式和格式约定。现有解决方案或者依赖于用户对数据和域的了解,或者依赖于机器学习的使用来基于训练数据发现这些参数。在本文中,我们提出了一种解决数据链接问题的新方法,该方法依赖于无监督发现所需的相似性参数。该方法不考虑使用标记的训练数据,而是考虑输出相似度值的分布应满足的若干所需属性。该方法将这些特征包括在遗传算法中使用的适应度标准中,以建立相似性参数,该相似性参数根据所考虑的属性最大化所得链接集的质量。我们在使用基准测试和现实世界数据集的实验中表明,这种无监督方法可以达到与手工设计方法相同的性能水平,以及遗传算法的不同参数和适应性标准如何影响不同数据集的结果。
课程简介: As commonly accepted identifiers for data instances in semantic datasets (such as ISBN codes or DOI identifiers) are often not available, discovering links between overlapping datasets on the Web is generally realised through the use of fuzzy similarity measures. Configuring such measures, i.e. deciding which similarity function to apply to which data properties with which parameters, is often a non-trivial task that depends on the domain, ontological schemas, and formatting conventions in data. Existing solutions either rely on the user's knowledge of the data and the domain or on the use of machine learning to discover these parameters based on training data. In this paper, we present a novel approach to tackle the issue of data linking which relies on the unsupervised discovery of the required similarity parameters. Instead of using labeled training data, the method takes into account several desired properties which the distribution of output similarity values should satisfy. The method includes these features into a fitness criterion used in a genetic algorithm to establish similarity parameters that maximise the quality of the resulting linkset according to the considered properties. We show in experiments using benchmarks as well as real-world datasets that such an unsupervised method can reach the same levels of performance as manually engineered methods, and how the different parameters of the genetic algorithm and the fitness criterion affect the results for different datasets.
关 键 词: 语义数据集; 相似性函数; 机器学习
课程来源: 视频讲座网
最后编审: 2020-09-18:chenxin
阅读次数: 8