0


DistLODStats:RDF数据集统计的分布式计算

DistLODStats: Distributed Computation of RDF Dataset Statistics
课程网址: http://videolectures.net/iswc2018_sejdiu_distlodstats_distributed...  
主讲教师: Gezim Sejdiu
开课单位: 波恩大学计算机科学研究所
开课时间: 2018-11-22
课程语种: 英语
中文简介:
在过去几年中,语义网一直在稳步增长。今天,我们统计了根据语义Web标准在线提供的10000多个数据集。然而,如果没有关于数据内部结构和覆盖范围的先验统计信息,许多应用程序(如数据集成、搜索和互联)可能无法充分利用数据。事实上,已经有很多工具可以提供这样的统计信息,提供有关RDF数据集和词汇表的基本信息。然而,一旦数据集的大小超出了单个机器的能力,它们通常在性能方面表现出严重的不足。在本文中,我们介绍了一个用于大型RDF数据集统计计算的软件库,它可以扩展到机器集群。更具体地说,我们描述了第一种使用ApacheSpark为RDF数据集计算32种不同统计标准的分布式内存方法。初步结果表明,我们的分布式方法改进了以前的集中式方法,与之相比,提供了近似线性的水平放大。该标准可扩展到32个默认标准之外,集成到更大的SANSA框架中,并在SANSA社区之外的至少四个主要使用场景中使用。
课程简介: Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software library for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.
关 键 词: 语义Web标准; 大型RDF数据集; 不同统计标准的分布式内存方法; SANSA框架
课程来源: 视频讲座网
数据采集: 2022-12-23:cyh
最后编审: 2023-05-15:cyh
阅读次数: 9