0


用统计方法识别物体

Object Identification by Statistical Methods
课程网址: http://videolectures.net/solomon_lenz_oism/  
主讲教师: Hans-Joachim Lenz
开课单位: 自由大学
开课时间: 2007-02-25
课程语种: 英语
中文简介:
如果在相应的数据集中不存在全局唯一标识键,则数字数据融合或重叠数据文件的合并将成为一个难题。典型示例是链接出于商业目的从不同来源提供的地址文件,赚钱领域,合并各种媒体中的特别优惠(参见重复检测)或德国计划的行政记录普查(ARC),其中几个自治的异构寄存器将被合并。我们介绍了一个三步过程,包括以下步骤:属性转换,一对对象的值比较以及对的分类(“匹配问题”),这些对的分类为“相同”或“匹配且”不相同”或”不匹配” “。我们特别关注方法的质量和效率。我们简要讨论了诸如正确性和完整性之类的问题以及诸如“阻止”之类的预选技术,以减少成对比较的计算复杂性。精心组合的基准数据集我们假设计算机科学和分类(监督学习)方面的一些基础知识。
课程简介: Numerical data fusion or merging of overlapping data files becomes a hard problem if no global unique identifying keys exist in the corresponding data sets. Typical examples are the linkage of address files supplied from different sources for commercial purposes - a money making area-, the merging of special offers in various media (cf. duplicate detection), or an administrative record census (ARC) as planed in Germany, where several autonomous, heterogeneous registers are to be merged. We present a three-step procedure consisting of the steps conversion of attributes, comparison of values of a pair of objects, and classification ('matching problem') of pairs either as "same" or "matched and "not same" or "not matched". We pay special attention to the quality and the efficiency of the methodology. We briefly discuss questions like correctness and completeness as well as pre-selection techniques like 'blocking' to reduce the computational complexity of pairwise comparisons. The approach is illustrated on data from carefully composed benchmark data sets. We assume some basic knowledge in computer science and classification (supervised learning).
关 键 词: 唯一标识; 重叠数据; 异构寄存器
课程来源: 视频讲座网
最后编审: 2019-09-22:cwx
阅读次数: 51