0


企业信息管理中的句法相似度算法

Applying Syntactic Similarity Algorithms for Enterprise Information Management
课程网址: http://videolectures.net/kdd09_cherkasova_assaeim/  
主讲教师: Ludmila Cherkasova
开课单位: 惠普公司
开课时间: 2009-09-14
课程语种: 英语
中文简介:
为了实施内容管理解决方案并启用与数据保留、法规遵从性和诉讼问题相关的新应用程序, 企业需要开发高级分析, 以发现文档之间的关系, 例如,内容相似性、来源和聚类。本文对四种句法相似算法的性能进行了评价。三种算法基于 broder 的 "撞击" 技术, 而第四种算法采用了较新的方法 "基于内容的分块"。对于我们的实验, 我们使用了专门设计的文档, 其中包括一组 "类似" 文档, 并进行了大量的修改。我们的性能研究表明, 四种算法的相似度度量对算法参数的设置都非常敏感: 滑动窗口大小和指纹采样频率。我们确定了这些参数的一个有用范围, 以获得良好的实际结果, 并比较了四种算法在受控环境中的性能。我们通过应用这些算法在两个大型 hp 技术支持文档集合中查找几乎重复的内容来验证我们的结果。
课程简介: For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's ``shingling'' technique while the fourth algorithm employs a more recent approach, ``content-based chunking''. For our experiments, we use a specially designed corpus of documents that includes a set of ``similar'' documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters: sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.
关 键 词: 计算机科学; 数据挖掘; 企业与金融
课程来源: 视频讲座网
最后编审: 2020-06-18:liush
阅读次数: 41