
Extracting Relevant Named Entities for Automated Expense Reimbursement
课程网址: http://videolectures.net/kdd07_zhu_erne/  
主讲教师: Guangyu Zhu
开课单位: 马里兰大学
开课时间: 2007-08-14
课程语种: 英语
费用报销是跨组织的一个耗时且劳动密集的过程。在本文中,我们介绍了一个在IBM Almaden研究中心开发的自动费用报销系统。我们的完整解决方案包括:(1)一个电子文档管理基础设施,它提供多通道图像捕获、传输和存储纸质文档,如收据;(2)从非结构化文档图像中提取相关命名实体的无约束数据挖掘方法;(3)自动化手动审计程序,以及sing提取的元数据。本演示的主要重点是,一旦我们通过这样一个可扩展的基础结构聚合文档,我们就可以自动提取重要的元数据。从具有无约束布局和不同格式的文档图像中大量提取相关命名实体是基于图像的数据挖掘、问题解答和其他信息检索任务的基本技术挑战。在许多需要这种能力的应用中,由于语言结构和标点符号等语言上下文的缺失,将传统的语言建模技术应用到OCR文本流中并不能得到令人满意的结果。我们提出了一种从文档图像中提取相关命名实体的新方法,该方法通过使用区分条件随机字段(CRF)框架,从文档的几何分解区域序列中学习页面布局和语言特征之间的统计依赖性。我们将此命名的实体提取引擎集成到我们的费用报销解决方案中,并评估IBM World Wide Reimbursement Center提供的大量真实收据图像的系统性能。
课程简介: Expense reimbursement is a time-consuming and labor-intensive process across organizations. In this talk, we present an automated expense reimbursement system developed at IBM Almaden Research Center. Our complete solution involves (1) an electronic document management infrastructure that provides multi-channel image capture, transport and storage of paper documents, such as receipts; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of manual auditing procedures using extracted metadata. The main focus of this presentation is our approach to automatically extracting important metadata, once we aggregate documents through such a scalable infrastructure. Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic contexts, such as language constructs and punctuation. We present a novel approach for extracting relevant named entities from document images by learning the statistical dependencies between page layout and language features collectively from the sequence of geometrically decomposed regions on a document using a discriminative conditional random fields (CRFs) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real world receipt images provided by IBM World Wide Reimbursement Center.\\
关 键 词: 费用报销; 图像采集; 文件传输; 元数据提取
课程来源: 视频讲座网
最后编审: 2021-02-16:nkq
阅读次数: 30