0


语料库转换服务:大规模摄取文档的机器学习平台

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
课程网址: http://videolectures.net/kdd2018_staar_corpus_conversion/  
主讲教师: Peter Staar
开课单位: IBM苏黎世研究实验室
开课时间: 2018-11-23
课程语种: 英语
中文简介:
生成对抗网络(GAN)在生成图像、标签和句子等真实合成数据方面取得了巨大成功。我们探索使用GAN直接从赞助搜索广告选择中的查询生成出价关键字,特别是对于罕见的查询。具体而言,在搜索广告中的查询扩展(查询关键字匹配)场景中,我们训练序列到序列模型作为生成器,以生成关键字,条件是用户查询,并使用递归神经网络模型作为鉴别器,与生成器进行对抗性游戏。通过应用经过训练的生成器,我们可以从给定的查询中直接生成关键字,从而可以高度提高搜索广告中基于查询关键字匹配的广告选择的有效性和效率。我们在来自商业搜索广告系统的点击查询关键字对数据集中训练了所提出的模型。评估结果表明,与基线模型相比,生成的关键字与给定查询更相关,并且它们具有带来额外收入改善的巨大潜力。
课程简介: Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather groundtruth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.
关 键 词: 科学文章和技术文献; 数量呈指数级增长; PDF格式或位图图像; 异步微服务架构
课程来源: 视频讲座网
数据采集: 2023-01-28:cyh
最后编审: 2023-01-28:cyh
阅读次数: 26