0


如何投资我的时间:人在回路中实体提取的经验教训

How to Invest my Time: Lessons from Human-in-the-Loop Entity Extraction
课程网址: http://videolectures.net/kdd2019_zhang_he_dragut/  
主讲教师: Eduard Dragut
开课单位: 坦普尔大学
开课时间: 2020-03-02
课程语种: 英语
中文简介:
识别遵循或非常类似正则表达式(regex)模式的实体是信息提取中的一项重要任务。提取此类实体的常见方法要求人们要么编写识别实体的正则表达式,要么在文档语料库中手动标记提及的实体。虽然人力资源对于构建实体识别模型至关重要,但令人惊讶的是,在有限的时间预算下,人们对如何最佳地投资人力资源知之甚少。为了得到答案,我们考虑了一个迭代的人在环(HIL)框架,它允许用户编写正则表达式或手动标记实体引用,然后根据提供的信息训练和优化分类器。我们在5个实体识别任务中证明,无论哪种方法,分类精度都会随着时间的推移而提高。当允许用户在正则表达式构造和手动标记之间进行选择时,我们发现(1)如果时间预算较低,则花费所有时间构建正则表达式通常是有利的;(2)如果时间开销较高,则花费全部时间手动标记似乎更优;(3)在这两个极端之间,编写正则表达式然后手动标记通常是最佳方法。
课程简介: Recognizing entities that follow or closely resemble a regular expression (regex) pattern is an important task in information extraction. Common approaches for extraction of such entities require humans to either write a regex recognizing an entity or manually label entity mentions in a document corpus. While human effort is critical to build an entity recognition model, surprisingly little is known about how to best invest that effort given a limited time budget. To get an answer, we consider an iterative human-in-the-loop (HIL) framework that allows users to write a regex or manually label entity mentions, followed by training and refining a classifier based on the provided information. We demonstrate on 5 entity recognition tasks that classification accuracy improves over time with either approach. When a user is allowed to choose between regex construction and manual labeling, we discover that (1) if the time budget is low, spending all time for regex construction is often advantageous, (2) if the time budget is high, spending all time for manual labeling seems to be superior, and (3) between those two extremes, writing regexes followed by manual labeling is typically the best approach. 
关 键 词: 如何投资我的时间; 人在回路中实体提取; 提取的经验教训
课程来源: 视频讲座网
数据采集: 2022-09-16:cyh
最后编审: 2022-09-19:cyh
阅读次数: 27