TSE-NER:科学出版物中长尾实体提取的迭代方法TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications |
|
课程网址: | http://videolectures.net/iswc2018_lofi_tse_ner_iterative/ |
主讲教师: | Christoph Lofi |
开课单位: | 代尔夫特理工大学(TU Delft) |
开课时间: | 2018-11-22 |
课程语种: | 英语 |
中文简介: | 命名实体识别和键入(NER/NET)是一项具有挑战性的任务,尤其是对于科学出版物中发现的长尾实体。这些实体(例如“WebKB”、“StatSnowball”等)非常罕见,通常只与特定知识领域相关,但对于检索和探索目的来说仍然很重要。最先进的NER方法采用了受监督的机器学习模型,这些模型基于人类注释器费力生成的昂贵类型标记数据进行训练。一种常见的解决方法是从知识库生成标记的训练数据;该方法不适用于根据定义几乎不在KB中表示的长尾实体类型。本文提出了一种迭代方法,用于训练科学出版物中长尾实体类型的NER和NET分类器,该方法依赖于最小的人工输入,即目标实体类型的一个小实例种子集。我们介绍了训练数据提取、语义扩展和结果实体过滤的不同策略。我们评估了我们在科学出版物上的方法,重点关注长尾实体类型数据集、计算机科学出版物中的方法和生物医学出版物中的蛋白质。 |
课程简介: | Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities – e.g. "WebKB", "StatSnowball", etc. – are rare, often relevant only in specific knowledge domains, but are yet important for retrieval and exploration purposes. State-of-the-artNER approaches employ supervised machine learning models, trained on expensive type-labeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs.This paper presents an iterative approach for training NER and NET classifiers for long-tail entity types in scientific publications that relies on minimal human input, namely a small seed set of instances for the targeted entity type. We introduce different strategies for training data extraction, semantic expansion, and result entity filtering. We evaluate our approach on scientific publications, focusing on the long-tail entities typesDatasets, Methods in computer science publications, and Proteins in biomedical publications. |
关 键 词: | 命名实体识别和键入; 先进的NER方法; 知识库生成标记的训练数据; 语义扩展和结果实体过滤 |
课程来源: | 视频讲座网 |
数据采集: | 2023-01-07:cyh |
最后编审: | 2023-01-07:cyh |
阅读次数: | 38 |