0


开发半监督领域适应的文档结构和特征层次

Exploiting document structure and feature hierarchy for semi-supervised domain adaptation
课程网址: http://videolectures.net/cmulls08_arnold_eds/  
主讲教师: Andrew Arnold
开课单位: 卡内基梅隆大学
开课时间: 2008-10-21
课程语种: 英语
中文简介:
在这项工作中,我们试图弥合研究人员经常遇到的差距,他们发现自己很少或没有标记的例子来自他们想要的目标域,但仍然可以访问来自其他相关但不同的源域的大量标记数据,而且看似没有实验上,我们关注从生物学领域的学术出版物中提取蛋白质提及的问题,其中源域数据是用蛋白质提及标记的摘要,并且目标域数据完全未标记。字幕。我们在互联网上免费提供大量此类全文文章,以补充有限数量的可用注释数据。通过利用这些文档不同小节的显式和隐式通用结构,包括未标记的全文,我们能够生成对跨域的类和数据的边际和条件分布的变化不敏感的强大功能。我们通过在目标域上自动获得高可信度的正面和负面预测来补充这些域不敏感特征,以学习从文档的一个部分到另一个部分很好地概括的提取器。类似地,我们开发了一种新颖的分层先验结构,该结构是针对跨越自然语言数据集的该任务的特征空间的共同结构所激发的特征。最后,由于缺乏标记的目标测试数据,我们采用比较用户偏好研究来评估所提方法相对于现有基线的相对性能。
课程简介: In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other. Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available. By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Similarly, we develop a novel hierarchical prior structure over the features motivated by the common structure of feature spaces for this task across natural language data sets. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.
关 键 词: 特征空间; 用户偏好研究
课程来源: 视频讲座网
最后编审: 2020-06-22:chenxin
阅读次数: 77