
Mining Complex Entities from Heterogeneous Information Networks
主讲教师: Fabio Ciravegna; Andrea Varga
开课单位: 谢菲尔德大学
课程语种: 英语
大多数关于信息挖掘的研究都集中在从结构化和非结构化文档(如报纸文章和网页)中提取经典信息(即)任务上。然而,在过去的几年中,社交媒体作为内容共享平台的惊人增长已经将焦点转移到了不同类型的提取目标上。社交媒体对信息提取提出了许多挑战:对博客、论坛、Twitter等社交媒体网站的贡献本质上是会话性的,因此往往是简短和非正式的,包含不精确、主观和模棱两可的信息。扩展的上下文(作者是谁,社会和地理上下文,它们的社会链接等)与消歧和链接信息相关。本教程旨在介绍和讨论从文档中提取信息的问题、方法和技术,特别关注挖掘异构信息网络(如社交网站),以便挖掘复杂的实体。本教程包括:*一般情况下从文档(20分钟)和特别是从信息网络(10分钟)中提取信息的介绍)*基于机器学习的信息提取方法的介绍(75分钟)表示文档和功能集实体和术语识别学习gazEtters事件和关系提取从多媒体文档提取*培训注释(15分钟)功能选择注释和错误跨域移植*信息网络信息提取(45分钟)使用Twitter和Facebook API实体识别和解析terM Association实体在大规模消除歧义*结论和未来工作(15分钟)的重点是基于机器学习的方法。我们将包括使用规则归纳、支持向量机、CRF、HMM、转移学习、主动学习等方法。我们还将讨论来自信息和知识管理领域的实际案例。
课程简介: Most research on information mining has focused on classic Information Extraction (IE) tasks, from structured and unstructured documents, like newspaper articles and web pages. In the last years however the staggering growth of social media as platform for sharing content has moved the focus towards a different type of extraction target. Social media pose a number of challenge to information extraction: contributions to social media sites like blogs, forums, Twitter, etc. are conversational in nature and thus tend to be brief and informal, containing imprecise, subjective and ambiguous information. The expanded context (who the author is, the social and geographical context, their social links, etc.) becomes relevant to disambiguate and interlink information. Aim of this tutorial is to introduce and discuss issues, methodologies and technologies for extracting information from documents, with a particular focus on mining heterogeneous information networks (e.g. social websites) in order to mine complex entities. The tutorial covers: * Introduction to information extraction from documents in general (20 minutes) and from information networks in particular (10 minutes) * Introduction to machine learning based methods for information extraction (75 minutes) #representing documents and feature sets #entity and terminology recognition #learning gazetteers #event and relation extraction #extraction from multimedia documents * Annotation for training (15 minutes) #feature selection #annotation and error #porting across domains * Information Extraction from information networks (45 minutes) #using the Twitter and Facebook APIs #entity recognition and resolution #term association #entity disambiguation over large scale * Conclusion and future work (15 minutes) The focus is on Machine Learning based methods. We will cover - among others - methods using Rule Induction, SVM, CRF, HMM, Transfer Learning, Active Learning. We will Also discuss real world cases from the field of information and knowledge management.
关 键 词: 数据挖掘; 社会内容; 网络分析; 社交网络; Web挖掘 ; 社会计算; 计算机科学; 社交媒体
