在干草堆中寻找针头:从小型单语语料库半自动创建拉脱维亚多词词典Looking for a Needle in a Haystack: Semi-automatic Creation of a Latvian Multi-word Dictionary from Small Monolingual Corpora |
|
课程网址: | http://videolectures.net/euralex2018_skadina_monolingual_corpora/ |
主讲教师: | Inguna Skadiņa |
开课单位: | 拉脱维亚大学 |
开课时间: | 2018-07-27 |
课程语种: | 英语 |
中文简介: | 多字表达式(MWE)是几乎所有词典中必不可少的部分。但是,识别最近以某种语言出现的缺失的MWE并不是一件容易的事。在本文中,我们描述了一个相当小的拉脱维亚文本语料库中用于MWE识别的自动化方法。我们建议从统计方法的应用开始,以识别广泛的MWE,然后应用语言动机的过滤器来清理最初提取的MWE候选列表。我们表明,对于语言丰富的语言(例如拉脱维亚语),在语言数据量较少的情况下,使用词素化数据可以获得更好的结果。我们还证明,在通用领域(平衡的)语料库较小的情况下,可以使用自动方法来找到优秀的MWE候选人-术语单位,命名实体和一些词汇化短语。但是,在小型通用领域语料库中找到惯用语表达会在大海捞针中寻找针脚:只有更大或更富表现力的语料库才能帮助识别过程。 p> |
课程简介: | Multiword expressions (MWEs) are an indispensable part of almost any dictionary. However, the identification of missing MWEs that have recently appeared in a language is not a simple task. In this paper we describe automated methods for MWE identification in a rather small Latvian text corpora. We propose starting with the application of statistical measures to identify a wide range of MWEs and then applying linguistically motivated filters to clean the list of initially extracted MWE candidates. We show that for morphologically rich languages, such as Latvian, in cases with a small amount of language data better results can be achieved with lemmatized data. We also demonstrate that in the case of a small general domain (balanced) corpus, automatic methods can be used to find good MWE candidates – terminological units, named entities and some lexicalized phrases. However, finding idiomatic expressions in small, general domain corpora is looking for a needle in a haystack: only a larger or more expressive corpus can help in the identification process. |
关 键 词: | 多字表达式; 拉脱维亚文本语料库; 惯用语表达 |
课程来源: | 视频讲座网 |
数据采集: | 2021-02-13:cjy |
最后编审: | 2021-02-13:cjy |
阅读次数: | 52 |