多语言标准的Saeima语料库注释Annotation of the Corpus of the Saeima with Multilingual Standards |
|
课程网址: | http://videolectures.net/parlaCLARIN2018_dargis_multilingual_stan... |
主讲教师: | Roberts Darģis |
开课单位: | 拉脱维亚大学 |
开课时间: | 2018-05-30 |
课程语种: | 英语 |
中文简介: | 本文描述了Saeima语料库(拉脱维亚议会)的发布,以作为开放数据资源进行多学科研究。该语料库包括从1993年到2017年拉脱维亚议会辩论的录音,其中包含来自468位演讲者的3800万个令牌。仅通过提供未加注解的语料库和目前主要由当地研究人员进行的单语研究,就不足以促进当前议会辩论的比较研究。我们建议,根据常用的多语言标准,通过增加额外的语料库来增加语料库,将使比较和对比不同语言的多个语料库变得更加容易。在这方面,我们认为需要添加的关键因素是每种话语中提到的实体的标识符,以及用于语言分析的词句法信息。由于这些原因,提供的语料库增加了链接到Wikidata知识库(作为链接数据提供)的命名实体,自动翻译成英语以及通用依赖格式的形态和句法注释。 p> |
课程简介: | This paper describes a release of corpus of Saeima (parliament of Latvia) as open data resources for multidisciplinary research. The corpus consists of the transcription of Latvian parliamentary debates from 1993 until 2017, containing 38 million tokens from 468 speakers. Current comparative research of parliamentary debate is not sufficiently facilitated by simply providing unannotated corpora and results mostly in monolingual research by local researchers. We propose that augmenting such corpora with extra layers according to commonly used multilingual standards would make it easier to compare and contrast multiple corpora in different languages. In this regard, we believe that the key factors that need to be added are identifiers of entities mentioned in each utterance, and morphosyntactic information for linguistic analysis. For these reasons, the provided corpus is augmented with named entity linking to the Wikidata knowledge base (provided as linked data), automated translations to English, and morphological and syntactic annotations in Universal Dependency format. |
关 键 词: | Saeima语料库; 多语言标准; 句法注释 |
课程来源: | 视频讲座网 |
数据采集: | 2020-11-26:cjy |
最后编审: | 2020-11-26:cjy |
阅读次数: | 37 |