作为语言观察和词典编纂工作基础的候选新词的检测:文体学项目On the Detection of Neologism Candidates as a Basis for Language Observation and Lexicographic Endeavors: the STyrLogism Project |
|
课程网址: | http://videolectures.net/euralex2018_abel_stemle_endeavors/ |
主讲教师: | Egon W. Stemle |
开课单位: | 布拉格查尔斯大学 |
开课时间: | 2018-07-27 |
课程语种: | 英语 |
中文简介: | STyrLogisms项目的目标是为南蒂罗尔使用的德国标准品种半自动提取候选新词(新词汇)。我们使用从南蒂罗尔的新闻、杂志和博客网站上手动审查的url列表,定期抓取数据,清理和处理。我们将这些新数据与参考语料库、额外的区域词表和所有先前抓取的数据集进行比较。我们的参考语料库是DECOW14,约有6000万个单词形式,以及南蒂洛尔Web语料库,约有240万个单词形式;额外的单词列表包括命名实体、来自该地区的术语和南蒂罗尔使用的德语标准品种的特定术语(总共约53,000个单词形式)。在这里,我们将报告所采用的方法,第一轮候选人提取与选择的候选人的分类模式的方法,以及关于第二轮提取的一些备注。 |
课程简介: | The goal of the project STyrLogisms is to semi-automatically extract candidate neologisms (new lexemes) for the German standard variety used in South Tyrol. We use a list of manually vetted URLs from news, magazines and blog websites of South Tyrol, and regularly crawl their data, clean and process it. We compare this new data to reference corpora, additional regional word lists and all the formerly crawled data sets. Our reference corpora are DECOW14, with around 60 million word forms, and the South Tyrolean Web Corpus, with around 2.4 million word forms; the additional word lists consist of named entities, terminological terms from the region and specific terms of the German standard variety used in South Tyrol (altogether around 53,000 word forms). Here, we will report on the method employed, the first round of candidate extraction with an approach for a classification schema for the selected candidates, and some remarks on the second extraction round. |
关 键 词: | 半自动提取; 参考语料库; 分类模式 |
课程来源: | 视频讲座网 |
数据采集: | 2022-12-16:chenjy |
最后编审: | 2023-05-11:chenjy |
阅读次数: | 26 |