从历史报纸构建食谱网站Constructing a Recipe Web from Historical Newspapers |
|
课程网址: | http://videolectures.net/iswc2018_van_erp_constructing_recipe/ |
主讲教师: | Marieke van Erp |
开课单位: | 阿姆斯特丹Vrije大学(VU)理学院 |
开课时间: | 2018-11-22 |
课程语种: | 英语 |
中文简介: | 历史报纸提供了一个关于过去风俗习惯的镜头。例如,报纸上刊登的食谱突出了我们吃什么、怎么吃以及对食物的看法。这里的挑战是,报纸数据通常是非结构化的,而且变化很大,数字化的历史报纸增加了一个额外的挑战,即OCR质量的波动。因此,很难从中找到和提取食谱。我们提出了基于远程监督和自动提取词典的方法,以识别数字化历史报纸中的食谱,生成食谱标签,并提取成分信息。我们提供OCR质量指标及其对提取过程的影响。我们通过与配料信息的链接来丰富食谱。我们的研究表明,如何将自然语言处理、机器学习和语义网络相结合,从异质报纸中构建丰富的数据集,用于食品文化的历史分析。 |
课程简介: | Historical newspapers provide a lens on customs and habits of the past. For example, recipes published in newspapers highlight what and how we ate and thought about food. The challenge here is that newspaper data is often unstructured and highly varied, digitised historical newspapers add an additional challenge, namely that of fluctuations in OCR quality. Therefore, it is difficult to locate and extract recipes from them. We present our approach based on distant supervision and automatically extracted lexicons to identify recipes in digitised historical newspapers, to generate recipe tags, and to extract ingredient information. We provide OCR quality indicators and their impact on the extraction process. We enrich the recipes with links to information on the ingredients. Our research shows how combining natural language processing, machine learning, and semantic web can be used to construct a rich dataset from heterogeneous newspapers for the historical analysis of food culture. |
关 键 词: | 数据非结构化; 自然语言处理; 机器学习和语义网络; 丰富的数据集 |
课程来源: | 视频讲座网 |
数据采集: | 2023-01-06:cyh |
最后编审: | 2023-01-07:cyh |
阅读次数: | 26 |