0


大文本数据集的主题汇总

Topic Summarization for Large Text Data Sets
课程网址: http://videolectures.net/cidu2011_el_ghaoui_text_corpora/  
主讲教师: Laurent El Ghaoui
开课单位: 加州大学伯克利分校
开课时间: 2012-06-27
课程语种: 英语
中文简介:
稀疏机器学习最近已成为以较低的计算成本获取高维数据模型的有力工具。本文认为, 这些方法对于理解大量的文本文档集合非常有用, 而不需要用户在机器学习方面的专业知识。我们的方法依赖于三个主要因素: (a) 多文档文本摘要和 (b) 两个语料库的比较摘要, 既使用稀疏回归还是分类;(c) 稀疏的主成分和稀疏的图形模型, 用于大型文本语料库的无监督分析和可视化。我们使用一系列航空安全报告系统 (asrs) 报告验证我们的方法, 并证明这些方法可以揭示跑道入侵的因果和促成因素。此外, 我们还表明, 这些方法自动发现飞行员在飞行过程中执行的四项主要任务, 这有助于进一步了解跑道入侵的原因和其他驱动因素, 用于航空安全事件。
课程简介: Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classi fication; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.
关 键 词: 计算机科学; 机器学习; 大文本数据集
课程来源: 视频讲座网
最后编审: 2020-07-06:heyf
阅读次数: 39