0


提高高精度文本分类器的基线

Upping the Baseline for High-Precision Text Classifiers
课程网址: http://videolectures.net/kdd07_kolcz_utbfhp/  
主讲教师: Aleksander Kołcz
开课单位: 推特公司
开课时间: 2007-08-13
课程语种: 英语
中文简介:
文本分类器的许多重要应用领域需要高精度,并且通常将预期解决方案与朴素贝叶斯的性能进行比较。这个基线通常很容易改进,但在这项工作中,我们证明适当的文档表示可以使这个分类器的性能更具挑战性。最重要的是,我们提供了朴素贝叶斯与专家混合框架的对数意见汇集之间的联系,该框架规定了特定类型的文档长度标准化。在特定于文档的特征选择的推动下,我们提出了对文档术语加权的单调约束,这被显示为微调文档表示的有效方法。使用与垃圾邮件检测问题相对应的三个大型电子邮件语料库的实验支持该讨论,其中高精度特别重要。
课程简介: Many important application areas of text classifiers demand high precision and it is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make outperforming this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.
关 键 词: 基线; 文本分类器; 文档长度标准化; 大型电子邮件语料库
课程来源: 视频讲座网
最后编审: 2019-05-09:cjy
阅读次数: 45