0


使用GNUsmail比较数据流挖掘的方法进行在线电子邮件分类

Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification
课程网址: http://videolectures.net/wapa2011_baena_garcia_gnusmail/  
主讲教师: Manuel Baena-Garcia
开课单位: 马拉加大学
开课时间: 2011-11-11
课程语种: 英语
中文简介:
电子邮件的实时分类是一项具有挑战性的任务,因为它的在线性质,也因为电子邮件流受到概念漂移的影响。识别电子邮件垃圾邮件,其中只有两个不同的标签或类别被定义(垃圾邮件或非垃圾邮件),已在文献中受到极大的关注。然而,我们对存在多个文件夹的更具体的分类感兴趣,这是复杂性的另一个来源:类可以有大量不同的值。此外,交叉验证和其他抽样过程都不适合在数据流上下文中进行评估,这就是为什么已经提出了其他度量,如先验错误。然而,先验误差带来了一些问题,可以通过使用最近提出的机制(如衰落因子)来减轻这些问题。在本文中,我们介绍了gnusmail,一个用于电子邮件分类的开放源码扩展框架,并重点介绍了它执行在线评估的能力。gnusmail体系结构支持增量和在线学习,并且可以使用最先进的在线评估指标,比较不同的数据流挖掘方法。除了描述具有两个重叠阶段特征的框架外,我们还展示了如何使用它来比较不同的算法,以便找到最合适的算法。gnusmail源代码包括用于启动可复制实验的工具。
课程简介: Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using recently proposed mechanisms such as fading factors. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.
关 键 词: 实时分类电子邮件; 数据流挖掘; 在线评价指标
课程来源: 视频讲座网
最后编审: 2019-11-05:lxf
阅读次数: 61