0


2007年网络垃圾邮件挑战第二轨道——安全计算公司研究

Web Spam Challenge 2007 Track II - Secure Computing Corporation Research
课程网址: http://videolectures.net/ecml07_krasser_wsc/  
主讲教师: Sven Krasser
开课单位: 安全计算公司
开课时间: 2008-01-28
课程语种: 英语
中文简介:
为了区分垃圾邮件Web主机/页面与正常垃圾邮件,为Web垃圾邮件挑战跟踪II提供了基于文本和基于链接的数据。给定一小部分标记节点(约10%)在Web链接图中,挑战是预测其他节点的类是垃圾邮件或正常。我们从基于链接的数据中提取特征,然后将它们与基于文本的特征组合。在特征缩放之后,支持向量机(SVM)和随机森林(RF)在极高维度空间中建模,具有大约500万个特征。 SVM的分层3折交叉验证和RF的袋估计用于调整建模参数并估计泛化能力。在Web主机分类的小型语料库中,最佳F Measure值为75.46%,最佳AUC值为95.11%。在网页分类的大型语料库中,最佳F Measure值为90.20%,最佳AUC值为98.92%。
课程简介: To discriminate spam Web hosts/pages from normal ones, text-based and link-based data are provided forWeb Spam Challenge Track II. Given a small part of labeled nodes (about 10%) in aWeb linkage graph, the challenge is to predict other nodes’ class to be spam or normal.We extract features from link-based data, and then combine them with text-based features. After feature scaling, Support Vector Machines (SVM) and Random Forests (RF) are modeled in the extremely high dimensional space with about 5 million features. Stratified 3-fold cross validation for SVM and out-of-bag estimation for RF are used to tune the modeling parameters and estimate the generalization capability. On the small corpus for Web host classification, the best F-Measure value is 75.46% and the best AUC value is 95.11%. On the large corpus for Web page classification, the best F-Measure value is 90.20% and the best AUC value is 98.92%.
关 键 词: 垃圾邮件; 跟踪II; 随机森林
课程来源: 视频讲座网
最后编审: 2019-03-23:lxf
阅读次数: 50