2007年网络垃圾邮件挑战第二轨道——安全计算公司研究Web Spam Challenge 2007 Track II - Secure Computing Corporation Research |
|
课程网址: | http://videolectures.net/ecml07_krasser_wsc/ |
主讲教师: | Sven Krasser |
开课单位: | 安全计算公司 |
开课时间: | 2008-01-28 |
课程语种: | 英语 |
中文简介: | 为了区分垃圾邮件Web主机/页面与正常垃圾邮件,为Web垃圾邮件挑战跟踪II提供了基于文本和基于链接的数据。给定一小部分标记节点(约10%)在Web链接图中,挑战是预测其他节点的类是垃圾邮件或正常。我们从基于链接的数据中提取特征,然后将它们与基于文本的特征组合。在特征缩放之后,支持向量机(SVM)和随机森林(RF)在极高维度空间中建模,具有大约500万个特征。 SVM的分层3折交叉验证和RF的袋估计用于调整建模参数并估计泛化能力。在Web主机分类的小型语料库中,最佳F Measure值为75.46%,最佳AUC值为95.11%。在网页分类的大型语料库中,最佳F Measure值为90.20%,最佳AUC值为98.92%。 |
课程简介: | To discriminate spam Web hosts/pages from normal ones, text-based and link-based data are provided forWeb Spam Challenge Track II. Given a small part of labeled nodes (about 10%) in aWeb linkage graph, the challenge is to predict other nodes’ class to be spam or normal.We extract features from link-based data, and then combine them with text-based features. After feature scaling, Support Vector Machines (SVM) and Random Forests (RF) are modeled in the extremely high dimensional space with about 5 million features. Stratified 3-fold cross validation for SVM and out-of-bag estimation for RF are used to tune the modeling parameters and estimate the generalization capability. On the small corpus for Web host classification, the best F-Measure value is 75.46% and the best AUC value is 95.11%. On the large corpus for Web page classification, the best F-Measure value is 90.20% and the best AUC value is 98.92%. |
关 键 词: | 垃圾邮件; 跟踪II; 随机森林 |
课程来源: | 视频讲座网 |
最后编审: | 2019-03-23:lxf |
阅读次数: | 50 |