2007年网络垃圾邮件挑战第二轨道——安全计算公司研究][Web Spam Challenge 2007 Track II - Secure Computing Corporation Research]_MOOC(慕课)境外开放课程

   首页 → 计算机应用
   首页 → 计算机工程
   首页 → 计算机软件

2007年网络垃圾邮件挑战第二轨道——安全计算公司研究 Web Spam Challenge 2007 Track II - Secure Computing Corporation Research


课程网址:	http://videolectures.net/ecml07_krasser_wsc/
主讲教师:	Sven Krasser
开课单位:	安全计算公司
开课时间:	2008-01-28
课程语种:	英语
中文简介:	为了区分垃圾邮件Web主机/页面与正常垃圾邮件，为Web垃圾邮件挑战跟踪II提供了基于文本和基于链接的数据。给定一小部分标记节点（约10％）在Web链接图中，挑战是预测其他节点的类是垃圾邮件或正常。我们从基于链接的数据中提取特征，然后将它们与基于文本的特征组合。在特征缩放之后，支持向量机（SVM）和随机森林（RF）在极高维度空间中建模，具有大约500万个特征。 SVM的分层3折交叉验证和RF的袋估计用于调整建模参数并估计泛化能力。在Web主机分类的小型语料库中，最佳F Measure值为75.46％，最佳AUC值为95.11％。在网页分类的大型语料库中，最佳F Measure值为90.20％，最佳AUC值为98.92％。
课程简介:	To discriminate spam Web hosts/pages from normal ones, text-based and link-based data are provided forWeb Spam Challenge Track II. Given a small part of labeled nodes (about 10%) in aWeb linkage graph, the challenge is to predict other nodes’ class to be spam or normal.We extract features from link-based data, and then combine them with text-based features. After feature scaling, Support Vector Machines (SVM) and Random Forests (RF) are modeled in the extremely high dimensional space with about 5 million features. Stratified 3-fold cross validation for SVM and out-of-bag estimation for RF are used to tune the modeling parameters and estimate the generalization capability. On the small corpus for Web host classification, the best F-Measure value is 75.46% and the best AUC value is 95.11%. On the large corpus for Web page classification, the best F-Measure value is 90.20% and the best AUC value is 98.92%.
关键词:	垃圾邮件; 跟踪II; 随机森林
课程来源:	视频讲座网
最后编审:	2019-03-23：lxf
阅读次数:	159

服务热线：0574-88229129
电子邮件：info_lib@nbt.edu.cn
信息服务：图书馆305室
系统研发：图书馆303室

图书馆学生服务群：437507696
图书馆教工服务群：1038697975
QQ在线咨询
2013-2026 © 浙大宁波理工学院图书馆