分类数据集中异常记录的检测][Detecting Anomalous Records in Categorical Datasets ]_MOOC(慕课)境外开放课程

   首页 → 应用数学
   首页 → 统计学
   首页 → 信息处理技术

分类数据集中异常记录的检测 Detecting Anomalous Records in Categorical Datasets


课程网址:	http://videolectures.net/kdd07_das_dar/
主讲教师:	Kaustav Das
开课单位:	卡内基梅隆大学
开课时间:	2007-09-14
课程语种:	英语
中文简介:	我们考虑在高度分类数据集中检测异常的问题。在大多数应用程序中，异常被定义为“异常”的数据点。我们经常访问主要由正常记录组成的数据，以及一小部分未标记的异常记录。我们对无监督异常检测的问题感兴趣，我们使用未标记的数据进行训练，并检测不遵循正态定义的记录。标准方法是创建普通数据模型，并将测试记录与其进行比较。概率方法根据训练数据建立似然模型。基于给定概率模型的完整记录可能性，对记录进行异常性测试。对于分类属性，拜耳网给出了可能性的标准表示。虽然这种方法很擅长在数据集中查找异常值，但它往往会检测具有罕见属性值的记录。有时，不希望仅检测属性的稀有值，并且在该上下文中不将这些异常值视为异常。我们提出了异常的另一种定义，并提出了一种与属性子集的边际分布进行比较的方法。我们证明这是一种更有意义的检测异常的方法，并且比半合成数据集和现实数据集具有更好的性能。
课程简介:	We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalousness based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distributions of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets.
关键词:	分类数据; 正态定义; 概率模型
课程来源:	视频讲座网
最后编审:	2019-05-08：lxf
阅读次数:	110

服务热线：0574-88229129
电子邮件：info_lib@nbt.edu.cn
信息服务：图书馆305室
系统研发：图书馆303室

图书馆学生服务群：437507696
图书馆教工服务群：1038697975
QQ在线咨询
2013-2025 © 浙大宁波理工学院图书馆