0


用于基于随机距离的孤立点检测的超高维数据学习表示

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection
课程网址: http://videolectures.net/kdd2018_pang_representations_distance-ba...  
主讲教师: Guansong Pang
开课单位: 悉尼理工大学
开课时间: 2018-11-23
课程语种: 英语
中文简介:
维度数据,例如具有数千/数百万个特征的数据,已经成为使学习方法能够解决维度诅咒的主要方式。然而,现有的无监督表示学习方法主要集中于保存数据规则性信息,并独立于后续的异常值检测方法学习表示,这可能导致检测异常值(即异常值)的次优和不稳定性能。本文介绍了一个基于排名模型的框架,称为RAMODO,以解决这个问题。RAMODO将表示学习和异常值检测相结合,以学习针对最先进的异常值检测方法(基于随机距离的方法)定制的低维表示。这种定制的学习为目标离群值检测器产生更优化和稳定的表示。此外,RAMODO可以利用很少标记的数据作为先验知识来学习更具表达力和应用相关的表示。我们将RAMODO实例化为一个名为REPEN的有效方法,以演示RAMODO的性能。对八个真实世界超高维数据集的广泛经验结果表明,REPEN(i)使基于随机距离的检测器能够获得显著更好的AUC性能和两个数量级的加速;(ii)比四种最先进的表示学习方法表现得更好和更稳定;以及(iii)利用少于1%的标记数据实现高达32%的AUC改善。
课程简介: dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distancebased detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.
关 键 词: 数百万个特征的数据; 异常值检测方法学习; 真实世界超高维数据集
课程来源: 视频讲座网
数据采集: 2023-02-01:cyh
最后编审: 2023-02-01:cyh
阅读次数: 19