0


用可伸缩对数线性模型估计多个层次的稀有事件率

Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models
课程网址: http://videolectures.net/kdd2010_agarwal_erre/  
主讲教师: Deepak Agarwal
开课单位: 领英公司
开课时间: 2010-07-26
课程语种: 英语
中文简介:
我们考虑了高维多变量分类数据的罕见事件率估计问题,其中多个维是分层的。这类问题在包括计算广告在内的多个数据挖掘应用程序中是常见的,是本文的重点。我们提出了一种新的对数线性建模方法\newmodel,该方法可扩展到海量数据应用程序中,在一个地图缩减框架中有数十亿个训练记录和数百万个潜在预测因子。我们的方法利用了在处理多个层次结构时在多个分辨率下观察到的聚合中的相关性;在较粗分辨率下的稳定估计提供了信息丰富的先验信息,以改进较细分辨率下的估计。除了预测的准确性和可扩展性之外,我们的方法还有一个内置的变量筛选程序,它基于“尖峰和平板先验”,通过删除非信息性预测因子而不影响预测的准确性,从而提供了简洁的方法。我们对来自真实计算广告应用程序的数据进行了大规模实验,并说明了我们对具有数十亿条记录和数亿个预测因子的数据集的方法。与其他基准方法的广泛比较表明,预测精度有显著提高。
课程简介: We consider the problem of estimating rates of rare events for high dimensional, multivariate categorical data where several dimensions are hierarchical. Such problems are routine in several data mining applications including computational advertising, our main focus in this paper. We propose \NEWMODEL, a novel log-linear modeling method that scales to massive data applications with billions of training records and several million potential predictors in a map-reduce framework. Our method exploits correlations in aggregates observed at multiple resolutions when working with multiple hierarchies; stable estimates at coarser resolution provide informative prior information to improve estimates at finer resolutions. Other than prediction accuracy and scalability, our method has an inbuilt variable screening procedure based on a ``spike and slab prior'' that provides parsimony by removing non-informative predictors without hurting predictive accuracy. We perform large scale experiments on data from real computational advertising applications and illustrate our approach on datasets with several billion records and hundreds of millions of predictors. Extensive comparisons with other benchmark methods show significant improvements in prediction accuracy.
关 键 词: 计算机科学; 算法信息理论; 数据
课程来源: 视频讲座网
最后编审: 2019-11-18:cwx
阅读次数: 29