0


在多分辨率下估计罕见事件的发生率

Estimating Rates of Rare Events at Multiple Resolutions
课程网址: http://videolectures.net/kdd07_chakrabarti_erore/  
主讲教师: Deepayan Chakrabarti
开课单位: 卡内基梅隆大学
开课时间: 2007-08-15
课程语种: 英语
中文简介:
我们考虑使用预先存在的层次结构在多个分辨率下执行推理来估计极稀疏数据的罕见事件的发生率的问题。特别是,我们关注的是估算(网页,广告)对(称为展示次数)的点击率的问题,其中页面和广告都被分类为以不同粒度级别捕获广泛上下文信息的层次结构。通常,点击率很低,层次结构的覆盖范围很小。为了克服这些困难,我们设计了一种采样方法,通过该方法我们分析训练集中特别选择的页面样本,然后使用两阶段模型估计点击率。第一阶段在层次结构的所有分辨率中计算(网页,广告)对的数量,以调整采样偏差。第二阶段通过树形结构马尔可夫模型在兄弟节点之间合并相关性之后估计所有分辨率下的点击率。这两种模型都具有可扩展性,适用于大规模数据挖掘应用。在由20亿次展示组成的真实世界数据集中,我们证明即使在训练集中有95%的负面(非点击)事件,我们的方法也可以根据继承人点击倾向有效地区分极为罕见的事件。
课程简介: We consider the problem of estimating occurrence rates of rare events for extremely sparse data, using pre-existing hierarchies to perform inference at multiple resolutions. In particular, we focus on the problem of estimating click rates for (webpage, advertisement) pairs (called impressions) where both the pages and the ads are classified into hierarchies that capture broad contextual information at different levels of granularity. Typically the click rates are low and the coverage of the hierarchies is sparse. To overcome these difficulties we devise a sampling method whereby we analyze a specially chosen sample of pages in the training set, and then estimate click rates using a two-stage model. The first stage imputes the number of (webpage, ad) pairs at all resolutions of the hierarchy to adjust for the sampling bias. The second stage estimates click rates at all resolutions after incorporating correlations among sibling nodes through a tree-structured Markov model. Both models are scalable and suited to large scale data mining applications. On a real-world dataset consisting of 1/2 billion impressions, we demonstrate that even with 95% negative (non-clicked)events in the training set, our method can effectively discriminate extremely rare events in terms of  heir click propensity.
关 键 词: 层次结构; 极稀疏数据; 马尔可夫模型
课程来源: 视频讲座网
最后编审: 2019-05-08:lxf
阅读次数: 28