0


大规模行为目标

Large-Scale Behavioral Targeting
课程网址: http://videolectures.net/kdd09_chen_lsbt/  
主讲教师: Ye Chen
开课单位: 微软
开课时间: 2009-09-14
课程语种: 英语
中文简介:
最佳应用论文奖得主行为定位(BT)利用历史用户行为来选择与用户最相关的广告来显示。最先进的BT从细粒度的用户行为数据中导出线性泊松回归模型,并从用户历史记录中预测点击率(CTR)。我们利用Hadoop MapReduce框架设计并实现了一个高度可扩展和高效的BT解决方案。通过我们的并行算法和生成的系统,我们可以在一天内从雅虎的整个用户基础上构建超过450个bt类别的模型,这是以前的系统无法想象的规模。此外,我们的方法利用从更大的训练数据集中拟合的可靠概率模型,比现有生产系统获得了20%的CTR提升。 具体来说,我们的主要贡献包括:(1)一个MapReduce统计学习算法和实现,实现了最佳的数据并行性、任务并行性和负载平衡,尽管领域数据的分布通常是倾斜的。(2)不考虑滑动目标窗口的粒度,具有线性时间复杂度O(n)的就地特征向量生成算法。(3)内存缓存方案,显著减少磁盘IOs数量,实现大规模学习。(4)高效的数据结构和模型和数据的稀疏表示,可以实现快速的模型更新。我们相信我们的工作为解决工业相关的大规模机器学习问题做出了重大贡献。最后,我们利用工业专有的代码库和数据集,报告了全面的实验结果。
课程简介: Best Application Paper Award Winner Behavioral targeting (BT) leverages historical user behavior to select the ads most relevant to users to display. The state-of-the-art of BT derives a linear Poisson regression model from fine-grained user behavioral data and predicts click-through rate (CTR) from user history. We designed and implemented a highly scalable and efficient solution to BT using Hadoop MapReduce framework. With our parallel algorithm and the resulting system, we can build above 450 BT-category models from the entire Yahoo's user base within one day, the scale that one can not even imagine with prior systems. Moreover, our approach has yielded 20% CTR lift over the existing production system by leveraging the well-grounded probabilistic model fitted from a much larger training dataset. Specifically, our major contributions include: (1) A MapReduce statistical learning algorithm and implementation that achieve optimal data parallelism, task parallelism, and load balance in spite of the typically skewed distribution of domain data. (2) An in-place feature vector generation algorithm with linear time complexity O(n) regardless of the granularity of sliding target window. (3) An in-memory caching scheme that significantly reduces the number of disk IOs to make large-scale learning practical. (4) Highly efficient data structures and sparse representations of models and data to enable fast model updates. We believe that our work makes significant contributions to solving large-scale machine learning problems of industrial relevance in general. Finally, we report comprehensive experimental results, using industrial proprietary codebase and datasets.
关 键 词: 线性泊松回归; 用户基础; 概率模型,
课程来源: 视频讲座网
数据采集: 2022-12-06:chenjy
最后编审: 2022-12-06:chenjy
阅读次数: 24