0


基于核心集的混合模型可扩展训练

Scalable Training of Mixture Models via Coresets
课程网址: http://videolectures.net/nips2011_faulkner_coresets/  
主讲教师: Matthew Faulkner
开课单位: 加州理工学院
开课时间: 2012-01-25
课程语种: 英语
中文简介:
我们如何在海量数据集上训练统计混合模型?在本文中,我们将展示如何构建高斯混合和自然概括的核心集。核心集是数据的加权子集,这保证了适合核心集的模型也能够很好地适应原始数据集。我们可能会惊讶地发现,高斯混合物允许大小的核心集独立于数据集的大小。更准确地说,我们证明加权的$ O(dk ^ 3 / \ eps ^ 2)$数据点集足以计算原始$ n $数据点上的最优模型的$(1 \ eps)$近似值。此外,可以在地图缩减样式计算中以及在流设置中有效地构造这样的核心集。我们的结果依赖于统计估计的新颖减少到计算几何中的问题,以及关于高斯混合的新的复杂性结果。我们在几个真实数据集上实验性地评估我们的算法,包括在使用移动电话中的加速度计的地震检测的背景下的密度估计问题。
课程简介: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of $O(dk^3/\eps^2)$ data points suffices for computing a $(1+\eps)$-approximation for the optimal model on the original $n$ data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.
关 键 词: 海量数据集; 高斯混合; 自然概括
课程来源: 视频讲座网
最后编审: 2019-07-26:cwx
阅读次数: 117