首页数学
   首页自然科学
0


流形假设检验的样本复杂性

Sample Complexity of Testing the Manifold Hypothesis
课程网址: http://videolectures.net/nips2010_narayanan_sct/  
主讲教师: Hariharan Narayanan
开课单位: 麻省理工学院
开课时间: 2011-03-25
课程语种: 英语
中文简介:

高维数据倾向于位于低维流形附近的假设是称为流形学习的一系列方法的基础。在本文中,我们研究了拟合具有近乎最优最小平方误差的流形问题的统计方面。给定维度、体积和曲率的上限,我们表明经验风险最小化可以使用大量随机样本产生近乎最优的流形,它与数据所在空间的环境维度无关。我们获得了所需样本数量的上限,该上限取决于曲率多项式、固有维度呈指数关系、固有体积呈线性关系。对于恒定误差,我们证明了样本复杂性的匹配极小极大下界,这表明这种对固有维度、体积和曲率的依赖是不可避免的。应用于任意维度单位球中数据的 k 均值经验风险最小化的样本复杂性的已知下限是否严格,自 1997 年以来一直是一个悬而未决的问题。 这里 eps 是期望的误差界限,de 是一个失败概率的约束。我们改进了目前已知的最佳上限。基于这些结果,我们设计了一个简单的 k 均值算法和另一个使用一系列凸程序将指定长度的分段线性曲线拟合到高维数据的算法,其中样本复杂度与环境维度无关。

课程简介: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is it independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume and curvature is unavoidable. Whether the known lower bound for the sample complexity of Empirical Risk minimization on k-means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997. Here eps is the desired bound on the error and de is a bound on the probability of failure. We improve the best currently known upper bound. Based on these results, we devise a simple algorithm for k-means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension.
关 键 词: 高维数据; 流形学习; k 均值算法
课程来源: 视频讲座网
数据采集: 2021-07-19:nkq
最后编审: 2021-07-19:nkq
阅读次数: 67