0


检验流形假设的样本复杂性

Sample Complexity of Testing the Manifold Hypothesis
课程网址: http://videolectures.net/nips2010_narayanan_sct/  
主讲教师: Hariharan Narayanan
开课单位: 麻省理工学院
开课时间: 2011-03-25
课程语种: 英语
中文简介:
高维数据倾向于位于低维流形附近的假设是称为流形学习的方法集合的基础。在本文中,我们研究了拟合具有近似最小平方误差的流形的问题的统计方面。给定维度,体积和曲率的上限,我们表明经验风险最小化可以使用多个随机样本产生近似最优的流形,该随机样本独立于数据所在空间的环境维度。我们获得所需数量的样本的上界,其在曲率上多项式地,在内在维度上指数地,并且在内在体积上线性地。对于恒定误差,我们证明了样本复杂度的匹配极小极大下界,表明这种对内在维度,体积和曲率的依赖是不可避免的。 K经济风险最小化的样本复杂度的已知下界是否适用于任意维度的单位球中的数据是紧的,自1997年以来一直是一个悬而未决的问题。这里eps是误差的期望界限,de是受到失败概率的约束。我们改善了目前已知的最佳上限。基于这些结果,我们设计了一个简单的k均值算法,另一个使用一系列凸程序将指定长度的分段线性曲线拟合到高维数据,其中样本复杂度与环境维度无关。
课程简介: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is it independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume and curvature is unavoidable. Whether the known lower bound for the sample complexity of Empirical Risk minimization on k-means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997. Here eps is the desired bound on the error and de is a bound on the probability of failure. We improve the best currently known upper bound. Based on these results, we devise a simple algorithm for k-means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension.
关 键 词: 高维数据; 流形假设; 样本复杂性
课程来源: 视频讲座网
最后编审: 2019-07-25:cwx
阅读次数: 53