0


多级比例更细粒度的分类标签

Multiclass-multilabel classification with more labels than example
课程网址: http://videolectures.net/aistats2010_dekel_mmcw/  
主讲教师: Ofer Dekel
开课单位: 微软公司
开课时间: 2010-01-03
课程语种: 英语
中文简介:
讨论了一类多标签分类问题,其中可能的标签集非常大。大多数现有的多类多标签学习算法都期望从每个类中观察到相当大的样本,如果它们只收到少数带有给定标签的示例,则会失败。我们提出并分析了以下两阶段的方法:首先使用任意(也许是启发式的)分类算法来构造一个初始分类器,然后应用一种简单但有原则的方法来通过从输出中去除有害标签来扩充这个分类器。仔细的理论分析使我们能够在一些合理的条件下(如标签稀疏性和标签频率的幂律分布)证明我们的方法是正确的,即使训练集不能提供大多数类在统计上的准确表示。令人惊讶的是,我们的理论分析仍然适用,即使类的数量超过了样本容量。我们展示了我们的方法在使用维基百科上定义的150万个类别对整个web进行分类这一雄心勃勃的任务中的优点。
课程简介: We discuss multiclass-multilabel classification problems in which the set of possible labels is extremely large. Most existing multiclass-multilabel learning algorithms expect to observe a reasonably large sample from each class, and fail if they receive only a handful of examples with a given label. We propose and analyze the following two-stage approach: first use an arbitrary (perhaps heuristic) classification algorithm to construct an initial classifier, then apply a simple but principled method to augment this classifier by removing harmful labels from its output. A careful theoretical analysis allows us to justify our approach under some reasonable conditions (such as label sparsity and power-law distribution of label frequencies), even when the training set does not provide a statistically accurate representation of most classes. Surprisingly, our theoretical analysis continues to hold even when the number of classes exceeds the sample size. We demonstrate the merits of our approach on the ambitious task of categorizing the entire web using the 1.5 million categories defined on Wikipedia.
关 键 词: 多级比例; 更细粒度; 分类标签
课程来源: 视频讲座网
最后编审: 2020-05-07:chenxin
阅读次数: 45