多标签类别学习Multi-Label Learning with Millions of Categories |
|
课程网址: | http://videolectures.net/onlinelearning2012_varma_multi_label_lea... |
主讲教师: | Manik Varma |
开课单位: | 微软公司 |
开课时间: | 2013-05-28 |
课程语种: | 英语 |
中文简介: | 我们的目标是构建一种算法,用于在输出空间包含数百万个类别时将数据点分类为一组标签。这是一种相对新颖的监督学习环境,带来了有趣的挑战,例如有效的培训和预测,仅从标记缺失和不正确的标记数据中学习以及处理标签相关性。我们提出了一个基于随机森林的解决方案,共同解决这些问题。我们开发了一种新的随机森林扩展,用于多标签分类,可以单独学习单独的数据,并可以扩展到大型数据集。我们生成真实有价值的信念,指示标签的状态,并调整我们的分类器来训练这些信念向量,以便补偿形式和噪声标签。此外,我们修改随机森林成本函数,以避免在高维特征空间中进行拟合,并学习短而平衡的树。最后,我们编写了高效的训练程序,让我们训练超过一亿个数据点,超过一百万维稀疏特征向量和超过一千万个类别的问题。大量实验表明,我们提出的解决方案不仅明显优于其他多标签分类算法,但也比基于NLP的技术更好地超过10%,用于为在线搜索广告商建议出价短语。 |
课程简介: | Our objective is to build an algorithm for classifying a data point into a set of labels when the output space contains millions of categories. This is a relatively novel setting in supervised learning and brings forth interesting challenges such as efficient training and prediction, learning from only positively labeled data with missing and incorrect labels and handling label correlations. We propose a random forest based solution for jointly tackling these issues. We develop a novel extension of random forests for multi-label classification which can learn from positive data alone and can scale to large data sets. We generate real valued beliefs indicating the state of labels and adapt our classifier to train on these belief vectors so as to compensate for missing and noisy labels. In addition, we modify the random forest cost function to avoid overfitting in high dimensional feature spaces and learn short, balanced trees. Finally, we write highly efficient training routines which let us train on problems with more than a hundred million data points, over a million dimensional sparse feature vector and over ten million categories. Extensive experiments reveal that our proposed solution is not only significantly better than other multi-label classification algorithms but also more than 10\% better than the state-of-the-art NLP based techniques for suggesting bid phrases for online search advertisers. |
关 键 词: | 随机森林; 多标签分类; 成本函数; 分类算法 |
课程来源: | 视频讲座网 |
最后编审: | 2020-06-08:王勇彬(课程编辑志愿者) |
阅读次数: | 120 |