0


多标签文本分类的伪标签生成

Pseudo-Label Generation for Multi-Label Text Classification
课程网址: http://videolectures.net/cidu2011_khan_classification/  
主讲教师: Latifur Khan
开课单位: 德克萨斯大学
开课时间: 2012-06-27
课程语种: 英语
中文简介:
随着社交网络的出现和扩大, 生成的文本数据量急剧增加。为了处理如此庞大的文本数据量, 新的和改进的文本挖掘技术是必要的。文本数据的特点之一是多小度, 这使得文本挖掘变得困难。为了建立一个稳健有效的文本分类方法, 作为文本挖掘研究的一个组成部分, 我们必须更仔细地考虑这个属性。这种类型的属性不是文本数据所独有的, 因为它也可以在非文本 (例如数字) 数据中找到。但是, 在文本数据中, 它是最普遍的。此属性还将文本分类问题置于多标签分类 (mlc) 的域中, 其中每个实例都与类标签的子集 (而不是单个类) 相关联, 如常规分类中所示。在本文中, 我们探讨了伪标签的生成 (即现有类标签的组合) 如何帮助我们执行更好的文本分类, 以及在何种情况下。在分类过程中, 还考虑了文本数据的高和稀疏维数。虽然在这里, 我们提出和评估文本分类技术, 我们的主要重点是处理文本数据的多不性, 同时利用数据集中存在的多个标签之间的相关性。我们的文本分类技术称为伪 lsc (基于伪标签的子空间聚类)。它是一种子空间聚类算法, 它考虑到分类过程中的高稀疏维数以及不同类标签之间的相关性, 以提供比现有方法更好的性能。三个真实世界的多标签数据集的结果使我们能够深入了解在分类过程中如何处理多标签, 并显示了我们的方法的有效性。
课程简介: With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
关 键 词: 计算机科学; 文本挖掘; 标签
课程来源: 视频讲座网
最后编审: 2020-07-06:heyf
阅读次数: 116