0


以语言信息为特征进行文本分类

Using linguistic information as features for text categorization
课程网址: http://videolectures.net/mmdss07_raez_uli/  
主讲教师: Arturo Montejo Ráez
开课单位: 哈恩大学
开课时间: 2007-11-26
课程语种: 英语
中文简介:
我们报告了使用语言信息作为经典向量空间模型[10]中的附加特征的一些经验。提取的每个单词的信息,如词性和词干,词汇根已经以不同的方式组合,用于试验分类性能的可能改进和几种算法,如SVM [3],BBR []和PLAUM [6] 。自动文本分类或自动文本分类也是已知的,它尝试将相关文档与预定义的类集合相关联。已经对该主题进行了广泛的研究[11],并且可以使用各种技术来解决这一任务:特征提取[5],特征加权,降维[4],机器学习算法等等。此外,分类任务可以是二进制(两个可能选择的类中的一个),多类(可能类的集合中的一个)或多标签(来自更大的潜在候选集合的一组类)。在大多数情况下,后两者可以简化为二元决策[1],正如所用算法在我们的实验中所做的那样[8]。为了验证新特征的贡献,我们通过预处理路透社215781集合将它们组合到矢量空间模型中,该集合是研究界专门针对文本分类问题的一组众所周知的数据[2]。
课程简介: We report on some experiences using linguistic information as additional features in a classical Vector Space Model[10]. Extracted information of every word like the Part Of Speech and stem, lexical root have been combined in different ways for experimenting on a possible improvement in the classification performance and on several algorithms, like SVM [3], BBR [] and PLAUM [6]. Automatic Text Classification, or Automatic Text Categorization as is also known, tries to related documents to predefined set of classes. Extensive research has been carried out on this subject [11] and a wide range of techniques are appliable to solve this task: feature extraction [5], feature weighting, dimensionality reduction [4], machine learning algorithms and more. Besides, the classification task can be either binary (one out of two possible classes to select), multi-class (one out of set of possible classes) or multi-label (a set of classes from a larger set of potential candidates). In most cases, the latter two can be reduced to binary decisions [1], as the used algorithm does in our experiments [8]. In order to verify the contribution of the new features, we have combined them to be included into the vector space model by preprocessing the Reuters- 215781 collection, a well known set of data by the research community devoted to text categorization problems [2].
关 键 词: 语言信息; 向量空间; 分类性能
课程来源: 视频讲座网
最后编审: 2020-06-15:wuyq
阅读次数: 30