0


稀疏性分析的权重计算方案和文本分类中的应用

Sparsity analsysis of term weighting schemes and application to text classification
课程网址: http://videolectures.net/slsfs05_brank_satws/  
主讲教师: Janez Brank
开课单位: 约瑟夫·斯特凡学院
开课时间: 2007-02-25
课程语种: 英语
中文简介:
我们回顾了特征选择的一般实践,以确定尺寸和减少噪音。这通常包括基于某些加权方案对特征进行评分和排序,并选择排名靠前的特征进行进一步处理。实验表明,文本分类方法的性能对所使用的特征集的特性是敏感的。例如,对于给定的分类方法,产生相同性能级别的特征集的大小可能非常不同,这取决于所使用的特征评分方法。我们通过考虑来自特定特征集的单个文档向量的表示来扩展这种探索。特别是,我们观察每个文档向量的平均特征数,即向量稀疏度或密度,并引入稀疏曲线来说明不同权重方案的特征集如何增加向量密度。我们表明,通过指定矢量密度参数而不是特征集大小来选择特征,可以得到与通常使用的实践相当的结果。然而,了解特征选择对文档矢量表示和系统参数(如分类操作的内存消耗)的影响还有一个额外的好处。此外,相应的分类性能曲线将稀疏性和性能度量联系起来,并提供进一步的信息,以了解如何通过分类方法解释特征的特殊性或特征在语料库中跨文档的分布。
课程简介: We revisit the common practice of feature selection for dimensionality and noise reduction. This typically involves scoring and ranking features based on some weighting scheme and selecting top ranked features for further processing. Experiments show that the performance of text classification methods is sensitive to characteristics of the used feature sets. For example, the size of the feature sets that yield the same performance level for a given classification method can be very different, depending on the feature scoring method used. We expand this exploration by considering representations of individual document vectors that result from a particular feature set. In particular, we observe the average number of features per document vector, i.e., the vector sparsity, or density and introduce sparsity curves to illustrate how the vector density increases with the feature set for different weighting schemes. We show that selecting feature by specifying the vector density parameter, instead of a feature set size, yields comparable results to the commonly adopted practice. However, it has the added benefit of understanding the effect of feature selection on document vector representation and system parameters, such as memory consumption of the classification operations. Furthermore, the corresponding classification performance curves link the sparsity and performance measures and provide further insight on how the feature specificity or distribution of the feature across documents in the corpus, is accounted for by the classification method.
关 键 词: 计算机科学; 文本挖掘; 稀疏曲线
课程来源: 视频讲座网
最后编审: 2020-07-28:yumf
阅读次数: 24