
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)
主讲教师: Jasminka Dobsa
开课单位: 萨格勒布大学
开课时间: 2007-02-25
课程语种: 英语
向量空间模型中的信息检索基于文档和查询中术语的文字匹配。该模型是通过创建术语文档矩阵来实现的,该文档矩阵基于文档中术语的频率而形成。术语的字面匹配不一定检索所有相关文档。同义(具有相同含义的多个单词)和多义性(具有多个含义的单词)是有效信息检索的两个主要障碍。潜在语义索引(LSI)和概念索引(CI)是嵌入在向量空间模型中的信息检索技术,解决了同义词和多义性问题。 LSI的方法是一种使用术语文档矩阵的低秩奇异值分解(SVD)的信息检索技术。尽管LSI方法在经验上取得了成功,但是它缺乏对低秩逼近的解释,因此,缺少用于完成信息检索中特定任务的控制。 CI的方法使用簇的质心或所谓的概念分解(CD)来降低术语文档矩阵的等级。在这里,我们在矩阵逼近和信息检索精度方面比较了SVD / LSI和CD / CI。
课程简介: Information retrieval in the vector space model is based on literal matching of terms in the documents and the queries. The model is implemented by creating the term-document matrix, which is formed on the base of frequencies of terms in documents. Literal matching of terms does not necessarily retrieve all relevant documents. Synonymy (multiple words having the same meaning) and polysemy (words having multiple meaning) are two major obstacles for efficient information retrieval. Latent semantic indexing (LSI) and concept indexing (CI) are information retrieval techniques embedded in the vector space model, which address the problem of synonymy and polysemy. The method of LSI is an information retrieval technique using a low-rank singular value decomposition (SVD) of the term-document matrix. Although the LSI method has empirical success, it suffers from the lack of interpretation for the low-rank approximation and, consequently, the lack of controls for accomplishing specific tasks in information retrieval. The method of CI uses centroids of clusters or so-called concept decomposition (CD) for lowering the rank of the term-document matrix. Here we compare SVD/LSI and CD/CI in terms of matrix approximations and precision of information retrieval.
关 键 词: 向量空间; 文字匹配; 术语文档矩阵
