# 如何计算两个文档的相似度（一）

1） TF-IDF，余弦相似度，向量空间模型

2）SVD和LSI

LSI本质上识别了以文档为单位的second-order co-ocurrence的单词并归入同一个子空间。因此：
1）落在同一子空间的单词不一定是同义词，甚至不一定是在同情景下出现的单词，对于长篇文档尤其如是。
2）LSI根本无法处理一词多义的单词（多义词），多义词会导致LSI效果变差。

A persistent myth in search marketing circles is that LSI grants contextuality; i.e., terms occurring in the same context. This is not always the case. Consider two documents X and Y and three terms A, B and C and wherein:

A and B do not co-occur.
X mentions terms A and C
Y mentions terms B and C.

:. A---C---B

The common denominator is C, so we define this relation as an in-transit co-occurrence since both A and B occur while in transit with C. This is called second-order co-occurrence and is a special case of high-order co-occurrence.

3) LDA

## 《如何计算两个文档的相似度（一）》上有16条评论

1. zl

“SVD与LSI教程系列”确实不错，去了原地址，看到内容下架的原因是，这些内容要收费（Fees: \$20 dollars (US currency) per copy, unless stated otherwise.）

[回复]

2. cathy

您好，我想问问gensim里面处理lda模型，我记得直接调用就行了，但是我不知道它用的是什么算法吉布采样还是那个blei那个esm?

[回复]

52nlp 回复:

看看代码就可以了吧:

gensim-0.8.6/gensim/models/ldamodel.py

"""
This module encapsulates functionality for the Latent Dirichlet Allocation algorithm.

It allows both model estimation from a training corpus and inference of topic
distribution on new, unseen documents.

The core estimation code is directly adapted from the `onlineldavb.py` script
by M. Hoffman [1]_, see
**Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.**

The algorithm:

* is **streamed**: training documents come in sequentially, no random access,
* runs in **constant memory** w.r.t. the number of documents: size of the
training corpus does not affect memory footprint, and
* is **distributed**: makes use of a cluster of machines, if available, to
speed up model estimation.

"""

[回复]

cathy 回复:

恩，看了10年的那篇文章用的变分贝叶斯 说是比Gibbs效果好

[回复]

3. 杨先生

我是搞人力资源管理咨询的，最近因为项目的问题，开始学习NLP，感谢博主提供的信息，对我们团队帮助很大，请问下博主，就目前的技术而言，能否将中文的两篇文章，例如，招聘信息和简历进行很好的匹配，保证语言与关键字的一致性，采用什么技术手段能够达到这一目的，谢谢 。

[回复]