如何计算两个文档的相似度（三）

作者52nlp

6 月 7, 2013 #Deep Learning, #Deep Learning公开课, #gensim, #nltk, #NLTK中文信息处理, #NLTK应用, #python, #Python自然语言处理, #主题模型, #布朗语料库, #文本相似度, #文档相似度, #机器学习, #机器学习公开课, #概率图模型, #概率图模型公开课, #神经网络公开课, #自然语言处理, #课程图谱

上一节我们用了一个简单的例子过了一遍gensim的用法，这一节我们将用课程图谱的实际数据来做一些验证和改进，同时会用到NLTK来对课程的英文数据做预处理。

三、课程图谱相关实验

1、数据准备
为了方便大家一起来做验证，这里准备了一份Coursera的课程数据，可以在这里下载：coursera_corpus，（百度网盘链接: http://t.cn/RhjgPkv，密码: oppc）总共379个课程，每行包括3部分内容：课程名\t课程简介\t课程详情, 已经清除了其中的html tag, 下面所示的例子仅仅是其中的课程名：

Writing II: Rhetorical Composing
Genetics and Society: A Course for Educators
General Game Playing
Genes and the Human Condition (From Behavior to Biotechnology)
A Brief History of Humankind
New Models of Business in Society
Analyse Numérique pour Ingénieurs
Evolution: A Course for Educators
Coding the Matrix: Linear Algebra through Computer Science Applications
The Dynamic Earth: A Course for Educators
...

好了，首先让我们打开Python, 加载这份数据：

>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']

2、引入NLTK
NTLK是著名的Python自然语言处理工具包，但是主要针对的是英文处理，不过课程图谱目前处理的课程数据主要是英文，因此也足够了。NLTK配套有文档，有语料库，有书籍，甚至国内有同学无私的翻译了这本书: 用Python进行自然语言处理，有时候不得不感慨：做英文自然语言处理的同学真幸福。

首先仍然是安装NLTK，在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK：http://nltk.org/install.html ，最主要的还是要先装好依赖NumPy和PyYAML，其他没什么问题。安装NLTK完毕，可以import nltk测试一下，如果没有问题，还有一件非常重要的工作要做，下载NLTK官方提供的相关语料：

>>> import nltk
>>> nltk.download()

这个时候会弹出一个图形界面，会显示两份数据供你下载，分别是all-corpora和book，最好都选定下载了，这个过程需要一段时间，语料下载完毕后，NLTK在你的电脑上才真正达到可用的状态，可以测试一下布朗语料库：

>>> from nltk.corpus import brown
>>> brown.readme()
'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192

现在我们就来处理刚才的课程数据，如果按此前的方法仅仅对文档的单词小写化的话，我们将得到如下的结果：

>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

注意其中很多标点符号和单词是没有分离的，所以我们引入nltk的word_tokenize函数，并处理相应的数据：

>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', ',', 'research', ',', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', ',', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', '...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

对课程的英文数据进行tokenize之后，我们需要去停用词，幸好NLTK提供了一份英文停用词数据：

>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
>>> len(english_stopwords)
127

总计127个停用词，我们首先过滤课程语料中的停用词：
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', ',', 'research', ',', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', ',', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', ',', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', ',', 'visual', ',', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', ',', 'demonstrations', ',', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', ',', 'rhetoric', 'course', 'design', ',', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', ',', 'students', ',', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', ',', 'writers', 'exchange', ',', 'place', 'exchange', 'work', 'feedback']

停用词被过滤了，不过发现标点符号还在，这个好办，我们首先定义一个标点符号list:
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']

然后过滤这些标点符号：
>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
>>> print texts_filtered[0]
['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', 'research', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', 'ideas', 'talents', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', 'visual', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', 'demonstrations', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', 'rhetoric', 'course', 'design', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', 'students', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', 'writers', 'exchange', 'place', 'exchange', 'work', 'feedback']

更进一步，我们对这些英文单词词干化（Stemming)，NLTK提供了好几个相关工具接口可供选择，具体参考这个页面: http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer, Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('stemmed')
'stem'
>>> st.stem('stemming')
'stem'
>>> st.stem('stemmer')
'stem'
>>> st.stem('running')
'run'
>>> st.stem('maximum')
'maxim'
>>> st.stem('presumably')
'presum'

让我们调用这个接口来处理上面的课程数据:
>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', 'sery', 'interact', 'read', 'research', 'compos', 'act', 'along', 'assign', 'design', 'help', 'becom', 'effect', 'consum', 'produc', 'alphabet', 'vis', 'multimod', 'texts.', 'join', 'us', 'becom', 'effect', 'writ', '...', 'bet', 'citizens.', 'rhet', 'compos', 'cours', 'writ', 'exchang', 'word', 'idea', 'tal', 'support.', 'introduc', 'vary', 'rhet', 'concepts\xe2\x80\x94that', 'idea', 'techn', 'inform', 'persuad', 'audiences\xe2\x80\x94that', 'help', 'becom', 'effect', 'consum', 'produc', 'writ', 'vis', 'multimod', 'texts.', 'class', 'includ', 'short', 'video', 'demonst', 'activities.', 'envid', 'rhet', 'compos', 'learn', 'commun', 'includ', 'enrol', 'cours', 'instructors.', 'bring', 'expert', 'writ', 'rhet', 'cours', 'design', 'design', 'assign', 'cours', 'infrastruct', 'help', 'shar', 'expery', 'writ', 'stud', 'profess', 'us.', 'collab', 'facilit', 'wex', 'writ', 'exchang', 'plac', 'exchang', 'work', 'feedback']

在我们引入gensim之前，还有一件事要做，去掉在整个语料库中出现次数为1的低频词，测试了一下，不去掉的话对效果有些影响：

>>> all_stems = sum(texts_stemmed, [])
>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

3、引入gensim
有了上述的预处理，我们就可以引入gensim，并快速的做课程相似度的实验了。以下会快速的过一遍流程，具体的可以参考上一节的详细描述。

>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> dictionary = corpora.Dictionary(texts)
2013-06-07 21:37:07,120 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-06-07 21:37:07,263 : INFO : built Dictionary(3341 unique tokens) from 379 documents (total 46417 corpus positions)

>>> corpus = [dictionary.doc2bow(text) for text in texts]

>>> tfidf = models.TfidfModel(corpus)
2013-06-07 21:58:30,490 : INFO : collecting document frequencies
2013-06-07 21:58:30,490 : INFO : PROGRESS: processing document #0
2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

>>> corpus_tfidf = tfidf[corpus]

这里我们拍脑门决定训练topic数量为10的LSI模型：
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])
2013-06-07 22:04:55,443 : INFO : scanning corpus to determine the number of features
2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

基于LSI模型的课程索引建立完毕，我们以Andrew Ng教授的机器学习公开课为例，这门课程在我们的coursera_corpus文件的第211行，也就是：

>>> print courses_name[210]
Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上，然后和其他课程计算相似度：
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程：
>>> print sort_sims[0:10]
[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:
>>> print courses_name[210]
Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课
>>> print courses_name[174]
Machine Learning

第三门课是Coursera的另一位创始人，同样是大牛的Daphne Koller教授的概率图模型公开课：
>>> print courses_name[238]
Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课，有同学评价是Deep Learning的必修课。
>>> print courses_name[203]
Neural Networks for Machine Learning

感觉效果还不错，如果觉得有趣的话，也可以动手试试。

好了，这个系列就到此为止了，原计划写一下在英文维基百科全量数据上的实验，因为课程图谱目前暂时不需要，所以就到此为止，感兴趣的同学可以直接阅读gensim上的相关文档，非常详细。之后我可能更关注将NLTK应用到中文信息处理上，欢迎关注。

注：原创文章，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：https://www.52nlp.cn/如何计算两个文档的相似度三

作者 52nlp

LLm 自然语言处理预训练模型

《如何计算两个文档的相似度（三）》有113条评论

52nlp说道：

2015年01月26号 12:43

第一步应该将pdf文件转为text文件再处理吧？pdf2text

[回复]
cathy 回复:
26 1 月, 2015 at 14:39
恩，python有个pdfminer，但是我没有装成功，直接用pdfminer网页转的。。

[回复]
cathy 回复:
26 1 月, 2015 at 16:43
对了，还想问您一下关于结巴分词，对于自定义词典中那个词频的给出，有什么规则么？难道是我自己给多少都行么？有没有一个范围

[回复]
52nlp 回复:
26 1 月, 2015 at 21:59
这个貌似是基于语料库统计的，如果你觉得某个词比较重要，可以尝试附一个比较大的词频试试

[回复]
cathy 回复:
27 1 月, 2015 at 19:46
恩，确实是，它自带的那个表根据统计，但是如果相应在此基础上再添加其他分词的话，好像要自己给，并且给的还很随性的样子

[回复]
cathy 回复:
29 1 月, 2015 at 20:30
还想问问您对于那个lda输出结果那个系数是怎么回事？为什么我的就是0.001或者0 啊
这个结果感觉不好解释

[回复]
52nlp 回复:
31 1 月, 2015 at 15:56
这一块儿和你的训练语料有关吧，具体我也不太清楚，抱歉

[回复]
何剑回复:
2 9 月, 2016 at 10:57
这里我们拍脑门决定训练topic数量为10的LSI模型：
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])

你好，对于这里我有以下疑问：
为什么我们训练lsi用的是corpus_tfidf，而在构造索引的时候用的是lsi[corpus]，也就是基于词频统计的。逻辑上感觉构造索引也应用使用基于tfidf的吧

[回复]
KwangKa说道：

2015年03月14号 23:54

decode 一下应该可以解决

texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]

[回复]
江湖说道：

2015年06月8号 11:08

想请教：
如果每次有新数据了，是不是要和老数据放在一起，做成lsi模型，然后映射出topic?

[回复]
52nlp 回复:
9 6 月, 2015 at 18:05
基于老模型可以对新数据（文档）生成topic

[回复]
江湖说道：

2015年06月11号 21:35

博主：你好
我用lsi对2000条文本数据进行训练，也是拍脑袋设定了100 个topic，但是最后只看见有5个，为什么看不见其余的95个?
是什么原因？
请教！

[回复]
52nlp 回复:
13 6 月, 2015 at 18:52
抱歉，这个不太清楚，具体原因可能还和你的训练文本有关

[回复]
江湖说道：

2015年06月11号 21:49

你好，我的训练文本是2000条，topics设定为100个，但最后只显示了5个topics，请教什么原因？

[回复]
jianghu说道：

2015年06月18号 09:09

我用你的数据，按照你的方法按部就班做，但是最后只有5个topic，虽然我指定了10个，请教什么原因？
一模一样的。

[回复]
jianghu 回复:
18 6 月, 2015 at 09:18
如果指定topic数目小于5，那生成的topic数目就是正确的；
只要超过5，那生成的topic数目就是5，
似乎5是个最大值，
不知道还要在哪儿需要注意？

[回复]
52nlp 回复:
18 6 月, 2015 at 16:16
好久没碰这个，这个有点奇怪了，按说不应该的，莫非和gensim的一些小变动有关，回头有空我试一下。

[回复]
江湖回复:
27 6 月, 2015 at 10:36
问题解决，logging里默认是5个，如果想要看全部，就print_topics(num),

后来遇到一个问题，就是lsi模型如何存储？如何得到lsi数据类型，可以用数据库存储吗？
jianghu说道：

2015年06月27号 16:21

lsi模型如何存储？是什么结构的？

[回复]
52nlp 回复:
27 6 月, 2015 at 19:03
多看看gensim的官方文档：https://radimrehurek.com/gensim/tut2.html#available-transformations
Model persistency is achieved with the save() and load() functions:

>>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
>>> lsi = models.LsiModel.load('/tmp/model.lsi')

[回复]
Yisha说道：

2015年07月15号 07:59

请问对于最后的相似度部分，能不能输出某个文档位于各个 topci 的概率？还有 LDA / LSI 生成的主题模型都是 unsupervised，那对于已经 label 好的数据如何处理？我找过一些关于 labeled LDA 相关的资料，但是都没有很好的解释。。。

[回复]
52nlp 回复:
18 7 月, 2015 at 08:31
第一个问题你可以考虑自己去修改代码试试？有label的数据为什么不直接考虑分类呢？要是分类的话很容易得到某个文档属于某个类别的概率。关于labeled LDA 我不是太清楚

[回复]
在线工具说道：

2015年07月15号 12:03

正在学习相似度相关的内容，博客很不错。

可以和博主做一个友链吗？虽然我的是工具网站~

[回复]
52nlp 回复:
18 7 月, 2015 at 08:28
很久没有做友链了，如果不是很熟悉的朋友，暂时不做，抱歉

[回复]
江湖说道：

2015年07月16号 07:12

2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

这两个计算过程的日志中均显示features，请问分别代表什么意义？

[回复]
52nlp 回复:
18 7 月, 2015 at 08:32
tf-idf模型，这里feature对应的时单词token

[回复]
jianghu 回复:
18 7 月, 2015 at 14:16
lsi.print_topics();
如何建立起topics和人的关系？

推荐系统应该是这样的吧：人-topics-文档，

现在topics和文档有关系了，那topics和人的关系？

[回复]
jianghu说道：

2015年07月18号 19:41

300-500据说是黄金值

[回复]
jianghu说道：

2015年07月19号 08:10

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上，然后和其他课程计算相似度：
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]

0-9对应的是topics，后面的值是什么意思？

013-05-27 19:15:26,467 : INFO : topic #0(1.137): 0.438*”gold” + 0.438*”shipment” + 0.366*”truck” + 0.366*”arrived” + 0.345*”damaged” + 0.345*”fire” + 0.297*”silver” + 0.149*”delivery” + 0.000*”in” + 0.000*”a”
2013-05-27 19:15:26,468 : INFO : topic #1(1.000): 0.728*”silver” + 0.364*”delivery” + -0.364*”fire” + -0.364*”damaged” + 0.134*”truck” + 0.134*”arrived” + -0.134*”shipment” + -0.134*”gold” + -0.000*”a” + -0.000*”in”
这里， topic #0(1.137):，topic #1(1.000): 。1.137和1.000分别代表什么意思？

[回复]
jianghu说道：

2015年07月30号 20:57

文本A和文本B的相似度值，会随着训练样本的增多而改变吗？

[回复]
52nlp 回复:
4 8 月, 2015 at 15:28
会的

[回复]
jianghu说道：

2015年08月8号 19:53

我训练了一个500 topic的训练集，71059 documents, 117522 features, 3427463 non-zero entries.

现在把两篇人工辨识相似度很高的文章先后放入训练集中，分别得出排名前三的topic,发现只有一个主题是相同的。

这种情况怎么判断相似度？

[回复]
52nlp 回复:
14 8 月, 2015 at 15:15
为什么只选top3的topic看？没有直接计算这两个文档的相似度？

[回复]
吉某某说道：

2015年10月22号 15:45

老师您好，我是个初学者，正在一边跟着您写的文章做一边学习python
在我用python加载“coursera_corpus”文件的时候总是出现如下错误，但是“coursera_corpus”文件我已经放在python安装文件了，请问这个有什么解决办法吗？谢谢~
>>> courses=[line.strip() for line in file('coursera_corpus')]
Traceback (most recent call last):
File "", line 1, in
courses=[line.strip() for line in file('coursera_corpus')]
NameError: name 'file' is not defined
>>>

[回复]
吉某某说道：

2015年10月22号 15:48

老师您好，我是个初学者，跟着您的文章学习python
在本篇文章里面，我用python加载'coursera_corpus'文件时总是出现下面的错误，'coursera_corpus'文件我已经放在了python文件夹里面了，这个错误是什么原因导致的呢？求帮助，谢谢~
>>> courses=[line.strip() for line in file('coursera_corpus')]
Traceback (most recent call last):
File "", line 1, in
courses=[line.strip() for line in file('coursera_corpus')]
NameError: name 'file' is not defined

[回复]
52nlp 回复:
23 10 月, 2015 at 10:17
你用的python是哪个版本？2.x or 3.x?

[回复]
吉某某回复:
23 10 月, 2015 at 10:23
是3.4.3版本，老师，我还有一个问题，我在用>>> from nltk.corpus import brown
>>> brown.readme()
下载语料库的时候花费了好长好长时间，大概快半天了还没有下载完成，这样正常吗？好想快点儿按照您的文章学习下去呀，下载太慢好着急~~~(┬＿┬)

[回复]
52nlp 回复:
23 10 月, 2015 at 20:01
python3.x 和 python2.x有些区别，我这里用的是python2.7
Cis_steven说道：

2016年03月14号 11:33

博主您好，首先感谢您的这篇好文，但是我按照您的代码进行了一下实验，发现Daphne Koller教授的概率图模型公开课和Geoffrey Hinton的神经网络公开课都只能排在相关度的5-10之间，但是我觉得这两门课的相关度应该是前五的才对，自己思考了许久没弄明白是什么地方的问题，望您有空之时能给予答复，再次感谢！！！

[回复]
52nlp 回复:
23 3 月, 2016 at 11:01
影响因素比较多，譬如训练文本的多少，训练文本清洗的质量等等等；另外你的直觉和实际结果还可能有出入。

[回复]
wut0n9说道：

2016年04月16号 02:39

文章太赞了，感谢博主。
按照您的代码思路使用微信朋友圈数据跑了下，朋友圈数据仅包含了文章分享的理由、文章标题及朋友圈发的纯文本三部分，这三部分作为一行数据，共有1100条记录。使用jieba做的中文分词移除了停用词、标点符号，程序返回的相似度结果从字面上看还看不出什么相似之处，可能是因为每条朋友圈数据本身就很少的缘故吧：-（
我觉得，如果要做个微信公众号相似文章推荐的应用，一定很好玩的。

[回复]
52nlp 回复:
19 4 月, 2016 at 23:02
朋友圈的文章估计比较散，可以加量试一下

[回复]
PyWilhelm说道：

2016年05月28号 16:45

想跟前辈请教一个问题
我现在在做一个新闻推荐，是用户无关的，其实可以一定程度上简化成语义相似度或者检索排序问题。语义相似度利用了tfidf, lsi之类的无监督方法，检索排序上用了svm，可是precision@2一直在50%就无法继续提高了，主要问题是我使用的数据库是一个现实存在的新闻网站，每篇新闻包含一到两个推荐链接，这样组成的一张新闻关系图，可是这个链接以前是人工完成，只能说correct但是uncomplete。我得到的很多prediction在人工审查时是介于相关和不相关之间，比如拜仁慕尼黑在德甲的比赛和在欧冠的比赛，奥巴马的个人财政问题和他的财政政策；但是数据库中的一些新闻之间的相似度并不高，比如News A是News B的背景、起因，这类就无法正确预测了。现在我已经卡住了，不知道如何下一步，求大牛提供思路。

[回复]
52nlp 回复:
30 5 月, 2016 at 15:29
这段话“主要问题是我使用的数据库是一个现实存在的新闻网站，每篇新闻包含一到两个推荐链接，这样组成的一张新闻关系图，可是这个链接以前是人工完成，只能说correct但是uncomplete。”是指你的测试集来源吗？

[回复]
何剑说道：

2016年09月2号 11:01

这里我们拍脑门决定训练topic数量为10的LSI模型：
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)
>>> index = similarities.MatrixSimilarity(lsi[corpus])

你好，对于这里我有个疑问：
为什么我们训练lsi用的是corpus_tfidf，而在构造索引的时候用的是lsi[corpus]，也就是基于词频统计的。逻辑上感觉构造索引也应用使用基于tfidf的吧

[回复]
52nlp 回复:
2 9 月, 2016 at 18:24
可以读一下gensim官方文档

[回复]
zixiaozhang说道：

2016年09月4号 21:53

非常好！全程全部敲下来了，也都实现了，要是能更新写其他应用就好了，和更多的关于gensim的应用就好了

[回复]
戴宇说道：

2016年09月11号 11:50

第一步在Python中加载数据显示SyntaxError: invalid character in identifier怎么解决？

[回复]
52nlp 回复:
11 9 月, 2016 at 23:35
貌似与你处理的字符串有关吧，提供的信息太少，可以参考 http://stackoverflow.com/questions/14844687/invalid-character-in-identifier

[回复]
道如那说道：

2017年03月22号 11:03

老师！您好！如果文档本身是query的话，代码应该是什么样的？我看了一下docsim没怎么懂。
附：There is also a special syntax for when you need similarity of documents in the index
to the index itself (i.e. queries=indexed documents themselves). This special syntax
uses the faster, batch queries internally and **is ideal for all-vs-all pairwise similarities**:

>>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
>>> ...

[回复]

如何计算两个文档的相似度（三）

作者52nlp

作者 52nlp

相关文章

Qwen3来了，全尺寸开源，性能拉满！附最新一手实测！

DeepSeek-V3解析及技术报告英中报告对照版

如何构建和优化推理型大型语言模型？DeepSeek R1的启示

《如何计算两个文档的相似度（三）》有113条评论

发表回复

You missed

Qwen3-VL技术报告英中对照版.pdf

DeepSeek-V3.2-Exp：用稀疏注意力实现更高效的长上下文推理

LongCat-Flash：美团发布的高效MoE大模型，支持智能体任务，推理速度达100 token/秒

GLM-4.5：三体合一的开源智能体大模型，重新定义AI推理边界

作者52nlp

相关文章：

作者 52nlp

相关文章

《如何计算两个文档的相似度（三）》有113条评论

发表回复

You missed