如何计算两个文档的相似度（三）

作者52nlp

6 月 7, 2013 #Deep Learning, #Deep Learning公开课, #gensim, #nltk, #NLTK中文信息处理, #NLTK应用, #python, #Python自然语言处理, #主题模型, #布朗语料库, #文本相似度, #文档相似度, #机器学习, #机器学习公开课, #概率图模型, #概率图模型公开课, #神经网络公开课, #自然语言处理, #课程图谱

上一节我们用了一个简单的例子过了一遍gensim的用法，这一节我们将用课程图谱的实际数据来做一些验证和改进，同时会用到NLTK来对课程的英文数据做预处理。

三、课程图谱相关实验

1、数据准备
为了方便大家一起来做验证，这里准备了一份Coursera的课程数据，可以在这里下载：coursera_corpus，（百度网盘链接: http://t.cn/RhjgPkv，密码: oppc）总共379个课程，每行包括3部分内容：课程名\t课程简介\t课程详情, 已经清除了其中的html tag, 下面所示的例子仅仅是其中的课程名：

Writing II: Rhetorical Composing
Genetics and Society: A Course for Educators
General Game Playing
Genes and the Human Condition (From Behavior to Biotechnology)
A Brief History of Humankind
New Models of Business in Society
Analyse Numérique pour Ingénieurs
Evolution: A Course for Educators
Coding the Matrix: Linear Algebra through Computer Science Applications
The Dynamic Earth: A Course for Educators
...

好了，首先让我们打开Python, 加载这份数据：

>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']

2、引入NLTK
NTLK是著名的Python自然语言处理工具包，但是主要针对的是英文处理，不过课程图谱目前处理的课程数据主要是英文，因此也足够了。NLTK配套有文档，有语料库，有书籍，甚至国内有同学无私的翻译了这本书: 用Python进行自然语言处理，有时候不得不感慨：做英文自然语言处理的同学真幸福。

首先仍然是安装NLTK，在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK：http://nltk.org/install.html ，最主要的还是要先装好依赖NumPy和PyYAML，其他没什么问题。安装NLTK完毕，可以import nltk测试一下，如果没有问题，还有一件非常重要的工作要做，下载NLTK官方提供的相关语料：

>>> import nltk
>>> nltk.download()

这个时候会弹出一个图形界面，会显示两份数据供你下载，分别是all-corpora和book，最好都选定下载了，这个过程需要一段时间，语料下载完毕后，NLTK在你的电脑上才真正达到可用的状态，可以测试一下布朗语料库：

>>> from nltk.corpus import brown
>>> brown.readme()
'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192

现在我们就来处理刚才的课程数据，如果按此前的方法仅仅对文档的单词小写化的话，我们将得到如下的结果：

>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

注意其中很多标点符号和单词是没有分离的，所以我们引入nltk的word_tokenize函数，并处理相应的数据：

>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', ',', 'research', ',', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', ',', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', '...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

对课程的英文数据进行tokenize之后，我们需要去停用词，幸好NLTK提供了一份英文停用词数据：

>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
>>> len(english_stopwords)
127

总计127个停用词，我们首先过滤课程语料中的停用词：
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', ',', 'research', ',', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', ',', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', ',', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', ',', 'visual', ',', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', ',', 'demonstrations', ',', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', ',', 'rhetoric', 'course', 'design', ',', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', ',', 'students', ',', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', ',', 'writers', 'exchange', ',', 'place', 'exchange', 'work', 'feedback']

停用词被过滤了，不过发现标点符号还在，这个好办，我们首先定义一个标点符号list:
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']

然后过滤这些标点符号：
>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
>>> print texts_filtered[0]
['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', 'research', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', 'ideas', 'talents', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', 'visual', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', 'demonstrations', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', 'rhetoric', 'course', 'design', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', 'students', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', 'writers', 'exchange', 'place', 'exchange', 'work', 'feedback']

更进一步，我们对这些英文单词词干化（Stemming)，NLTK提供了好几个相关工具接口可供选择，具体参考这个页面: http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer, Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('stemmed')
'stem'
>>> st.stem('stemming')
'stem'
>>> st.stem('stemmer')
'stem'
>>> st.stem('running')
'run'
>>> st.stem('maximum')
'maxim'
>>> st.stem('presumably')
'presum'

让我们调用这个接口来处理上面的课程数据:
>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', 'sery', 'interact', 'read', 'research', 'compos', 'act', 'along', 'assign', 'design', 'help', 'becom', 'effect', 'consum', 'produc', 'alphabet', 'vis', 'multimod', 'texts.', 'join', 'us', 'becom', 'effect', 'writ', '...', 'bet', 'citizens.', 'rhet', 'compos', 'cours', 'writ', 'exchang', 'word', 'idea', 'tal', 'support.', 'introduc', 'vary', 'rhet', 'concepts\xe2\x80\x94that', 'idea', 'techn', 'inform', 'persuad', 'audiences\xe2\x80\x94that', 'help', 'becom', 'effect', 'consum', 'produc', 'writ', 'vis', 'multimod', 'texts.', 'class', 'includ', 'short', 'video', 'demonst', 'activities.', 'envid', 'rhet', 'compos', 'learn', 'commun', 'includ', 'enrol', 'cours', 'instructors.', 'bring', 'expert', 'writ', 'rhet', 'cours', 'design', 'design', 'assign', 'cours', 'infrastruct', 'help', 'shar', 'expery', 'writ', 'stud', 'profess', 'us.', 'collab', 'facilit', 'wex', 'writ', 'exchang', 'plac', 'exchang', 'work', 'feedback']

在我们引入gensim之前，还有一件事要做，去掉在整个语料库中出现次数为1的低频词，测试了一下，不去掉的话对效果有些影响：

>>> all_stems = sum(texts_stemmed, [])
>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

3、引入gensim
有了上述的预处理，我们就可以引入gensim，并快速的做课程相似度的实验了。以下会快速的过一遍流程，具体的可以参考上一节的详细描述。

>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> dictionary = corpora.Dictionary(texts)
2013-06-07 21:37:07,120 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-06-07 21:37:07,263 : INFO : built Dictionary(3341 unique tokens) from 379 documents (total 46417 corpus positions)

>>> corpus = [dictionary.doc2bow(text) for text in texts]

>>> tfidf = models.TfidfModel(corpus)
2013-06-07 21:58:30,490 : INFO : collecting document frequencies
2013-06-07 21:58:30,490 : INFO : PROGRESS: processing document #0
2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

>>> corpus_tfidf = tfidf[corpus]

这里我们拍脑门决定训练topic数量为10的LSI模型：
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])
2013-06-07 22:04:55,443 : INFO : scanning corpus to determine the number of features
2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

基于LSI模型的课程索引建立完毕，我们以Andrew Ng教授的机器学习公开课为例，这门课程在我们的coursera_corpus文件的第211行，也就是：

>>> print courses_name[210]
Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上，然后和其他课程计算相似度：
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程：
>>> print sort_sims[0:10]
[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:
>>> print courses_name[210]
Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课
>>> print courses_name[174]
Machine Learning

第三门课是Coursera的另一位创始人，同样是大牛的Daphne Koller教授的概率图模型公开课：
>>> print courses_name[238]
Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课，有同学评价是Deep Learning的必修课。
>>> print courses_name[203]
Neural Networks for Machine Learning

感觉效果还不错，如果觉得有趣的话，也可以动手试试。

好了，这个系列就到此为止了，原计划写一下在英文维基百科全量数据上的实验，因为课程图谱目前暂时不需要，所以就到此为止，感兴趣的同学可以直接阅读gensim上的相关文档，非常详细。之后我可能更关注将NLTK应用到中文信息处理上，欢迎关注。

注：原创文章，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：https://www.52nlp.cn/如何计算两个文档的相似度三

作者 52nlp

LLm 自然语言处理预训练模型

《如何计算两个文档的相似度（三）》有113条评论

飘过说道：

2013年06月14号 09:15

学习鸟，谢谢。对了，能否介绍下中文文档处理相似度的情况呢？我想英文的这些资料比较全了，比较麻烦的是分词那块么？

[回复]
52nlp 回复:
14 6 月, 2013 at 15:25
中文应该是需要用分词预处理一下其他是相似的，回头有机会我会分享一下。

[回复]
duskwaitor说道：

2013年06月28号 17:48

您在各课程排序之前有这么个计算：sims = index[ml_lsi]，我想将sims内容输出到txt文档中查看，但发现txt中全是乱码，我谷歌百度了好些方法都没有办法把乱码还原，特来向您请教，如果我想将sims中内容存放文本中，该怎么做？

[回复]
52nlp 回复:
1 7 月, 2013 at 11:09
可以看看上一节，尝试用下面的输出：
>>> sims = index[query_lsi]
>>> print list(enumerate(sims))
[(0, 0.40757114), (1, 0.93163693), (2, 0.83416492)]

[回复]
tusoutu说道：

2013年10月25号 16:58

为什么我的结果是：
[(26, 0.99999994), (175, 0.99999994), (210, 0.99999994), (230, 0.99999994), (273, 0.99586576), (139, 0.96113241), (158, 0.96113241), (301, 0.96113241), (109, 0.92345488), (238, 0.92345488)]
第一个
Live!: A History of Art for Artists, Animators and Gamers
第二个
Exploring Quantum Physics
关于艺术和量子物理的。是不是可以设定课程的类别再计算相似性。

[回复]
52nlp 回复:
14 12 月, 2013 at 11:47
如果是按这个教程一步一步走得，应该不会出现这个问题，估计是某一步出问题了。另外相似性基于的是topic，可以不考虑类别。

[回复]
refeng说道：

2014年04月15号 22:58

博主有没有注意到nltk的word_tokenize函数不能分离句号和单词？

[回复]
52nlp 回复:
16 4 月, 2014 at 11:35
好像有的可以，有的不行：

>>> word_tokenize("this's a test.")
['this', "'s", 'a', 'test', '.']

要完全剥离标点符号的，可以试试：

>>> from nltk.tokenize import WordPunctTokenizer
>>> word_punct_tokenizer = WordPunctTokenizer()
>>> word_punct_tokenizer.tokenize("This's a test.")
['This', "'", 's', 'a', 'test', '.']

[回复]
陈亮说道：

2014年07月17号 18:47

list是一个向量化的语料库,list长度是13000，这个语料库的词典中包含词汇14000，使用gensim包的matutils.corpus2dense(list,len(dictionary))将list转化为numpy的nbarray时候报错，如下：
Traceback (most recent call last):
File "", line 1, in
numpy_matrix = matutils.corpus2dense(corpus_tfidf,len(dictionary))
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\matutils.py", line 190, in corpus2dense
return numpy.column_stack(sparse2full(doc, num_terms) for doc in corpus)
File "C:\Python27\lib\site-packages\numpy-1.7.1-py2.7-win32.egg\numpy\lib\shape_base.py", line 296, in column_stack
return _nx.concatenate(arrays,1)
MemoryError
您知道是什么原因吗？是否内存不够用？

[回复]
52nlp 回复:
20 7 月, 2014 at 11:55
抱歉不太清楚，从问题的报错“MemoryError”大概和内存相关，这个可以试试小样本看看是否可以通过

[回复]
陈亮回复:
19 9 月, 2014 at 15:42
是内存不够用，我后来在32G内存的服务器上跑了下，顺利通过

[回复]
lqs说道：

2014年07月17号 21:06

Coursera的课程数据下载不了，求资源

[回复]
52nlp 回复:
20 7 月, 2014 at 12:00
上传了一份到百度网盘，请自取：
链接: http://pan.baidu.com/s/1gdsvS1X 密码: oppc

[回复]
lqs说道：

2014年07月24号 14:36

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
courses = [line.strip() for line in file('coursera_corpus')]
courses_name = [course.split('\t')[0] for course in courses]
import nltk
from nltk.corpus import brown
brown.readme()
texts_lower = [[word for word in document.lower().split()] for document in courses]
print texts_lower[0]
from nltk.tokenize import word_tokenize
texts_tokenized = [[word.lower() for word in word_tokenize(document)] for document in courses]

执行最后一句会出现错误：
'utf8' codec can't decode byte 0xc2 in position 11
如果前面没有sys.setdefaultencoding("utf-8")就是
'ascii' codec can't decode byte 0xc2 in position 11

求问怎么解决？

[回复]
52nlp 回复:
25 7 月, 2014 at 13:36
这个和中文或者非拉丁字母的课程有关，应该简单的办法是人工去把报错的行删掉，应该不影响测试。

[回复]
Nan说道：

2014年07月25号 15:30

我用这种方法处理的英文网页出现了和楼上一样的错误，有什么除了删除报错行以外的方法解决吗？

[回复]
52nlp 回复:
25 7 月, 2014 at 20:55
这种编码问题非常讨厌，各种各样的情况，可以试一下把英文网页先encode为utf8试试，其他只能google了

[回复]
Jimmy说道：

2014年08月30号 13:26

hi 经常看52NLP的文章，深受启发。不过这里有一个小小的问题：
texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', ...]
这里面的'write'变成了'writ'，'compose'变成了'compos',这个用来计算相似度自然是没有问题，但是想要为这些词设定先验分布的时候，怎样使它提取词干的结果是write'而不是'writ'呢谢谢~

[回复]
52nlp 回复:
31 8 月, 2014 at 18:33
具体没有研究过词干化的细节，这里主要是调用了其中的一个，其他英文词干化工具还有几个，不过或多或少会存在如此的问题，因为不太好覆盖所有的内容，只能做到一致化。如果要保留英文单词的原意，另一个选择是lemtazier.

[回复]
Jimmy 回复:
2 9 月, 2014 at 22:53
谢谢那个WordNetLemmatizer用了一下确实可以把write变回原形不过也存在其他一些问题，看来还要遇到具体问题具体分析。对了，请问帖主那个全量的英文维基百科（400多万文章，压缩后9个多G的语料可以在哪里下载到吗，不知道可否发我下？多谢多谢！

[回复]
52nlp 回复:
3 9 月, 2014 at 09:02
维基百科的数据来源于官方，可以自行下载：http://dumps.wikimedia.org/

Jimmy 回复:
3 9 月, 2014 at 09:55
好的看到了不过有很多下载的您说的400多万文章，压缩后9个多G的语料是哪一个呢？非常感谢

Jimmy 回复:
3 9 月, 2014 at 11:08
在Wikimedia Downloads上面找了很久似乎都没有找到您说的400多万文章，压缩后9个多G的语料 🙂

52nlp 回复:
3 9 月, 2014 at 11:12
那是去年这个时候的全部wiki文章了，现在最新的应该在这个目录下：http://dumps.wikimedia.org/enwiki/20140811/

全量大概有11G enwiki-20140811-pages-articles-multistream.xml.bz2

Jimmy 回复:
3 9 月, 2014 at 16:00
多谢楼主，11G的xml文件已经下载下来了，不过想问问如何将其应用到课程图谱上来改进课程的相似度效果呢是不是首先需要建立一个相似度效果的评价标准（metrics）？还有就是请问楼主具体是怎么用wiki的文章改进课程相似度的呢:) （我还蛮想实现一下这个东东的）多谢多谢

52nlp 回复:
3 9 月, 2014 at 19:25
可以按gensim的这个实验跑一下，不过还要看一下你的机器性能，我当时跑了很长时间

http://radimrehurek.com/gensim/wiki.html

Jimmy 回复:
3 9 月, 2014 at 23:12
应该没问题吧我的是工作站我先试试看。不过看楼主提到“第三部分我们一起来实现” 哈哈不知道到时候可不可以商量下怎么改进课程图谱的相似度 🙂 对了，请问方便要一下q不发我邮件什么的:) 想着以后请教方便点啦多谢~

JIM 回复:
4 9 月, 2014 at 11:49
哈哈明白了更有兴趣做了谢了哈

[回复]
飞行棋回复:
8 9 月, 2014 at 23:48
jim，你好，想请教一下你怎么用wikipedia帮助相似度计算的谢谢哈

52nlp 回复:
4 9 月, 2014 at 11:10
抱歉，很忙，不太方便透露，谢谢

[回复]
JIM 回复:
4 9 月, 2014 at 11:49
哈哈明白了更有兴趣做了谢了哈

[回复]
飞行棋说道：

2014年09月8号 23:46

你好，看了你的文章很受启发啊，写的很详细，跟着一步一步都可以学会。不过有一个问题想问问，我自己也试了你给的网址上的11G enwiki-20140811-pages-articles-multistream.xml.bz2 想问问这个可以帮助相似度的计算吗？具体怎么操作呢谢谢哈

[回复]
52nlp 回复:
9 9 月, 2014 at 08:04
方法是一致的是，只是数据量的大小不同，按道理说wiki的语料应该可以帮助提高计算文章的相似度，不过还要看你面向什么领域的文章了。可以按gensim的guide学习如何使用这个wiki语料 http://radimrehurek.com/gensim/wiki.html ，计算文章相似度和博客中的方法是一致的，没有什么新鲜的

[回复]
lgg说道：

2014年09月9号 16:46

当需要计算的文档量大的时候，就比较慢了，不知道怎么破....

[回复]
52nlp 回复:
9 9 月, 2014 at 20:47
gensim对于大规模语料的支持还是非常不错的，甚至还有分布式的方法，慢主要和你的机器有关，话说回来，还要和你对慢的定义有关。我跑过维基百科的语料，大概用了1天，觉得可以接受。

[回复]
lgg说道：

2014年09月9号 16:56

测试了下，对中文文档相关性计算，效果不理想。不知是否真的是这样？

[回复]
52nlp 回复:
9 9 月, 2014 at 20:45
方法应该是中英文无关的，区别在于你的训练集是什么，测试集是什么，具体情况具体分析

[回复]
lgg说道：

2014年09月9号 17:01

请问一下博主：
lsi = models.LsiModel(corpus_tfidf, id2word=dic, num_topics=15)
这个num_topics 数值的选取，该怎样选取最好的值？

[回复]
52nlp 回复:
9 9 月, 2014 at 20:44
这个是一个经验值，和你的训练数据规模有关

[回复]
飞行棋说道：

2014年09月10号 19:39

想问下，博主是怎么爬coursera的课程数据的？今天写了个爬虫但是没爬下来

[回复]
52nlp 回复:
13 9 月, 2014 at 18:18
可以参考这个

http://www.quora.com/Does-Coursera-have-any-plans-to-offer-a-REST-API

我们用的是相似的思路，开始直接用类似scrapy的爬虫没有搞定，后来是侦测它的json文件读取出来的，不过当时还没有这个参考，琢磨了很久，你可以自己试试。

[回复]
飞行棋回复:
15 9 月, 2014 at 16:38
谢谢博主，cousera的API试了两天，debug了许久，还是没搞定，想问问博主也有木有完整的coursera的语料（貌似现在coursera英文的课程都有770多节课了）？博主之前的那个语料貌似只有400个课程，其中还包括一些法文的课程

[回复]
vincent说道：

2014年09月19号 15:47

您好，LSA实现中，可以使用TFIDF为每个词赋权重，也可以直接使用布尔值，或者该词在该文档出现的词频，但tfidf赋权重效果好吗？为什么效果好？水平有限，我推了半天公式，也没证明出来什么，请作者指点迷津。

[回复]
cathy说道：

2015年01月21号 15:25

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])
您好，我想问问为什么第二行这个不用lsi[corpus_tfidf]

[回复]
52nlp 回复:
21 1 月, 2015 at 21:16
>>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

corpus这里表示的文档向量矩阵

[回复]
cathy 回复:
21 1 月, 2015 at 21:40
但是第一行那个lsi里面用的是corpus转成的tfidf，如果lsi=models.LsiModel(corpus, id2word=dictionary, num_topics=10)
那么我感觉第二行可以是index = similarities.MatrixSimilarity(lsi[corpus])
就像gensim里面的那个tutorial说明。自己感觉如果按照corpus_tfidf，那么计算query的相似性时也应该把query转成tfidf然后，index[lsi[tfidf[query]]] 并且刚开始那个index也应该等于similarities.MatrixSimilarity(lsi[corpus_tfidf])。

[回复]
cathy 回复:
21 1 月, 2015 at 21:42
如果楼主方便可以加个微信，刚好最近一直在做中文的这块，特别想请教对于同义词的处理，是否是建个字典，如果建字典同义词那么多有没有类似的同义词词典啊？

52nlp 回复:
22 1 月, 2015 at 09:24
抱歉，很忙，不方便微信；同义词这块儿建议你关注一下word2vec，可以基于语料计算词的相似度，从统计的角度计算了词之间的关系。

[回复]
cathy 回复:
22 1 月, 2015 at 14:12
噢，好的多谢
cathy说道：

2015年01月25号 19:33

想问问对于pdf格式的文章怎么处理啊？那个open().read()可以么？为什么我尝试的时候读出来的是乱码啊如果文章四十多篇文章每篇50页左右跑下来多久啊我试了一下一直没有结果并且不知道什么时候能有啊

[回复]

如何计算两个文档的相似度（三）

作者52nlp

作者 52nlp

相关文章

Qwen3来了，全尺寸开源，性能拉满！附最新一手实测！

DeepSeek-V3解析及技术报告英中报告对照版

如何构建和优化推理型大型语言模型？DeepSeek R1的启示

《如何计算两个文档的相似度（三）》有113条评论

发表回复

You missed

Qwen3-VL技术报告英中对照版.pdf

DeepSeek-V3.2-Exp：用稀疏注意力实现更高效的长上下文推理

LongCat-Flash：美团发布的高效MoE大模型，支持智能体任务，推理速度达100 token/秒

GLM-4.5：三体合一的开源智能体大模型，重新定义AI推理边界

作者52nlp

相关文章：

作者 52nlp

相关文章

《如何计算两个文档的相似度（三）》有113条评论

发表回复

You missed