中英文维基百科语料上的Word2Vec实验

AINLP

注:目前可以直接在AINLP公众号上体验腾讯词向量,公众号对话直接输入:相似词 词条

最近试了一下Word2Vec, GloVe 以及对应的python版本 gensim word2vecpython-glove,就有心在一个更大规模的语料上测试一下,自然而然维基百科的语料进入了视线。维基百科官方提供了一个很好的维基百科数据源:https://dumps.wikimedia.org,可以方便的下载多种语言多种格式的维基百科数据。此前通过gensim的玩过英文的维基百科语料并训练LSI,LDA模型来计算两个文档的相似度,所以想看看gensim有没有提供一种简便的方式来处理维基百科数据,训练word2vec模型,用于计算词语之间的语义相似度。感谢Google,在gensim的google group下,找到了一个很长的讨论帖:training word2vec on full Wikipedia ,这个帖子基本上把如何使用gensim在维基百科语料上训练word2vec模型的问题说清楚了,甚至参与讨论的gensim的作者Radim Řehůřek博士还在新的gensim版本里加了一点修正,而对于我来说,所做的工作就是做一下验证而已。虽然github上有一个wiki2vec的项目也是做得这个事,不过我更喜欢用python gensim的方式解决问题。

关于word2vec,这方面无论中英文的参考资料相当的多,英文方面既可以看官方推荐的论文,也可以看gensim作者Radim Řehůřek博士写得一些文章。而中文方面,推荐 @licstar的《Deep Learning in NLP (一)词向量和语言模型》,有道技术沙龙的《Deep Learning实战之word2vec》,@飞林沙 的《word2vec的学习思路》, falao_beiliu 的《深度学习word2vec笔记之基础篇》和《深度学习word2vec笔记之算法篇》等。

一、英文维基百科的Word2Vec测试

首先测试了英文维基百科的数据,下载的是xml压缩后的最新数据(下载日期是2015年3月1号),大概11G,下载地址:

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

处理包括两个阶段,首先将xml的wiki数据转换为text格式,通过下面这个脚本(process_wiki.py)实现:
注:因为很多同学留言是在python3.x环境下使用遇到问题,这里修改了一个版本兼容python2.x和python3.x, Ubuntu16.04下测试有效(2017.5.1)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyrigh 2017
 
from __future__ import print_function
 
import logging
import os.path
import six
import sys
 
from gensim.corpora import WikiCorpus
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
 
    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(b' '.join(text).decode('utf-8') + '\n')
        #   ###another method###
        #    output.write(
        #            space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
 
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

这里利用了gensim里的维基百科处理类WikiCorpus,通过get_texts将维基里的每篇文章转换位1行text文本,并且去掉了标点符号等内容,注意这里“wiki = WikiCorpus(inp, lemmatize=False, dictionary={})”将lemmatize设置为False的主要目的是不使用pattern模块来进行英文单词的词干化处理,无论你的电脑是否已经安装了pattern,因为使用pattern会严重影响这个处理过程,变得很慢。

执行"python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text":

2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
2015-03-07 15:11:12,860: INFO: Saved 10000 articles
2015-03-07 15:13:25,369: INFO: Saved 20000 articles
2015-03-07 15:15:19,771: INFO: Saved 30000 articles
2015-03-07 15:16:58,424: INFO: Saved 40000 articles
2015-03-07 15:18:12,374: INFO: Saved 50000 articles
2015-03-07 15:19:03,213: INFO: Saved 60000 articles
2015-03-07 15:19:47,656: INFO: Saved 70000 articles
2015-03-07 15:20:29,135: INFO: Saved 80000 articles
2015-03-07 15:22:02,365: INFO: Saved 90000 articles
2015-03-07 15:23:40,141: INFO: Saved 100000 articles
.....
2015-03-07 19:33:16,549: INFO: Saved 3700000 articles
2015-03-07 19:33:49,493: INFO: Saved 3710000 articles
2015-03-07 19:34:23,442: INFO: Saved 3720000 articles
2015-03-07 19:34:57,984: INFO: Saved 3730000 articles
2015-03-07 19:35:31,976: INFO: Saved 3740000 articles
2015-03-07 19:36:05,790: INFO: Saved 3750000 articles
2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words)
2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles

在我的macpro(4核16G机器)大约跑了4个半小时,处理了375万的文章后,我们得到了一个12G的text格式的英文维基百科数据wiki.en.text,格式类似这样的:

anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix...

有了这个数据后,无论用原始的word2vec binary版本还是gensim中的python word2vec版本,都可以用来训练word2vec模型,不过我们试了一下前者,发现很慢,所以还是采用google group 讨论帖中的gensim word2vec方式的训练脚本,不过做了一点修改,保留了vector text格式的输出,方便debug, 脚本train_word2vec_model.py如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os
import sys
import multiprocessing
 
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 4:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
                     workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use(much) less RAM
    # model.init_sims(replace=True)
    model.save(outp1)
    model.wv.save_word2vec_format(outp2, binary=False)

执行 "python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector":

2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
2015-03-09 22:48:29,593: INFO: collecting all words and their counts
2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word 
types
......
2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5
2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words
2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29
2015-03-09 23:14:09,790: INFO: resetting layer weights
2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s
2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s
2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s
2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s
2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s
2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s
2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s
2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s
2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s
2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s
2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s
2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s
2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s
2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s
2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s
2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s
2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s
2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s
2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s
2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s
2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s
2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s
2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s
2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s
2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s
2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s
.......
2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s
2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs
2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s
2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None
2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm
2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy
2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy
2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector

大约跑了7个小时,我们得到了一个gensim中默认格式的word2vec model和一个原始c版本word2vec的vector格式的模型: wiki.en.text.vector,格式如下:

1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817...
...

在ipython中,我们通过gensim来加载和测试这个模型,因为这个模型大约有7G,所以加载的时间也稍长一些:

In [2]: import gensim
 
# 注:因为gensim版本更新的问题,如果下面这个load有问题,可以使用新的接口:model = gensim.models.word2vec.Word2Vec.load(MODEL_PATH)
In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)
 
In [4]: model.most_similar("queen")
Out[4]: 
[(u'princess', 0.5760838389396667),
 (u'hyoui', 0.5671186447143555),
 (u'janggyung', 0.5598698854446411),
 (u'king', 0.5556215047836304),
 (u'dollallolla', 0.5540223121643066),
 (u'loranella', 0.5522741079330444),
 (u'ramphaiphanni', 0.5310937166213989),
 (u'jeheon', 0.5298476219177246),
 (u'soheon', 0.5243583917617798),
 (u'coronation', 0.5217245221138)]
 
In [5]: model.most_similar("man")
Out[5]: 
[(u'woman', 0.7120707035064697),
 (u'girl', 0.58659827709198),
 (u'handsome', 0.5637181997299194),
 (u'boy', 0.5425317287445068),
 (u'villager', 0.5084836483001709),
 (u'mustachioed', 0.49287813901901245),
 (u'mcgucket', 0.48355430364608765),
 (u'spider', 0.4804879426956177),
 (u'policeman', 0.4780033826828003),
 (u'stranger', 0.4750771224498749)]
 
In [6]: model.most_similar("woman")
Out[6]: 
[(u'man', 0.7120705842971802),
 (u'girl', 0.6736541986465454),
 (u'prostitute', 0.5765659809112549),
 (u'divorcee', 0.5429972410202026),
 (u'person', 0.5276163816452026),
 (u'schoolgirl', 0.5102938413619995),
 (u'housewife', 0.48748138546943665),
 (u'lover', 0.4858251214027405),
 (u'handsome', 0.4773051142692566),
 (u'boy', 0.47445783019065857)]
 
In [8]: model.similarity("woman", "man")
Out[8]: 0.71207063453821218
 
In [10]: model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'
 
In [11]: model.similarity("woman", "girl")
Out[11]: 0.67365416785207421
 
In [13]: model.most_similar("frog")
Out[13]: 
[(u'toad', 0.6868536472320557),
 (u'barycragus', 0.6607867479324341),
 (u'grylio', 0.626731276512146),
 (u'heckscheri', 0.6208407878875732),
 (u'clamitans', 0.6150864362716675),
 (u'coplandi', 0.612680196762085),
 (u'pseudacris', 0.6108512878417969),
 (u'litoria', 0.6084023714065552),
 (u'raniformis', 0.6044802665710449),
 (u'watjulumensis', 0.6043726205825806)]

一切ok,但是当加载gensim默认的基于numpy格式的模型时,却遇到了问题:

In [1]: import gensim 
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model")
 
In [3]: model.most_similar("man")
... RuntimeWarning: invalid value encountered in divide
  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
 
Out[3]: 
[(u'ahsns', nan),
 (u'ny\xedl', nan),
 (u'indradeo', nan),
 (u'jaimovich', nan),
 (u'addlepate', nan),
 (u'jagello', nan),
 (u'festenburg', nan),
 (u'picatic', nan),
 (u'tolosanum', nan),
 (u'mithoo', nan)]

这也是我修改前面这个脚本的原因所在,这个脚本在训练小一些的数据,譬如前10万条text的时候没任何问题,无论原始格式还是gensim格式,但是当跑完这个英文维基百科的时候,却存在这个问题,试了一些方法解决,还没有成功,如果大家有好的建议或解决方案,欢迎提出。

二、中文维基百科的Word2Vec测试

测试完英文维基百科之后,自然想试试中文的维基百科数据,与英文处理过程相似,也分两个步骤,不过这里需要对中文维基百科数据特殊处理一下,包括繁简转换,中文分词,去除非utf-8字符等。中文数据的下载地址是:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文维基百科的数据比较小,整个xml的压缩文件大约才1G,相对英文数据小了很多。首先用 process_wiki.py处理这个XML压缩文件,执行:python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2015-03-11 17:40:08,329: INFO: Saved 10000 articles
2015-03-11 17:40:45,501: INFO: Saved 20000 articles
2015-03-11 17:41:23,659: INFO: Saved 30000 articles
2015-03-11 17:42:01,748: INFO: Saved 40000 articles
2015-03-11 17:42:33,779: INFO: Saved 50000 articles
......
2015-03-11 17:55:23,094: INFO: Saved 200000 articles
2015-03-11 17:56:14,692: INFO: Saved 210000 articles
2015-03-11 17:57:04,614: INFO: Saved 220000 articles
2015-03-11 17:57:57,979: INFO: Saved 230000 articles
2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles

得到了大约23万多篇中文语料的text格式的语料:wiki.zh.text,大概750多M。不过查看之后发现,除了加杂一些英文词汇外,还有很多繁体字混迹其中,这里还是参考了 @licstar 《维基百科简体中文语料的获取》中的方法,安装opencc,然后将wiki.zh.text中的繁体字转化位简体字:

opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

然后就是分词处理了,这次我用基于MeCab训练的一套中文分词系统来进行中文分词,目前虽还没有达到实用的状态,但是性能和分词结果基本能达到这次的使用要求:

mecab -d ../data/ -O wakati wiki.zh.text.jian -o wiki.zh.text.jian.seg -b 10000000

注意这里data目录下是给mecab训练好的分词模型和词典文件等,详细可参考《用MeCab打造一套实用的中文分词系统》。

有了中文维基百科的分词数据,还以为就可以执行word2vec模型训练了:

python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector

不过仍然遇到了问题,提示的错误是:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5394-5395: invalid continuation byte

google了一下,大致是文件中包含非utf-8字符,又用iconv处理了一下这个问题:

iconv -c -t UTF-8 < wiki.zh.text.jian.seg > wiki.zh.text.jian.seg.utf-8

这样基本上就没问题了,执行:

python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector

2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2015-03-11 18:50:02,592: INFO: collecting all words and their counts
2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
...
2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5
2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words
2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25
2015-03-11 18:52:29,683: INFO: resetting layer weights
2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
......
2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm
2015-03-11 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
2015-03-11 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

让我们看一下训练好的中文维基百科word2vec模型“wiki.zh.text.vector"的效果:

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
 
In [3]: model.most_similar(u"足球")
Out[3]: 
[(u'\u8054\u8d5b', 0.6553816199302673),
 (u'\u7532\u7ea7', 0.6530429720878601),
 (u'\u7bee\u7403', 0.5967546701431274),
 (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
 (u'\u4e59\u7ea7', 0.5840631723403931),
 (u'\u8db3\u7403\u961f', 0.5560152530670166),
 (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
 (u'allsvenskan', 0.5249762535095215),
 (u'\u4ee3\u8868\u961f', 0.5214947462081909),
 (u'\u7532\u7ec4', 0.5177896022796631)]
 
In [4]: result = model.most_similar(u"足球")
 
In [5]: for e in result:
    print e[0], e[1]
   ....:     
联赛 0.65538161993
甲级 0.653042972088
篮球 0.596754670143
俱乐部 0.587228953838
乙级 0.58406317234
足球队 0.556015253067
亚足联 0.530800580978
allsvenskan 0.52497625351
代表队 0.521494746208
甲组 0.51778960228
 
In [6]: result = model.most_similar(u"男人")
 
In [7]: for e in result:
    print e[0], e[1]
   ....:     
女人 0.77537125349
家伙 0.617369174957
妈妈 0.567102909088
漂亮 0.560832381248
잘했어 0.540875017643
谎言 0.538448691368
爸爸 0.53660941124
傻瓜 0.535608053207
예쁘다 0.535151124001
mc刘 0.529670000076
 
In [8]: result = model.most_similar(u"女人")
 
In [9]: for e in result:
    print e[0], e[1]
   ....:     
男人 0.77537125349
我的某 0.589010596275
妈妈 0.576344847679
잘했어 0.562340974808
美丽 0.555426716805
爸爸 0.543958246708
新娘 0.543640494347
谎言 0.540272831917
妞儿 0.531066179276
老婆 0.528521537781
 
In [10]: result = model.most_similar(u"青蛙")
 
In [11]: for e in result:
    print e[0], e[1]
   ....:     
老鼠 0.559612870216
乌龟 0.489831030369
蜥蜴 0.4789905250070.46728849411
鳄鱼 0.461885392666
蟾蜍 0.448014199734
猴子 0.436584025621
白雪公主 0.434905380011
蚯蚓 0.433413207531
螃蟹 0.4314712286
 
In [12]: result = model.most_similar(u"姨夫")
 
In [13]: for e in result:
    print e[0], e[1]
   ....:     
堂伯 0.583935439587
祖父 0.574735701084
妃所生 0.569327116013
内弟 0.562012672424
早卒 0.5580426454540.553856015205
胤祯 0.553288519382
陈潜 0.550716996193
愔之 0.550510883331
叔父 0.550032019615
 
In [14]: result = model.most_similar(u"衣服")
 
In [15]: for e in result:
    print e[0], e[1]
   ....:     
鞋子 0.686688780785
穿着 0.672499775887
衣物 0.67173999548
大衣 0.667605519295
裤子 0.662670075893
内裤 0.662210345268
裙子 0.659705817699
西装 0.648508131504
洋装 0.647238850594
围裙 0.642895817757
 
In [16]: result = model.most_similar(u"公安局")
 
In [17]: for e in result:
    print e[0], e[1]
   ....:     
司法局 0.730189085007
公安厅 0.634275555611
公安 0.612798035145
房管局 0.597343325615
商业局 0.597183346748
军管会 0.59476184845
体育局 0.59283208847
财政局 0.588721752167
戒毒所 0.575558543205
新闻办 0.573395550251
 
In [18]: result = model.most_similar(u"铁道部")
 
In [19]: for e in result:
    print e[0], e[1]
   ....:     
盛光祖 0.565509021282
交通部 0.548688530922
批复 0.546967327595
刘志军 0.541010737419
立项 0.517836689949
报送 0.510296344757
计委 0.508456230164
水利部 0.503531932831
国务院 0.503227233887
经贸委 0.50156635046
 
In [20]: result = model.most_similar(u"清华大学")
 
In [21]: for e in result:
    print e[0], e[1]
   ....:     
北京大学 0.763922810555
化学系 0.724210739136
物理系 0.694550514221
数学系 0.684280991554
中山大学 0.677202701569
复旦 0.657914161682
师范大学 0.656435549259
哲学系 0.654701948166
生物系 0.654403865337
中文系 0.653147578239
 
In [22]: result = model.most_similar(u"卫视")
 
In [23]: for e in result:
    print e[0], e[1]
   ....:     
湖南 0.676812887192
中文台 0.626506924629
収蔵 0.621356606483
黄金档 0.582251906395
cctv 0.536769032478
安徽 0.536752820015
非同凡响 0.534517168999
唱响 0.533438682556
最强音 0.532605051994
金鹰 0.531676828861
 
In [24]: result = model.most_similar(u"习近平")
 
In [25]: for e in result:
    print e[0], e[1]
   ....:     
胡锦涛 0.809472680092
江泽民 0.754633367062
李克强 0.739740967751
贾庆林 0.737033963203
曾庆红 0.732847094536
吴邦国 0.726941585541
总书记 0.719057679176
李瑞环 0.716384887695
温家宝 0.711952567101
王岐山 0.703570842743
 
In [26]: result = model.most_similar(u"林丹")
 
In [27]: for e in result:
    print e[0], e[1]
   ....:     
黄综翰 0.538035452366
蒋燕皎 0.52646958828
刘鑫 0.522252976894
韩晶娜 0.516120731831
王晓理 0.512289524078
王适 0.508560419083
杨影 0.508159279823
陈跃 0.507353425026
龚智超 0.503159761429
李敬元 0.50262516737
 
In [28]: result = model.most_similar(u"语言学")
 
In [29]: for e in result:
    print e[0], e[1]
   ....:     
社会学 0.632598280907
人类学 0.623406708241
历史学 0.618442356586
比较文学 0.604823827744
心理学 0.600066184998
人文科学 0.577783346176
社会心理学 0.575571238995
政治学 0.574541330338
地理学 0.573896467686
哲学 0.573873817921
 
In [30]: result = model.most_similar(u"计算机")
 
In [31]: for e in result:
    print e[0], e[1]
   ....:     
自动化 0.674171924591
应用 0.614087462425
自动化系 0.611132860184
材料科学 0.607891201973
集成电路 0.600370049477
技术 0.597518980503
电子学 0.591316461563
建模 0.577238917351
工程学 0.572855889797
微电子 0.570086717606
 
In [32]: model.similarity(u"计算机", u"自动化")
Out[32]: 0.67417196002404789
 
In [33]: model.similarity(u"女人", u"男人")
Out[33]: 0.77537125129824813
 
In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
Out[34]: u'\u4e2d\u5fc3'
 
In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
中心

有好的也有坏的case,甚至bad case可能会更多一些,这和语料库的规模有关,还和分词器的效果有关等等,不过这个实验暂且就到这里了。至于word2vec有什么用,目前除了用来来计算词语相似度外,业界更关注的是word2vec在具体的应用任务中的效果,这个才是更有意思的东东,也欢迎大家一起探讨。

注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn

本文链接地址:http://www.52nlp.cn/中英文维基百科语料上的word2vec实验

中英文维基百科语料上的Word2Vec实验》上有362条评论

  1. 赵强

    楼主您好,谢谢你的分享,非常有帮助。
    不过我在网上找不到process_wiki.py这个文件,希望楼主能传给我一个,
    我的邮箱zqiang825@163.com,谢谢楼主

    [回复]

    52nlp 回复:

    代码在文中贴了,process_wiki.py是我的命名而已

    [回复]

    赵强 回复:

    小白了,已经照着楼主方法实现了,谢谢楼主

    [回复]

    王焕宇 回复:

    楼主,你好,我使用的是python3.4,按照你的步骤,在处理的时候提示“TypeError: sequence item 0: expected str instance, bytes found”,想问问这是个什么问题?

    [回复]

    52nlp 回复:

    版本的问题?我用的还是python2.7

    王焕宇 回复:

    不太清楚,就在process_wiki.py这个文件中的output.write这句话上提示错误。
    for text in wiki.get_texts():
    output.write(space.join(text) + '\n')

    52nlp 回复:

    处理的是中文还是英文?估计和Python3的一些属性有关,试试这个解决方案:

    Perform the join on a byte string using b''.join():

    >>> b''.join([b'line 1\n', b'line 2\n'])
    b'line 1\nline 2\n'

    http://stackoverflow.com/questions/17068100/joining-byte-list-with-python

    王焕宇 回复:

    是应为wiki数据,我重新安装的python2.7.x,现在好了,因为python不熟,我觉得那个可能是新版本中编码上的问题吧。

  2. Victor

    我在测试 中文 wiki的时候,我是把 wiki 切割开训练的,训练没什么问题,但是,model.most_similar()的时候会有报错,memory_error,在 word2vec.py 的827行 self.syn0norm 处,博主有没有遇到这个问题?

    [回复]

    52nlp 回复:

    中文没有遇到这个问题

    [回复]

  3. zzningxp

    你好,请教你一个问题。在Word2Vec的实现中,一般的,model矩阵会被存储在磁盘的一个文件中,再次使用时,程序需要从该文件中load数据再进行查询。但是,这个load的过程比较慢,并且,如果该矩阵文件大小超过内存大小时,查询时有可能涉及频繁的磁盘内存交换,影响效率。因此,想请问你有没有针对该问题的开源解决方案,比如将矩阵放到memcached、mongodb或者是MySQL等存储中的方法?

    [回复]

  4. KevinJ

    楼主您好!您的文章给了我们很多启示,我们最近项目需要一个语料库作为文本挖掘支持。想问问您在训练中文维基时,使用了什么样的配置,大概跑了多长时间?

    [回复]

    52nlp 回复:

    中文语料不大,印象处理后没有1G,很快的;我机器的配置文中已经说过了:

    在我的macpro(4核16G机器)大约跑了4个半小时,处理了375万的文章后,我们得到了一个12G的text格式的英文维基百科数据wiki.en.text

    处理英文花了一些时间,中文很快。

    [回复]

  5. Token

    你好,这里有个不情之请:由于需要用到词向量来完成RAE的构建,但是自己的机器却跑不动这么大的语料库,所以想问博主能不能分享你的训练好的中文词向量文件,或者告知其它已经训练好的中文词向量文件的下载地址。谢谢!

    [回复]

    52nlp 回复:

    中文语料其实很小的,建议你自己操作一下试试,真的不大

    [回复]

  6. HenrySky

    你好, 楼主,如果我需要将wiki corpus tokenize成一句话一行,有没有现成的工具可以使用?

    [回复]

  7. HenrySky

    楼主你好,如果我想把wiki corpus拆分成一句一行的格式,有没有现成的工具?

    [回复]

    52nlp 回复:

    英文的nltk就提供有断句工具,中文的可能要自己简单写一个。

    [回复]

    HenrySky 回复:

    那么,就无法使用wiki corpus了咯?因为标点符号被去掉了没法断句吧

    [回复]

    52nlp 回复:

    是的,那你可能要自己修改一下代码了

  8. Jarvis

    用model.init_sims(replace=True)有啥影响吗,为啥要注释掉呢?处理之后可以减少model一半体积~

    [回复]

    52nlp 回复:

    有点忘了,印象当时是debug需要用,init_sims貌似丢了一些信息

    [回复]

  9. Small

    博主你好,看了你的这篇文章,感觉很有收获,想问一下,维基百科上有法语,德语等其他语言语料库可供下载么?

    [回复]

    52nlp 回复:

    维基百科上应该有各种语言的资料备份,具体请仔细阅读官方链接:https://dumps.wikimedia.org/

    [回复]

  10. Small

    博主你好,想向您问一下在哪获得其他语言的语料库呢,比如法语、德语、西班牙语?

    [回复]

  11. 鸿志

    你好,请问中文的分词速度会不会很慢呢?你采用的是什么分词呢?

    [回复]

    52nlp 回复:

    文章里已经说的很清楚了:

    然后就是分词处理了,这次我用基于MeCab训练的一套中文分词系统来进行中文分词,目前虽还没有达到实用的状态,但是性能和分词结果基本能达到这次的使用要求:

    mecab -d ../data/ -O wakati wiki.zh.text.jian -o wiki.zh.text.jian.seg -b 10000000

    注意这里data目录下是给mecab训练好的分词模型和词典文件等,详细可参考《用MeCab打造一套实用的中文分词系统》。

    [回复]

    鸿志 回复:

    非常感谢!按照你的说明实现了一下。

    [回复]

  12. 刘佳军

    楼主你好,为什么我按照你的运行英文语料库报错呢:
    在text in wiki.get_texts():上报错
    一直追溯到一个cElementTree.PareseError:no element Found: in line 36 column 0

    [回复]

    52nlp 回复:

    抱歉,这个不清楚,所有的条件都是一样的吗?处理的语料也一样吗

    [回复]

  13. liujiajun

    楼主 按照你的办法
    为什么会在for text in wiki.get_texts() 这里报错呢
    最后报的错失 cElementTree.ParseError:no element found:line 36 column 0

    [回复]

  14. 日尧

    楼主你好,我在用opencc繁简转换的时候遇到这个问题:
    riyaodeMacBook-Pro:Wiki_word2vec riyao$ opencc -i trytry.text -o trytry.text.jian -c zht2zhs.ini
    zht2zhs.ini not found or not accessible.
    请问你有没有遇到

    [回复]

    52nlp 回复:

    没有遇到,你什么系统下跑的?

    [回复]

    日尧 回复:

    也是在mac(riyaodeMacBook-Pro)跑,改了utf-8编码也不行,后来没办法,在ubuntu上实现的

    [回复]

  15. Pingback引用通告: 【GoGo闯】【折腾】通过word2vec结合SEO做关键词分类_SEO好文章_【方法SEO顾问】

  16. nlper 老张

    楼主你好,我想把word2vec训练的结果 按照相似度 全部输出到单独文件,但是总是报错,已经试过把 变量“line” 加单引号,加encode(),decode()等多种方法,但是在most_similar()中,就是无法被正确识别,能否帮忙看看:
    position_list=codecs.open('position_test.txt', 'r', 'utf-8')
    m=1
    for line in position_list:
    line=line.strip('\n')
    print isinstance(line, unicode) #用来判断是否为unicode
    print(type(line))
    print isinstance(line, unicode) #用来判断是否为unicode
    print line
    result = new_model.most_similar(line.encode('utf8'))
    for e in result:
    print e[0].decode('utf-8'), e[1]
    fm = codecs.open('line.txt', 'w')
    fm.write(e[0])
    #fm.write(' ')
    #fm.write(str(e[1]))
    fm.write('\n')
    m+=1
    fm.close()
    报错为:
    Traceback (most recent call last):
    File "D:\test\gensim_word2vec5.py", line 116, in
    result = new_model.most_similar(line.encode('utf8'))
    File "C:\Python27\lib\site-packages\gensim-0.12.1-py2.7-win32.egg\gensim\models\word2vec.py", line 1088, in most_similar
    raise KeyError("word '%s' not in vocabulary" % word)
    KeyError: "word '\xef\xbb\xbf\xe5\x89\x8d\xe5\x8f\xb0\r' not in vocabulary"

    [回复]

    52nlp 回复:

    KeyError: “word ‘\xef\xbb\xbf\xe5\x89\x8d\xe5\x8f\xb0\r’ not in vocabulary”

    貌似词汇表是用unicode的方式存储的,你这一行:

    result = new_model.most_similar(line.encode(‘utf8′))

    是不是先不要转化为utf8,保留unicode的方式先进行词汇表查找?

    [回复]

  17. 小杨

    楼主您好,我用cmd运行您说的
    python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
    然后报以下错:
    Traceback (most recent call last):
    File "process_wiki.py", line 8, in
    from gensim.corpora import WikiCorpus
    ImportError: No module named gensim.corpora
    请问是什么问题?~~求助

    [回复]

    52nlp 回复:

    gensim可安装了?

    [回复]

  18. 薛方

    楼主大哥你好,我运行时出错。 print globals()['__doc__'] % locals() 里面'__doc__'是什么?
    我用爬虫下载了网络帖子。做成中文的词库,有175M,用jieba做的分词,想做词向量分析。
    我按照你的文章下载并安装了gensim。import gensim没有出错。看了你的代码,我先试了试英文的(代码大部分看不太懂)。能不能加下我的QQ2530170123。

    [回复]

    52nlp 回复:

    抱歉,关于你的问题,建议google一下就知道了;平时比较忙,不方便加qq。

    [回复]

    薛方 回复:

    不好意思,我问的太简单了。
    你的文档能给我些指导。
    但是,我不具备成功运行你提供的程序的能力,所以让你加QQ只是想交流下,不过我的python级别好像还不够。

    没用过google。
    谢过~

    [回复]

    52nlp 回复:

    没别的意思,我是很忙,咨询的人多,没办法一一加qq,抱歉;google没用过就百度吧

  19. 薛方

    首先谢谢楼主,我前面没有在cmd中运行而是用python一步一步的试你的程序,哈哈,见笑了。现在第一步已经实现,得到了wiki.en.text.model和wiki.en.text.vector。wiki.en.text.model不知道用什么格式的文件能打开。wiki.en.text.vector打开后和你博客中的样子类似。
    现在在ipython中,通过gensim来加载和测试这个模型时出错,错误不知道是啥,但感觉是gensim有问题。以下是运行的结果,我把wiki.en.text.model和wiki.en.text.vector放到C:\Python27\lib\site-packages\gensim-0.12.1-py2.7-win32.egg\gensim\目录下结果也是这样。请日理万机的楼主帮我看下,谢谢啦。
    model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)
    Traceback (most recent call last):

    File "", line 1, in
    model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)

    File "C:\Python27\lib\site-packages\gensim-0.12.1-py2.7-win32.egg\gensim\models\word2vec.py", line 954, in load_word2vec_format
    with utils.smart_open(fname) as fin:

    File "C:\Python27\lib\site-packages\gensim-0.12.1-py2.7-win32.egg\gensim\utils.py", line 74, in smart_open
    return open(fname, mode)

    IOError: [Errno 2] No such file or directory: 'wiki.en.text.vector'

    [回复]

    52nlp 回复:

    “IOError: [Errno 2] No such file or directory: ‘wiki.en.text.vector’",没有找到这个文件,你把wiki.en.text.model和wiki.en.text.vector放在你运行ipython的当前目录就可以了,不需要放在gensim里。

    [回复]

    薛方 回复:

    楼主你好,我就是放在当前目录里出现这样的错误的,我看出现
    File “C:\Python27\lib\site-packages\gensim-0.12.1-py2.7-win32.egg\gensim\models\word2vec.py”, line 954, in load_word2vec_format
    with utils.smart_open(fname) as fin:

    File “C:\Python27\lib\site-packages\gensim-0.12.1-py2.7-win32.egg\gensim\utils.py”, line 74, in smart_open
    return open(fname, mode)
    才放到这个目录下试的。结果错误是一样。
    我之前装gemsim 时,按照该网页https://pypi.python.org/pypi/gensim的指示安装的,在cmd中打开目录运行python setup.py test时最后显示有几个错误,最后强关了,直接cmd 中打开目录运行python setup.py install的,所以我感觉好像gensim有些问题。
    实在抱歉,问的问题太小白了,麻烦你了。

    [回复]

    52nlp 回复:

    报错就是没有找到文件,没有其他问题;你在ipython中import gensim看看是否有问题,没有报错的话gensim就安装好了

    薛方 回复:

    再次感谢楼主能回复我。
    总结一下
    首先:无法找到文件这个事情有些奇怪。
    1,早上又试了试,开始时还是无法找到该文件,我就用f=open("××","r")试了试目录里所有txt文件,全部都无法找到(以前其他文件时可以打开的)。最后用f=open("××","r") bady=f.read()试,反复运行了两次,就好了。然后其他文件就全部都可以找到了。很奇怪吧。(我没有胡说哦)
    2,在cmd时也出现了找不到文件,我把文件换了个目录重新来,就又好了(很奇怪)。
    其次,中文没有出现编码的错误。
    可能因为是用jieba分词的,没有出现UnicodeDecodeError:故没有用到iconv。
    最后:感谢楼主的耐心回答,今后我也会认真回帖认真交流。
    提点新的问题
    1,result = model.most_similar(u"某某")
    一次只能显示10个词,怎么自定义显示多个词。
    2,分词后,需要先去除重复不?去标点符号空格不?
    3,分词后词与词之间应该换行还是词与词之间是空格就行?(这个问的有点傻,理论上应该都可以吧?)
    谢谢楼主啦。

    [回复]

    52nlp 回复:

    关于问题1,可以看看most_similar的定义:
    most_similar(positive=[], negative=[], topn=10)

    result = model.most_similar(u”某某”, topn=你想要的数即可);

    2:分词后,可以考虑去标点符号和停用词,不用去重复;

    3. 空格就ok了

    提问的时候把问题描述清楚很重要,不是一句“有报错”,这样其他人也不知道你的报错信息,哪怕你把报错信息贴出来都ok,你后来的问题和描述都很清楚了,这样有空的时候就可以回答,大家互相节省时间。

    薛方 回复:

    楼主你好
    之前用6M的中文词库试了试,运行成功着呢。
    正式用的中文语料库ciku.txt是321M。但是在cmd下执行python train_word2vec_model.py ciku.txt ciku.model ciku.vector时报MemoryError。
    我电脑是4GB内存,64位,3.4GHz
    应该怎么解决?

    [回复]

发表评论

电子邮件地址不会被公开。 必填项已用*标注