腾讯词向量实战：通过Annoy进行索引和快速查询

上周《玩转腾讯词向量：词语相似度计算和在线查询》推出后，有同学提到了annoy，我其实并没有用annoy，不过对annoy很感兴趣，所以决定用annoy试一下腾讯 AI Lab 词向量。

学习一个东西最直接的方法就是从官方文档走起：https://github.com/spotify/annoy , Annoy是Spotify开源的一个用于近似最近邻查询的C++/Python工具，对内存使用进行了优化，索引可以在硬盘保存或者加载：Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk。

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

照着官方文档，我在自己的机器上进行了简单的测试（Ubuntu16.04, 48G内存, Python2.7, gensim 3.6.0, annoy, 1.15.2），以下是Annoy初探。

安装annoy很简单，在virtuenv虚拟环境中直接：pip install annoy，然后大概可以按着官方文档体验一下最简单的case了：

In [1]: import random
 
In [2]: from annoy import AnnoyIndex
 
# f是向量维度
In [3]: f = 20
 
In [4]: t = AnnoyIndex(f)
 
In [5]: for i in xrange(100):
   ...:     v = [random.gauss(0, 1) for z in xrange(f)]
   ...:     t.add_item(i, v)
   ...:     
 
In [6]: t.build(10)
Out[6]: True
 
In [7]: t.save('test.ann.index')
Out[7]: True
 
In [8]: print(t.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]
 
# 此处测试从硬盘盘索引加载
In [10]: u = AnnoyIndex(f)
 
In [11]: u.load('test.ann.index')
Out[11]: True
 
In [12]: print(u.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]

看起来还是比较方便的，那么Annoy有用吗? 非常有用，特别是做线上服务的时候，现在有很多Object2Vector, 无论这个Object是Word, Document, User, Item, Anything, 当这些对象被映射到向量空间后，能够快速实时的查找它的最近邻就非常有意义了，Annoy诞生于Spotify的Hack Week，之后被用于Sptify的音乐推荐系统，这是它的诞生背景：

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Annoy还有很多优点（Summary of features）：

Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.7, 3.6, and 3.7.
Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

现在回到腾讯词向量的话题，关于如何用Annoy做词向量的索引和查询这个问题，在用Annoy玩腾讯词向量之前，我google了一下相关的资料，这篇文章《超平面多维近似向量查找工具annoy使用总结》提到了一个特别需要注意的坑：

但是我还是想弄明白到底怎么回事，于是我去官网问作者，作者就说了一句，你需要进行整数映射，（而且应该是非负整数）卧槽！！！其实官网写的明明白白：

a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.

也就是说我的txt文件需要是

1 vec
2 vec

所以从一开始我就考虑避开这个坑，刚好gensim的相关接口支持得很好，另外gensim官方文档里也有一份关于Annoy的文档，引入了Annoy的接口，这个之前用gensim的时候没有注意到：
similarities.index – Fast Approximate Nearest Neighbor Similarity with Annoy package

不过这次操作的时候还是直接用annoy的接口，因为基于gensim的word2vec的接口，本身就可以很方便的操作了，以下是简单的操作记录，关键步骤我简单做了注释，仅供参考：

In [15]: from gensim.models import KeyedVectors
 
# 此处加载时间略长，加载完毕后大概使用了12G内存，后续使用过程中内存还在增长，如果测试，请用大一些内存的机器
In [16]: tc_wv_model = KeyedVectors.load_word2vec_format('./Tencent_AILab_Chines
    ...: eEmbedding.txt', binary=False)
 
# 构建一份词汇ID映射表，并以json格式离线保存一份（这个方便以后离线直接加载annoy索引时使用）
In [17]: import json
 
In [18]: from collections import OrderedDict
 
In [19]: word_index = OrderedDict()
 
In [21]: for counter, key in enumerate(tc_wv_model.vocab.keys()):
    ...:     word_index[key] = counter
    ...:     
 
In [22]: with open('tc_word_index.json', 'w') as fp:
    ...:     json.dump(word_index, fp)
    ...: 
 
# 开始基于腾讯词向量构建Annoy索引，腾讯词向量大概是882万条
In [23]: from annoy import AnnoyIndex
 
# 腾讯词向量的维度是200
In [24]: tc_index = AnnoyIndex(200)
 
In [25]: i = 0
 
In [26]: for key in tc_wv_model.vocab.keys():
    ...:     v = tc_wv_model[key]
    ...:     tc_index.add_item(i, v)
    ...:     i += 1
    ...: 
 
# 这个构建时间也比较长，另外n_trees这个参数很关键，官方文档是这样说的：
# n_trees is provided during build time and affects the build time and the index size. 
# A larger value will give more accurate results, but larger indexes.
# 这里首次使用没啥经验，按文档里的是10设置，到此整个流程的内存占用大概是30G左右
In [29]: tc_index.build(10)
 
Out[29]: True
 
# 可以将这份index存储到硬盘上，再次单独加载时，带词表内存占用大概在2G左右
In [30]: tc_index.save('tc_index_build10.index')
Out[30]: True
 
# 准备一个反向id==>word映射词表
In [32]: reverse_word_index = dict([(value, key) for (key, value) in word_index.item
    ...: s()])   
 
# 然后测试一下Annoy，自然语言处理和AINLP公众号后台的结果基本一致
# 感兴趣的同学可以关注AINLP公众号，查询：相似词 自然语言处理
In [33]: for item in tc_index.get_nns_by_item(word_index[u'自然语言处理'], 11):
    ...:     print(reverse_word_index[item])
    ...:     
自然语言处理
自然语言理解
计算机视觉
深度学习
机器学习
图像识别
语义理解
自然语言识别
知识图谱
自然语言
自然语音处理
 
# 不过英文词的结果好像有点不同
In [34]: for item in tc_index10.get_nns_by_item(word_index[u'nlp'], 11):
    ...:     print(reverse_word_index[item])
    ...: 
 
nlp
神经语言
机器学习理论
时间线疗法
神经科学
统计学习
统计机器学习
nlp应用
知识表示
强化学习
机器学习研究

In [15]: from gensim.models import KeyedVectors # 此处加载时间略长，加载完毕后大概使用了12G内存，后续使用过程中内存还在增长，如果测试，请用大一些内存的机器 In [16]: tc_wv_model = KeyedVectors.load_word2vec_format('./Tencent_AILab_Chines ...: eEmbedding.txt', binary=False) # 构建一份词汇ID映射表，并以json格式离线保存一份（这个方便以后离线直接加载annoy索引时使用） In [17]: import json In [18]: from collections import OrderedDict In [19]: word_index = OrderedDict() In [21]: for counter, key in enumerate(tc_wv_model.vocab.keys()): ...: word_index[key] = counter ...: In [22]: with open('tc_word_index.json', 'w') as fp: ...: json.dump(word_index, fp) ...: # 开始基于腾讯词向量构建Annoy索引，腾讯词向量大概是882万条 In [23]: from annoy import AnnoyIndex # 腾讯词向量的维度是200 In [24]: tc_index = AnnoyIndex(200) In [25]: i = 0 In [26]: for key in tc_wv_model.vocab.keys(): ...: v = tc_wv_model[key] ...: tc_index.add_item(i, v) ...: i += 1 ...: # 这个构建时间也比较长，另外n_trees这个参数很关键，官方文档是这样说的： # n_trees is provided during build time and affects the build time and the index size. # A larger value will give more accurate results, but larger indexes. # 这里首次使用没啥经验，按文档里的是10设置，到此整个流程的内存占用大概是30G左右 In [29]: tc_index.build(10) Out[29]: True # 可以将这份index存储到硬盘上，再次单独加载时，带词表内存占用大概在2G左右 In [30]: tc_index.save('tc_index_build10.index') Out[30]: True # 准备一个反向id==>word映射词表 In [32]: reverse_word_index = dict([(value, key) for (key, value) in word_index.item ...: s()]) # 然后测试一下Annoy，自然语言处理和AINLP公众号后台的结果基本一致 # 感兴趣的同学可以关注AINLP公众号，查询：相似词自然语言处理 In [33]: for item in tc_index.get_nns_by_item(word_index[u'自然语言处理'], 11): ...: print(reverse_word_index[item]) ...: 自然语言处理自然语言理解计算机视觉深度学习机器学习图像识别语义理解自然语言识别知识图谱自然语言自然语音处理 # 不过英文词的结果好像有点不同 In [34]: for item in tc_index10.get_nns_by_item(word_index[u'nlp'], 11): ...: print(reverse_word_index[item]) ...: nlp 神经语言机器学习理论时间线疗法神经科学统计学习统计机器学习 nlp应用知识表示强化学习机器学习研究

到此，我们初步过了一遍Annoy在腾讯词向量上的实战，我没有仔细对比查询速度，感兴趣的同学可以参考这篇博客：

topk相似度性能比较（kd-tree、kd-ball、faiss、annoy、线性搜索）

里面有很详细的对比，这次时间匆忙，后续我会继续测试，感兴趣的同学欢迎一起探讨。

另外上次文章推出后，还有同学后台问腾讯词向量是怎么来的，所以这里再贴一下腾讯 AI Lab 词向量官方文档和下载地址：
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
https://ai.tencent.com/ailab/nlp/embedding.html

参考：
Annoy: https://github.com/spotify/annoy
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

超平面多维近似向量查找工具annoy使用总结
https://zhuanlan.zhihu.com/p/50604120
topk相似度性能比较（kd-tree、kd-ball、faiss、annoy、线性搜索）
https://www.twblogs.net/a/5bf2c5fabd9eee0405185f34/zh-cn

Similarity Queries using Annoy Tutorial
https://markroxor.github.io/gensim/static/notebooks/annoytutorial.html

注：原创文章，转载请注明出处及保留链接“我爱自然语言处理”：https://www.52nlp.cn

本文链接地址：腾讯词向量实战：通过Annoy进行索引和快速查询 https://www.52nlp.cn/?p=11587

作者 52nlp

自然语言处理

《腾讯词向量实战：通过Annoy进行索引和快速查询》有7条评论

Yemu说道：

2019年04月19号 09:42

很棒，之前我们测试完全加载需要8-11g左右内存。
试试这个希望能压到2g左右。

[回复]
kyle说道：

2019年06月28号 14:48

您好，您的文章写的很棒对我们帮助很大，此篇文章通过Annoy进行索引和快速查询的python文件是否能公开参考一下，万分感谢！

[回复]
52nlp 回复:
28 6 月, 2019 at 15:41
这个本身就是公开的，操作步骤就在ipython里，稍微整合一下就是单独的python脚本了。

[回复]
weehaa说道：

2019年07月5号 17:46

这种是不是意味着只能根据汉字去查了，不能用生成的向量去查找它最接近的意思了？

[回复]
52nlp 回复:
5 7 月, 2019 at 23:05
不是

[回复]
zafdaf说道：

2019年10月25号 10:21

我想问一下，这个词向量怎么样辅助分词效果呢，有的时候现有的分词工具分的不太对，把这个变成自定义词典？会不会吃内存分的非常慢呀，求解...

[回复]
52nlp 回复:
25 10 月, 2019 at 11:02
把这个词向量中的“词”抽取出来作为自定义词典可以“尝试”，不过这个词向量里面的词需要过滤，我在之前测试时发现里面的一些次条边界没有处理好，直接拿来处理肯定有问题；在现有分词工具上加这个作为自定义词典，如果词典结构不优化，应该很慢

[回复]

腾讯词向量实战：通过Annoy进行索引和快速查询

作者52nlp

作者 52nlp

相关文章

新浪张俊林：大语言模型的涌现能力——现象与解释

中科院张家俊：ChatGPT中的提示与指令学习

“国产类 ChatGPT ”所存在的差距与挑战-专家圆桌

《腾讯词向量实战：通过Annoy进行索引和快速查询》有7条评论

发表回复

You missed

新浪张俊林：大语言模型的涌现能力——现象与解释

中科院张家俊：ChatGPT中的提示与指令学习

“国产类 ChatGPT ”所存在的差距与挑战-专家圆桌

探索大语言模型垂直化训练技术和应用-陈运文

作者52nlp

相关文章：

作者 52nlp

相关文章

《腾讯词向量实战：通过Annoy进行索引和快速查询》有7条评论

发表回复

You missed