上周《玩转腾讯词向量:词语相似度计算和在线查询》推出后,有同学提到了annoy,我其实并没有用annoy,不过对annoy很感兴趣,所以决定用annoy试一下腾讯 AI Lab 词向量。
学习一个东西最直接的方法就是从官方文档走起:https://github.com/spotify/annoy , Annoy是Spotify开源的一个用于近似最近邻查询的C++/Python工具,对内存使用进行了优化,索引可以在硬盘保存或者加载:Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk。
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
照着官方文档,我在自己的机器上进行了简单的测试(Ubuntu16.04, 48G内存, Python2.7, gensim 3.6.0, annoy, 1.15.2),以下是Annoy初探。
安装annoy很简单,在virtuenv虚拟环境中直接:pip install annoy,然后大概可以按着官方文档体验一下最简单的case了:
In [1]: import random
In [2]: from annoy import AnnoyIndex
# f是向量维度
In [3]: f = 20
In [4]: t = AnnoyIndex(f)
In [5]: for i in xrange(100):
...: v = [random.gauss(0, 1) for z in xrange(f)]
...: t.add_item(i, v)
...:
In [6]: t.build(10)
Out[6]: True
In [7]: t.save('test.ann.index')
Out[7]: True
In [8]: print(t.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]
# 此处测试从硬盘盘索引加载
In [10]: u = AnnoyIndex(f)
In [11]: u.load('test.ann.index')
Out[11]: True
In [12]: print(u.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84] |
In [1]: import random
In [2]: from annoy import AnnoyIndex
# f是向量维度
In [3]: f = 20
In [4]: t = AnnoyIndex(f)
In [5]: for i in xrange(100):
...: v = [random.gauss(0, 1) for z in xrange(f)]
...: t.add_item(i, v)
...:
In [6]: t.build(10)
Out[6]: True
In [7]: t.save('test.ann.index')
Out[7]: True
In [8]: print(t.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]
# 此处测试从硬盘盘索引加载
In [10]: u = AnnoyIndex(f)
In [11]: u.load('test.ann.index')
Out[11]: True
In [12]: print(u.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]
看起来还是比较方便的,那么Annoy有用吗? 非常有用,特别是做线上服务的时候,现在有很多Object2Vector, 无论这个Object是Word, Document, User, Item, Anything, 当这些对象被映射到向量空间后,能够快速实时的查找它的最近邻就非常有意义了,Annoy诞生于Spotify的Hack Week,之后被用于Sptify的音乐推荐系统,这是它的诞生背景:
继续阅读 →