Beautiful Data-统计语言模型的应用三：分词3

　　本节我们进入《Beautiful Data》中分词的编码阶段，完整的程序及数据大家可以在“Natural Language Corpus Data: Beautiful Data”上下载ngrams.zip，我这里主要做一些解读。程序由python实现，无论在Linux或者Windows平台下，只要安装了相应的python版本，程序均可以通过测试，不过我所使用的是python2.6，注意，在python3.0上会有一些问题。
　　首先新建一个segment.py文件，依据上一节的思路，我们定义一个segment函数：
　　
　　def segment( text ):

　　　　"""Return a list of words that is the best segmentation of text."""

　　　　if not text : return []

　　　　candidates = ( [first] + segment( rem ) for first, rem in splits( text ) )

　　　　return max( candidates, key = Pwords )

　　segment函数的目标就是“ 对于所有的候选切分，选择P(first) × P(remaining) 乘积最高的那一个作为最佳切分”，事实上它里面包括了递归调用，最终返回的是最佳的分词短语，这个我们暂且不说，且看看它另外调用的两个函数splits和Pwords，在segment.py中加入如下的代码：

　　def splits( text, L = 20 ):

　　　　"""Return a list of all possible ( first, rem ) pairs, len( first ) <=L""" 　　　　return [ ( text[:i+1], text[i+1:] ) for i in range( min(len(text), L ) ) ] 　　def Pwords( words ): 　　　　pass 　　splits函数的作用是返回所有可能的将字符串切分成首词和剩余字符串切分结果，Pwords稍后再论，暂且”pass”，我们利用python解释器来测试一下splits函数：　　nlp@52nlp:~/python/beautiful$ python 　　Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) 　　[GCC 4.3.3] on linux2 　　Type "help", "copyright", "credits" or "license" for more information. 　　>>> import segment

　　>>> split = segment.splits( "12345" )

　　>>> print split

　　[('1', '2345'), ('12', '345'), ('123', '45'), ('1234', '5'), ('12345', '')]

　　>>> split = segment.splits( "wheninthecourse" )

　　>>> print split

　　[('w', 'heninthecourse'), ('wh', 'eninthecourse'), ('whe', 'ninthecourse'), ('when', 'inthecourse'), ('wheni', 'nthecourse'), ('whenin', 'thecourse'), ('whenint', 'hecourse'), ('wheninth', 'ecourse'), ('wheninthe', 'course'), ('wheninthec', 'ourse'), ('whenintheco', 'urse'), ('wheninthecou', 'rse'), ('wheninthecour', 'se'), ('wheninthecours', 'e'), ('wheninthecourse', '')]

　　当然，你也可以测试一下:
　　>>> split = segment.splits
( "wheninthecourseofhumaneventsitbecomesnecessary" )
　　print之后的结果会比较长。

　　再来看Pwords函数，在segment.py中将其修改为：
　　def Pwords( words ):

　　　　"""The Naive Bayes probability of a sequence of words."""

　　　　return product( Pw(w) for w in words )

　　《Beautiful Data》中的标准解释是“The Naive Bayes probability of a sequence of words”，既“单词序列的朴素贝叶斯（Naive Bayes，简称NB）概率“，朴素贝叶斯概率的核心在于它假设向量的所有分量之间是独立的，这里的向量是单词序列，故假设的是所有单词之间是独立的，这也是我们利用一元语言模型的一个前提假设。回顾'wheninrome'这个分词例子，它有很多候选分词短语，譬如“when in rome”，利用一元语言模型，只需计算P(when) × P(in) × P(rome)。
　　不过Pwords函数同样调用了两个辅助函数:product和Pw(w)，在segment.py中加入如下的代码：
　　def product( nums ):

　　　　"""Return the product of a sequence of numbers."""

　　　　return reduce( operator.mul, nums, 1 )

　　def Pw( word ):

　　　　pass

　　并且在segment.py的开始处加入：
　　import operator

　　因为product函数调用了operator模块中的二元乘积mul函数：
　　operator.mul(a, b)
　　operator.__mul__(a, b)
　　　　Return a * b, for a and b numbers.
　　事实上operator.mul是被producet中的reduce函数调用的，reduce在python2.6中是内建函数（注意在python3.0中已不是，需要“from functools import reduce”）:
　　reduce(func,seq[,init])：func 为二元函数，将func作用于seq序列的元素，每次携带一对（先前的结果以及下一个序列的元素），连续的将现有的结果和下一个值作用在获得的随后的结果上，最后减少我们的序列为一个单一的返回值：如果初始值init给定，第一个比较会是init和第一个序列元素而不是序列的头两个元素。
　　这里的func为operator.mul，最终reduce返回给product的结果是数字序列的乘积结果，可以在python的解释器中做如下验证：
　　>>> reload( segment )
　　<module 'segment' from 'segment.py'>
　　>>> segment.product( [1, 2, 3, 4, 5] )
　　120

未完待续：分词4

注：原创文章，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：https://www.52nlp.cn/beautiful-data-统计语言模型的应用三分词3

Beautiful Data-统计语言模型的应用三：分词3

作者52nlp

作者 52nlp

相关文章

解码Google Gemini 2.5：推理、多模态与智能体能力的革命性突破

解密小米MiMo-VL：7B小模型如何实现多模态SOTA性能

QwenLong-L1：通过强化学习实现长上下文推理的大模型飞跃

发表回复

You missed

Qwen3-VL技术报告英中对照版.pdf

DeepSeek-V3.2-Exp：用稀疏注意力实现更高效的长上下文推理

LongCat-Flash：美团发布的高效MoE大模型，支持智能体任务，推理速度达100 token/秒

GLM-4.5：三体合一的开源智能体大模型，重新定义AI推理边界

作者52nlp

相关文章：

作者 52nlp

相关文章

发表回复

You missed