Beautiful Data-统计语言模型的应用三：分词6

作者52nlp

3 月 31, 2010 #Beautiful Data, #Google, #Peter Norvig, #python, #yield, #__call__, #中文分词, #分词, #统计语言模型, #语言模型

　　说完了Python中的__call__函数，我们继续来完善segment.py，首先将Pw函数删除：
　　def Pw( word ):
　　　　pass

　　然后，在程序中添加Pdist类的Pw对象：
　　Pw = Pdist( datafile( 'count_1w.txt' ), N, avoid_long_words )

　　Pdist类中分别调用了datafile和avoid_long_words函数，以及变量N，所以需要在此句之前添加：
[cc lang="python"]
　　def datafile( name, sep = '\t' ):
　　"""Read key, value pairs from file."""
　　　　for line in file( name ):
　　　　　　yield line.split( sep )

　　def avoid_long_words( word, N ):
　　　　"""Estimate the probability of an unknown word."""
　　　　return 10./( N * 10**len( word ) )

　　N = 1024908267229 ## Number of tokens in corpus
[/cc]
　　datafile函数中调用了yield生成器，网上的解释很多，看得有点晕，不过我觉得下面这句话最形象或者和我们这个例子最相关：

yield就相当于往一个list中塞东西而已，只不过写法很奇怪罢了。

　　还记得我们的测试例子吗？
　　>>> data_pair = [("A", 1), ("B", 2), ("C", 3), ("D", 4)]
　　data_pair就是一个list，而 datafile函数说白了，就是利用yield生成器来生成一个类似的list。多说无意，我们还是来看一个例子，首先在segment.py所在的目录下建立一个test_data的文件，内容如下：

A　　　　1
B　　　　2
C　　　　3
D　　　　4

　　在python解释器中：
　　>>> reload( segment )
　　在datafile函数读入test_data文件：
　　>>> test = segment.datafile( "test_data" )
　　观察一下test，是一个generator:
　　>>> test
　　<generator object datafile at 0xb76b7eb4>
　　利用next()，遍历test:
　　>>> test.next()
　　['A', '1\\n']
　　>>> test.next()
　　['B', '2\\n']
　　>>> test.next()
　　['C', '3\\n']
　　>>> test.next()
　　['D', '4\\n']
　　>>> test.next()
　　Traceback (most recent call last):
　　File "", line 1, in
　　StopIteration
　　重新试一次：
　　>>> test = segment.datafile( "test_data" )
　　并且：
　　>>> for data in test:
　　... print data,
　　...
　　['A', '1\\n'] ['B', '2\\n'] ['C', '3\\n'] ['D', '4\\n']
　　读者现在大概可以弄明白datafile函数的作用了吧！继续：
　　>>> Pw = segment.Pdist( segment.datafile( 'test_data') )
　　>>> Pw.N
　　10.0
　　>>> Pw("A")
　　0.10000000000000001
　　>>> Pw("D")
　　0.40000000000000002
　　>>> Pw("E")
　　0.10000000000000001
　　和我们之前定义的data_pair结果完全一样，只不过这一次我们把它们写在文件里了。
　　至于avoid_long_words函数，就是重载的未登录词处理函数，即为了避免过长的单词拥有过高的概率，我们从概率10/N出发，对于候选单词的每增加一个字母就除以10：
　　10. / ( N * 10**len( word ) )

　　可以在python解释器中测试一下：
　　>>> segment.avoid_long_words("e", Pw.N)
　　0.10000000000000001
　　>>> segment.avoid_long_words("we", Pw.N)
　　0.01
　　>>> segment.avoid_long_words("the", Pw.N)
　　0.001

未完待续：分词7

注：原创文章，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：https://www.52nlp.cn/beautiful-data-统计语言模型的应用三分词6

作者 52nlp

LLm 语言模型预训练模型

《Beautiful Data-统计语言模型的应用三：分词6》有6条评论

我是一头驴子说道：

2010年04月5号 15:38

if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
楼主这个max(candidates,key=Pwords)能给解释一下吗？
candidates应该是个序列，Pwords在BeautifulData中的源码中是有参数的，怎么可以这么用？

[回复]
navygong 回复:
5 4 月, 2010 at 21:40
candidates实际上是个generator(生成器)，你提到的这两行代码就是计算每种候选分词方式的概率，并从中取概率最大的那种。如"wheninthecourse"可能的分词方式有
['w', 'henin', 'the', 'course']
['wh', 'en', 'in', 'the', 'course']
['whe', 'n', 'in', 'the', 'course']
...
['wheninthecour', 'se']
['wheninthecours', 'e']
['wheninthecourse']。
以['wh', 'en', 'in', 'the', 'course']为例，Pwords函数作用到这个列表上后得到的是各个词出现的概率的乘积。然后用max函数取出最大乘积的那种候选分词方式。
PS：如果你对Python熟悉应该很好理解。

[回复]
52nlp 回复:
5 4 月, 2010 at 22:39
这个会在这一节之后提到！
谢谢navygong的热心回答，毕竟一个人的力量是有限的！
PS：清明出游刚回来，抱歉。

[回复]
Thought this was cool: Itenyh版-用HMM做中文分词四：A Pure-HMM 分词器 « CWYAlpha说道：

2012年03月17号 23:42

[...] Beautiful Data-统计语言模型的应用三：分词6 [...]
黎啊黎说道：

2021年08月26号 15:49

def datafile( name, sep = '\t' ):
　　"""Read key, value pairs from file."""
　　　　for line in file( name ):
　　　　　　yield line.split( sep )
我运行提示
in datafile for line in file(name):
E NameError: name 'file' is not defined
collected 0 items / 1 errors
就是file(name):file有错

请问怎么解决？？谢谢

[回复]
52nlp 回复:
26 8 月, 2021 at 16:51
你应该用的是python3吧，将file改为open试试，这篇文章10多年前写得，用得是python2

[回复]

Beautiful Data-统计语言模型的应用三：分词6

作者52nlp

作者 52nlp

相关文章

解码Google Gemini 2.5：推理、多模态与智能体能力的革命性突破

解密小米MiMo-VL：7B小模型如何实现多模态SOTA性能

QwenLong-L1：通过强化学习实现长上下文推理的大模型飞跃

《Beautiful Data-统计语言模型的应用三：分词6》有6条评论

发表回复

You missed

Qwen3-VL技术报告英中对照版.pdf

DeepSeek-V3.2-Exp：用稀疏注意力实现更高效的长上下文推理

LongCat-Flash：美团发布的高效MoE大模型，支持智能体任务，推理速度达100 token/秒

GLM-4.5：三体合一的开源智能体大模型，重新定义AI推理边界

作者52nlp

相关文章：

作者 52nlp

相关文章

《Beautiful Data-统计语言模型的应用三：分词6》有6条评论

发表回复

You missed