MIT自然语言处理第三讲：概率语言模型（第三部分）

作者52nlp

1 月 18, 2009 MIT, 困惑度, 开放式课程, 自然语言处理, 语言模型

自然语言处理：概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年1月18日）

三、语言模型的评估
a) 评估一个语言模型（Evaluating a Language Model）
　i. 我们有n个测试单词串（We have n test string）:
　　　　　S_{1},S_{2},...,S_{n}
　ii. 考虑在我们模型之下这段单词串的概率（Consider the probability under our model）：
　　　　　prod{i=1}{n}{P(S_{i})}
或对数概率(or log probability):
　　log{prod{i=1}{n}{P(S_{i})}}=sum{i=1}{n}{logP(S_{i})}
　iii. 困惑度（Perplexity）:
　　　　　Perplexity = 2^{-x}
　　这里x = {1/W}sum{i=1}{n}{logP(S_{i})}
　　W是测试数据里总的单词数（W is the total number of words in the test data.）
　iv. 困惑度是一种有效的“分支因子”评测方法（Perplexity is a measure of effective “branching factor”）
　　1. 我们有一个规模为N的词汇集v，模型预测（We have a vocabulary v of size N, and model predicts）：
　　P(w) = 1/N 对于v中所有的单词（for all the words in v.）
　v. 困惑度是什么（What about Perplexity）?
　　　　　 Perplexity = 2^{-x}
　　　这里 x = log{1/N}
　　　于是 Perplexity = N
　vi. 人类行为的评估（estimate of human performance (Shannon, 1951)
　　1. 香农游戏（Shannon game）— 人们在一段文本中猜测下一个字母（humans guess next letter in text）
　　2. PP=142(1.3 bits/letter), uncased, open vocabulary
　vii. 三元语言模型的评估（estimate of trigram language model (Brown et al. 1992)）
　　PP=790(1.75 bits/letter), cased, open vocabulary

未完待续：第四部分

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：
https://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-third-part/

作者 52nlp

自然语言处理

《MIT自然语言处理第三讲：概率语言模型（第三部分）》有4条评论

poson说道：

2009年02月2号 13:43

支持

[回复]
admin 回复:
5 2 月, 2009 at 13:40
谢谢！其实翻译是个累活！呵呵！

[回复]
zhangjun0806说道：

2012年05月7号 19:21

最近在看MIT的mallet项目……看得不是很明白，这里大牛众多，希望接触过的给我简
单介绍一下这个项目吧。
另外，我需要得到P(w|t),单词出现在topic中的概率，mallet中只提供了计算topic出
现在document中的概率的方法。从网上查到一个方法：
> On Tue, May 3, 2011 at 10:48 AM, Steven Bethard wrote:
>> TopicInferencer.getSampledDistribution gives you a double[] representing
the topic distribution
for the entire instance (document). Is there a way to get the per-word topic
distributions?

On May 9, 2011, at 6:26 PM, David Mimno wrote:
> It doesn't look like there's an easy way without digging into the
> sampling code. You'd need to add an additional data structure to store
> token-topic distributions, and update it from the "topics" array after
> every sampling round. Once you're done, you'll need a way to pass it
> back -- keeping the token-topic distributions as a state variable and
> adding a callback function to pick up the distribution after every
> document might be the best option.

Thanks for the response. I ended up using the Stanford Topic Modeling
Toolbox instead, which supports
per-word topic distributions out of the box, but the approach above sounds
plausible if I ever end up going
back to the Mallet code.
url是：
http://article.gmane.org/gmane.comp.ai.mallet.devel/1482/match=getting+topic
+distribution
希望做过相关改进的师兄分享一下经验，小弟不胜感激。

[回复]
52nlp 回复:
8 5 月, 2012 at 14:07
这个不太清楚，期待其他同学来回答。

[回复]

MIT自然语言处理第三讲：概率语言模型（第三部分）

作者52nlp

作者 52nlp

相关文章

新浪张俊林：大语言模型的涌现能力——现象与解释

中科院张家俊：ChatGPT中的提示与指令学习

“国产类 ChatGPT ”所存在的差距与挑战-专家圆桌

《MIT自然语言处理第三讲：概率语言模型（第三部分）》有4条评论

发表回复

You missed

新浪张俊林：大语言模型的涌现能力——现象与解释

中科院张家俊：ChatGPT中的提示与指令学习

“国产类 ChatGPT ”所存在的差距与挑战-专家圆桌

探索大语言模型垂直化训练技术和应用-陈运文

作者52nlp

相关文章：

作者 52nlp

相关文章

《MIT自然语言处理第三讲：概率语言模型（第三部分）》有4条评论

发表回复

You missed