# MIT自然语言处理第三讲：概率语言模型（第三部分）

Natural Language Processing: Probabilistic Language Modeling

a) 评估一个语言模型（Evaluating a Language Model）
i. 我们有n个测试单词串（We have n test string）:
S_{1},S_{2},...,S_{n}
ii. 考虑在我们模型之下这段单词串的概率（Consider the probability under our model）：
prod{i=1}{n}{P(S_{i})}

log{prod{i=1}{n}{P(S_{i})}}=sum{i=1}{n}{logP(S_{i})}
iii. 困惑度（Perplexity）:
Perplexity = 2^{-x}
这里x = {1/W}sum{i=1}{n}{logP(S_{i})}
W是测试数据里总的单词数（W is the total number of words in the test data.）
iv. 困惑度是一种有效的“分支因子”评测方法（Perplexity is a measure of effective “branching factor”）
1. 我们有一个规模为N的词汇集v，模型预测（We have a vocabulary v of size N, and model predicts）：
P(w) = 1/N 对于v中所有的单词（for all the words in v.）
Perplexity = 2^{-x}
这里 x = log{1/N}
于是 Perplexity = N
vi. 人类行为的评估（estimate of human performance (Shannon, 1951)
1. 香农游戏（Shannon game）— 人们在一段文本中猜测下一个字母（humans guess next letter in text）
2. PP=142(1.3 bits/letter), uncased, open vocabulary
vii. 三元语言模型的评估（estimate of trigram language model (Brown et al. 1992)）
PP=790(1.75 bits/letter), cased, open vocabulary

http://people.csail.mit.edu/regina/6881/

## 《MIT自然语言处理第三讲：概率语言模型（第三部分）》上有4条评论

1. zhangjun0806

最近在看MIT的mallet项目……看得不是很明白，这里大牛众多，希望接触过的给我简
单介绍一下这个项目吧。
另外，我需要得到P(w|t),单词出现在topic中的概率，mallet中只提供了计算topic出
现在document中的概率的方法。从网上查到一个方法：
> On Tue, May 3, 2011 at 10:48 AM, Steven Bethard wrote:
>> TopicInferencer.getSampledDistribution gives you a double[] representing
the topic distribution
for the entire instance (document). Is there a way to get the per-word topic
distributions?

On May 9, 2011, at 6:26 PM, David Mimno wrote:
> It doesn't look like there's an easy way without digging into the
> sampling code. You'd need to add an additional data structure to store
> token-topic distributions, and update it from the "topics" array after
> every sampling round. Once you're done, you'll need a way to pass it
> back -- keeping the token-topic distributions as a state variable and
> adding a callback function to pick up the distribution after every
> document might be the best option.

Thanks for the response. I ended up using the Stanford Topic Modeling
per-word topic distributions out of the box, but the approach above sounds
plausible if I ever end up going
back to the Mallet code.
url是：
http://article.gmane.org/gmane.comp.ai.mallet.devel/1482/match=getting+topic
+distribution
希望做过相关改进的师兄分享一下经验，小弟不胜感激。

[回复]

52nlp 回复:

这个不太清楚，期待其他同学来回答。

[回复]