MIT自然语言处理第四讲：标注（第四部分）

自然语言处理：标注
Natural Language Processing: Tagging
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年3月26日）

三、马尔科夫模型（Markov Model）
g) 有效标注（Efficient Tagging）
　i. 对于一个单词序列，如何寻找最可能的标记序列（How to find the most likely a sequence of tags for a sequence of words）?
　　1. 盲目搜索的方法是可怕的（The brute force search is dreadful）— 对于N个标记和W个单词计算代价是N^W.for N tags and W words, the cost is NW
　　2. 主意（Idea）: 使用备忘录（Viterbi算法）（use memoization (the Viterbi Algorithm)）
　　——结束于相同标记的序列可以压缩在一起，因为下一个标记仅依赖于此序列的当前标记（Sequences that end in the same tag can be collapsed together since the next tag depends only on the current tag of the sequence）
　　图示如下：
test
h) Viterbi 算法（The Viterbi Algorithm）
　i. 初始情况（Base case）:
　　　pi delim{[}{0, START}{]} = log 1 = 0
　　　pi delim{[}{0, t_{-1}}{]} = log 0 = infty
　　对所有其他的t_{-1}(for all other t_{-1})
　ii. 递归情况（Recursive case）:
　　1. 对于i = 1...S.length及对于所有的t_{-1} in T:
pi delim{[}{i, t_{-1}}{]} = {max}under{t in T union START}{ pi delim{[}{i-1, t}{]} + log P(t_{-1}delim{|}{t}{}) + log P(S_i delim{|}{t_{-1}}{})}
　　2. 回朔指针允许我们找出最大概率序列（Backpointers allow us to recover the max probability sequence）:
BP delim{[}{i, t_{-1}}{]} = {argmax}under{t in T union START}{ pi delim{[}{i-1, t}{]} + log P(t_{-1}delim{|}{t}{}) + log P(S_i delim{|}{t_{-1}}{})}
i) 性能（Performance）
　i. HMM标注器对于训练非常简单（HMM taggers are very simple to train）
　ii. 表现相对很好（Perform relatively well） (over 90% performance on named entities)
　iii. 最大的困难是对p(单词|标记)建模（Main difficulty is modeling of p(word|tag)）

四、结论（Conclusions）
a) 标注是一个相对比较简单的任务，至少在一个监督框架下对于英语来说（Tagging is relatively easy task (at least, in a supervised framework, and for English)）
b) 影响标注器性能的因素包括（Factors that impact tagger performance include）:
　i. 训练集数量（The amount of training data available）
　ii. 标记集（The tag set）
　iii. 训练集和测试集的词汇差异（The difference in vocabulary between the training and the testing）
　iv. 未登录词（Unknown words）
c) TBL和HMM框架可用于其他自然语言处理任务（TBL and HMM framework can be used for other tasks）

第四讲结束！
第五讲：最大熵和对数线性模型

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：
https://www.52nlp.cn/mit-nlp-fourth-lesson-tagging-fourth-part/

MIT自然语言处理第四讲：标注（第四部分）

作者52nlp

作者 52nlp

相关文章

Qwen3来了，全尺寸开源，性能拉满！附最新一手实测！

DeepSeek-V3解析及技术报告英中报告对照版

如何构建和优化推理型大型语言模型？DeepSeek R1的启示

在 “MIT自然语言处理第四讲：标注（第四部分）” 有 1 条评论

发表回复

You missed

Qwen3-VL技术报告英中对照版.pdf

DeepSeek-V3.2-Exp：用稀疏注意力实现更高效的长上下文推理

LongCat-Flash：美团发布的高效MoE大模型，支持智能体任务，推理速度达100 token/秒

GLM-4.5：三体合一的开源智能体大模型，重新定义AI推理边界