# 初学者报道(3) CRF 中文分词解码过程理解

CRF的公式：P(y|x,λ)=Σj λjFj(y,x)/Z(x)     //这里的j都是下标

B     B    B    B   B    B    B

O    O   O    O   O    O     O

M   M   M   M   M   M   M

E     E    E    E    E    E     E

Viterbe解码就是在以上由标记组成的 数组中 搜索一条 最优的路径。

1. 计算第一列的分数(score),对于，‘民’来说，我们要算 B,O,M,E的Score，因为是第一列，所以PreSocre和TransW都是0，就不用计算，只需要计算自己的一元特征的权重：

2.对于第二列，首先要计算是每个标记的 一元权重W2B，W2O,W2M,W2E.

3.一直计算到最后一列，‘值’字的所有标记，得到S7B，S7O，S7M，S7E.比较这四个值中的最大值，即为最优路径的分数，然后以该值的标记点为始点 回溯得到最优路径（这里在计算过程中，要记录到达该标记的前一个标记，用于回溯）

# 专业老友痛批立委《迷思》系列搅乱NLP秩序，立委固执己见

G 是资深同行专业老友很多年了，常与立委有专业内外的交流。都是过来人，激烈交锋、碰撞出火是常有的事儿。

2011/12/28 G

The third one is more to the point - 严格说起来，这不能算是迷思，而应该算是放之四海而皆准的“多余的话”
Frankly, the first two are 标题党 to me. Most "supporting evidence" is wrong.
Well, I think I know what you were trying to say. But to most people I believe you are misleading.
No, I was not misleading, this is 矫枉过正 on purpose.
At least I think you should explain a bit more, and carefully pick up your examples.
Take one example. Tokenizing Peoples Republic of China is routinely done by regular expression (rule based) based on capitalization, apostrophe and proposition (symbolic evidences), but NOT using dictionary.
that is not the point.  yes, maybe I should have chosen a non-Name example ("interest rate" 利率 is a better example for both Chinese and English), but the point is that closed compounding can (and should) be looked up by lexicons rather than using rules.
What you are referring to I guess is named entity recognition. Even that chinese and English could be significantly different.
No I was not talking about NE, that is a special topic by itself.  I consider that to be a low-level, solved problem, and do not plan to re-invent the wheel.  I will just pick an off-shelf API to use for NE, tolerating its imperfection.
I wouldn't be surprised if you don't do tokenization, as you can well combine that in overall parsing. But to applications like Baidu search, tokenization is the end of text processing and is a must-have.
Chunking of words into phrases (syntax) are by nature no different from chunking of morphemes (characters) into words (morphology).  Parsing with no "word segmentation" is thus possible.

In existing apps like search engines, no big players are using parsing and deep NLP, yet (they will: only a time issue), so lexical features from large lexicons may not be necessary.  As a result, they may prefer to adopt a light-weight tokenization without lexicons.  That is a different case from what I am addressing here.   NLP discussed in my post series assumes the need for developing a parser as its core.
Your attack to tagging is also misleading. You basically say if a word has two categories, just tag it both without further processing. That is tagging already.
That is not (POS) tagging in the traditional sense: the traditional sense of tagging is deterministic and relies on context.  Lexical feature assignment from lexical lookup is not tagging in the traditional sense.  If you want to change the definition, then that is off the topic.
What others do is merely one step forward, saying tag-a has 90% correct while tag-b 10% chance. I did rule based parser before and I find that is really helpful (at least in terms of speed). I try the high chance first. If it making sense, I just take it. If not, I come back trying the other. Let me know if you don't do something like that.
Parsing can go a long way without context-based POS tagging.  But note that at the end I proposed 一步半 approach, i.e. I can do limited, simple context-based tagging for convenience' sake.  The later development is adaptive and in principle does not rely on tagging.
Note here I am not talking about 兼语词 which is essentially another unique tag with its own properties. I know this is not 100% accurate but I see it in chinese something like 动名词 in English.
In fact, I do not see that as 兼语词, but for the sake of explanation of the phenomena, I used that term (logically equivalent, but to elaborate on that clearly requires too much space).  In my actual system, 学习 is a verb, only a verb (or logical verb).
Then this touches grammar theory. While we may not really need a new theory, we do need to have a working theory with consistency. You may have a good one in mind. But to most people it is not the case. For example, I see you are deeply influenced by 中心词 and dependency. But not everyone even aware of that, not to mention if they agree with. Till now there is no serious competition, as really no large scale success story yet. We need to wait and see which 学派 eventually casts a bigger shadow.
Good to be criticized.  But I had a point to make there.
【相关博文】

# 欢迎大家试用信息学科数字化知识服务网络平台

下面是平台使用过程中的几点注意事项：（1）初次使用时，如果您的浏览器没有安装silverlight插件，请您按提示下载安装该插件；（2）如果您在使用中遇到一些小问题，可以查看网站的帮助文件3该平台框架实际为数据库检索系统，因此您输入检索词后，需要等待下拉菜单出现相应检索词，选中相应检索词，之后在点击搜索按钮，如下图所示

1 检索说明示意图

平台旨在挖掘、分析和展现我国自动化领域（包括部分计算机、通信的交叉领域）自1960年以来的学术发展情况。我们力求展现出国内自动化领域学术活动的立体全景，对领域内的文献、学者、机构、以及研究方向、方法、理论和工具等，做了全方位的关联分析。为了更好地展现知识，我们在精心设计页面布局的基础上，使用了SilverlightAjax等技术进行网站开发；为了让展现出来的知识更加精确，我们在数据处理中使用了包括命名实体识别与排歧、文本聚类在内的多种数据挖掘技术。

email: y.liu@ia.ac.cn

# 迷思之三：中文处理的长足进步有待于汉语语法的理论突破

（1）对于这件事，我的看法，我们应该听其自然。
（2）这件事我的看法应该听其自然。

# 中文处理的迷思之二：词类标注是句法分析的前提

POS 模块的本义在于词类消歧，即根据上下文的条件标注唯一的一个语法词类，譬如把同一个“学习”在不同的上下文中分别标注为名词或动词。前面说过，这样做有工程上的便利，因为如果词类标注是准确的话，后续的句法分析规则就可以简化，是动词就走动词的规则，是名词就走名词的规则。但这只是问题的一个方面。

# 中文处理的迷思之一：切词特有论

1 汉语词典的词，虽然以多字词为多数，但也有单字词，特别是那些常用的功能词（连词、介词、叹词等）。对于单字词，书面汉语显然是有显性标志的，其标志就是字与字的自然分界（如果以汉字作为语言学分析的最小单位，语言学上叫语素，其 tokenization 极其简单：每两个字节为一个汉字），无需 space.
2 现代汉语的多字词（如：中华人民共和国）是复合词，本质上与西文的复合词（e.g. People's Republic of China）没有区别，space 并不能解决复合词的分界问题。无论中西，复合词都主要靠查词典来解决，而不是靠自然分界（如 space）来解决（德语的名词复合词算是西文中的一个例外，封闭类复合词只要 space 就可以了，开放类复合词则需要进一步切词，叫 decompounding）。如果复合词的左边界或者右边界有歧义问题（譬如：“天下” 左右边界都可能歧义， e.g. 今天 下雨；英语复合副词 "in particular" 的右边界可能有歧义：e.g. in particular cases），无论中西，这种歧义都需要上下文的帮助才能解决。从手段上看，中文的多字词切词并无任何特别之处，英语 tokenization 用以识别复合词 People's Republic of China 和 in particular 的方法，同样适用于中文切词。

# 突然有一种紧迫感：再不上中文NLP，可能就错过时代机遇了

Quote

～～～～～～～～～～～～

Quote
A good 经验之谈. Somehow it reminds me this --

You made a hidden preamble -- a given type of application in a given domain.

A recommendation: expand your blog a bit as a series, heading to a book.

My friend 吴军 did that quite successfully. Of course with statistics background. So he approached NLP from math perspective -- 数学之美 系列

You have very good thoughts and raw material. Just you need to put a bit more time to make your writing more approachable -- I am commenting on comments like "学习不了。" and "读起来鸭梨很大".

I know you said: "有时候想，也不能弄得太可读了，都是多年 的经验，后生想学的话，也该吃点苦头。:=)"

But as you already put in the efforts, why not make it more approachable?

The issue is, even if I am willing to 吃点苦头, I still don't know where to start 吃苦头, IF I have never built a real-life NLP system.

For example, 词汇主义 by itself is enough for an article. You need to mention its opponents and its history to put it into context. Then you need to give some examples.

So far I have been fairly straightforward on what I write about.  If there is readability issue, it is mainly due to my lack of time.  Young people should be able to benefit from my writings especially once they start getting their hands dirty in building up a system.

Your discussion is fun. You can see and appreciate things hidden behind my work more than other readers.  After all, you have published in THE CL and you have almost terminated the entire segmentation as a scientific area. Seriously, it is my view that there is not much to do there after your work on tokenization both in theory and practice.

I feel some urgency now for having to do Chinese NLP asap.  Not many people have been though that much as what I have been (luckily), so I am in a position to potentially build a much more powerful system to make an impact on Chinese NLP, and hopefully on the IT landscape as well.  But time passes fast . That is why my focus is on the Chinese processing now, day and night.  I am keeping my hands dirty also with a couple of European languages, but they are less challenging and exciting.

# 再说苹果爱疯的贴身小蜜 死日（Siri）

Quote

"聊天系统，干的本身就是不着调的工作"，一点儿不错，那是所谓 old AI 的残余。不过，即便如此，我在 苹果 Siri 中看到的三个来源（1.自然语言技术：语音和文字 2 Askjeeves 模板技术；3. 所谓 AI 聊天系统）中也看到了它的影子，它是有实用价值的，价值在于制造没有理解下的 "人工智能" 的假象。

Siri: “Aluminosilicate glass and stainless steel. Nice, huh?”

Siri 扭着细声答道：

I am your humble assistant.

《立委随笔：非常折服苹果的技术转化能力。。。》
《从新版iPhone发布，看苹果和微软技术转化能力的天壤之别》

# 坚持四项基本原则，开发鲁棒性NLP系统

regression testing 等开发操作规范的制定以及 data quality QA 的配合。理想的数据制导还应该包括引入机器学习的方法，来筛选制约具有统计意义的语言现象反馈给语言学家。从稍微长远一点看，自动分类用户的数据反馈，实现某种程度的粗颗粒度的自学习，建立半自动人际交互式开发环境，这是手工开发和机器学习以长补短的很有意义的思路。

# 应该立法禁止分词研究 :=)

RE: 分词当然是第一关。这个没弄好，其他的免谈