月度归档:2012年01月

NLP资源共享盛宴

“科技创新,方法先行”。为响应科技部“十二五”关于加强科技资源共享的号召,中科院自动化所“自动化学科创新思想与科学方法研究(课题编号:2009IM020300)”课题 与国内专业的科研数据共享平台-数据堂 网站展开全面合作,将自动化学科数字化知服务网络平台的部分后台数据,以及项目中的一些其他数据资源,免费提供给自然语言处理等相关领域同仁从事科研使用。数据专区地址是:http://www.datatang.com/member/5878。如您论文或项目使用该专区数据,请注明数据来自“自动化学科创新思想与科学方法研究”课题,编号2009IM020300,以及数据堂数据地址http://www.datatang.com/member/5878

该专区主要包括以下几部分资源:

1.面向计算机学科内学术共同体相关研究的中文DBLP资源

2.面向人物同名消歧研究的的中文DBLP资源

3.万篇随机抽取论文中文DBLP资源

4.以自然语言处理领域中文期刊论文为主导的中文DBLP资源

5.面向文本分类研究的中英文新闻分类语料

6.文本分类程序(含开源代码)

7.面向汉语姓名构词研究的10万中文人名语料库

8.以IG卡方等特征词选择方法生成的多维度ARFF格式英文VSM模型

9.以IG卡方等特征词选择方法生成的多维度ARFF格式中文VSM模型

欢迎自动化学科数字化知识服务网络平台:http://autoinnovation.ia.ac.cn

欢迎大家继续关注自动化学科创新方法课题,我们的联系方式

http://weibo.com/autoinnovation

欢迎大家关注数据堂: http://weibo.com/datatang

祝大家新春快乐,龙年如意!

后生可畏,专业新人对《迷思》争论表面和稀泥,其实门儿清

“专业新人” (early stage researcher)也别被我的夸赞冲昏头脑。门道门道,有门有道。门儿清,不等于道儿清。做到门儿情,只要聪颖和悟性即可,而道儿清要的却是耐性、经验、时间,屡战屡败、屡败屡战的磨练,而且还要有运气。是为冰冻之寒也。
On Thu, Dec 29, 2011 G wrote:

>> As you titled yourself early stage researcher, I'd recommend you a recent dialog on something related -
http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&id=523458.
>> He has a point as an experienced practitioner.

>> I quote him here as overall he is negative to what you are going to work on [注:指的是切词研究]. And agree with him that it's time to shift focus to parsing.
2011/12/29 G
Continuation of the dialog, but with an "early stage researcher". FYI as I actually recommended your blogs to him in place of my phd thesis 🙂

On Dec 29, 2011, M wrote:
Hi Dr. G,

I just read the Liwei's posts and your comments. I partly agree with Liwei's arguments. I think It's just a different perspective to one of the core problem in NLP, disambiguation.

Usually, beginners take the pipeline architecture as granted, i.e. segmentation-->POS tagging-->chunking-->parsing, etc. However, given the ultimate goal is to predict the overal syntactical structures of sentences, the early stages of disambiguation can be considered as pruning for the exponential number of possible parsing trees. In this sense, Liwei's correct. As ambiguity is the enemy, it's the system designer's choice to decide what architecture to use and/or when to resolve it.

I guess recently many other people in NLP also realized (and might even widely agreed on) the disadvantages of pipeline architectures, which explains why there are many "joint learning of X and Y" papers in past 5 years. In Chinese word segmentation, there are also attempts at doing word segmentation and parsing in one go, which seems to be promising to me.

On the other hand, I think your comments are quite to the point. Current applications mostly utilize very shallow NLP information. So accurate tokenization/POS tagger/chunker have their own values.

As for the interaction between linguistics theory and computational linguistics. I think it's quite similar to the relationship between other pairs of science and engineering. Basically, science decides the upper bound of engineering. But given the level of scientific achievements, engineering by itself has a huge space of possibilities. Moreover, in this specific case of our interest, CL itself may serve as a tool to advance linguistics theory, as the corpus based study of linguistics seems to be an inevitable trend.

From: Wei Li
Date: Fri, Dec 30, 2011

He is indeed a very promising young researcher who is willing to think and air his own opinions.

I did not realize that the effect of my series is that I am against the pipeline architecture. In fact I am all for it as this is the proven solid architecture for engineering modular development. Of course, by just reading my recent three posts, it is not surprising that he got that impression. There is something deeper than that: a balance between pipeline structure and keeping ambiguity untouched principle. But making the relationship clear is not very easy, but there is a way of doing that based on experiences of "adaptive development" (another important principle).

【相关博文】
专业老友痛批立委《迷思》系列搅乱NLP秩序,立委固执己见

NLP 历史上最大的媒体误导:成语难倒了电脑

NLP 最早的实践是机器翻译,在电脑的神秘光环下,被认为是 模拟或挑战 人类智能活动的机器翻译自然成为媒体报道的热点。其中有这么一个广为流传的机器翻译笑话,为媒体误导之最:

说的是有记者测试机器翻译系统,想到用这么一个出自圣经的成语:

The spirit is willing, but the flesh is weak (心有余而力不足)

翻译成俄语后再翻译回英语就是:

The whiskey is alright, but the meat is rotten(威士忌没有问题,但肉却腐烂了)

这大概是媒体上流传最广的笑话了。很多年来,这个经典笑话不断被添油加醋地重复着,成为NLP的标准笑柄。

然而,自然语言技术中没有比成语更加简单的问题了。成语是NLP难点的误解全然是外行人的臆测,这种臆测由于两个因素使得很多不求甚解的人轻信了。其一是NLP系统的成语词典不够全面的时候造成的类似上述的“笑话”,似乎暴露了机器的愚蠢,殊不知这样的“错误”是系统最容易 debug 的:补全词典即可。因为成语 by definition 是可列举的(listable),补全成语的办法可以用人工,也可以从语料库中自动习得,无论何种方式,都是 tractable 的任务。语言学告诉我们,成语的特点在于其不具有语义的可分解性(no/little semantic compositianlity),必须作为整体来记忆(存贮),这就决定了它的非开放性(可列举)。其二是对于机器“理解”(实际是一种“人工智能”)的误解,以为人理解有困难的部分也必然是机器理解的难点,殊不知两种“理解”根本就不是一回事。很多成语背后有历史故事,需要历史知识才可以真正理解其含义,而机器是没有背景知识的,由此便断言,成语是NLP的瓶颈。

事实是,对于 NLP,可以说,识别了就是理解了,而识别可枚举的表达法不过是记忆而已,说到底是存储量的问题。可是确实有人天真到以为由冷冰冰的无机材料制作的“电脑”真地具有人脑那样的自主理解 能力/机制。 

引用
成语的本质是记忆,凡记忆电脑是大拿,人脑是豆腐

当然要大词库,无论何种方式 建立,只要想做就可以做,因此不是问题。

所谓自然语言“理解”(NLU),就是把 open expressions 分解成词典单位(包括成语)的关系组合(术语叫 semantic compositionality)。凡事到了词典层,理解就终结了。无论semantic representation 如何摆弄,那都是系统内部的事情(system internal),与理解的本质无关。



【后记】为写这篇短文,上网查阅这个广为流传的笑话的原始出处,结果发现了冯志伟老师有专文讲述这个故事的来历和变迁,根据冯老师的考证,这个笑话是杜撰出来的(见 《冯志伟:一个关于机器翻译的史料错误》)。本文的主旨是澄清这一误解。杜撰与否并不重要,重要的是这个笑话的娱乐性以及媒体与大众对于娱乐的追求使得一种似是而非的误解经久不衰,得以深入人心。