月度归档:2012年03月

abcNLP: AB-Natural Chinese Languange Processing

abcNLP Readme

Introduce

abcNLP

AB-Natural Chinese Languange Processing, The movement of C makes it ab-natural.

Thanks to censorship of internet, people oftern meet trouble to express themselves freely. Usually the SYSTEM will filter contents by (keywords) pattern matching. If we can make the language hard understood to SYSTEM, but not to human, we make the expression “Immune 2 Censorship” (at least, to some extent)!
继续阅读

ACL 2012 Accepted Long Papers (Poster Papers)

以下ACL 2012 Poster Papers的相关信息,转载自水木NLP版:

Poster Papers

No. Paper Title Authors
1 A BROAD-COVERAGE NORMALIZATION SYSTEM FOR SOCIAL MEDIA LANGUAGE Fei Liu, Fuliang Weng and Xiao Jiang
2 A COST SENSITIVE PART-OF-SPEECH TAGGING: DIFFERENTIATING SERIOUS ERRORS FROM MINOR ERRORS Hyun-Je Song, Jeong-Woo Son and Seong-Bae Park
3 A RANKING-BASED APPROACH TO WORD REORDERING FOR STATISTICAL MACHINE TRANSLATION Nan Yang, Mu Li and Dongdong Zhang
4 AUTOMATIC EVENT EXTRACTION WITH STRUCTURED PREFERENCE MODELING Wei Lu and Dan Roth
5 BIG DATA VERSUS THE CROWD: WHERE TO LOOK FOR A RELATIONSHIP (TO EXTRACT) Ce Zhang, Feng Niu, Christopher Ré and Jude Shavlik
6 CHINESE COMMA DISAMBIGUATION FOR DISCOURSE ANALYSIS Yaqin Yang and Nianwen Xue
7 CLASSIFYING FRENCH VERBS USING FRENCH AND ENGLISH LEXICAL RESOURCES Ingrid Falk, Claire Gardent and Jean-Charles Lamirel
8 COLLECTIVE CLASSIFICATION FOR FINE-GRAINED INFORMATION STATUS Katja Markert, Yufang Hou and Michael Strube
9
COMBINING COHERENCE MODELS AND MACHINE TRANSLATION EVALUATION METRICS FOR SUMMARIZATION
EVALUATION
Ziheng Lin, Chang Liu, Hwee Tou Ng and Min-Yen Kan
10 DISCRIMINATIVE LEARNING FOR JOINT TEMPLATE FILLING Einat Minkov and Luke Zettlemoyer
11 ECOLOGICAL EVALUATION OF PERSUASIVE MESSAGES USING GOOGLE ADWORDS Marco Guerini, Carlo Strapparava and Oliviero Stock
12 ENHANCED CHARACTER-LEVEL EVALUATION FOR LANGUAGES WITH HIGHLY COMPOSITIONAL VOCABULARY Chang Liu and Hwee Tou Ng
13 EXPLOITING SOCIAL INFORMATION IN GROUNDED LANGUAGE LEARNING VIA GRAMMATICAL REDUCTION Mark Johnson, Katherine Demuth and Michael Frank
14
EXPLORING DETERMINISTIC CONSTRAINTS: FROM A CONSTRAINED ENGLISH POS TAGGER TO AN EFFICIENT
ILP SOLUTION TO CHINESE WORD SEGMENTATION
Qiuye Zhao and Mitch Marcus
15 HIERARCHICAL CHUNK-TO-STRING TRANSLATION Yang Feng, Dongdong Zhang, Mu Li and Qun Liu
16 IMPROVE SMT QUALITY WITH AUTOMATICALLY EXTRACTED PARAPHRASE RULES Wei He, Hua Wu, Haifeng Wang and Ting Liu
17 IMPROVING WORD REPRESENTATIONS VIA GLOBAL CONTEXT AND MULTIPLE WORD PROTOTYPES Eric Huang, Richard Socher, Christopher Manning and Andrew Ng
18
INCREMENTAL JOINT APPROACH TO WORD SEGMENTATION, POS TAGGING, AND DEPENDENCY PARSING IN
CHINESE
Jun Hatori, Takuya Matsuzaki, Yusuke Miyao and Jun'ichi Tsujii
Complete List of Long Papers (Poster) No. Paper Title Authors
19 LARGE-SCALE TREELET LANGUAGE MODELING Adam Pauls and Dan Klein
20 MIXING MULTIPLE TRANSLATION MODELS IN STATISTICAL MACHINE TRANSLATION Majid Razmara, George Foster and Anoop Sarkar
21 MODELING SENTENCE SIMILARITY IN THE LATENT SPACE Weiwei Guo and Mona Diab
22 MODELING THE TRANSLATION OF PREDICATE-ARGUMENT STRUCTURE FOR SMT Deyi Xiong, Min Zhang and Haizhou Li
23 NAMED ENTITY DISAMBIGUATION IN STREAMING DATA
Alexandre Davis, Adriano Veloso, Altigran Soares, Alberto Laender
and Wagner Meira Jr.
24 POLARITY CONSISTENCY CHECKING FOR SENTIMENT DICTIONARIES
Eduard Dragut, Hong Wang, Clement Yu, Prasad Sistla and Weiyi
Meng
25 PORT: A PRECISION-ORDER-RECALL MT EVALUATION METRIC FOR TUNING Boxing Chen, Roland Kuhn and Samuel Larkin
26 SENTENCE SIMPLIFICATION BY MONOLINGUAL MACHINE TRANSLATION Sander Wubben, Antal van den Bosch and Emiel Krahmer
27 STRUCTURING E-COMMERCE INVENTORY Khash Rohanimanesh, Karin Mauge and Jean-David Ruvini
28 TEXT SEGMENTATION BY LANGUAGE USING MINIMUM DESCRIPTION LENGTH Hiroshi Yamaguchi and Kumiko Tanaka-Ishii
29 YOU HAD ME AT HELLO: HOW PHRASING AFFECTS MEMORABILITY
Cristian Danescu-Niculescu-Mizil, Justin Cheng, Jon Kleinberg and
Lillian Lee

ACL 2012的官方消息可参考:http://www.acl2012.org/board/news_view.asp?intId=25&intCurrPage=&strChoi=&strKeyw=

ACL 2012 Accepted Long Papers (ORAL)

No. Paper Title Authors
1 A CLASS-BASED AGREEMENT MODEL FOR GENERATING ACCURATELY INFLECTED TRANSLATIONS Spence Green and John DeNero
2 A COMPUTATIONAL APPROACH TO AUTOMATIZE CREATIVE NAMING Gozde Ozbal and Carlo Strapparava
3 A DISCRIMINATIVE HIERARCHICAL MODEL FOR FAST COREFERENCE AT LARGE SCALE Michael Wick, Sameer Singh and Andrew McCallum
4 A JOINT MODEL FOR DISCOVERY OF ASPECTS IN UTTERANCES Asli Celikyilmaz and Dilek Hakkani-Tur
5 A NONPARAMETRIC BAYESIAN APPROACH TO ACOUSTIC MODEL DISCOVERY Chia-ying Lee and James Glass
6 A PROBABILISTIC MODEL FOR CANONICALIZING NAMED ENTITY MENTIONS Dani Yogatama, Yanchuan Sim and Noah A. Smith
7 A STATISTICAL MODEL FOR UNSUPERVISED AND SEMI-SUPERVISED TRANSLITERATION MINING Hassan Sajjad, Alexander Fraser and Helmut Schmid
8 A TOPIC SIMILARITY MODEL FOR HIERARCHICAL PHRASE-BASED TRANSLATION Xinyan Xiao, Deyi Xiong, Min Zhang, Qun Liu and Shouxun Lin
9 ASPECT EXTRACTION THROUGH SEMI-SUPERVISED MODELING Arjun Mukherjee and Bing Liu
10 ATTACKING PARSING BOTTLENECKS WITH UNLABELED DATA AND RELEVANT FACTORIZATIONS Emily Pitler
11 AUTOMATED ESSAY SCORING TOWARDS ASR TRANSCRIPTION OF ORAL ENGLISH SPEECH BASED ON FINITE STATE TRANSDUCER Xingyuan Peng, Dengfeng Ke and Bo Xu
12 BAYESIAN SYMBOL-REFINED TREE SUBSTITUTION GRAMMARS FOR SYNTACTIC PARSING Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino and Masaaki Nagata
13 BOOTSTRAPPING A UNIFIED MODEL OF LEXICAL AND PHONETIC ACQUISITION Micha Elsner, Sharon Goldwater and Jacob Eisenstein
14 BOOTSTRAPPING VIA GRAPH PROPAGATION Max Whitney and Anoop Sarkar
15 CAPTURING PARADIGMATIC AND SYNTAGMATIC LEXICAL RELATIONS: TOWARDS ACCURATE CHINESE PART-OFSPEECH TAGGING Weiwei Sun and Hans Uszkoreit
16 COLLECTIVE GENERATION OF NATURAL IMAGE DESCRIPTIONS Polina Kuznetsova, Vicente Ordonez, Alexander Berg, Tamara Berg and Yejin Choi
17 COMMUNITY ANSWER SUMMARIZATION FOR MULTI-SENTENCE QUESTION WITH GROUP L1 REGULARIZATION Wen Chan, Xiangdong Zhou, Wei Wang and Tat-Seng Chua
18 COMPUTATIONAL APPROACHES TO SENTENCE COMPLETION Geoffrey Zweig, John C. Platt, Christopher Meek, Christopher J.C. Burges, Ainur Yessenalina and Qiang Liu
19 CONCEPT-TO-TEXT GENERATION VIA DISCRIMINATIVE RERANKING Ioannis Konstas and Mirella Lapata
20 COREFERENCE SEMANTICS FROM WEB FEATURES Mohit Bansal and Dan Klein
21 CROSS-DOMAIN CO-EXTRACTION OF SENTIMENT AND TOPIC LEXICONS Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang and Xiaoyan Zhu
22 CROSSLINGUAL INDUCTION OF SEMANTIC ROLES Ivan Titov and Alexandre Klementiev
23 CROSS-LINGUAL MIXTURE MODEL FOR SENTIMENT CLASSIFICATION Xinfan Meng, Furu Wei, Xiaohua Liu, Ming Zhou and Houfeng Wang
24 DECIPHERING FOREIGN LANGUAGE BY COMBINING N-GRAM LANGUAGE MODELS AND CONTEXT VECTORS Malte Nuhn, Arne Mauser and Hermann Ney
25 DISCRIMINATIVE PRONUNCIATION MODELING: A LARGE-MARGIN, FEATURE-RICH APPROACH Hao Tang, Joseph Keshet and Karen Livescu
26 DISCRIMINATIVE STRATEGIES TO INTEGRATE MULTIWORD EXPRESSION RECOGNITION AND PARSING Matthieu Constant, Anthony Sigogne and Patrick Watrin
27 DISTRIBUTIONAL SEMANTICS IN TECHNICOLOR Elia Bruni, Gemma Boleda, Marco Baroni and Nam Khanh Tran
28 EFFICIENT SEARCH FOR TRANSFORMATION-BASED INFERENCE Asher Stern, Roni Stern, Ido Dagan and Ariel Felner
29 EFFICIENT TREE-BASED APPROXIMATION FOR ENTAILMENT GRAPH LEARNING Jonathan Berant, Ido Dagan, Meni Adler and Jacob Goldberger
30 ERROR MINING ON DEPENDENCY TREES Claire Gardent and Shashi Narayan
31 EXPLOITING MULTIPLE TREEBANKS FOR PARSING WITH QUASI-SYNCHRONOUS GRAMMARS Zhenghua Li, Wanxiang Che and Ting Liu
32 EXTRACTING NARRATIVE TIMELINES AS TEMPORAL DEPENDENCY STRUCTURES Oleksandr Kolomiyets, Steven Bethard and Marie-Francine Moens
33 FAST ONLINE LEXICON LEARNING FOR GROUNDED LANGUAGE ACQUISITION David Chen
34 FAST SYNTACTIC ANALYSIS FOR STATISTICAL LANGUAGE MODELING VIA SUBSTRUCTURE SHARING AND UPTRAINING Ariya Rastrow, Mark Dredze and Sanjeev Khudanpur
35 FINDING BURSTY TOPICS FROM MICROBLOGS Qiming DIAO, Jing JIANG, Feida ZHU and Ee Peng LIMNo. Paper Title Authors
36 FINDING SALIENT DATES FOR BUILDING THEMATIC TIMELINES Ré my Kessler, Xavier Tannier, Caroline Hagè ge, Vé ronique Moriceau and André Bittar
37 HEAD-DRIVEN TRANSITION-BASED PARSING WITH TOP-DOWN PREDICTION Katsuhiko Hayashi, Taro Watanabe, Masayuki Asahara and Yuji Matsumoto
38 HISTORICAL ANALYSIS OF LEGAL OPINIONS WITH A SPARSE MIXED-EFFECTS LATENT VARIABLE MODEL William Yang Wang, Elijah Mayfield, Suresh Naidu and Jeremiah Dittmar
39 ITERATIVE VITERBI A* ALGORITHM FOR K-BEST SEQUENTIAL DECODING Zhiheng Huang, Yi Chang, Bo Long, Jean-Francois Crespo, Anlei Dong, Sathiya Keerthi and Su-Lin Wu
40 JOINT CHINESE WORD SEGMENTATION AND NEW WORD DETECTION VIA FAST CONFLICTION-BASED LEARNING Xu Sun
41 JOINT FEATURE SELECTION IN DISTRIBUTED STOCHASTIC LEARNING FOR LARGE-SCALE DISCRIMINATIVE TRAINING IN SMT Patrick Simianer, Stefan Riezler and Chris Dyer
42 JOINT INFERENCE OF NAMED ENTITY RECOGNITION AND NORMALIZATION FOR TWEETS Xiaohua Liu, Ming Zhou, Xiangyang Zhou, Zhongyang Fu and Furu Wei
43 LABELING DOCUMENTS WITH TIMESTAMPS: LEARNING FROM THEIR TIME EXPRESSIONS Nathanael Chambers
44 LEARNING HIGH-LEVEL PLANNING FROM TEXT S.R.K. Branavan, Nate Kushman, Tao Lei and Regina Barzilay
45 LEARNING SYNTACTIC VERB FRAMES USING GRAPHICAL MODELS Thomas Lippincott, Anna Korhonen and Diarmuid ó Sé aghdha
46 LEARNING TO "READ BETWEEN THE LINES" USING BAYESIAN LOGIC PROGRAMS Sindhu Raghavan, Raymond Mooney and Hyeonseo Ku
47 LEARNING TO TRANSLATE WITH MULTIPLE OBJECTIVES Kevin Duh, Katsuhito Sudoh, Xianchao Wu, Hajime Tsukada and Masaaki Nagata
48 LEARNING TRANSLATION CONSENSUS WITH STRUCTURED LABEL PROPAGATION Shujie Liu, Chi-Ho Li, Mu Li and Ming Zhou
49 MACHINE TRANSLATION WITHOUT WORDS THROUGH SUBSTRING ALIGNMENT Graham Neubig, Taro Watanabe, Shinsuke Mori and Tatsuya Kawahara
50 MAXIMUM EXPECTED BLEU TRAINING OF PHRASE AND LEXICON TRANSLATION MODELS Xiaodong He and Li Deng
51 MINING ENTITY TYPES FROM QUERY LOGS VIA USER INTENT MODELING Patrick Pantel, Thomas Lin and Michael Gamon
52 MIX IS NOT A TREE-ADJOINING LANGUAGE Makoto Kanazawa and Sylvain Salvati
53 MODELING REVIEW COMMENTS Arjun Mukherjee and Bing LiuNo. Paper Title Authors
54 MODELING TOPIC DEPENDENCIES IN HIERARCHICAL TEXT CATEGORIZATION Alessandro Moschitti and Qi Ju
55 MODELING TRANSLATION DISAGREEMENT THROUGH RICH GRAPHS FOR TRANSLATION QUALITY ESTIMATION Daniele Pighin, Meritxell Gonzalez and Lluís Mà rquez
56 MODIFIED DISTORTION MATRICES FOR PHRASE-BASED STATISTICAL MACHINE TRANSLATION Arianna Bisazza and Marcello Federico
57 MULTILINGUAL NAMED ENTITY RECOGNITION USING PARALLEL DATA AND METADATA FROM WIKIPEDIA Sungchul Kim and Kristina Toutanova
58 NBEST CCG PARSING AND RERANKING Dominick Ng and James R. Curran
59 PDTB-STYLE DISCOURSE ANNOTATION OF CHINESE TEXT Yuping Zhou and Nianwen Xue
60 PREDICTION OF LEARNING CURVES IN MACHINE TRANSLATION Prasanth Kolachina, Nicola Cancedda, Marc Dymetman and Sriram Venkatapathy
61 PROBABILISTIC INTEGRATION OF PARTIAL LEXICAL INFORMATION FOR NOISE ROBUST HAPTIC VOICE RECOGNITION Khe Chai Sim
62 REDUCING APPROXIMATION AND ESTIMATION ERRORS FOR CHINESE LEXICAL PROCESSING WITH HETEROGENEOUS ANNOTATIONS Weiwei Sun
63 REDUCING WRONG LABELS IN DISTANT SUPERVISION FOR RELATION EXTRACTION Shingo Takamatsu, Issei Sato and Hiroshi Nakagawa
64 SELECTIVE SHARING FOR MULTILINGUAL DEPENDENCY PARSING Tahira Naseem, Regina Barzilay and Amir Globerson
65 SEMANTIC PARSING WITH BAYESIAN TREE TRANSDUCERS Bevan Jones, Mark Johnson and Sharon Goldwater
66 SEMI-SUPERVISED DEPENDENCY PARSING USING LEXICAL AFFINITIES Seyed Abolghasem Mirroshandel, Alexis Nasr and Joseph Le Roux
67 SENTENCE DEPENDENCY TAGGING IN ONLINE QUESTION ANSWERING FORUMS Zhonghua Qu and Yang Liu
68 SITS: A HIERARCHICAL NONPARAMETRIC MODEL USING SPEAKER IDENTITY FOR TOPIC SEGMENTATION IN MULTIPARTY CONVERSATIONS Viet-An Nguyen, Jordan Boyd-Graber and Philip Resnik
69 SMALLER ALIGNMENT MODELS FOR BETTER TRANSLATIONS: UNSUPERVISED WORD ALIGNMENT WITH THE L0-NORM Ashish Vaswani, Liang Huang and David Chiang
70 SPECTRAL LEARNING OF LATENT-VARIABLE PCFGS Shay B. Cohen, Karl Stratos, Michael Collins, Dean Foster and Lyle Ungar
71 SPICE IT UP? MINING REFINEMENTS TO ONLINE INSTRUCTIONS FROM USER GENERATED CONTENT Gregory Druck and Bo PangNo. Paper Title Authors
72 STRING RE-WRITING KERNEL Fan Bu, Hang Li and Xiaoyan Zhu
73 STRONG LEXICALIZATION OF TREE ADJOINING GRAMMARS Andreas Maletti and Joost Engelfriet
74 SUBGROUP DETECTION IN IDEOLOGICAL DISCUSSIONS Amjad Abu-Jbara, Pradeep Dasigi, Mona Diab and Dragomir Radev
75 TEMPORALLY ANCHORED RELATION EXTRACTION Guillermo Garrido, Anselmo Peñ as, Bernardo Cabaleiro and á lvaro Rodrigo
76 TEXT-LEVEL DISCOURSE PARSING WITH RICH LINGUISTIC FEATURES Vanessa Wei Feng and Graeme Hirst
77 THE CREATION OF A CORPUS OF ENGLISH METALANGUAGE Shomir Wilson
78 TRANSLATION MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION WITH MONOLINGUAL TOPIC INFORMATION Jinsong Su, Hua Wu, Haifeng Wang, Yidong Chen, Xiaodong Shi, Huailin Dong and Qun Liu
79 TWEET RECOMMENDATION: A UNIFIED TWEET-TWITTERER CO-RANKING FRAMEWORK FUSING POPULARITY, PERSONALIZATION AND DIVERSITY Rui Yan, Mirella Lapata and Xiaoming Li
80 UNSUPERVISED RELATION DISCOVERY WITH SENSE DISAMBIGUATION Limin Yao, Sebastian Riedel and Andrew McCallum
81 UTILIZING DEPENDENCY LANGUAGE MODELS FOR GRAPH-BASED DEPENDENCY PARSING MODELS Wenliang Chen, Min Zhang and Haizhou Li
82 VERB CLASSIFICATION USING DISTRIBUTIONAL SIMILARITY IN SYNTACTIC AND SEMANTIC STRUCTURES Danilo Croce, Alessandro Moschitti, Roberto Basili and Martha Palmer
83 WORD SENSE DISAMBIGUATION IMPROVES INFORMATION RETRIEVAL Zhi Zhong and Hwee Tou Ng

Moses的一些新变化

  看了一下Moses,发现有了一些新变化,特别是Moses整个开源项目几个月之前从Sourceforge上迁移到github上,可见github近来的人气有多旺。另外Moses的编译方式有了很大的改变,之前是Make方式编译,现在改为了bjam;之前依赖的boost库是可选的,现在boost库是必选的,不安装boost库Moses基本上是无法编译成功的。

  具体到操作上,如果是在ubuntu上,可以通过"sudo apt-get install libboost-all-dev"的方式快速的安装boost库,然后check out源代码:
git clone git://github.com/moses-smt/mosesdecoder.git

  Check out下Moses代码之后,如果不考虑整套统计机器翻译平台的搭建,仅仅测试Moses,直接用bjam编译moses就可以了:
cd ~/mosesdecoder
./bjam -j2
-j后的数字代表多核并行编译;

如果一切顺利并允许几个无关紧要的错误的话,编译完成之后会在dist下面生成一个bin和一个lib目录,前者存放可执行的二进制程序,例如moses, moses_chart,后者存放相关的lib库,例如:libmose.a

Step to Step的编译方法可以参考Moses的官方文档:
http://www.statmt.org/moses_steps.html
这个文档的一个问题是没有提示boost的安装,不安装boost,用bjam编译后会遇到很多boost某个库找不到的错误,并且不会生成Moses的二进制文件及Lib库。

另一个重要新闻是Moese的目前的开发由欧盟下的MosesCore项目支持,查了一下这个项目,貌似是今年才立项的,从名字上看,与Moses紧密相关,并且致力于开源统计机器翻译系统在学术界和工业界的推广:

MosesCore is an EU funded Coordination Action, which aims to encourage the development and usage of open source machine translation.

MosesCore draws together academic and commercial partners sharing a common interest in open source machine translation, and will:

Provide coordination and stewardship of the development of open source software for machine translation, notably the Moses statistical MT toolkit. This will result in at least three major releases of Moses, one in each year of the project.

Outreach to the research community through academic workshops, evaluation campaigns and the machine translation marathons.

Outreach to current and potential users of MT by providing a well maintained web presence, an active newsletter, and three annual outreach events for knowledge sharing and tutorial.

Improve interaction between academic and industrial MT stakeholders through both the outreach events and tutorials, and the marathons.

Itenyh版-用HMM做中文分词四:A Pure-HMM 分词器

先介绍一下使用的资源,分词使用的语料来自于SIGHAN Bakeoff 2005的 icwb2-data.rar,《中文分词入门之资源》里有很详细的介绍:

/icwb2-data.rar/training/msr_training.utf8  用以训练HMM,其中包含已分词汇约2000000个

/icwb2-data.rar/testing/pku_test.utf8           测试集

/icwb2-data.rar/scripts/score                           一个perl脚本,测试分词效果

         我们使用经典的字符标注模型,首先需要确定标注集,在前面的介绍中,我们使用的是{B,E}的二元集合,研究表明基于四类标签的字符标注模型明显优于两类标签,原因是两类标签过于简单而损失了部分信息。四类标签的集合是 {B,E,M,S},其含义如下:

B:一个词的开始

E:一个词的结束

M:一个词的中间

S:单字成词

举例:你S现B在E应B该E去S幼B儿M园E了S

用四类标签为msr_training.utf8做好标记后,就可以开始用统计的方法构建一个HMM。我们打算构建一个2-gram(bigram)语言模型,也即一个1阶HMM,每个字符的标签分类只受前一个字符分类的影响。现在,我们需要求得HMM的状态转移矩阵 A 以及混合矩阵 B。其中:

                                     Aij = P(Cj|Ci)  =  P(Ci,Cj) / P(Ci) = Count(Ci,Cj) / Count(Ci)

                                     Bij = P(Oj|Ci)  =  P(Oj,Ci) / P(Ci) = Count(Oj,Ci) / Count(Ci)

公式中C = {B,E,M,S},O = {字符集合},Count代表频率。在计算Bij时,由于数据的稀疏性,很多字符未出现在训练集中,这导致概率为0的结果出现在B中,为了修补这个问题,我们采用加1的数据平滑技术,即:

                                     Bij = P(Oj|Ci)  =  (Count(Oj,Ci) + 1)/ Count(Ci)

         这并不是一种最好的处理技术,因为这有可能低估或高估真实概率,更加科学的方法是使用复杂一点的Good—Turing技术,这项技术的的原始版本是图灵当年和他的助手Good在破解德国密码机时发明的。求得的矩阵A如下:

        B       M                    E              S

B  0.0 | 0.19922840916814916 | 0.8007715908318509 | 0.0

M  0.0 | 0.47583202978061256 | 0.5241679702193874 | 0.0

E  0.6309567616935934 | 0.0 | 0.0 | 0.36904323830640656

S  0.6343402140354506 | 0.0 | 0.0 | 0.36565844303914763

矩阵中出现的概率为0的元素表明B-B, B-S, M-B, M-S, E-M, E-E, S-M, S-E这8种组合是不可能出现的。这是合乎逻辑的。求得的矩阵B,部分如下:

   o1               o2            o3                o4

B 7.8127868094E-7 1.460991186336E-4 0.007293548529516 6.047041844505E-4……

M ……

E …….

S ……

我们设定初始向量Pi = {0.5, 0.0, 0.0, 0.5}(M和E不可能出现在句子的首位)。至此,HMM模型构建完毕。其实用统计方法构建HMM并不复杂,麻烦的是写程序,特别是需要将分类标签和字符集合进行编码,这是个极其繁琐的过程。

我用的JAVA写程序,因此使用的Jahmm这个开源项目来进行相关的计算,对于一个HMM模型,现在我们可以写入一个观察序列,用Viterbi算法获得一个隐藏序列(分词结果)。具体做法是(举例为虚构):

1 . 对测试集按标点符号,空格,换行符拆分为一条一条的只包含字符的单句:

……

中共中央总书记

国家主席

……

2. 将句子转化为对应的编码,也即观察序列:

……

101 121 101 47 1010 32 1992

332 3241 893 2111

……

3. 输入观察序列,使用Viterbi算法输出隐藏序列(0:B,1:M,2:E,3:S)

……

0112333

0202

……

4. 输出最终分词结果:

……

中共中央/总/书/记

国家/主席

……

现在我们在对测试集pku_test.utf8分词,并用perl脚本进行评测,结果如下=== SUMMARY:
=== TOTAL INSERTIONS:    5627
=== TOTAL DELETIONS:    10639
=== TOTAL SUBSTITUTIONS:    18194
=== TOTAL NCHANGE:    34460
=== TOTAL TRUE WORD COUNT:    104372
=== TOTAL TEST WORD COUNT:    99360:

=== TOTAL TRUE WORDS RECALL:  0.724   正确切分出的词的数目/应切分出的词的总数

=== TOTAL TEST WORDS PRECISION: 0.760  正确切分出的词的数目/切分出的词的总数

=== F MEASURE:      0.742   F1 = 2 * Precision*Recall / (Precision + Recall)

=== OOV Recall Rate:    0.250      未登录词召回率

精度为:76%,召回为:72.4%。一个不太让人满意的结果,我们来看看他都干了些什么:

                                                        改判/被/告人/死/刑立/即/执行

这是一个较典型的例子,可见分词结果中出现了“告人”,“刑立”等字典中不存在的词,这些是由于HMM分词的“自由度”太大造成的,当然这种较大的“自由度”也有好处,比如:

                                                        检察院/鲍绍坤/检察/长

         看!HMM成功的识别出一个人名,这是基于词典的分词方法做不到的。实际上,在ICTCLAS系统中(2004)就是用的HMM来识别人名。下一节将会介绍一个基于词典的混合HMM分词器。

Free Online Stanford Course on NLP

Registration closes 18 March 2012.

https://www.coursera.org/nlp/auth/welcome

Stanford is offering this course on Natural Language Processing free and online to students worldwide, starting on March 12, 2012, continuing Stanford's exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford's courses in Natural Language Processing.

腾讯搜索广告平台部招聘工程师

腾讯具有互联网大规模的优质内容流量, 但是流量变现能力还有待提高,腾讯搜索广告
平台长尾广告中心部目前正在努力打造一个高质量的内容流量变现平台。我们目前正在并行的展开以下几个方面的工作:

0. 开发内容广告匹配的整个工程流水线,包括前台和后台

1. Text summarization & keyword extraction

2. 文本分类,包括网页分类和广告分类

3. 广告点击率预估

4. 并行机器学习工具的开发与优化, 包括 paralel logistic regression, parallel LDA

5. 用户行为分析与建模,主要目标是广告中的 behavioral targeting

欢迎有相应背景的同学加盟。

要求:

1、熟练掌握C++

2、会 Python/Java/Perl/AWK 的加分, 熟悉Linux 开发的加分

3、熟悉 Google coding style 的加分, 有 codereview 经验的加分

4、使用过 mapreduce 的加分

5、有 NLP, IR, ML以及计算广告学背景的加分

工作地点在北京,如果想应聘正式员工, 简历请发送到 rickjin At tencent.com
继续阅读

Itenyh版-用HMM做中文分词三:前向算法和Viterbi算法的开销

上文中始终未提到前向算法与Viterbi算法,主要是因为想特意强调一下数学解不等同于算法解。算法解考虑进了计算的性能问题,算法是为计算机设计的。

首先看一下“评估”问题中的时间开销。回顾一下公式:

P(O) = sigmaO P(O|C)P(C)= sigmaO P(C1)P(O1|C1)P(C2|C1)P(O2|C2)…P(Ci|Ci-1)P(Oi|Ci)

如果我们使用枚举的方法来求解,那么对于C来说,设有N种状态,序列长度为T,那么所有可能的隐藏序列组合为N`T(N的T次方)个,对于每一个序列需要的乘法次数为2T次,而加法需要T次,那么总共需要的计算次数为2T*N`T的级数(精确的,需要(2T-1)N`T次乘法,N`T-1词加法)。例如N=5,T=100,那么需要约2*100*5`100约10`72次计算,显然我们需要一个更好的算法。

         考虑上图这个简单的例子,图中描述了两个隐藏序列:abb,cbb。为了求得P(O),使用枚举算法有:

Pia * P(O1|a)*Pab*P(O2|b)*Pbb*P(O3|b) + Pic * P(O1|c)*Pcb*P(O2|b)*Pbb*P(O3|b)

Pi代表初始概率,使用前向算法有:

(Pia* P(O1|a) *Pab + Pic * P(O1|c)*Pcb) *P(O2|b)*Pbb* P(O3|b)

其中前向算法中的第一个大括号中的式子为局部概率a2(b),可以清楚的看到枚举算法用了10次乘法和1次加法,而前向算法把乘法次数减少到了7次,原因就是前向算法使用了大家熟知的“乘法分配律”提取了公因式。看起来好像提升不多,不过当计算量大的话这种提升就很可观了。事实上,前向算法将计算次数减低到了N`2*T的级数,那么对于N=5,T=100只需要约3000次计算即可。

         现在使用上面的例子考虑一下Viterbi算法,使用枚举算法我们需要计算:

Pia * P(O1|a)*Pab*P(O2|b)*Pbb*P(O3|b) 和 Pic * P(O1|c)*Pcb*P(O2|b)*Pbb* P(O3|b)

并比较求得其中较大的一个,而使用Viterbi算法,我们首先比较:

Pia *P(O1|a)*Pab 与 Pic * P(O1|c)*Pcb 的大小,选出较大的一个再乘以 Pbb* P(O3|b)

10次乘法一次比较,减少到5词乘法一次比较。

         总结一下,前向算法和Viterbi算法分别从两种角度向我们展现了降低计算复杂度的方法,一种是提取公因式减少计算次数,例如计算(1+2+3)* 5 和 1*5+2*5+3*5,从数学角度来看只是表达的不同,但对于计算者(无论是人还是计算机)来说,少的计算次数意味着更小的计算时间开销;另一种是,去掉不必要的计算,要找到最大的隐藏序列,如果在子序列上都不能取得最大还有必要去计算全序列吗?

Itenyh版-用HMM做中文分词二:模型准备

本质上看,分词可以看做一个为文本中每个字符分类的过程,例如我们现在定义两个类别:E代表词尾词,B代表非词尾词,于是分词“你/现在/应该/去/幼儿园/了”可以表达为:你E现B在E应B该E去E幼B儿B园E了B,分类完成后只需要对结果进行“解读”就可以得到分词结果了。

那么如何找到这个分类序列EBEBEEBBEB呢?我们可以求得所有可能出现的分类序列的出现概率然后选择其中概率最大的一个,用数学表达为:

                                                 argmax C  P(C1,C2,C3….Ci) (1)

其中C代表一个分类标识,这里,C 属于 {E,B} 进一步的:

P(C1,C2,C3….Ci) = P(C1)P(C2,C3…..Ci|C1)=P(C1)P(C2|C1)P(C3…..Ci|C1,C2)=

P(C1)P(C2|C1)P(C3|C2,C1)P(C4|C3,C2,C1)…P(Ci|Ci-1….C1)

这里,我们假设第i个分类标记只依赖于第i-1个标记,那么我们可以得到分类标记的马尔科夫链了:

                                 P(C1,C2,C3….Ci) =P(C1)P(C2|C1)P(C3|C2)…P(Ci|Ci-1)

我们称P(Cj|Ci)为i到j的状态转移概率,可以通过在大量预料中使用统计的方法求得。如果我们用(1)这个模型为上例找到了EBEBEEBBEB这个序列,看似好像是一个好的解决方案,不难想到的问题是,对于任何一个10个字符组成的文本,我们都会得到这个相同的序列,也即相同的分词结果,这太可怕了!问题在于我们只考虑了分类标记而没有考虑被分词的对象,也即每一个具体的字符,我们称这些字符为观察值,用O表示,现在把它们考虑进来,我们要找的是:

                                argmax C  P(C1,C2,C3….Ci | O1, O2, O3……Oi) = P(C|O)(1)

         给定了O的情况下,最有可能出现的C的序列,也是HMM的“解码问题”,根据贝叶斯公式有:

                                               P(C|O) = P(O|C)P(C) / P(O)

        由于P(O)对于每一种序列来说都是一样的,因此可以去掉,因此,(1)等价于 :

                                                  argmax C P(O|C)P(C) (2)

P(C)的求法上面已经讲过了,对于P(O|C),HMM有两个假设:(a)已知类地序列,观察值在统计上是独立的; (b)某类的概率密度函数不依赖其他类。也就是说依赖性仅仅存在于产生类的序列,而在类内,观察值服从类自己的规则。因此对于:

                              P(O|C) = P(O1, O2, O3……Oi | C1,C2,C3….Ci )

根据假设a,观察序列具有统计意义上的独立性,因此:

                   P(O1, O2, O3……Oi | C1,C2,C3….Ci )=P(O1|C) * P(O2|C)……P(Oi|C)

根据假设b,对于Oi的分类Ci与Cj没有依赖关系,所以有:

                      P(O1|C) * P(O2|C)……P(Oi|C) = P(O1|C1) * P(O2|C2)……P(Oi|Ci)

同样的P(Oi|Ci) 可以通过在大量预料中使用统计的方法求得。至此我们在一个马尔科夫链上加入了观察值数据,构造了一个HMM,这个模型已经可以具备一定的分词能力啦!

尽管没有用到,顺便回顾一下HMM要解决的第一个问题“评估”,也即求一个给定观察序列的概率,即:P(O),用全概率公式展开有:P(O) = sigmaO P(O|C)P(C)(sigmaO代表对O的集合求和),同样需要计算P(O|C),因此计算上和“解码”有相似之处。

PS : 关于模型(1),在论文HMM Tutorial中写的是:

图中O是观察值,q是代表分类序列,λ指HMM模型,可见作者用的是qO的联合概率,而(1)中用的是O作为条件的条件概率。这个模型和模型(2)是相等的,因此所求的的序列也和模型(1)相等。个人认为,从理解来看,既然给定了O,是不是把O作为条件概率更合适呢?

Itenyh版-用HMM做中文分词一:序

前段时间仔细看了52nlp的关于隐马尔科夫模型(HMM)的介绍,深入浅出,真的是非常好的教材,再次感谢一下52nlp的辛勤劳作。这里,本人打算写一下个人运用HMM做分词的过程,目的有两个,一个是对之前工作的一个总结,梳理下自己的思路,另一个是和大家做一个交流。

如果对HMM不熟悉的话,推荐先看一看52nlp的系列文章《HMM学习最佳范例》,本文假设读者已经对HMM有所了解,很多地方会直接提出相关概念。理解前向算法,维特比算法是关键,关于无监督学习HMM的Baum-Welch算法在本文中没有使用,至少了解它的作用即可。

总所周知,在汉语中,词与词之间不存在分隔符(英文中,词与词之间用空格分隔,这是天然的分词标记),词本身也缺乏明显的形态标记,因此,中文信息处理的特有问题就是如何将汉语的字串分割为合理的词语序。例如,英文句子:you should go to kindergarten now 天然的空格已然将词分好,只需要去除其中的介词“to”即可;而“你现在应该去幼儿园了”这句表达同样意思的话没有明显的分隔符,中文分词的目的是,得到“你/现在/应该/去/幼儿园/了”。那么如何进行分词呢?主流的方法有三种:第1类是基于语言学知识的规则方法,如:各种形态的最大匹配、最少切分方法;第2类是基于大规模语料库的机器学习方法,这是目前应用比较广泛、效果较好的解决方案.用到的统计模型有N元语言模型、信道---噪声模型、最大期望、HMM等。第3类也是实际的分词系统中用到的,即规则与统计等多类方法的综合。

我计划在 本系列文章中完成以下工作,首先在52nlp的工作基础上对HMM模型做一些补充,然后制作一个HMM的分词器,之后分析其缺点,最后制作一个混合分词器。如果有任何不足的地方,欢迎大家拍砖!