# 正态分布的前世今生(八)

(八)大道至简,大美天成

To see a world in a grain of sand
And a heaven in a wild flower,
Hold infinity in the palm of your hand
And eternity in an hour.

$\bar{X} = \frac{X_1 + X_2 + \cdots + X_n}{n}$

（九)推荐阅读

-- C.R.Rao

• 陈希孺, 数理统计学简史
• 蔡聰明,誤差論與最小平方法,数学传播
• 吴江霞,正态分布进入统计学的历史演化
• E.T. Jaynes, Probability Theory, The Logic of Science (概率论沉思录)
• Saul Stahl, The Evolution of the Normal Distribution
• Kiseon Kim, Why Gaussianity
• Stigler, Stephen M. The History of Statistics: The Measurement of Uncertainty before 1900.
• L.Le Cam, The Central Limit Theorem Around 1935
• Hans Fischer, A History of the Central Limit Theorem: From Classical to Modern Probability Theory

# 正态分布的前世今生(七)

(七)正态魅影

Everyone believes in it: experimentalists believing that it is a
mathematical theorem, mathematicians believing that it is an empirical fact.
---- Henri Poincare

$\displaystyle f(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{{(x-\mu})^2}{2\sigma^2}}$

E.T. Jaynes 在《Probability Theory, the Logic of Science》提出了两个问题:

1. 为什么正态分布被如此广泛的使用？
2. 为什么正态分布在实践使用中非常的成功？

E.T. Jaynes 指出，正态分布在实践中成功的被广泛应用，更多的是因为正态分布在数学方面的具有多方面的稳定性质，这些性质包括：

• 两个正态分布密度的乘积还是正态分布
• 两个正态分布密度的卷积还是正态分布，也就是两个正态分布的和还是正态分布
• 正态分布的傅立叶变换还是正态分布
• 中心极限定理保证了多个随机变量的求和效应将导致正态分布
• 正态分布和其它具有相同均值、方差的概率分布相比，具有最大熵

• 二项分布 $B(n,p)$ 在 $n$很大逼近正态分布 $N(np, np(1-p))$
• 泊松分布 $Poisson(\lambda)$ 在 $\lambda$ 较大时逼近正态分布 $N(\lambda,\lambda)$
• $\chi^2_{(n)}$在 $n$很大的时候接近正态分布 $N(n,2n)$
• $t$分布在 $n$ 很大时接近标准正态分布 $N(0,1)$
• 正态分布的共轭分布还是正态分布
• 几乎所有的极大似然估计在样本量$n$增大的时候都趋近于正态分布
• Cramer 分解定理(之前介绍过)：如果 $X,Y$ 是独立的随机变量，且 $S=X+Y$ 是正态分布，那么 $X,Y$ 也是正态分布
• 如果 $X,Y$ 独立且满足正态分布$N(\mu, \sigma^2)$, 那么 $X+Y$, $X-Y$ 独立且同分布，而正态分布是唯一满足这一性质的概率分布
• 对于两个正态分布$X,Y$, 如果$X,Y$ 不相关则意味着$X,Y$独立，而正态分布是唯一满足这一性质的概率分布

# 正态分布的前世今生(六)

（六)开疆扩土，正态分布的进一步发展

2.进军近代统计学

Quetelet这名字或许不如其它数学家那么响亮，估计很多人不熟悉，所以有必要介绍一下。 Quetelet是比利时人，数学博士毕业，年轻的时候曾追谁拉普拉斯学习过概率论。 此人学识渊博，涉猎广泛，脑门上的桂冠包括统计学家、数学家、天文学家、社会学家、 国际统计会议之父、近代统计学之父、数理统计学派创始人。 Quetelet 的最大的贡献就是将法国的古典概率引入统计学，用纯数学的方法对社会现象进行研究。

1831年，Quetelet参与主持新建比利时统计总局的工作。他开始从事有关人口问题的统计学研究。 在这种研究中，Quetelet发现,以往被人们认为杂乱无章的、偶然性占统治地位的社会现象， 如同自然现象一样也具有一定的规律性。 Quetelet 搜集了大量关于人体生理测量的数据，如体重、身高与胸围等，并使用概率统计方法来 对数据进行数据分析。但是当时的统计分析方法遭到了社会学家的质疑， 社会学家们的反对意见主要在于：社会问题 与科学实验不同，其数据一般由观察得到，无法控制且经常不了解其异质因素，这样数据 的同质性连带其分析结果往往就有了问题，于是社会统计工作者就面临一个如何判 断数据同质性的问题。Quetelet大胆地提出：

Quetelet提出了一个使用正态曲线拟合数据的方法，并广泛的使用正态分布去拟合各种类型的数据。 由此， Quetelet为正态分布的应用拓展了广阔的舞台。 正态分布如同一把屠龙刀，在Quetelet 的带领下，学者们挥舞着这把宝刀在各个领域披荆斩棘， 攻陷了人口、领土、政治、农业、工业、商业、道德等社会领域， 并进一步攻占天文学、数学、物理学、生物学、社会统计学及气象学等自然科学领域。

3. 数理统计三剑客

• $\displaystyle \chi_n^2 = X_1^2 + \cdots + X_n^2$
• $\displaystyle t = \frac{Y_1}{\sqrt{\frac{X_1^2 + \cdots + X_n^2}{n}}}$
• $\displaystyle F = \frac{\frac{X_1^2 + \cdots + X_n^2}{n}}{\frac{Y_1^2 + \cdots + Y_m^2}{m}}$

20世纪初，统计学这三大剑客成为了现代数理统计学的奠基人。以哥塞特为先驱，费歇尔为主将， 掀起了小样本理论的革命，事实上提升了正态分布在统计学中的地位。 在数理统计学中，除了以正态分布为基础的小样本理论获得了空前的胜利，其它分布上都没有成功的案例， 这不能不让人对正态分布刮目相看。在随后的发展中，相关回归分析、多元分析、方差分析、因子分析、 布朗运动、高斯过程等等诸多统计分析方法陆续登上了历史舞台， 而这些和正态分布密切相关的方法，成为推动现代统计学飞速发展的一个强大动力。

# 【立委科普：NLP 白皮书】

Quote：
NLP is not magic, but the results you can get sometimes seem almost magical.
（“NLP 不是魔术，但是，其结果有时几乎就是魔术一般神奇。”）

【立委按】作为老兵，常常需要做行业概览（NLP Overview）之类的演讲，有时作为新人训练计划（orientation）的一部分，也有时是应朋友之邀去别处讲，为行业间交流的座谈形式。NLP 是做了一辈子，琢磨了一辈子的事儿，照林彪元帅的话说就是，已经“融化在血液里，落实在行动上”了。所以我从来就当是唠家常，走马谈花。无酒话桑麻，兴之所至，有时也穿插一些掌故，说过就完，烟消云散。可今年的一次演讲，有有心人细心记录在案（caught on tape），整理成了文档。虽然这次演讲枯燥一些（去年一次出外座谈，就精彩多了，现场气氛热络，笑声不断），也是赶上哪趟算哪趟，分享如下，未几对新人有益。删去敏感内容，这篇英语“科普”大体相当于我领导研发的系统的白皮书（white paper）吧。顺便预告一下，手头正在撰写姐妹篇【立委科普：NLP 联络图】，力求分层次对NLP及其相关领域做一个鸟瞰式全面介绍，敬请留意。

Overview of Natural Language Processing (NLP)

【This document provides a text version of Dr. Wei Li's overview of NLP, presented on August 8, 2012.】

At a high level, our NLP core engine reads sentences and extracts insights to support our products. The link between the products and the core engine is the storage system. Today’s topic is on the workings of the NLP core engine.

System Overview

Our NLP core engine is a two-component system.

The first component is a parser, with the dependency tree structure as output, representing the system’s understanding of each sentence. This component outputs a system-internal, linguistic representation, much like diagramming taught in grammar school. This part of the system takes a sentence and “draws a tree of it.” The system parses language in a number of passes (modules), starting from a shallow level and moving on to a deep level.

The second component is an extractor, sitting on top of the parser and outputs a table (or frame) that directly meets the needs of products. This is where extraction rules, based on sub-tree matching, work, including our sentiment extraction component for social media customer insights.

Dependency Tree Structure and Frames

An insight extractor of our system is defined by frames. A frame is a table or template that defines the name of each column (often called event roles) for the target information (or insights). The purpose of the extraction component is to fill in the blanks of the frame and use such extracted information to support a product.

Each product is supported by different insight types, which are defined in the frame. To build a frame, Product Management determines what customers need and what output they want from processing sentences and uses the output information to formulate frame definitions. The NLP team takes the product-side requirements, does a feasibility study, and starts the system development, including rules (in a formalism equivalent to an extended version of cascaded finite state mechanism), lexicons and procedures (including machine learning for classification/clustering), based on a development corpus, to move the project forward. The frames for objective events define things like who did what when and where etc with a specific domain or use scenario in mind. The frames for sentiments or subjective evaluations contain information first to determine whether a comment is positive or negative (or neutral, in a process called sentiment classification). It also defines additional, more detailed columns on who made the comment on what to what degree (passion intensity) in which aspects (details) and why. It distinguishes an insight that is objective (for example, “cost-effective” or “expensive”) from subjective insight (for example, "terrific", “ugly” or “awful”).

The type of insight extraction is based on the first component of linguistic processing (parsing). More specifically, the insight extraction is realized by sub-tree matching rule in extraction grammars. In this example:

Apple launched iPhone 4s last month

The parser first decodes the linguistic tree structure, determining that the logical subject (actor) is “Apple,” the action is “launch,” the logical object (undergoer) is “iPhone 4s,” and “last month” is an adverbial. The system extracts these types of phrases to fill in the linguistic tree structure as follows.

Based on the above linguistic analysis, the second component extracts a product launch event as shown below:

How Systems Answer Questions

We can also look at our system from the perspective of how it addresses users information needs, in particular, how it answers questions in our mind. There are two major systems for getting feedback to satisfy users’ information needs.

Traditional systems, like search engines. A user enters a query into a search engine and gets documents or URLs related to query keywords. This system satisfies some needs, but there is too much information and what you want to know might be buried deep in the data.

NLP-based systems, which can answer users’ questions. All our products can be regarded as special types of “question-answering systems.” The system reads everything, sentence by sentence. If it has a target hit, it can pull out answers from the index to the specified types of questions.

Technology for answering factoid questions, such as when (time), where (location), who (person) is fairly mature. The when-question, for example, is easy to answer because time is almost always expressed in standard formats. The most challenging questions to answer are “how” and “why.” There is consensus in the question answering community on this. To answer “how” questions, you might need a recipe, a procedure, or a long list of drug names. To answer “why,” the system needs to find motivation behind sentiment or motive behind behavior.

Our products are high-end systems that are actually designed to answer “how” and “why” questions in addition to sentiments. For example, if you enter “heart attack” into our system, you get a full solution package organized into sections that includes a list of procedures, a list of drugs, a list of operations, the names of doctors and professionals, etc. Our consumer insight product classify sentiments, otherwise known as “thumbs-up” and “thumbs-down” classification, just like what our competitors do. But we do much more fined-grained and much deeper, and still scale up. Not only can it tell you what percentage, what ratio, how intensively people like or dislike a product, it also provides answers for why people like or dislike a product or a feature of a product. This is important: knowing how popular a brand is only gives a global view of customer sentiments, but such coursed-grained sentiments by themselves are not insightful: the actionable insights in the sentiment world need to answer why questions. Why do customers like or dislike a product feature? Systems that can answer such questions provide invaluable actionable insights to businesses. For example, it is much more insightful to know that consumers love the online speed of iPhone 4s but are very annoyed by the lack of support to flash. This is an actionable insight, one that a company could use to redirect resources to address issues or drive a product’s development. Extraction of such insights is enabled by our deep NLP, as a competitive advantage to traditional classification and clustering algorithms, practiced by almost all the competitions who claim to do sentiments.

Q&A

Q: How do you handle sarcasm?

A: Sarcasm is tough. It is a challenge to all the systems, us included. We have made some tangible progress and implemented some patterns of sarcasm in our system. But overall, it is a really difficult phenomenon of natural language. So far in the community, there is only limited research in the lab, far from being practical. People might say “no” when they mean “yes,” using a “zig-zag” way to express their emotions. It’s difficult enough for humans to understand these things and much more difficult for a machine.

The good news is that sarcasm is not that common overall, assuming that we are considering a large amount of real-life data. There are benchmarks in literature about what percentage of sarcastic data occurs in real-life language corpora. Fortunately, only a small fraction of the data might be related to sarcasm, often not making a statistical impact on data quality, whether or not it is captured.

Not all types of sarcasm are intractable. our products can capture common patterns of sarcasm fairly well. Our first target is sarcasm with fairly clear linguistic patterns, such as when people combine “thank you” (a positive emotion) with a negative behavior: “Thank you for hurting my feelings.” Our system recognizes and captures this contradictory pattern as sarcasm. “Thank you,” in this context, would not be presented as a positive insight.

Q: Do you take things only in context (within a sentence, phrase, or word) or consider a larger context?

A: Do we do anything beyond the sentence boundary to make our insights more coherent to users? Yes, to some extent, and more work is in progress. The index contains all local insights, broken down into “local” pieces. If we don’t put data into the index piece by piece, users can’t “drill down.” Drill-down is a necessary feature in products so the users can verify the insight sources (where exactly the insight is extracted from) and may choose to dive into a particular source.

After our application retrieves data from the index, it performs a “massaging” phase that occurs between retrieving the data storage and displaying it. This massaging phase introduces context beyond sentence and document boundaries. For example, “acronym association” identifies all of the numerous names used to refer to an entity (such as “IBM” versus “International Business Machine Corp”). This context-based acronym association capability is used as an anchoring point for merging the related insights. We have also developed co-reference capability to associate, for example, the pronoun “it” with the entity (such as iPhone) it refers to.
This phase also includes merging of phrases from local insights. For example, “cost-ineffective” is a synonym of “expensive.” The app merges these local insights before presenting them to the users.

Concluding Remarks on Language Technology and its Applications

NLP has been confined to labs for decades since beginning machine translation research in the early 1950s and up until the last decade. Until only a few years ago, NLP in applications had experienced only limited success. While it is moving very fast, NLP has not yet reached its prime time yet in the industry.

However, this technology is maturing and starting to show clear signs of serving as an enabling technology that can revolutionize how humans access information. We are already beyond the point of having to prove its value, the proof-of-concept stage. It just works, and we want to make it work better and more effectively. In the IT sector, more and more applications using NLP are expected to go live, ranging from social media, big data processing to intelligent assistants (e.g., Siri-like features) in mobile platforms. We are in a an exciting race towards making the language technology work in large-scale, real-life systems.

【相关篇什】
【立委科普：从产业角度说说NLP这个行当】
【立委科普：NLP 联络图】（coming soon）

# NLP 是一个力气活：再论成语不是问题

NLP是一个力气活,100% agree.

Quote

Quote

# 【NLP 迷思之四：词义消歧（WSD）是NLP应用的瓶颈】

（1）购买：Microsoft bought Powerset for \$100 million
（2）相信：I am not going to buy his argument

==》
《公司并购事件》：

@MyGod9:如果以机器翻译为目标呢？

WSD 可以作为 NLP 庙堂里的一尊菩萨供起来，让学者型研究家去烧香，实际系统开发者大可以敬鬼神而远之。：=）

# 说说科研立项中的大跃进

《NLP 迷思之四：词汇消歧（WSD）是NLP应用的瓶颈》

DARPA: Defense Advanced Research Projects Agency

DARPA 项目，是什么？

"跨越式......"的本质就是跃进，可是有不少阶段或过程是跨不过去的。

"举个国内的例子呗？"--居心不良，不想让博主在中国混了？

# 女怕嫁错郎，男怕入错行

【女怕嫁错郎，男怕入错行，专业怕选错方向】

《NLP 迷思之四：词汇消歧（WSD）是NLP应用的瓶颈》

# 2011 信息产业的两大关键词：社交媒体和云计算

Big data intelligence

# 立委统计发现，人是几乎无可救药的情绪性动物

Just some initial statistics found in the ball park from our experimentation of using default rules across languages: although the subjective quality default rule is only trigged by good/love/happy (or bad/hate/annoyed) emotional words which are a smaller subset than objective quality trigger words (cheap/expensive, high/low resolution, long/short battery-life etc.), the sentences captured by the subjective default rule doubles the sentences from applying the objective default rule.  This shows, perhaps, that human beings are very emotional creatures, so emotional that they judge twice as often as they provide simple objective evidence to justify their judgments.