NLP is not magic, but the results you can get sometimes seem almost magical.
【立委按】作为老兵，常常需要做行业概览（NLP Overview）之类的演讲，有时作为新人训练计划（orientation）的一部分，也有时是应朋友之邀去别处讲，为行业间交流的座谈形式。NLP 是做了一辈子，琢磨了一辈子的事儿，照林彪元帅的话说就是，已经“融化在血液里，落实在行动上”了。所以我从来就当是唠家常，走马谈花。无酒话桑麻，兴之所至，有时也穿插一些掌故，说过就完，烟消云散。可今年的一次演讲，有有心人细心记录在案（caught on tape），整理成了文档。虽然这次演讲枯燥一些（去年一次出外座谈，就精彩多了，现场气氛热络，笑声不断），也是赶上哪趟算哪趟，分享如下，未几对新人有益。删去敏感内容，这篇英语“科普”大体相当于我领导研发的系统的白皮书（white paper）吧。顺便预告一下，手头正在撰写姐妹篇【立委科普：NLP 联络图】，力求分层次对NLP及其相关领域做一个鸟瞰式全面介绍，敬请留意。
Overview of Natural Language Processing (NLP)
【This document provides a text version of Dr. Wei Li’s overview of NLP, presented on August 8, 2012.】
At a high level, our NLP core engine reads sentences and extracts insights to support our products. The link between the products and the core engine is the storage system. Today’s topic is on the workings of the NLP core engine.
Our NLP core engine is a two-component system.
The first component is a parser, with the dependency tree structure as output, representing the system’s understanding of each sentence. This component outputs a system-internal, linguistic representation, much like diagramming taught in grammar school. This part of the system takes a sentence and “draws a tree of it.” The system parses language in a number of passes (modules), starting from a shallow level and moving on to a deep level.
The second component is an extractor, sitting on top of the parser and outputs a table (or frame) that directly meets the needs of products. This is where extraction rules, based on sub-tree matching, work, including our sentiment extraction component for social media customer insights.
Dependency Tree Structure and Frames
An insight extractor of our system is defined by frames. A frame is a table or template that defines the name of each column (often called event roles) for the target information (or insights). The purpose of the extraction component is to fill in the blanks of the frame and use such extracted information to support a product.
Each product is supported by different insight types, which are defined in the frame. To build a frame, Product Management determines what customers need and what output they want from processing sentences and uses the output information to formulate frame definitions. The NLP team takes the product-side requirements, does a feasibility study, and starts the system development, including rules (in a formalism equivalent to an extended version of cascaded finite state mechanism), lexicons and procedures (including machine learning for classification/clustering), based on a development corpus, to move the project forward. The frames for objective events define things like who did what when and where etc with a specific domain or use scenario in mind. The frames for sentiments or subjective evaluations contain information first to determine whether a comment is positive or negative (or neutral, in a process called sentiment classification). It also defines additional, more detailed columns on who made the comment on what to what degree (passion intensity) in which aspects (details) and why. It distinguishes an insight that is objective (for example, “cost-effective” or “expensive”) from subjective insight (for example, “terrific”, “ugly” or “awful”).
The type of insight extraction is based on the first component of linguistic processing (parsing). More specifically, the insight extraction is realized by sub-tree matching rule in extraction grammars. In this example:
Apple launched iPhone 4s last month
The parser first decodes the linguistic tree structure, determining that the logical subject (actor) is “Apple,” the action is “launch,” the logical object (undergoer) is “iPhone 4s,” and “last month” is an adverbial. The system extracts these types of phrases to fill in the linguistic tree structure as follows.
Based on the above linguistic analysis, the second component extracts a product launch event as shown below:
How Systems Answer Questions
We can also look at our system from the perspective of how it addresses users information needs, in particular, how it answers questions in our mind. There are two major systems for getting feedback to satisfy users’ information needs.
Traditional systems, like search engines. A user enters a query into a search engine and gets documents or URLs related to query keywords. This system satisfies some needs, but there is too much information and what you want to know might be buried deep in the data.
NLP-based systems, which can answer users’ questions. All our products can be regarded as special types of “question-answering systems.” The system reads everything, sentence by sentence. If it has a target hit, it can pull out answers from the index to the specified types of questions.
Technology for answering factoid questions, such as when (time), where (location), who (person) is fairly mature. The when-question, for example, is easy to answer because time is almost always expressed in standard formats. The most challenging questions to answer are “how” and “why.” There is consensus in the question answering community on this. To answer “how” questions, you might need a recipe, a procedure, or a long list of drug names. To answer “why,” the system needs to find motivation behind sentiment or motive behind behavior.
Our products are high-end systems that are actually designed to answer “how” and “why” questions in addition to sentiments. For example, if you enter “heart attack” into our system, you get a full solution package organized into sections that includes a list of procedures, a list of drugs, a list of operations, the names of doctors and professionals, etc. Our consumer insight product classify sentiments, otherwise known as “thumbs-up” and “thumbs-down” classification, just like what our competitors do. But we do much more fined-grained and much deeper, and still scale up. Not only can it tell you what percentage, what ratio, how intensively people like or dislike a product, it also provides answers for why people like or dislike a product or a feature of a product. This is important: knowing how popular a brand is only gives a global view of customer sentiments, but such coursed-grained sentiments by themselves are not insightful: the actionable insights in the sentiment world need to answer why questions. Why do customers like or dislike a product feature? Systems that can answer such questions provide invaluable actionable insights to businesses. For example, it is much more insightful to know that consumers love the online speed of iPhone 4s but are very annoyed by the lack of support to flash. This is an actionable insight, one that a company could use to redirect resources to address issues or drive a product’s development. Extraction of such insights is enabled by our deep NLP, as a competitive advantage to traditional classification and clustering algorithms, practiced by almost all the competitions who claim to do sentiments.
Q: How do you handle sarcasm?
A: Sarcasm is tough. It is a challenge to all the systems, us included. We have made some tangible progress and implemented some patterns of sarcasm in our system. But overall, it is a really difficult phenomenon of natural language. So far in the community, there is only limited research in the lab, far from being practical. People might say “no” when they mean “yes,” using a “zig-zag” way to express their emotions. It’s difficult enough for humans to understand these things and much more difficult for a machine.
The good news is that sarcasm is not that common overall, assuming that we are considering a large amount of real-life data. There are benchmarks in literature about what percentage of sarcastic data occurs in real-life language corpora. Fortunately, only a small fraction of the data might be related to sarcasm, often not making a statistical impact on data quality, whether or not it is captured.
Not all types of sarcasm are intractable. our products can capture common patterns of sarcasm fairly well. Our first target is sarcasm with fairly clear linguistic patterns, such as when people combine “thank you” (a positive emotion) with a negative behavior: “Thank you for hurting my feelings.” Our system recognizes and captures this contradictory pattern as sarcasm. “Thank you,” in this context, would not be presented as a positive insight.
Q: Do you take things only in context (within a sentence, phrase, or word) or consider a larger context?
A: Do we do anything beyond the sentence boundary to make our insights more coherent to users? Yes, to some extent, and more work is in progress. The index contains all local insights, broken down into “local” pieces. If we don’t put data into the index piece by piece, users can’t “drill down.” Drill-down is a necessary feature in products so the users can verify the insight sources (where exactly the insight is extracted from) and may choose to dive into a particular source.
After our application retrieves data from the index, it performs a “massaging” phase that occurs between retrieving the data storage and displaying it. This massaging phase introduces context beyond sentence and document boundaries. For example, “acronym association” identifies all of the numerous names used to refer to an entity (such as “IBM” versus “International Business Machine Corp”). This context-based acronym association capability is used as an anchoring point for merging the related insights. We have also developed co-reference capability to associate, for example, the pronoun “it” with the entity (such as iPhone) it refers to.
This phase also includes merging of phrases from local insights. For example, “cost-ineffective” is a synonym of “expensive.” The app merges these local insights before presenting them to the users.
Concluding Remarks on Language Technology and its Applications
NLP has been confined to labs for decades since beginning machine translation research in the early 1950s and up until the last decade. Until only a few years ago, NLP in applications had experienced only limited success. While it is moving very fast, NLP has not yet reached its prime time yet in the industry.
However, this technology is maturing and starting to show clear signs of serving as an enabling technology that can revolutionize how humans access information. We are already beyond the point of having to prove its value, the proof-of-concept stage. It just works, and we want to make it work better and more effectively. In the IT sector, more and more applications using NLP are expected to go live, ranging from social media, big data processing to intelligent assistants (e.g., Siri-like features) in mobile platforms. We are in a an exciting race towards making the language technology work in large-scale, real-life systems.
【立委科普：NLP 联络图】（coming soon）