分类目录归档:自然语言处理

Coursera上Python课程(公开课)汇总推荐:从Python入门到应用Python

Deep Learning Specialization on Coursera

Python是深度学习时代的语言,Coursera上有很多Python课程,从Python入门到精通,从Python基础语法到应用Python,满足各个层次的需求,以下是Coursera上的Python课程整理,仅供参考,这里也会持续更新。

1. 密歇根大学的“Python for Everybody Specialization(人人都可以学习的Python专项课程)”

这个系列对于学习者的编程背景和数学要求几乎为零,非常适合Python入门学习。这个系列也是Coursera上最受欢迎的Python学习系列课程,强烈推荐。这个Python系列的目标是“通过Python学习编程并分析数据,开发用于采集,清洗,分析和可视化数据的程序(Learn to Program and Analyze Data with Python-Develop programs to gather, clean, analyze, and visualize data.” ,以下是关于这个系列的简介:

This Specialization builds on the success of the Python for Everybody course and will introduce fundamental programming concepts including data structures, networked application program interfaces, and databases, using the Python programming language. In the Capstone Project, you’ll use the technologies learned throughout the Specialization to design and create your own applications for data retrieval, processing, and visualization.

这个系列包含4门子课程和1门毕业项目课程,包括Python入门基础,Python数据结构, 使用Python获取网络数据(Python爬虫),在Python中使用数据库以及Python数据可视化等。以下是具体子课程的介绍:

1.1 Programming for Everybody (Getting Started with Python)

Python入门级课程,这门课程暂且翻译为“人人都可以学编程-从Python开始”,如果没有任何编程基础,就从这门课程开始吧:

This course aims to teach everyone the basics of programming computers using Python. We cover the basics of how one constructs a program from a series of simple instructions in Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course. This course will cover Chapters 1-5 of the textbook “Python for Everybody”. Once a student completes this course, they will be ready to take more advanced programming courses. This course covers Python 3.

1.2 Python Data Structures(Python数据结构)

Python基础课程,这门课程的目标是介绍Python语言的核心数据结构(This course will introduce the core data structures of the Python programming language.),关于这门课程:

This course will introduce the core data structures of the Python programming language. We will move past the basics of procedural programming and explore how we can use the Python built-in data structures such as lists, dictionaries, and tuples to perform increasingly complex data analysis. This course will cover Chapters 6-10 of the textbook “Python for Everybody”. This course covers Python 3.

1.3 Using Python to Access Web Data(使用Python获取网页数据--Python爬虫)

Python应用课程,只有使用Python才能学以致用,这门课程的目标是展示如何通过爬取和分析网页数据将互联网作为数据的源泉(This course will show how one can treat the Internet as a source of data):

This course will show how one can treat the Internet as a source of data. We will scrape, parse, and read web data as well as access data using web APIs. We will work with HTML, XML, and JSON data formats in Python. This course will cover Chapters 11-13 of the textbook “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-10 of the textbook and the first two courses in this specialization. These topics include variables and expressions, conditional execution (loops, branching, and try/except), functions, Python data structures (strings, lists, dictionaries, and tuples), and manipulating files. This course covers Python 3.

1.4 Using Databases with Python(Python数据库)

Python应用课程,在Python中使用数据库。这门课程的目标是在Python中学习SQL,使用SQLite3作为抓取数据的存储数据库:

This course will introduce students to the basics of the Structured Query Language (SQL) as well as basic database design for storing data as part of a multi-step data gathering, analysis, and processing effort. The course will use SQLite3 as its database. We will also build web crawlers and multi-step data gathering and visualization processes. We will use the D3.js library to do basic data visualization. This course will cover Chapters 14-15 of the book “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-13 of the textbook and the first three courses in this specialization. This course covers Python 3.

1.5 Capstone: Retrieving, Processing, and Visualizing Data with Python(毕业项目课程:使用Python获取,处理和可视化数据)

Python应用实践课程,这是这个系列的毕业项目课程,目的是通过开发一系列Python应用项目让学生熟悉Python抓取,处理和可视化数据的流程。

In the capstone, students will build a series of applications to retrieve, process and visualize data using Python. The projects will involve all the elements of the specialization. In the first part of the capstone, students will do some visualizations to become familiar with the technologies in use and then will pursue their own project to visualize some other data that they have or can find. Chapters 15 and 16 from the book “Python for Everybody” will serve as the backbone for the capstone. This course covers Python 3.

2. 多伦多大学的编程入门课程"Learn to Program: The Fundamentals(学习编程:基础) "

Python入门级课程。这门课程以Python语言传授编程入门知识,实为零基础的Python入门课程。感兴趣的同学可以参考课程图谱上的老课程评论 :http://coursegraph.com/coursera_programming1 ,之前一个同学的评价是 “两个老师语速都偏慢,讲解细致,又有可视化工具Python Visualizer用于详细了解程序具体执行步骤,可以说是零基础学习python编程的最佳选择。”

Behind every mouse click and touch-screen tap, there is a computer program that makes things happen. This course introduces the fundamental building blocks of programming and teaches you how to write fun and useful programs using the Python language.

3. 莱斯大学的Python专项课程系列:Introduction to Scripting in Python Specialization

入门级Python学习系列课程,涵盖Python基础, Python数据表示, Python数据分析, Python数据可视化等子课程,比较适合Python入门。这门课程的目标是让学生可以在处理实际问题是使用Python解决问题:Launch Your Career in Python Programming-Master the core concepts of scripting in Python to enable you to solve practical problems.

This specialization is intended for beginners who would like to master essential programming skills. Through four courses, you will cover key programming concepts in Python 3 which will prepare you to use Python to perform common scripting tasks. This knowledge will provide a solid foundation towards a career in data science, software engineering, or other disciplines involving programming.

这个系列包含4门子课程,以下是具体子课程的介绍:

3.1 Python Programming Essentials(Python编程基础)

Python入门基础课程,这门课程将讲授Python编程基础知识,包括表达式,变量,函数等,目标是让用户熟练使用Python:

This course will introduce you to the wonderful world of Python programming! We'll learn about the essential elements of programming and how to construct basic Python programs. We will cover expressions, variables, functions, logic, and conditionals, which are foundational concepts in computer programming. We will also teach you how to use Python modules, which enable you to benefit from the vast array of functionality that is already a part of the Python language. These concepts and skills will help you to begin to think like a computer programmer and to understand how to go about writing Python programs. By the end of the course, you will be able to write short Python programs that are able to accomplish real, practical tasks. This course is the foundation for building expertise in Python programming. As the first course in a specialization, it provides the necessary building blocks for you to succeed at learning to write more complex Python programs. This course uses Python 3. While many Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This first course will use a Python 3 version of the CodeSkulptor development environment, which is specifically designed to help beginning programmers learn quickly. CodeSkulptor runs within any modern web browser and does not require you to install any software, allowing you to start writing and running small programs immediately. In the later courses in this specialization, we will help you to move to more sophisticated desktop development environments.

3.2 Python Data Representations(Python数据表示)

Python入门基础课程,这门课程依然关注Python的基础知识,包括Python字符串,列表等,以及Python文件操作:

This course will continue the introduction to Python programming that started with Python Programming Essentials. We'll learn about different data representations, including strings, lists, and tuples, that form the core of all Python programs. We will also teach you how to access files, which will allow you to store and retrieve data within your programs. These concepts and skills will help you to manipulate data and write more complex Python programs. By the end of the course, you will be able to write Python programs that can manipulate data stored in files. This will extend your Python programming expertise, enabling you to write a wide range of scripts using Python This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course introduces basic desktop Python development environments, allowing you to run Python programs directly on your computer. This choice enables a smooth transition from online development environments.

3.3 Python Data Analysis(Python数据分析)

Python基础课程,这门课程将讲授通过Python读取和分析表格数据和结构化数据等,例如TCSV文件等:

This course will continue the introduction to Python programming that started with Python Programming Essentials and Python Data Representations. We'll learn about reading, storing, and processing tabular data, which are common tasks. We will also teach you about CSV files and Python's support for reading and writing them. CSV files are a generic, plain text file format that allows you to exchange tabular data between different programs. These concepts and skills will help you to further extend your Python programming knowledge and allow you to process more complex data. By the end of the course, you will be comfortable working with tabular data in Python. This will extend your Python programming expertise, enabling you to write a wider range of scripts using Python. This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course uses basic desktop Python development environments, allowing you to run Python programs directly on your computer.

3.4 Python Data Visualization(Python数据可视化)

Python应用课程,这门课程将基于前3门课程学习的Python知识,抓取网络数据,然后清洗,处理和分析数据,并最终可视化呈现数据:

This if the final course in the specialization which builds upon the knowledge learned in Python Programming Essentials, Python Data Representations, and Python Data Analysis. We will learn how to install external packages for use within Python, acquire data from sources on the Web, and then we will clean, process, analyze, and visualize that data. This course will combine the skills learned throughout the specialization to enable you to write interesting, practical, and useful programs. By the end of the course, you will be comfortable installing Python packages, analyzing existing data, and generating visualizations of that data. This course will complete your education as a scripter, enabling you to locate, install, and use Python packages written by others. You will be able to effectively utilize tools and packages that are widely available to amplify your effectiveness and write useful programs.

4. 莱斯大学的计算(机)基础专项课程系列:Fundamentals of Computing Specialization

入门级Python编程学习课程系列,这个系列覆盖了大部分莱斯大学一年级计算机科学新生的学习材料,学生通过Python学习现代编程语言技巧,并将这些技巧应用到20个左右的有趣的编程项目中。

This Specialization covers much of the material that first-year Computer Science students take at Rice University. Students learn sophisticated programming skills in Python from the ground up and apply these skills in building more than 20 fun projects. The Specialization concludes with a Capstone exam that allows the students to demonstrate the range of knowledge that they have acquired in the Specialization.

这个系列包括Python交互式编程设计,计算原理,算法思维等6门课程和1门毕业项目课程,目标是让学生像计算机科学家一样编程和思考(Learn how to program and think like a Computer Scientist),以下是子课程的相关介绍:

4.1 An Introduction to Interactive Programming in Python (Part 1)(Python交互式编程导论上)

Python入门级课程,这门课程将讲授Python编程基础知识,例如普通表达式,条件表达式和函数,并用这些知识构建一个简单的交互式应用。

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 1 of this course, we will introduce the basic elements of programming (such as expressions, conditionals, and functions) and then use these elements to create simple interactive applications such as a digital stopwatch. Part 1 of this class will culminate in building a version of the classic arcade game "Pong".

4.2 An Introduction to Interactive Programming in Python (Part 2)(Python交互式编程导论下)

Python入门级课程,这门课程将继续讲授Python基础知识,例如列表,词典和循环,并将使用这些知识构建一个简单的游戏例如Blackjack:

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 2 of this course, we will introduce more elements of programming (such as list, dictionaries, and loops) and then use these elements to create games such as Blackjack. Part 1 of this class will culminate in building a version of the classic arcade game "Asteroids". Upon completing this course, you will be able to write small, but interesting Python programs. The next course in the specialization will begin to introduce a more principled approach to writing programs and solving computational problems that will allow you to write larger and more complex programs.

4.3 Principles of Computing (Part 1)(计算原理上)

编程基础课程,这门课程聚焦在了编程的基础上,包括编码标准和测试,数学基础包括概率和组合等。

This two-part course builds upon the programming skills that you learned in our Introduction to Interactive Programming in Python course. We will augment those skills with both important programming practices and critical mathematical problem solving skills. These skills underlie larger scale computational problem solving and programming. The main focus of the class will be programming weekly mini-projects in Python that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 1 of this course, the programming aspect of the class will focus on coding standards and testing. The mathematical portion of the class will focus on probability, combinatorics, and counting with an eye towards practical applications of these concepts in Computer Science. Recommended Background - Students should be comfortable writing small (100+ line) programs in Python using constructs such as lists, dictionaries and classes and also have a high-school math background that includes algebra and pre-calculus.

4.4 Principles of Computing (Part 2)(计算原理下)

编程基础课程,这门课程聚焦在搜索、排序、递归等主题上:

This two-part course introduces the basic mathematical and programming principles that underlie much of Computer Science. Understanding these principles is crucial to the process of creating efficient and well-structured solutions for computational problems. To get hands-on experience working with these concepts, we will use the Python programming language. The main focus of the class will be weekly mini-projects that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 2 of this course, the programming portion of the class will focus on concepts such as recursion, assertions, and invariants. The mathematical portion of the class will focus on searching, sorting, and recursive data structures. Upon completing this course, you will have a solid foundation in the principles of computation and programming. This will prepare you for the next course in the specialization, which will begin to introduce a structured approach to developing and analyzing algorithms. Developing such algorithmic thinking skills will be critical to writing large scale software and solving real world computational problems.

4.5 Algorithmic Thinking (Part 1)(算法思维上)

编程基础课程,这门课程聚焦在算法思维的培养上,讲授图算法的相关概念并用Python实现:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part course builds on the principles that you learned in our Principles of Computing course and is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to real-world computational problems. In part 1 of this course, we will study the notion of algorithmic efficiency and consider its application to several problems from graph theory. As the central part of the course, students will implement several important graph algorithms in Python and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Recommended Background - Students should be comfortable writing intermediate size (300+ line) programs in Python and have a basic understanding of searching, sorting, and recursion. Students should also have a solid math background that includes algebra, pre-calculus and a familiarity with the math concepts covered in "Principles of Computing".

4.6 Algorithmic Thinking (Part 2)(算法思维下)

编程基础课程,这门课程聚焦在培养学生的算法思维,并了解一些高级算法主题,例如分治法,动态规划等:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part class is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to computational problems. In part 2 of this course, we will study advanced algorithmic techniques such as divide-and-conquer and dynamic programming. As the central part of the course, students will implement several algorithms in Python that incorporate these techniques and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Once students have completed this class, they will have both the mathematical and programming skills to analyze, design, and program solutions to a wide range of computational problems. While this class will use Python as its vehicle of choice to practice Algorithmic Thinking, the concepts that you will learn in this class transcend any particular programming language.

4.7 The Fundamentals of Computing Capstone Exam(计算基础毕业项目课程)

Python应用课程,基于以上子课程的学习,计算基础毕业项目课程将用Python和所学的知识完成 20+ 项目:

While most specializations on Coursera conclude with a project-based course, students in the "Fundamentals of Computing" specialization have completed more than 20+ projects during the first six courses of the specialization. Given that much of the material in these courses is reused from session to session, our goal in this capstone class is to provide a conclusion to the specialization that allows each student an opportunity to demonstrate their individual mastery of the material in the specialization. With this objective in mind, the focus in this Capstone class will be an exam whose questions are updated periodically. This approach is designed to help insure that each student is solving the exam problems on his/her own without outside help. For students that have done their own work, we do not anticipate that the exam will be particularly hard. However, those students who have relied too heavily on outside help in previous classes may have a difficult time. We believe that this approach will increase the value of the Certificate for this specialization.

5. 密歇根大学的 Applied Data Science with Python(Python数据科学应用专项课程系列)

Python应用系列课程,这个系列的目标主要是通过Python编程语言介绍数据科学的相关领域,包括应用统计学,机器学习,信息可视化,文本分析和社交网络分析等知识,并结合一些流行的Python工具包,例如pandas, matplotlib, scikit-learn, nltk以及networkx等Python工具。

The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have basic a python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.

这个系列课程有5门课程,包括Python数据科学导论课程(Introduction to Data Science in Python),Python数据可视化(Applied Plotting, Charting & Data Representation in Python),Python机器学习(Applied Machine Learning in Python) ,Python文本挖掘(Applied Text Mining in Python) , Python社交网络分析(Applied Social Network Analysis in Python),以下是具体子课程的介绍:

5.1 Introduction to Data Science in Python(Python数据科学导论)

Python基础和应用课程,这门课程从Python基础讲起,然后通过pandas数据科学库介绍DataFrame等数据分析中的核心数据结构概念,让学生学会操作和分析表格数据并学会运行基础的统计分析工具。

This course will introduce the learner to the basics of the python programming environment, including how to download and install python, expected fundamental python programming techniques, and how to find help with python programming questions. The course will also introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the DataFrame as the central data structure for data analysis. The course will end with a statistics primer, showing how various statistical measures can be applied to pandas DataFrames. By the end of the course, students will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses. This course should be taken before any of the other Applied Data Science with Python courses: Applied Plotting, Charting & Data Representation in Python, Applied Machine Learning in Python, Applied Text Mining in Python, Applied Social Network Analysis in Python.

5.2 Applied Plotting, Charting & Data Representation in Python(Python数据可视化)

Python应用课程,这门课程聚焦在通过使用matplotlib库进行数据图表的绘制和可视化呈现:

This course will introduce the learner to information visualization basics, with a focus on reporting and charting using the matplotlib library. The course will start with a design and information literacy perspective, touching on what makes a good and bad visualization, and what statistical measures translate into in terms of visualizations. The second week will focus on the technology used to make visualizations in python, matplotlib, and introduce users to best practices when creating basic charts and how to realize design decisions in the framework. The third week will describe the gamut of functionality available in matplotlib, and demonstrate a variety of basic statistical charts helping learners to identify when a particular method is good for a particular problem. The course will end with a discussion of other forms of structuring and visualizing data. This course should be taken after Introduction to Data Science in Python and before the remainder of the Applied Data Science with Python courses: Applied Machine Learning in Python, Applied Text Mining in Python, and Applied Social Network Analysis in Python.

5.3 Applied Machine Learning in Python(Python机器学习)

Python应用课程,这门课程主要聚焦在通过Python应用机器学习,包括机器学习和统计学的区别,机器学习工具包scikit-learn的介绍,有监督学习和无监督学习,数据泛化问题(例如交叉验证和过拟合)等。

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis. This course should be taken after Introduction to Data Science in Python and Applied Plotting, Charting & Data Representation in Python and before Applied Text Mining in Python and Applied Social Analysis in Python.

5.4 Applied Text Mining in Python(Python文本挖掘)

Python应用课程,这门课程主要聚焦在文本挖掘和文本分析基础,包括正则表达式,文本清洗,文本预处理等,并结合NLTK讲授自然语言处理的相关知识,例如文本分类,主题模型等。

This course will introduce the learner to text mining and text manipulation basics. The course begins with an understanding of how text is handled by python, the structure of text both to the machine and to humans, and an overview of the nltk framework for manipulating text. The second week focuses on common manipulation needs, including regular expressions (searching for text), cleaning text, and preparing text for use by machine learning processes. The third week will apply basic natural language processing methods to text, and demonstrate how text classification is accomplished. The final week will explore more advanced methods for detecting the topics in documents and grouping them by similarity (topic modelling). This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

5.5 Applied Social Network Analysis in Python(Python社交网络分析)

Python应用课程,这门课程通过Python工具包 NetworkX 介绍社交网络分析的相关知识。

This course will introduce the learner to network analysis through the NetworkX library. The course begins with an understanding of what network analysis is and motivations for why we might model phenomena as networks. The second week introduces the concept of connectivity and network robustness.. The third week will explore ways of measuring the importance or centrality of a node in a network. The final week will explore the evolution of networks over time and cover models of network generation and the link prediction problem. This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

您可以继续在课程图谱上挖掘Coursera上新的Python课程,也欢迎推荐到这里。

注:本文首发“课程图谱博客”:http://blog.coursegraph.com ,同步发布到这里,原文链接地址:http://blog.coursegraph.com/coursera%E4%B8%8Apython%E8%AF%BE%E7%A8%8B%EF%BC%88%E5%85%AC%E5%BC%80%E8%AF%BE%EF%BC%89%E6%B1%87%E6%80%BB%E6%8E%A8%E8%8D%90%EF%BC%9A%E4%BB%8Epython%E5%85%A5%E9%97%A8%E5%88%B0%E5%BA%94%E7%94%A8python

Synonyms: 中文近义词工具包

Deep Learning Specialization on Coursera

Synonyms

Chinese Synonyms for Natural Language Processing and Understanding.
最好的中文近义词库。

最近需要做一个基于知识图谱的检索,但是因为知识图谱中存储的都是标准关键词,所以需要对用户的输入进行标准关键词的匹配。目前很缺乏质量好的中文近义词库,于是便考虑使用word2vec训练一个高质量的同义词库将"非标准表述" 映射到 "标准表述",这就是Synonyms的起源。

在经典的信息检索系统中,相似度的计算是基于匹配的,而且是Query经过分词后与文档库的严格的匹配,这种就缺少了利用词汇之间的“关系”。而word2vec使用大量数据,利用上下文信息进行训练,将词汇映射到低维空间,产生了这种“关系”,这种“关系”是基于距离的。有了这种“关系”,就可以进一步利用词汇之间的距离进行检索。所以,在算法层面上,检索更是基于了“距离”而非“匹配”,基于“语义”而非“形式”。

下面我们来仔细聊聊Synonyms(https://github.com/huyingxi/Synonyms)。

首先需要语料,我们采用了开放的大规模中文语料——维基百科中文语料。

(1)下载维基百科中文语料。
(2)繁简转换。
(3)分词。

具体操作访问wikidata-corpus

使用gensim自带的word2vec包进行词向量的训练。
(1)下载gensim。
(2)输入分词之后的维基语料进行词向量训练。
(3)测试训练好的词的近义词。

具体操作访问
wikidata-corpus
gensim.word2vec官方文档

安装

pip install -U synonyms

API接口

synonyms.nearby

获取近义词列表及对应的分数

import synonyms
print("人脸: %s" % (synonyms.nearby("人脸"))) # 获取近义词
print("识别: %s" % (synonyms.nearby("识别")))
print("NOT_EXIST: %s" % (synonyms.nearby("NOT_EXIST")))

synonyms.nearby(WORD)返回一个list,list中包含两项:[[nearby_words], [nearby_words_score]],nearby_words是WORD的近义词们,也以list的方式存储,并且按照距离的长度由近及远排列,nearby_words_score是nearby_words中对应位置的词的距离的分数,分数在(0-1)区间内,越接近于1,代表越相近。

人脸: [['图片', '图像', '通过观察', '数字图像', '几何图形', '脸部', '图象', '放大镜', '面孔', 'Mii'], [0.597284, 0.580373, 0.568486, 0.535674, 0.531835, 0.530095, 0.525344, 0.524009, 0.523101, 0.516046]]
识别: [['辨识', '辨别', '辨认', '标识', '鉴别', '标记', '识别系统', '分辨', '检测', '区分'], [0.872249, 0.764099, 0.725761, 0.702918, 0.68861, 0.678132, 0.663829, 0.661863, 0.639442, 0.611004]]

synonyms.compare

获得两个句子的相似度

sen1 = "旗帜引领方向"
sen2 = "道路决定命运"
assert synonyms.compare(sen1, sen2) == 0.0, "the similarity should be zero"
sen1 = "发生历史性变革"
sen2 = "取得历史性成就"
assert synonyms.compare(sen1, sen2) > 0, "the similarity should be bigger then zero"

返回值:[0-1],并且越接近于1代表两个句子越相似。

详细的文档

场景

  • 推荐。将用户输入进行近义词分析,可以推荐给用户相关的关键词。
  • 搜索。将用户非标准化输入转换为标准化输入,进而进行数据库/知识库检索。
  • 相似度计算。解决在自然语言处理任务中常见的词语语义相似度计算问题。

Why Synonyms

1、准确率高。
从上面的示例可以看到synonyms作为开放领域的同义词库,已经有较优的表现。
2、快速使用。
即安即用,方便开发者直接调用。
3、方便搭建。

作者

Hu Ying Xi

hain

Synonyms 点赞。

机器学习保险行业问答开放数据集: 2. 使用案例

Deep Learning Specialization on Coursera

上一篇文章中,介绍了数据集的设计,该语料可以用于研究和学习,从规模和质量上,是目前中文问答语料中,保险行业垂直领域最优秀的语料,关于该语料制作过程可以通过语料主页了解,本篇的主要内容是使用该语料实现一个简单的问答模型,并且给出准确度和损失函数作为数据集的Baseline。

DeepQA-1

为了展示如何使用该语料训练模型和评测算法,我做了一个示例项目 - DeepQA-1,本文接下来会介绍DeepQA-1,假设读者了解深度学习基本概念和Python语言。

Data Loader

数据加载包含两部分:加载语料和预处理。 加载数据使用 insuranceqa_data  载入训练,测试和验证集的数据。

预处理是按照模型的超参数处理问题和答案,将它们组合成输入需要的格式,在本文介绍的baseline model中,预处理包含下面工作:

  1. 在词汇表(vocab)中添加辅助Token: <PAD>, <GO>. 假设x是问题序列,是u回复序列,输入序列可以表示为:

超参数question_max_length代表模型中问题的最大长度。 超参数utterance_max_length代表模型中回复的最大长度,回复可能是正例,也可能是负例。

其中,Token <GO> 用来分隔问题和回复,Token <PAD> 用来补齐问题或回复。

训练数据包含了141,779条,正例:负例=1:10,根据超参数生成输入序列:

上图 中 x 就是输入序列。y_代表标注数据:正例还是负例,正例标为[1,0],负例标为[0,1],这样做的好处是方便计算损失函数和准确度。测试数据和验证数据也用同样的方式进行处理,唯一不同的是它们不需要做成mini-batch。需要强调的是,处理词汇表和构建输入序列的方式可以尝试用不同的方法,上述方案仅作为表达baseline结果而采用,一些有助于增强模型能力的,比如使用word2vec训练词向量都值得尝试。

Network

baseline model使用了最简单的神经网络,输入序列从左侧进入,输出序列输出包含2个数值的vector,然后使用损失函数计算误差。

超参数,Hyper params

损失函数

神经网络的激活函数使用函数,损失函数使用最大似然的思想。

 

迭代训练

使用mini-batch加载数据,迭代训练的大部分工作在back_propagation中完成,它计算出每次迭代的损失和b,W 的误差率,然后使用学习率和误差率更新每个b,W 。

 

执行训练脚本

python3 deep_qa_1/network.py

Visual

在训练过程中,观察损失函数和准确度的变化可以帮助优化超参数的设计。

loss

python3 visual/loss.py

accuracy

python3 visual/accuracy.py

在迭代了25,000步后就基本维持在一个固定值,学习停止了。

Baseline

使用获得的Baseline数据为:

Epoch 25, total step 36400, accuracy 0.9031, cost 1.056221.

总结

Baseline model设计的非常简单,它展示了如何使用insuranceqa-corpus-zh训练FAQ问答模型,项目的源码参考这里。在过去两周中,为了能让这个数据集能满足使用,体现其价值,我花了很多时间来建设,仓促之中仍然会包含一些不足,比如数据集中,每个问题是唯一的,不包含相似问题,是这个数据集目前最大的缺陷,另外一方面,因为该数据集的回复包含一个正例和多个负例,可以用用于训练分类器,也可以用于训练ranking model。如果在使用的过程中,遇到任何问题,可以通过数据集的地址 反馈。

机器学习保险行业问答开放数据集: 1. 语料介绍

Deep Learning Specialization on Coursera

目前机器学习,尤其是因为深度学习的一波小高潮,大家对使用深度学习处理文本任务,兴趣浓厚,数据是特征提取的天花板,特征提取是深度学习的天花板。在缺少语料的情况下,评价算法和研究都很难着手,在调研了众多语料之后,深知高质量的开放语料十分稀少,比如百度开放的Web QA 1.0 语料,包含的问题也就是四万余条,而分成不同的垂直领域,就根本不能用于FAQ模型的训练,这就是我做了这个语料的原因 - 支持常见问题集模型的算法评测和研究。我将通过两篇文章来分享这个语料:(1) 语料介绍,  介绍语料的组成; (2) 使用案例,介绍一个简单使用该语料进行深度学习训练的案例,可以作为 baseline。

该语料库包含从网站Insurance Library 收集的问题和答案。

据我们所知,这是保险领域首个开放的QA语料库:

该语料库的内容由现实世界的用户提出,高质量的答案由具有深度领域知识的专业人士提供。 所以这是一个具有真正价值的语料,而不是玩具。

在上述论文中,语料库用于答复选择任务。 另一方面,这种语料库的其他用法也是可能的。 例如,通过阅读理解答案,观察学习等自主学习,使系统能够最终拿出自己的看不见的问题的答案。

数据集分为两个部分“问答语料”和“问答对语料”。问答语料是从原始英文数据翻译过来,未经其他处理的。问答对语料是基于问答语料,又做了分词和去标去停,添加label。所以,"问答对语料"可以直接对接机器学习任务。如果对于数据格式不满意或者对分词效果不满意,可以直接对"问答语料"使用其他方法进行处理,获得可以用于训练模型的数据。

欢迎任何进一步增加此数据集的想法。

快速开始

语料地址

https://github.com/Samurais/insuranceqa-corpus-zh

在Python环境中,可以使用pip安装

兼容py2, py3

pip install --upgrade insuranceqa_data

问答语料

问题 答案 词汇(英语)
训练 12,889  21,325   107,889
验证  2,000  3354   16,931
测试  2,000  3308   16,815

每条数据包括问题的中文,英文,答案的正例,答案的负例。案的正例至少1项,基本上在1-5条,都是正确答案。答案的负例有200条,负例根据问题使用检索的方式建立,所以和问题是相关的,但却不是正确答案。

{
"INDEX": {
"zh": "中文",
"en": "英文",
"domain": "保险种类",
"answers": [""] # 答案正例列表
"negatives": [""] # 答案负例列表
},
more ...
}

训练:corpus/pool/train.json.gz

验证:corpus/pool/valid.json.gz

测试:corpus/pool/test.json.gz

答案:corpus/pool/answers.json 一共有 27,413 个回答,数据格式为 json:

{
"INDEX": {
"zh": "中文",
"en": "英文"
},
more ...
}

中英文对照文件

问答对

文件: corpus/pool/train.txt.gz, corpus/pool/valid.txt.gz, corpus/pool/test.txt.gz.

格式: INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文

答案

文件: corpus/pool/answers.txt.gz

格式: INDEX ++$++ 中文 ++$++ 英文

语料库使用gzip进行压缩以减小体积,可以使用zmore, zless, zcat, zgrep等命令访问数据。

zmore pool/test.txt.gz

加载数据

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pool_train()
test_data = insuranceqa.load_pool_test()
valid_data = insuranceqa.load_pool_valid()

# valid_data, test_data and train_data share the same properties

for x in train_data:

print('index %s value: %s ++$++ %s ++$++ %s' % \
(x, d[x]['zh'], d[x]['en'], d[x]['answers'], d[x]['negatives']))

answers_data = insuranceqa.load_pool_answers()

for x in answers_data:

print('index %s: %s ++$++ %s' % (x, d[x]['zh'], d[x]['en']))

问答对语料

使用"问答语料",还需要做很多工作才能进入机器学习的模型,比如分词,去停用词,去标点符号,添加label标记。所以,在"问答语料"的基础上,还可以继续处理,但是在分词等任务中,可以借助不同分词工具,这点对于模型训练而言是有影响的。为了使数据能快速可用,insuranceqa-corpus-zh提供了一个使用HanLP分词和去标,去停,添加label的数据集,这个数据集完全是基于"问答语料"。

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pairs_train()
test_data = insuranceqa.load_pairs_test()
valid_data = insuranceqa.load_pairs_valid()

# valid_data, test_data and train_data share the same properties

for x in test_data:

print('index %s value: %s ++$++ %s ++$++ %s' % \
(x['qid'], x['question'], x['utterance'], x['label']))

vocab_data = insuranceqa.load_pairs_vocab()
vocab_data['word2id']['UNKNOWN']
vocab_data['id2word'][0]
vocab_data['tf']
vocab_data['total']

vocab_data包含word2id(dict, 从word到id), id2word(dict, 从id到word),tf(dict, 词频统计)和total(单词总数)。 其中,未登录词的标识为UNKNOWN,未登录词的id为0。

train_datatest_data 和 valid_data 的数据格式一样。qid 是问题Id,question 是问题,utterance 是回复,label 如果是 [1,0] 代表回复是正确答案,[0,1] 代表回复不是正确答案,所以 utterance 包含了正例和负例的数据。每个问题含有10个负例和1个正例。

train_data含有问题12,889条,数据 141779条,正例:负例 = 1:10 test_data含有问题2,000条,数据 22000条,正例:负例 = 1:10 valid_data含有问题2,000条,数据 22000条,正例:负例 = 1:10

句子长度:

max len of valid question : 31, average: 5(max)
max len of valid utterance: 878(max), average: 165(max)
max len of test question : 33, average: 5
max len of test utterance: 878, average: 161
max len of train question : 42(max), average: 5
max len of train utterance: 878, average: 162
vocab size: 24997

可将本语料库和以下开源码配合使用

DeepQA2: https://github.com/Samurais/DeepQA2

InsuranceQA TensorFlow: https://github.com/l11x0m7/InsuranceQA

Chatbot Retrieval: https://github.com/dennybritz/chatbot-retrieval

声明

声明1 : insuranceqa-corpus-zh

本数据集使用翻译 insuranceQA而生成,代码发布证书 GPL 3.0。数据仅限于研究用途,如果在发布的任何媒体、期刊、杂志或博客等内容时,必须注明引用和地址。

InsuranceQA Corpus, Hai Liang Wang, https://github.com/Samurais/insuranceqa-corpus-zh, 07 27, 2017

任何基于insuranceqa-corpus衍生的数据也需要开放并需要声明和“声明1”和“声明2”一致的内容。

声明2 : insuranceQA

此数据集仅作为研究目的提供。如果您使用这些数据发表任何内容,请引用我们的论文:

Applying Deep Learning to Answer Selection: A Study and An Open Task。Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015

 

如何学习自然语言处理:一本书和一门课

Deep Learning Specialization on Coursera

关于“如何学习自然语言处理”,有很多同学通过不同的途径留过言,这方面虽然很早之前写过几篇小文章:《如何学习自然语言处理》和《几本自然语言处理入门书》,但是更推崇知乎上这个问答:自然语言处理怎么最快入门,里面有微软亚洲研究院周明老师的系统回答和清华大学刘知远老师的倾情奉献:初学者如何查阅自然语言处理(NLP)领域学术资料,当然还包括其他同学的无私分享。

不过,对于希望入门NLP的同学来说,推荐你们先看一下这本书: Speech and Language Processing,第一版中文名译为《自然语言处理综论》,作者都是NLP领域的大大牛:斯坦福大学 Dan Jurafsky 教授和科罗拉多大学的 James H. Martin 教授。这也是我当年的入门书,我读过这本书的中文版(翻译自第一版英文版)和英文版第二版,该书第三版正在撰写中,作者已经完成了不少章节的撰写,所完成的章节均可下载:Speech and Language Processing (3rd ed. draft)。从章节来看,第三版增加了不少和NLP相关的深度学习的章节,内容和篇幅相对于之前有了更多的更新:

Chapter Slides Relation to 2nd ed.
1: Introduction [Ch. 1 in 2nd ed.]
2: Regular Expressions, Text Normalization, and Edit Distance Text [pptx] [pdf]
Edit Distance [pptx] [pdf]
[Ch. 2 and parts of Ch. 3 in 2nd ed.]
3: Finite State Transducers
4: Language Modeling with N-Grams LM [pptx] [pdf] [Ch. 4 in 2nd ed.]
5: Spelling Correction and the Noisy Channel Spelling [pptx] [pdf] [expanded from pieces in Ch. 5 in 2nd ed.]
6: Naive Bayes Classification and Sentiment NB [pptx] [pdf]
Sentiment [pptx] [pdf]
[new in this edition]
7: Logistic Regression
8: Neural Nets and Neural Language Models
9: Hidden Markov Models [Ch. 6 in 2nd ed.]
10: Part-of-Speech Tagging [Ch. 5 in 2nd ed.]
11: Formal Grammars of English [Ch. 12 in 2nd ed.]
12: Syntactic Parsing [Ch. 13 in 2nd ed.]
13: Statistical Parsing
14: Dependency Parsing [new in this edition]
15: Vector Semantics Vector [pptx] [pdf] [expanded from parts of Ch. 19 and 20 in 2nd ed.]
16: Semantics with Dense Vectors Dense Vector [pptx] [pdf] [new in this edition]
17: Computing with Word Senses: WSD and WordNet Intro, Sim [pptx] [pdf]
WSD [pptx] [pdf]
[expanded from parts of Ch. 19 and 20 in 2nd ed.]
18: Lexicons for Sentiment and Affect Extraction SentLex [pptx] [pdf] [new in this edition]
19: The Representation of Sentence Meaning
20: Computational Semantics
21: Information Extraction [Ch. 22 in 2nd ed.]
22: Semantic Role Labeling and Argument Structure SRL [pptx] [pdf]
Select [pptx] [pdf]
[expanded from parts of Ch. 19 and 20 in 2nd ed.]
23: Neural Models of Sentence Meaning (RNN, LSTM, CNN, etc.)
24: Coreference Resolution and Entity Linking
25: Discourse Coherence
26: Seq2seq Models and Summarization
27: Machine Translation
28: Question Answering
29: Conversational Agents
30: Speech Recognition
31: Speech Synthesis

另外该书作者之一斯坦福大学 Dan Jurafsky 教授曾经在Coursera上开设过一门自然语言处理课程:Natural Language Processing,该课程目前貌似在Coursera新课程平台上已经查询不到,不过我们在百度网盘上做了一个备份,包括该课程视频和该书的第二版英文,两个一起看,效果更佳:

链接: https://pan.baidu.com/s/1kUCrV8r 密码: jghn 。

对于一直寻找如何入门自然语言处理的同学来说,先把这本书和这套课程拿下来才是一个必要条件,万事先有个基础。

同时欢迎大家关注我们的公众号:NLPJob,回复"slp"获取该书和课程最新资源。

维基百科语料中的词语相似度探索

Deep Learning Specialization on Coursera

之前写过《中英文维基百科语料上的Word2Vec实验》,近期有不少同学在这篇文章下留言提问,加上最近一些工作也与Word2Vec相关,于是又做了一些功课,包括重新过了一遍Word2Vec的相关资料,试了一下gensim的相关更新接口,google了一下"wikipedia word2vec" or "维基百科 word2vec" 相关的英中文资料,发现多数还是走得这篇文章的老路,既通过gensim提供的维基百科预处理脚本"gensim.corpora.WikiCorpus"提取维基语料,每篇文章一行文本存放,然后基于gensim的Word2Vec模块训练词向量模型。这里再提供另一个方法来处理维基百科的语料,训练词向量模型,计算词语相似度(Word Similarity)。关于Word2Vec, 如果英文不错,推荐从这篇文章入手读相关的资料: Getting started with Word2Vec

这次我们仅以英文维基百科语料为例,首先依然是下载维基百科的最新XML打包压缩数据,在这个英文最新更新的数据列表下:https://dumps.wikimedia.org/enwiki/latest/ ,找到 "enwiki-latest-pages-articles.xml.bz2" 下载,这份英文维基百科全量压缩数据的打包时间大概是2017年4月4号,大约13G,我通过家里的电脑wget下载大概花了3个小时,电信100M宽带,速度还不错。

接下来就是处理这份压缩的XML英文维基百科语料了,这次我们使用WikiExtractor:

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

WikiExtractor是一个Python 脚本,专门用于提取和清洗Wikipedia的dump数据,支持Python 2.7 或者 Python 3.3+,无额外依赖,安装和使用都非常方便:

安装:
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install

使用:
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2

......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

这个过程总计花了2个多小时,提取了大概537万多篇文章。关于我的机器配置,可参考:《深度学习主机攒机小记

提取后的文件按一定顺序切分存储在多个子目录下:

每个子目录下的又存放若干个以wiki_num命名的文件,每个大小在1M左右,这个大小可以通过参数 -b 控制:

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

我们看一下wiki_00里的具体内容:


Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.



Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...

...

每个wiki_num文件里又存放若干个doc,每个doc都有相关的tag标记,包括id, url, title等,很好区分。

这里我们按照Gensim作者提供的word2vec tutorial里"memory-friendly iterator"方式来处理英文维基百科的数据。代码如下,也同步放到了github里:train_word2vec_with_gensim.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyright 2017 @ Yu Zhen
 
import gensim
import logging
import multiprocessing
import os
import re
import sys
 
from pattern.en import tokenize
from time import time
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)
 
 
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext
 
 
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for root, dirs, files in os.walk(self.dirname):
            for filename in files:
                file_path = root + '/' + filename
                for line in open(file_path):
                    sline = line.strip()
                    if sline == "":
                        continue
                    rline = cleanhtml(sline)
                    tokenized_line = ' '.join(tokenize(rline))
                    is_alpha_word_line = [word for word in
                                          tokenized_line.lower().split()
                                          if word.isalpha()]
                    yield is_alpha_word_line
 
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Please use python train_with_gensim.py data_path"
        exit()
    data_path = sys.argv[1]
    begin = time()
 
    sentences = MySentences(data_path)
    model = gensim.models.Word2Vec(sentences,
                                   size=200,
                                   window=10,
                                   min_count=10,
                                   workers=multiprocessing.cpu_count())
    model.save("data/model/word2vec_gensim")
    model.wv.save_word2vec_format("data/model/word2vec_org",
                                  "data/model/vocabulary",
                                  binary=False)
 
    end = time()
    print "Total procesing time: %d seconds" % (end - begin)

注意其中的word tokenize使用了pattern里的英文tokenize模块,当然,也可以使用nltk里的word_tokenize模块,做一点修改即可,不过nltk对于句尾的一些词的work tokenize处理的不太好。另外我们设定词向量维度为200, 窗口长度为10, 最小出现次数为10,通过 is_alpha() 函数过滤掉标点和非英文词。现在可以用这个脚本来训练英文维基百科的Word2Vec模型了:
python train_word2vec_with_gensim.py enwiki

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts
2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types
2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types
2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types
2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types
......
2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads                                                                      2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads                                                                      2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads                                                                      2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s                          
2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None                                                       
2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm                 
2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy                                                              
2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy                                                           
2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table                
2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim               
2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary    
2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org                                                             
Total procesing time: 44476 seconds

这个训练过程中大概花了12多小时,训练后的文件存放在data/model下:

我们来测试一下这个英文维基百科的Word2Vec模型:

textminer@textminer:/opt/wiki/data$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from gensim.models import Word2Vec
 
In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim')

首先来测试几个单词的相似单词(Word Similariy):

word:

In [3]: en_wiki_word2vec_model.most_similar('word')
Out[3]: 
[('phrase', 0.8129693269729614),
 ('meaning', 0.7311851978302002),
 ('words', 0.7010501623153687),
 ('adjective', 0.6805518865585327),
 ('noun', 0.6461974382400513),
 ('suffix', 0.6440576314926147),
 ('verb', 0.6319557428359985),
 ('loanword', 0.6262609958648682),
 ('proverb', 0.6240501403808594),
 ('pronunciation', 0.6105246543884277)]

similarity:

In [4]: en_wiki_word2vec_model.most_similar('similarity')
Out[4]: 
[('similarities', 0.8517599701881409),
 ('resemblance', 0.786037266254425),
 ('resemblances', 0.7496883869171143),
 ('affinities', 0.6571112275123596),
 ('differences', 0.6465682983398438),
 ('dissimilarities', 0.6212711930274963),
 ('correlation', 0.6071442365646362),
 ('dissimilarity', 0.6062943935394287),
 ('variation', 0.5970577001571655),
 ('difference', 0.5928016901016235)]

nlp:

In [5]: en_wiki_word2vec_model.most_similar('nlp')
Out[5]: 
[('neurolinguistic', 0.6698148250579834),
 ('psycholinguistic', 0.6388964056968689),
 ('connectionism', 0.6027182936668396),
 ('semantics', 0.5866401195526123),
 ('connectionist', 0.5865628719329834),
 ('bandler', 0.5837364196777344),
 ('phonics', 0.5733655691146851),
 ('psycholinguistics', 0.5613113641738892),
 ('bootstrapping', 0.559638261795044),
 ('psychometrics', 0.5555593967437744)]

learn:

In [6]: en_wiki_word2vec_model.most_similar('learn')
Out[6]: 
[('teach', 0.7533557415008545),
 ('understand', 0.71148681640625),
 ('discover', 0.6749690771102905),
 ('learned', 0.6599283218383789),
 ('realize', 0.6390970349311829),
 ('find', 0.6308424472808838),
 ('know', 0.6171890497207642),
 ('tell', 0.6146825551986694),
 ('inform', 0.6008728742599487),
 ('instruct', 0.5998791456222534)]

man:

In [7]: en_wiki_word2vec_model.most_similar('man')
Out[7]: 
[('woman', 0.7243080735206604),
 ('boy', 0.7029494047164917),
 ('girl', 0.6441491842269897),
 ('stranger', 0.63275545835495),
 ('drunkard', 0.6136815547943115),
 ('gentleman', 0.6122575998306274),
 ('lover', 0.6108279228210449),
 ('thief', 0.609005331993103),
 ('beggar', 0.6083744764328003),
 ('person', 0.597919225692749)]

再来看看其他几个相关接口:

In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
Out[8]: [('queen', 0.7752252817153931)]
 
In [9]: en_wiki_word2vec_model.similarity('woman', 'man')
Out[9]: 0.72430799548282099
 
In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'

我把这篇文章的相关代码还有另一篇“中英文维基百科语料上的Word2Vec实验”的相关代码整理了一下,在github上建立了一个 Wikipedia_Word2vec 的项目,感兴趣的同学可以参考。

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:维基百科语料中的词语相似度探索 http://www.52nlp.cn/?p=9454

自然语言处理工具包spaCy介绍

Deep Learning Specialization on Coursera

spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

sudo apt-get install build-essential python-dev git
sudo pip install -U spacy

不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用"all"参数下载所有的数据:

sudo python -m spacy.en.download all

或者可以分别下载相关的模型和用glove训练好的词向量数据:


# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型
python -m spacy.en.download parser

# 这个过程下载glove训练好的词向量数据
python -m spacy.en.download glove

下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
776M en-1.1.0
774M en_glove_cc_300_1m_vectors-1.0.0

进入到英文数据模型下:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
424M deps
8.0K meta.json
35M ner
12M pos
84K tokenizer
300M vocab
6.3M wordnet

可以用如下命令检查模型数据是否安装成功:


textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"
OK

也可以用pytest进行测试:


# 首先找到spacy的安装路径:
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
/usr/local/lib/python2.7/dist-packages/spacy

# 再安装pytest:
sudo python -m pip install -U pytest

# 最后进行测试:
python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0
rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:
collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....
......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:


textminer@textminer:~$ ipython
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加载英文模型数据,稍许等待
In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)
it's word tokenize test for spacy

In [5]: for token in test_doc:
print(token)
...:
it
's
word
tokenize
test
for
spacy

英文断句:


In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:
print(sent)
...:
Natural language processing (NLP) deals with the application of computational models to text or speech data.
Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.
NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.
From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.


词干化(Lemmatize):


In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:
print(token, token.lemma_, token.lemma)
...:
(you, u'you', 472)
(are, u'be', 488)
(best, u'good', 556)
(., u'.', 419)
(it, u'it', 473)
(is, u'be', 488)
(lemmatize, u'lemmatize', 1510296)
(test, u'test', 1351)
(for, u'for', 480)
(spacy, u'spacy', 173783)
(., u'.', 419)
(I, u'i', 570)
(love, u'love', 644)
(these, u'these', 642)
(books, u'book', 1011)

词性标注(POS Tagging):


In [10]: for token in test_doc:
print(token, token.pos_, token.pos)
....:
(you, u'PRON', 92)
(are, u'VERB', 97)
(best, u'ADJ', 82)
(., u'PUNCT', 94)
(it, u'PRON', 92)
(is, u'VERB', 97)
(lemmatize, u'ADJ', 82)
(test, u'NOUN', 89)
(for, u'ADP', 83)
(spacy, u'NOUN', 89)
(., u'PUNCT', 94)
(I, u'PRON', 92)
(love, u'VERB', 97)
(these, u'DET', 87)
(books, u'NOUN', 89)

命名实体识别(NER):


In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:
print(ent, ent.label_, ent.label)
....:
(Rami Eid, u'PERSON', 346)
(Stony Brook University, u'ORG', 349)
(New York, u'GPE', 350)

名词短语提取:


In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:
print(np)
....:
Natural language processing
Natural language processing (NLP) deals
the application
computational models
text
speech
data
Application areas
NLP
automatic (machine) translation
languages
dialogue systems
a human
a machine
natural language
information extraction
the goal
unstructured text
structured (database) representations
flexible ways
NLP technologies
a dramatic impact
the way
people
computers
the way
people
the use
language
the way
people
the vast amount
linguistic data
electronic form
a scientific viewpoint
NLP
fundamental questions
formal models
example
natural language phenomena
algorithms
these models

基于词向量计算两个单词的相似度:


In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)
Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)
oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)
Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)
hippos

In [24]: apples.similarity(oranges)
Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)
Out[25]: 0.038474555379008429

当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

参考:
spaCy官方文档
Getting Started with spaCy

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:自然语言处理工具包spaCy介绍 http://www.52nlp.cn/?p=9386

Mecab安装过程中的一些坑

Deep Learning Specialization on Coursera

先说一点题外话,最近发现 Linode 因为庆祝13周年活动将所有的Plan加了一倍,又来了一次加量不加价,这一下子和别的产品拉开了差距,可惜目前Linode日本节点并不参加活动,否则52nlp目前所用的这台 Linode 主机性能就可以翻倍了。不过还是搞了一台 Linode 8GB(8G内存,4核,96G SSD硬盘容量) 的VPS套餐(40$/mo),选择了美国西部的 Fremont 节点,据说国内连接速度很不错。在上面选择了64位的Ubuntu14.04 版本,但是在这个环境下安装Mecab的过程中接连踩了几个坑,所以记录一下。

==============================================================================
Update: 2017.03.21

近期又试了一下Ubuntu上基于apt-get的安装方式,非常方便,如果不想踩下面源代码编译安装的坑,推荐这种方式,参考自:https://gist.github.com/YoshihitoAso/9048005

$ sudo apt-get install mecab libmecab-dev mecab-ipadic
$ sudo apt-get install mecab-ipadic-utf8
$ sudo apt-get install python-mecab

注意其中mecab-ipadic 和 mecab-ipadic-utf8 是日文词典和模型,可以选择安装或者不安装,基于需求而定。剩下的用法和之前的一样,选定一个中文词典和模型,使用即可。

==============================================================================

这里曾写过“Mecab中文分词”系列文章,也在github上发布过一个中文分词项目 MeCab-Chinese:Chinese morphological analysis with Word Segment and POS Tagging data for MeCab ,但是这个过程中没有怎么写到Mecab安装的问题,因为之前觉得rickjin的这篇《日文分词器 Mecab 文档》应该足够参考,自己当时也在Mac OS和Ubuntu环境下安装成功并测试,印象貌似不是太复杂。这次在Ubuntu 14.04的环境安装的时候,遇到了几个小坑,记录一下,做个备忘,仅供参考。
继续阅读

QA问答系统中的深度学习技术实现

Deep Learning Specialization on Coursera

应用场景

智能问答机器人火得不行,开始研究深度学习在NLP领域的应用已经有一段时间,最近在用深度学习模型直接进行QA系统的问答匹配。主流的还是CNN和LSTM,在网上没有找到特别合适的可用的代码,自己先写了一个CNN的(theano),效果还行,跟论文中的结论是吻合的。目前已经应用到了我们的产品上。

原理

参看《Applying Deep Learning To Answer Selection: A Study And An Open Task》,文中比较了好几种网络结构,选择了效果相对较好的其中一个来实现,网络描述如下:

qacnn_v2

Q&A共用一个网络,网络中包括HL,CNN,P+T和Cosine_Similarity,HL是一个g(W*X+b)的非线性变换,CNN就不说了,P是max_pooling,T是激活函数Tanh,最后的Cosine_Similarity表示将Q&A输出的语义表示向量进行相似度计算。

详细描述下从输入到输出的矩阵变换过程:

  1. Qp:[batch_size, sequence_len],Qp是Q之前的一个表示(在上图中没有画出)。所有句子需要截断或padding到一个固定长度(因为后面的CNN一般是处理固定长度的矩阵),例如句子包含3个字ABC,我们选择固定长度sequence_len为100,则需要将这个句子padding成ABC<a><a>...<a>(100个字),其中的<a>就是添加的专门用于padding的无意义的符号。训练时都是做mini-batch的,所以这里是一个batch_size行的矩阵,每行是一个句子。
  2. Q:[batch_size, sequence_len, embedding_size]。句子中的每个字都需要转换成对应的字向量,字向量的维度大小是embedding_size,这样Qp就从一个2维的矩阵变成了3维的Q
  3. HL层输出:[batch_size, embedding_size, hl_size]。HL层:[embedding_size, hl_size],Q中的每个句子会通过和HL层的点积进行变换,相当于将每个字的字向量从embedding_size大小变换到hl_size大小。
  4. CNN+P+T输出:[batch_size, num_filters_total]。CNN的filter大小是[filter_size, hl_size],列大小是hl_size,这个和字向量的大小是一样的,所以对每个句子而言,每个filter出来的结果是一个列向量(而不是矩阵),列向量再取max-pooling就变成了一个数字,每个filter输出一个数字,num_filters_total个filter出来的结果当然就是[num_filters_total]大小的向量,这样就得到了一个句子的语义表示向量。T就是在输出结果上加上Tanh激活函数。
  5. Cosine_Similarity:[batch_size]。最后的一层并不是通常的分类或者回归的方法,而是采用了计算两个向量(Q&A)夹角的方法,下面是网络损失函数。t2,m是需要设定的参数margin,VQ、VA+、VA-分别是问题、正向答案、负向答案对应的语义表示向量。损失函数的意义就是:让正向答案和问题之间的向量cosine值要大于负向答案和问题的向量cosine值,大多少,就是margin这个参数来定义的。cosine值越大,两个向量越相近,所以通俗的说这个Loss就是要让正向的答案和问题愈来愈相似,让负向的答案和问题越来越不相似。

实现

代码点击这里,使用的数据是一份英文的insuranceQA,下面介绍代码重点部分:

字向量。本文采用字向量的方法,没有使用词向量。使用字向量的目的主要是为了解决未登录词的问题,这样在测试的时候就很少会遇到Unknown的字向量的问题了。而且字向量的效果也不一定比词向量的效果差,还省去了分词的各种麻烦。先用word2vec生成一份字向量,相当于我们在做pre-training了(之后测试了随机初始化字向量的方法,效果差不多)

原理中的步骤2。这里没有做HL层的变换,实际测试中,增加HL层有非常非常小的提升,所以在这里就省去了改步骤。

t4

CNN可以设置多种大小的filter,最后各种filter的结果会拼接起来。

t5

原理中的步骤4。这里执行卷积,max-pooling和Tanh激活。

t6

生成的ouputs_1是一个python的list,使用concatenate将list的多个tensor拼接起来(list中的每个tensor表示一种大小的filter卷积的结果)t7

原理中的步骤5。计算问题、正向答案、负向答案的向量夹角

t8

生成Loss损失函数和Accuracy。t9

核心的网络构建代码就是这些,其他的代码都是训练数据、验证数据的读入,以及theano构建训练时的一些常规代码。

如果需要增加HL层,可参照如下的代码。Whl即是HL层的网络,将input和Whl点积即可。t10

dropout的实现。

t11

结果

使用上面的代码,Test 1的Top-1 Accuracy可以达到61%-62%,和论文中的结论基本一致了,至于论文中提到的GESD、AESD等方法没有再测试了,运行较慢,其他数据集也没有再测试了。

下面是国外友人用一个叫keras的工具(封装的theano和tensorflow)弄的类似代码,Test 1的Top-1准确率在50%左右,比他这个要高:)

http://benjaminbolte.com/blog/2016/keras-language-modeling.html

Test set Top-1 Accuracy Mean Reciprocal Rank
Test 1 0.4933 0.6189
Test 2 0.4606 0.5968
Dev 0.4700 0.6088

另外,原始的insuranceQA需要进行一些处理才能在这个代码上使用,具体参看github上的说明吧。

一些技巧

  1. 字向量和词向量的效果相当。所以优先使用字向量,省去了分词的麻烦,还能更好的避免未登录词的问题,何乐而不为。
  2. 字向量不是固定的,在训练中会更新
  3. Dropout的使用对最高的准确率没有很大的影响,但是使用了Dropout的结果更稳定,准确率的波动会更小,所以建议还是要使用Dropout的。不过Dropout也不易过度使用,比如Dropout的keep_prob概率如果设置到0.25,则模型收敛得更慢,训练时间长很多,效果也有可能会更差,设置会差很多。我这版代码使用的keep_prob为0.5,同时保证准确率和训练时间。另外,Dropout只应用到了max-pooling的结果上,其他地方没有再使用了,过多的使用反而不好。
  4. 如何生成训练集。每个训练case需要一个问题+一个正向答案+一个负向答案,很明显问题和正向答案都是有的,负向答案的生成方法就是随机采样,这样就不需要涉及任何人工标注工作了,可以很方便的应用到大数据集上。
  5. HL层的效果不明显,有很微量的提升。如果HL层的大小是200,字向量是100,则HL层相当于将字向量再放大一倍,这个感觉没有多少信息可利用的,还不如直接将字向量设置成200,还省去了HL这一层的变换。
  6. margin的值一般都设置得比较小。这里用的是0.05
  7. 如果将Cosine_similarity这一层换成分类或者回归,印象中效果是不如Cosine_similarity的(具体数据忘了)
  8. num_filters越大并不是效果越好,基本到了一定程度就很难提升了,反而会降低训练速度。
  9. 同时也写了tensorflow版本代码,对比theano的,效果差不多
  10. Adam和SGD两种训练方法比较,Adam训练速度貌似会更快一些,效果基本也持平吧,没有太细节的对比。不过同样的网络+SGD,theano好像训练要更快一些。
  11. Loss和Accuracy是比较重要的监控参数。如果写一个新的网络的话,类似的指标是很有必要的,可以在每个迭代中评估网络是否正在收敛。因为调试比较麻烦,所以通过这些参数能评估你的网络写对没,参数设置是否正确。
  12. 网络的参数还是比较重要的,如果一些参数设置不合理,很有可能结果千差万别,记得最初用tensorflow实现的时候,应该是dropout设置得太小,导致效果很差,很久才找到原因。所以调参和微调网络还是需要一定的技巧和经验的,做这版代码的时候就经历了一段比较痛苦的调参过程,最开始还怀疑是网络设计或是代码有问题,最后总结应该就是参数没设置好。

结语

如果关注这个东西的人多的话,后面还可以有tensorflow版本的QA CNN,以及LSTM的代码奉上:)

补充

tensorflow的CNN代码已添加到github上,点击这里

Contact: jiangwen127@gmail.com weibo:码坛奥沙利文

达观数据搜索引擎的Query自动纠错技术和架构详解

Deep Learning Specialization on Coursera

1 背景

如今,搜索引擎是人们的获取信息最重要的方式之一,在搜索页面小小的输入框中,只需输入几个关键字,就能找到你感兴趣问题的相关网页。搜索巨头Google,甚至已经使Google这个创造出来的单词成为动词,有问题Google一下就可以。在国内,百度也同样成为一个动词。除了通用搜索需求外,很多垂直细分领域的搜索需求也很旺盛,比如电商网站的产品搜索,文学网站的小说搜索等。面对这些需求,达观数据(www.datagrand.com)作为国内提供中文云搜索服务的高科技公司,为合作伙伴提供高质量的搜索技术服务,并进行搜索服务的统计分析等功能。(达观数据联合创始人高翔)

搜索引擎系统最基本最核心的功能是信息检索,找到含有关键字的网页或文档,然后按照一定排序将结果给出。在此基础之上,搜索引擎能够提供更多更复杂的功能来提升用户体验。对于一个成熟的搜索引擎系统,用户看似简单的搜索过程,需要在系统中经过多个环节,多个模块协同工作,才能提供一个让人满意的搜索结果。其中拼写纠错(Error Correction,以下简称EC)是用户比较容易感知的一个功能,比如百度的纠错功能如下图所示:

图片 1

图 1:百度纠错功能示例

EC其实是属于Query Rewrite(以下简称QR)模块中的一个功能,QR模块包括拼写纠错,同义改写,关联query等多个功能。QR模块对于提升用户体验有着巨大的帮助,对于搜索质量不佳的query进行改写后能返回更好的搜索结果。QR模块内容较多,以下着重介绍EC功能。
继续阅读