标签归档:nltk

Coursera上Python课程(公开课)汇总推荐:从Python入门到应用Python

Deep Learning Specialization on Coursera

Python是深度学习时代的语言,Coursera上有很多Python课程,从Python入门到精通,从Python基础语法到应用Python,满足各个层次的需求,以下是Coursera上的Python课程整理,仅供参考,这里也会持续更新。

1. 密歇根大学的“Python for Everybody Specialization(人人都可以学习的Python专项课程)”

这个系列对于学习者的编程背景和数学要求几乎为零,非常适合Python入门学习。这个系列也是Coursera上最受欢迎的Python学习系列课程,强烈推荐。这个Python系列的目标是“通过Python学习编程并分析数据,开发用于采集,清洗,分析和可视化数据的程序(Learn to Program and Analyze Data with Python-Develop programs to gather, clean, analyze, and visualize data.” ,以下是关于这个系列的简介:

This Specialization builds on the success of the Python for Everybody course and will introduce fundamental programming concepts including data structures, networked application program interfaces, and databases, using the Python programming language. In the Capstone Project, you’ll use the technologies learned throughout the Specialization to design and create your own applications for data retrieval, processing, and visualization.

这个系列包含4门子课程和1门毕业项目课程,包括Python入门基础,Python数据结构, 使用Python获取网络数据(Python爬虫),在Python中使用数据库以及Python数据可视化等。以下是具体子课程的介绍:

1.1 Programming for Everybody (Getting Started with Python)

Python入门级课程,这门课程暂且翻译为“人人都可以学编程-从Python开始”,如果没有任何编程基础,就从这门课程开始吧:

This course aims to teach everyone the basics of programming computers using Python. We cover the basics of how one constructs a program from a series of simple instructions in Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course. This course will cover Chapters 1-5 of the textbook “Python for Everybody”. Once a student completes this course, they will be ready to take more advanced programming courses. This course covers Python 3.

1.2 Python Data Structures(Python数据结构)

Python基础课程,这门课程的目标是介绍Python语言的核心数据结构(This course will introduce the core data structures of the Python programming language.),关于这门课程:

This course will introduce the core data structures of the Python programming language. We will move past the basics of procedural programming and explore how we can use the Python built-in data structures such as lists, dictionaries, and tuples to perform increasingly complex data analysis. This course will cover Chapters 6-10 of the textbook “Python for Everybody”. This course covers Python 3.

1.3 Using Python to Access Web Data(使用Python获取网页数据--Python爬虫)

Python应用课程,只有使用Python才能学以致用,这门课程的目标是展示如何通过爬取和分析网页数据将互联网作为数据的源泉(This course will show how one can treat the Internet as a source of data):

This course will show how one can treat the Internet as a source of data. We will scrape, parse, and read web data as well as access data using web APIs. We will work with HTML, XML, and JSON data formats in Python. This course will cover Chapters 11-13 of the textbook “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-10 of the textbook and the first two courses in this specialization. These topics include variables and expressions, conditional execution (loops, branching, and try/except), functions, Python data structures (strings, lists, dictionaries, and tuples), and manipulating files. This course covers Python 3.

1.4 Using Databases with Python(Python数据库)

Python应用课程,在Python中使用数据库。这门课程的目标是在Python中学习SQL,使用SQLite3作为抓取数据的存储数据库:

This course will introduce students to the basics of the Structured Query Language (SQL) as well as basic database design for storing data as part of a multi-step data gathering, analysis, and processing effort. The course will use SQLite3 as its database. We will also build web crawlers and multi-step data gathering and visualization processes. We will use the D3.js library to do basic data visualization. This course will cover Chapters 14-15 of the book “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-13 of the textbook and the first three courses in this specialization. This course covers Python 3.

1.5 Capstone: Retrieving, Processing, and Visualizing Data with Python(毕业项目课程:使用Python获取,处理和可视化数据)

Python应用实践课程,这是这个系列的毕业项目课程,目的是通过开发一系列Python应用项目让学生熟悉Python抓取,处理和可视化数据的流程。

In the capstone, students will build a series of applications to retrieve, process and visualize data using Python. The projects will involve all the elements of the specialization. In the first part of the capstone, students will do some visualizations to become familiar with the technologies in use and then will pursue their own project to visualize some other data that they have or can find. Chapters 15 and 16 from the book “Python for Everybody” will serve as the backbone for the capstone. This course covers Python 3.

2. 多伦多大学的编程入门课程"Learn to Program: The Fundamentals(学习编程:基础) "

Python入门级课程。这门课程以Python语言传授编程入门知识,实为零基础的Python入门课程。感兴趣的同学可以参考课程图谱上的老课程评论 :http://coursegraph.com/coursera_programming1 ,之前一个同学的评价是 “两个老师语速都偏慢,讲解细致,又有可视化工具Python Visualizer用于详细了解程序具体执行步骤,可以说是零基础学习python编程的最佳选择。”

Behind every mouse click and touch-screen tap, there is a computer program that makes things happen. This course introduces the fundamental building blocks of programming and teaches you how to write fun and useful programs using the Python language.

3. 莱斯大学的Python专项课程系列:Introduction to Scripting in Python Specialization

入门级Python学习系列课程,涵盖Python基础, Python数据表示, Python数据分析, Python数据可视化等子课程,比较适合Python入门。这门课程的目标是让学生可以在处理实际问题是使用Python解决问题:Launch Your Career in Python Programming-Master the core concepts of scripting in Python to enable you to solve practical problems.

This specialization is intended for beginners who would like to master essential programming skills. Through four courses, you will cover key programming concepts in Python 3 which will prepare you to use Python to perform common scripting tasks. This knowledge will provide a solid foundation towards a career in data science, software engineering, or other disciplines involving programming.

这个系列包含4门子课程,以下是具体子课程的介绍:

3.1 Python Programming Essentials(Python编程基础)

Python入门基础课程,这门课程将讲授Python编程基础知识,包括表达式,变量,函数等,目标是让用户熟练使用Python:

This course will introduce you to the wonderful world of Python programming! We'll learn about the essential elements of programming and how to construct basic Python programs. We will cover expressions, variables, functions, logic, and conditionals, which are foundational concepts in computer programming. We will also teach you how to use Python modules, which enable you to benefit from the vast array of functionality that is already a part of the Python language. These concepts and skills will help you to begin to think like a computer programmer and to understand how to go about writing Python programs. By the end of the course, you will be able to write short Python programs that are able to accomplish real, practical tasks. This course is the foundation for building expertise in Python programming. As the first course in a specialization, it provides the necessary building blocks for you to succeed at learning to write more complex Python programs. This course uses Python 3. While many Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This first course will use a Python 3 version of the CodeSkulptor development environment, which is specifically designed to help beginning programmers learn quickly. CodeSkulptor runs within any modern web browser and does not require you to install any software, allowing you to start writing and running small programs immediately. In the later courses in this specialization, we will help you to move to more sophisticated desktop development environments.

3.2 Python Data Representations(Python数据表示)

Python入门基础课程,这门课程依然关注Python的基础知识,包括Python字符串,列表等,以及Python文件操作:

This course will continue the introduction to Python programming that started with Python Programming Essentials. We'll learn about different data representations, including strings, lists, and tuples, that form the core of all Python programs. We will also teach you how to access files, which will allow you to store and retrieve data within your programs. These concepts and skills will help you to manipulate data and write more complex Python programs. By the end of the course, you will be able to write Python programs that can manipulate data stored in files. This will extend your Python programming expertise, enabling you to write a wide range of scripts using Python This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course introduces basic desktop Python development environments, allowing you to run Python programs directly on your computer. This choice enables a smooth transition from online development environments.

3.3 Python Data Analysis(Python数据分析)

Python基础课程,这门课程将讲授通过Python读取和分析表格数据和结构化数据等,例如TCSV文件等:

This course will continue the introduction to Python programming that started with Python Programming Essentials and Python Data Representations. We'll learn about reading, storing, and processing tabular data, which are common tasks. We will also teach you about CSV files and Python's support for reading and writing them. CSV files are a generic, plain text file format that allows you to exchange tabular data between different programs. These concepts and skills will help you to further extend your Python programming knowledge and allow you to process more complex data. By the end of the course, you will be comfortable working with tabular data in Python. This will extend your Python programming expertise, enabling you to write a wider range of scripts using Python. This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course uses basic desktop Python development environments, allowing you to run Python programs directly on your computer.

3.4 Python Data Visualization(Python数据可视化)

Python应用课程,这门课程将基于前3门课程学习的Python知识,抓取网络数据,然后清洗,处理和分析数据,并最终可视化呈现数据:

This if the final course in the specialization which builds upon the knowledge learned in Python Programming Essentials, Python Data Representations, and Python Data Analysis. We will learn how to install external packages for use within Python, acquire data from sources on the Web, and then we will clean, process, analyze, and visualize that data. This course will combine the skills learned throughout the specialization to enable you to write interesting, practical, and useful programs. By the end of the course, you will be comfortable installing Python packages, analyzing existing data, and generating visualizations of that data. This course will complete your education as a scripter, enabling you to locate, install, and use Python packages written by others. You will be able to effectively utilize tools and packages that are widely available to amplify your effectiveness and write useful programs.

4. 莱斯大学的计算(机)基础专项课程系列:Fundamentals of Computing Specialization

入门级Python编程学习课程系列,这个系列覆盖了大部分莱斯大学一年级计算机科学新生的学习材料,学生通过Python学习现代编程语言技巧,并将这些技巧应用到20个左右的有趣的编程项目中。

This Specialization covers much of the material that first-year Computer Science students take at Rice University. Students learn sophisticated programming skills in Python from the ground up and apply these skills in building more than 20 fun projects. The Specialization concludes with a Capstone exam that allows the students to demonstrate the range of knowledge that they have acquired in the Specialization.

这个系列包括Python交互式编程设计,计算原理,算法思维等6门课程和1门毕业项目课程,目标是让学生像计算机科学家一样编程和思考(Learn how to program and think like a Computer Scientist),以下是子课程的相关介绍:

4.1 An Introduction to Interactive Programming in Python (Part 1)(Python交互式编程导论上)

Python入门级课程,这门课程将讲授Python编程基础知识,例如普通表达式,条件表达式和函数,并用这些知识构建一个简单的交互式应用。

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 1 of this course, we will introduce the basic elements of programming (such as expressions, conditionals, and functions) and then use these elements to create simple interactive applications such as a digital stopwatch. Part 1 of this class will culminate in building a version of the classic arcade game "Pong".

4.2 An Introduction to Interactive Programming in Python (Part 2)(Python交互式编程导论下)

Python入门级课程,这门课程将继续讲授Python基础知识,例如列表,词典和循环,并将使用这些知识构建一个简单的游戏例如Blackjack:

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 2 of this course, we will introduce more elements of programming (such as list, dictionaries, and loops) and then use these elements to create games such as Blackjack. Part 1 of this class will culminate in building a version of the classic arcade game "Asteroids". Upon completing this course, you will be able to write small, but interesting Python programs. The next course in the specialization will begin to introduce a more principled approach to writing programs and solving computational problems that will allow you to write larger and more complex programs.

4.3 Principles of Computing (Part 1)(计算原理上)

编程基础课程,这门课程聚焦在了编程的基础上,包括编码标准和测试,数学基础包括概率和组合等。

This two-part course builds upon the programming skills that you learned in our Introduction to Interactive Programming in Python course. We will augment those skills with both important programming practices and critical mathematical problem solving skills. These skills underlie larger scale computational problem solving and programming. The main focus of the class will be programming weekly mini-projects in Python that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 1 of this course, the programming aspect of the class will focus on coding standards and testing. The mathematical portion of the class will focus on probability, combinatorics, and counting with an eye towards practical applications of these concepts in Computer Science. Recommended Background - Students should be comfortable writing small (100+ line) programs in Python using constructs such as lists, dictionaries and classes and also have a high-school math background that includes algebra and pre-calculus.

4.4 Principles of Computing (Part 2)(计算原理下)

编程基础课程,这门课程聚焦在搜索、排序、递归等主题上:

This two-part course introduces the basic mathematical and programming principles that underlie much of Computer Science. Understanding these principles is crucial to the process of creating efficient and well-structured solutions for computational problems. To get hands-on experience working with these concepts, we will use the Python programming language. The main focus of the class will be weekly mini-projects that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 2 of this course, the programming portion of the class will focus on concepts such as recursion, assertions, and invariants. The mathematical portion of the class will focus on searching, sorting, and recursive data structures. Upon completing this course, you will have a solid foundation in the principles of computation and programming. This will prepare you for the next course in the specialization, which will begin to introduce a structured approach to developing and analyzing algorithms. Developing such algorithmic thinking skills will be critical to writing large scale software and solving real world computational problems.

4.5 Algorithmic Thinking (Part 1)(算法思维上)

编程基础课程,这门课程聚焦在算法思维的培养上,讲授图算法的相关概念并用Python实现:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part course builds on the principles that you learned in our Principles of Computing course and is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to real-world computational problems. In part 1 of this course, we will study the notion of algorithmic efficiency and consider its application to several problems from graph theory. As the central part of the course, students will implement several important graph algorithms in Python and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Recommended Background - Students should be comfortable writing intermediate size (300+ line) programs in Python and have a basic understanding of searching, sorting, and recursion. Students should also have a solid math background that includes algebra, pre-calculus and a familiarity with the math concepts covered in "Principles of Computing".

4.6 Algorithmic Thinking (Part 2)(算法思维下)

编程基础课程,这门课程聚焦在培养学生的算法思维,并了解一些高级算法主题,例如分治法,动态规划等:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part class is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to computational problems. In part 2 of this course, we will study advanced algorithmic techniques such as divide-and-conquer and dynamic programming. As the central part of the course, students will implement several algorithms in Python that incorporate these techniques and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Once students have completed this class, they will have both the mathematical and programming skills to analyze, design, and program solutions to a wide range of computational problems. While this class will use Python as its vehicle of choice to practice Algorithmic Thinking, the concepts that you will learn in this class transcend any particular programming language.

4.7 The Fundamentals of Computing Capstone Exam(计算基础毕业项目课程)

Python应用课程,基于以上子课程的学习,计算基础毕业项目课程将用Python和所学的知识完成 20+ 项目:

While most specializations on Coursera conclude with a project-based course, students in the "Fundamentals of Computing" specialization have completed more than 20+ projects during the first six courses of the specialization. Given that much of the material in these courses is reused from session to session, our goal in this capstone class is to provide a conclusion to the specialization that allows each student an opportunity to demonstrate their individual mastery of the material in the specialization. With this objective in mind, the focus in this Capstone class will be an exam whose questions are updated periodically. This approach is designed to help insure that each student is solving the exam problems on his/her own without outside help. For students that have done their own work, we do not anticipate that the exam will be particularly hard. However, those students who have relied too heavily on outside help in previous classes may have a difficult time. We believe that this approach will increase the value of the Certificate for this specialization.

5. 密歇根大学的 Applied Data Science with Python(Python数据科学应用专项课程系列)

Python应用系列课程,这个系列的目标主要是通过Python编程语言介绍数据科学的相关领域,包括应用统计学,机器学习,信息可视化,文本分析和社交网络分析等知识,并结合一些流行的Python工具包,例如pandas, matplotlib, scikit-learn, nltk以及networkx等Python工具。

The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have basic a python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.

这个系列课程有5门课程,包括Python数据科学导论课程(Introduction to Data Science in Python),Python数据可视化(Applied Plotting, Charting & Data Representation in Python),Python机器学习(Applied Machine Learning in Python) ,Python文本挖掘(Applied Text Mining in Python) , Python社交网络分析(Applied Social Network Analysis in Python),以下是具体子课程的介绍:

5.1 Introduction to Data Science in Python(Python数据科学导论)

Python基础和应用课程,这门课程从Python基础讲起,然后通过pandas数据科学库介绍DataFrame等数据分析中的核心数据结构概念,让学生学会操作和分析表格数据并学会运行基础的统计分析工具。

This course will introduce the learner to the basics of the python programming environment, including how to download and install python, expected fundamental python programming techniques, and how to find help with python programming questions. The course will also introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the DataFrame as the central data structure for data analysis. The course will end with a statistics primer, showing how various statistical measures can be applied to pandas DataFrames. By the end of the course, students will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses. This course should be taken before any of the other Applied Data Science with Python courses: Applied Plotting, Charting & Data Representation in Python, Applied Machine Learning in Python, Applied Text Mining in Python, Applied Social Network Analysis in Python.

5.2 Applied Plotting, Charting & Data Representation in Python(Python数据可视化)

Python应用课程,这门课程聚焦在通过使用matplotlib库进行数据图表的绘制和可视化呈现:

This course will introduce the learner to information visualization basics, with a focus on reporting and charting using the matplotlib library. The course will start with a design and information literacy perspective, touching on what makes a good and bad visualization, and what statistical measures translate into in terms of visualizations. The second week will focus on the technology used to make visualizations in python, matplotlib, and introduce users to best practices when creating basic charts and how to realize design decisions in the framework. The third week will describe the gamut of functionality available in matplotlib, and demonstrate a variety of basic statistical charts helping learners to identify when a particular method is good for a particular problem. The course will end with a discussion of other forms of structuring and visualizing data. This course should be taken after Introduction to Data Science in Python and before the remainder of the Applied Data Science with Python courses: Applied Machine Learning in Python, Applied Text Mining in Python, and Applied Social Network Analysis in Python.

5.3 Applied Machine Learning in Python(Python机器学习)

Python应用课程,这门课程主要聚焦在通过Python应用机器学习,包括机器学习和统计学的区别,机器学习工具包scikit-learn的介绍,有监督学习和无监督学习,数据泛化问题(例如交叉验证和过拟合)等。

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis. This course should be taken after Introduction to Data Science in Python and Applied Plotting, Charting & Data Representation in Python and before Applied Text Mining in Python and Applied Social Analysis in Python.

5.4 Applied Text Mining in Python(Python文本挖掘)

Python应用课程,这门课程主要聚焦在文本挖掘和文本分析基础,包括正则表达式,文本清洗,文本预处理等,并结合NLTK讲授自然语言处理的相关知识,例如文本分类,主题模型等。

This course will introduce the learner to text mining and text manipulation basics. The course begins with an understanding of how text is handled by python, the structure of text both to the machine and to humans, and an overview of the nltk framework for manipulating text. The second week focuses on common manipulation needs, including regular expressions (searching for text), cleaning text, and preparing text for use by machine learning processes. The third week will apply basic natural language processing methods to text, and demonstrate how text classification is accomplished. The final week will explore more advanced methods for detecting the topics in documents and grouping them by similarity (topic modelling). This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

5.5 Applied Social Network Analysis in Python(Python社交网络分析)

Python应用课程,这门课程通过Python工具包 NetworkX 介绍社交网络分析的相关知识。

This course will introduce the learner to network analysis through the NetworkX library. The course begins with an understanding of what network analysis is and motivations for why we might model phenomena as networks. The second week introduces the concept of connectivity and network robustness.. The third week will explore ways of measuring the importance or centrality of a node in a network. The final week will explore the evolution of networks over time and cover models of network generation and the link prediction problem. This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

您可以继续在课程图谱上挖掘Coursera上新的Python课程,也欢迎推荐到这里。

注:本文首发“课程图谱博客”:http://blog.coursegraph.com ,同步发布到这里,原文链接地址:http://blog.coursegraph.com/coursera%E4%B8%8Apython%E8%AF%BE%E7%A8%8B%EF%BC%88%E5%85%AC%E5%BC%80%E8%AF%BE%EF%BC%89%E6%B1%87%E6%80%BB%E6%8E%A8%E8%8D%90%EF%BC%9A%E4%BB%8Epython%E5%85%A5%E9%97%A8%E5%88%B0%E5%BA%94%E7%94%A8python

维基百科语料中的词语相似度探索

Deep Learning Specialization on Coursera

之前写过《中英文维基百科语料上的Word2Vec实验》,近期有不少同学在这篇文章下留言提问,加上最近一些工作也与Word2Vec相关,于是又做了一些功课,包括重新过了一遍Word2Vec的相关资料,试了一下gensim的相关更新接口,google了一下"wikipedia word2vec" or "维基百科 word2vec" 相关的英中文资料,发现多数还是走得这篇文章的老路,既通过gensim提供的维基百科预处理脚本"gensim.corpora.WikiCorpus"提取维基语料,每篇文章一行文本存放,然后基于gensim的Word2Vec模块训练词向量模型。这里再提供另一个方法来处理维基百科的语料,训练词向量模型,计算词语相似度(Word Similarity)。关于Word2Vec, 如果英文不错,推荐从这篇文章入手读相关的资料: Getting started with Word2Vec

这次我们仅以英文维基百科语料为例,首先依然是下载维基百科的最新XML打包压缩数据,在这个英文最新更新的数据列表下:https://dumps.wikimedia.org/enwiki/latest/ ,找到 "enwiki-latest-pages-articles.xml.bz2" 下载,这份英文维基百科全量压缩数据的打包时间大概是2017年4月4号,大约13G,我通过家里的电脑wget下载大概花了3个小时,电信100M宽带,速度还不错。

接下来就是处理这份压缩的XML英文维基百科语料了,这次我们使用WikiExtractor:

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

WikiExtractor是一个Python 脚本,专门用于提取和清洗Wikipedia的dump数据,支持Python 2.7 或者 Python 3.3+,无额外依赖,安装和使用都非常方便:

安装:
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install

使用:
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2

......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

这个过程总计花了2个多小时,提取了大概537万多篇文章。关于我的机器配置,可参考:《深度学习主机攒机小记

提取后的文件按一定顺序切分存储在多个子目录下:

每个子目录下的又存放若干个以wiki_num命名的文件,每个大小在1M左右,这个大小可以通过参数 -b 控制:

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

我们看一下wiki_00里的具体内容:


Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.



Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...

...

每个wiki_num文件里又存放若干个doc,每个doc都有相关的tag标记,包括id, url, title等,很好区分。

这里我们按照Gensim作者提供的word2vec tutorial里"memory-friendly iterator"方式来处理英文维基百科的数据。代码如下,也同步放到了github里:train_word2vec_with_gensim.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyright 2017 @ Yu Zhen
 
import gensim
import logging
import multiprocessing
import os
import re
import sys
 
from pattern.en import tokenize
from time import time
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)
 
 
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext
 
 
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for root, dirs, files in os.walk(self.dirname):
            for filename in files:
                file_path = root + '/' + filename
                for line in open(file_path):
                    sline = line.strip()
                    if sline == "":
                        continue
                    rline = cleanhtml(sline)
                    tokenized_line = ' '.join(tokenize(rline))
                    is_alpha_word_line = [word for word in
                                          tokenized_line.lower().split()
                                          if word.isalpha()]
                    yield is_alpha_word_line
 
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Please use python train_with_gensim.py data_path"
        exit()
    data_path = sys.argv[1]
    begin = time()
 
    sentences = MySentences(data_path)
    model = gensim.models.Word2Vec(sentences,
                                   size=200,
                                   window=10,
                                   min_count=10,
                                   workers=multiprocessing.cpu_count())
    model.save("data/model/word2vec_gensim")
    model.wv.save_word2vec_format("data/model/word2vec_org",
                                  "data/model/vocabulary",
                                  binary=False)
 
    end = time()
    print "Total procesing time: %d seconds" % (end - begin)

注意其中的word tokenize使用了pattern里的英文tokenize模块,当然,也可以使用nltk里的word_tokenize模块,做一点修改即可,不过nltk对于句尾的一些词的work tokenize处理的不太好。另外我们设定词向量维度为200, 窗口长度为10, 最小出现次数为10,通过 is_alpha() 函数过滤掉标点和非英文词。现在可以用这个脚本来训练英文维基百科的Word2Vec模型了:
python train_word2vec_with_gensim.py enwiki

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts
2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types
2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types
2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types
2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types
......
2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads                                                                      2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads                                                                      2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads                                                                      2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s                          
2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None                                                       
2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm                 
2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy                                                              
2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy                                                           
2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table                
2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim               
2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary    
2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org                                                             
Total procesing time: 44476 seconds

这个训练过程中大概花了12多小时,训练后的文件存放在data/model下:

我们来测试一下这个英文维基百科的Word2Vec模型:

textminer@textminer:/opt/wiki/data$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from gensim.models import Word2Vec
 
In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim')

首先来测试几个单词的相似单词(Word Similariy):

word:

In [3]: en_wiki_word2vec_model.most_similar('word')
Out[3]: 
[('phrase', 0.8129693269729614),
 ('meaning', 0.7311851978302002),
 ('words', 0.7010501623153687),
 ('adjective', 0.6805518865585327),
 ('noun', 0.6461974382400513),
 ('suffix', 0.6440576314926147),
 ('verb', 0.6319557428359985),
 ('loanword', 0.6262609958648682),
 ('proverb', 0.6240501403808594),
 ('pronunciation', 0.6105246543884277)]

similarity:

In [4]: en_wiki_word2vec_model.most_similar('similarity')
Out[4]: 
[('similarities', 0.8517599701881409),
 ('resemblance', 0.786037266254425),
 ('resemblances', 0.7496883869171143),
 ('affinities', 0.6571112275123596),
 ('differences', 0.6465682983398438),
 ('dissimilarities', 0.6212711930274963),
 ('correlation', 0.6071442365646362),
 ('dissimilarity', 0.6062943935394287),
 ('variation', 0.5970577001571655),
 ('difference', 0.5928016901016235)]

nlp:

In [5]: en_wiki_word2vec_model.most_similar('nlp')
Out[5]: 
[('neurolinguistic', 0.6698148250579834),
 ('psycholinguistic', 0.6388964056968689),
 ('connectionism', 0.6027182936668396),
 ('semantics', 0.5866401195526123),
 ('connectionist', 0.5865628719329834),
 ('bandler', 0.5837364196777344),
 ('phonics', 0.5733655691146851),
 ('psycholinguistics', 0.5613113641738892),
 ('bootstrapping', 0.559638261795044),
 ('psychometrics', 0.5555593967437744)]

learn:

In [6]: en_wiki_word2vec_model.most_similar('learn')
Out[6]: 
[('teach', 0.7533557415008545),
 ('understand', 0.71148681640625),
 ('discover', 0.6749690771102905),
 ('learned', 0.6599283218383789),
 ('realize', 0.6390970349311829),
 ('find', 0.6308424472808838),
 ('know', 0.6171890497207642),
 ('tell', 0.6146825551986694),
 ('inform', 0.6008728742599487),
 ('instruct', 0.5998791456222534)]

man:

In [7]: en_wiki_word2vec_model.most_similar('man')
Out[7]: 
[('woman', 0.7243080735206604),
 ('boy', 0.7029494047164917),
 ('girl', 0.6441491842269897),
 ('stranger', 0.63275545835495),
 ('drunkard', 0.6136815547943115),
 ('gentleman', 0.6122575998306274),
 ('lover', 0.6108279228210449),
 ('thief', 0.609005331993103),
 ('beggar', 0.6083744764328003),
 ('person', 0.597919225692749)]

再来看看其他几个相关接口:

In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
Out[8]: [('queen', 0.7752252817153931)]
 
In [9]: en_wiki_word2vec_model.similarity('woman', 'man')
Out[9]: 0.72430799548282099
 
In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'

我把这篇文章的相关代码还有另一篇“中英文维基百科语料上的Word2Vec实验”的相关代码整理了一下,在github上建立了一个 Wikipedia_Word2vec 的项目,感兴趣的同学可以参考。

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:维基百科语料中的词语相似度探索 http://www.52nlp.cn/?p=9454

Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

Deep Learning Specialization on Coursera

斯坦福大学自然语言处理组是世界知名的NLP研究小组,他们提供了一系列开源的Java文本分析工具,包括分词器(Word Segmenter),词性标注工具(Part-Of-Speech Tagger),命名实体识别工具(Named Entity Recognizer),句法分析器(Parser)等,可喜的事,他们还为这些工具训练了相应的中文模型,支持中文文本处理。在使用NLTK的过程中,发现当前版本的NLTK已经提供了相应的斯坦福文本处理工具接口,包括词性标注,命名实体识别和句法分析器的接口,不过可惜的是,没有提供分词器的接口。在google无果和阅读了相应的代码后,我决定照猫画虎为NLTK写一个斯坦福中文分词器接口,这样可以方便的在Python中调用斯坦福文本处理工具。

首先需要做一些准备工作,第一步当然是安装NLTK,这个可以参考我们在gensim的相关文章中的介绍《如何计算两个文档的相似度》,不过这里建议check github上最新的NLTK源代码并用“python setup.py install”的方式安装这个版本:https://github.com/nltk/nltk。这个版本新增了对于斯坦福句法分析器的接口,一些老的版本并没有,这个之后我们也许还会用来介绍。而我们也是在这个版本中添加的斯坦福分词器接口,其他版本也许会存在一些小问题。其次是安装Java运行环境,以Ubuntu 12.04为例,安装Java运行环境仅需要两步:

sudo apt-get install default-jre
sudo apt-get install default-jdk

最后,当然是最重要的,你需要下载斯坦福分词器的相应文件,包括源代码,模型文件,词典文件等。注意斯坦福分词器并不仅仅支持中文分词,还支持阿拉伯语的分词,需要下载的zip打包文件是这个: Download Stanford Word Segmenter version 2014-08-27,下载后解压。

准备工作就绪后,我们首先考虑的是在nltk源代码里的什么地方来添加这个接口文件。在nltk源代码包下,斯坦福词性标注器和命名实体识别工具的接口文件是这个:nltk/tag/stanford.py ,而句法分析器的接口文件是这个:nltk/parse/stanford.py , 虽然在nltk/tokenize/目录下有一个stanford.py文件,但是仅仅提供了一个针对英文的tokenizer工具PTBTokenizer的接口,没有针对斯坦福分词器的接口,于是我决定在nltk/tokenize下添加一个stanford_segmenter.py文件,作为nltk斯坦福中文分词器的接口文件。NLTK中的这些接口利用了Linux 下的管道(PIPE)机制和subprocess模块,这里直接贴源代码了,感兴趣的同学可以自行阅读:
继续阅读

Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱

Deep Learning Specialization on Coursera

曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python。离开腾讯创业后,第一个作品课程图谱也是选择了Python系的Flask框架,渐渐的将自己的绝大部分工作交给了Python。这些年来,接触和使用了很多Python工具包,特别是在文本处理,科学计算,机器学习和数据挖掘领域,有很多很多优秀的Python工具包可供使用,所以作为Pythoner,也是相当幸福的。其实如果仔细留意微博,你会发现很多这方面的分享,自己也Google了一下,发现也有同学总结了“Python机器学习库”,不过总感觉缺少点什么。最近流行一个词,全栈工程师(full stack engineer),作为一个苦逼的创业者,天然的要把自己打造成一个full stack engineer,而这个过程中,这些Python工具包给自己提供了足够的火力,所以想起了这个系列。当然,这也仅仅是抛砖引玉,希望大家能提供更多的线索,来汇总整理一套Python网页爬虫,文本处理,科学计算,机器学习和数据挖掘的兵器谱。

一、Python网页爬虫工具集

一个真实的项目,一定是从获取数据开始的。无论文本处理,机器学习和数据挖掘,都需要数据,除了通过一些渠道购买或者下载的专业数据外,常常需要大家自己动手爬数据,这个时候,爬虫就显得格外重要了,幸好,Python提供了一批很不错的网页爬虫工具框架,既能爬取数据,也能获取和清洗数据,我们也就从这里开始了:

1. Scrapy

Scrapy, a fast high-level screen scraping and web crawling framework for Python.

鼎鼎大名的Scrapy,相信不少同学都有耳闻,课程图谱中的很多课程都是依靠Scrapy抓去的,这方面的介绍文章有很多,推荐大牛pluskid早年的一篇文章:《Scrapy 轻松定制网络爬虫》,历久弥新。

官方主页:http://scrapy.org/
Github代码页: https://github.com/scrapy/scrapy

2. Beautiful Soup

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

读书的时候通过《集体智慧编程》这本书知道Beautiful Soup的,后来也偶尔会用用,非常棒的一套工具。客观的说,Beautifu Soup不完全是一套爬虫工具,需要配合urllib使用,而是一套HTML/XML数据分析,清洗和获取工具。

官方主页:http://www.crummy.com/software/BeautifulSoup/

3. Python-Goose

Html Content / Article Extractor, web scrapping lib in Python

Goose最早是用Java写得,后来用Scala重写,是一个Scala项目。Python-Goose用Python重写,依赖了Beautiful Soup。前段时间用过,感觉很不错,给定一个文章的URL, 获取文章的标题和内容很方便。

Github主页:https://github.com/grangier/python-goose

二、Python文本处理工具集

从网页上获取文本数据之后,依据任务的不同,就需要进行基本的文本处理了,譬如对于英文来说,需要基本的tokenize,对于中文,则需要常见的中文分词,进一步的话,无论英文中文,还可以词性标注,句法分析,关键词提取,文本分类,情感分析等等。这个方面,特别是面向英文领域,有很多优秀的工具包,我们一一道来。
继续阅读

如何计算两个文档的相似度(三)

Deep Learning Specialization on Coursera

上一节我们用了一个简单的例子过了一遍gensim的用法,这一节我们将用课程图谱的实际数据来做一些验证和改进,同时会用到NLTK来对课程的英文数据做预处理。

三、课程图谱相关实验

1、数据准备
为了方便大家一起来做验证,这里准备了一份Coursera的课程数据,可以在这里下载:coursera_corpus,(百度网盘链接: http://t.cn/RhjgPkv,密码: oppc)总共379个课程,每行包括3部分内容:课程名\t课程简介\t课程详情, 已经清除了其中的html tag, 下面所示的例子仅仅是其中的课程名:

Writing II: Rhetorical Composing
Genetics and Society: A Course for Educators
General Game Playing
Genes and the Human Condition (From Behavior to Biotechnology)
A Brief History of Humankind
New Models of Business in Society
Analyse Numérique pour Ingénieurs
Evolution: A Course for Educators
Coding the Matrix: Linear Algebra through Computer Science Applications
The Dynamic Earth: A Course for Educators
...

好了,首先让我们打开Python, 加载这份数据:

>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']

2、引入NLTK
NTLK是著名的Python自然语言处理工具包,但是主要针对的是英文处理,不过课程图谱目前处理的课程数据主要是英文,因此也足够了。NLTK配套有文档,有语料库,有书籍,甚至国内有同学无私的翻译了这本书: 用Python进行自然语言处理,有时候不得不感慨:做英文自然语言处理的同学真幸福。

首先仍然是安装NLTK,在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK:http://nltk.org/install.html ,最主要的还是要先装好依赖NumPy和PyYAML,其他没什么问题。安装NLTK完毕,可以import nltk测试一下,如果没有问题,还有一件非常重要的工作要做,下载NLTK官方提供的相关语料:

>>> import nltk
>>> nltk.download()

这个时候会弹出一个图形界面,会显示两份数据供你下载,分别是all-corpora和book,最好都选定下载了,这个过程需要一段时间,语料下载完毕后,NLTK在你的电脑上才真正达到可用的状态,可以测试一下布朗语料库

>>> from nltk.corpus import brown
>>> brown.readme()
'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192

现在我们就来处理刚才的课程数据,如果按此前的方法仅仅对文档的单词小写化的话,我们将得到如下的结果:

>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

注意其中很多标点符号和单词是没有分离的,所以我们引入nltk的word_tokenize函数,并处理相应的数据:

>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', ',', 'research', ',', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', ',', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', '...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

对课程的英文数据进行tokenize之后,我们需要去停用词,幸好NLTK提供了一份英文停用词数据:

>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
>>> len(english_stopwords)
127

总计127个停用词,我们首先过滤课程语料中的停用词:
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', ',', 'research', ',', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', ',', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', ',', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', ',', 'visual', ',', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', ',', 'demonstrations', ',', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', ',', 'rhetoric', 'course', 'design', ',', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', ',', 'students', ',', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', ',', 'writers', 'exchange', ',', 'place', 'exchange', 'work', 'feedback']

停用词被过滤了,不过发现标点符号还在,这个好办,我们首先定义一个标点符号list:
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']

然后过滤这些标点符号:
>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
>>> print texts_filtered[0]
['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', 'research', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', 'ideas', 'talents', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', 'visual', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', 'demonstrations', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', 'rhetoric', 'course', 'design', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', 'students', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', 'writers', 'exchange', 'place', 'exchange', 'work', 'feedback']

更进一步,我们对这些英文单词词干化(Stemming),NLTK提供了好几个相关工具接口可供选择,具体参考这个页面: http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer, Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('stemmed')
'stem'
>>> st.stem('stemming')
'stem'
>>> st.stem('stemmer')
'stem'
>>> st.stem('running')
'run'
>>> st.stem('maximum')
'maxim'
>>> st.stem('presumably')
'presum'

让我们调用这个接口来处理上面的课程数据:
>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', 'sery', 'interact', 'read', 'research', 'compos', 'act', 'along', 'assign', 'design', 'help', 'becom', 'effect', 'consum', 'produc', 'alphabet', 'vis', 'multimod', 'texts.', 'join', 'us', 'becom', 'effect', 'writ', '...', 'bet', 'citizens.', 'rhet', 'compos', 'cours', 'writ', 'exchang', 'word', 'idea', 'tal', 'support.', 'introduc', 'vary', 'rhet', 'concepts\xe2\x80\x94that', 'idea', 'techn', 'inform', 'persuad', 'audiences\xe2\x80\x94that', 'help', 'becom', 'effect', 'consum', 'produc', 'writ', 'vis', 'multimod', 'texts.', 'class', 'includ', 'short', 'video', 'demonst', 'activities.', 'envid', 'rhet', 'compos', 'learn', 'commun', 'includ', 'enrol', 'cours', 'instructors.', 'bring', 'expert', 'writ', 'rhet', 'cours', 'design', 'design', 'assign', 'cours', 'infrastruct', 'help', 'shar', 'expery', 'writ', 'stud', 'profess', 'us.', 'collab', 'facilit', 'wex', 'writ', 'exchang', 'plac', 'exchang', 'work', 'feedback']

在我们引入gensim之前,还有一件事要做,去掉在整个语料库中出现次数为1的低频词,测试了一下,不去掉的话对效果有些影响:

>>> all_stems = sum(texts_stemmed, [])
>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

3、引入gensim
有了上述的预处理,我们就可以引入gensim,并快速的做课程相似度的实验了。以下会快速的过一遍流程,具体的可以参考上一节的详细描述。

>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> dictionary = corpora.Dictionary(texts)
2013-06-07 21:37:07,120 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-06-07 21:37:07,263 : INFO : built Dictionary(3341 unique tokens) from 379 documents (total 46417 corpus positions)

>>> corpus = [dictionary.doc2bow(text) for text in texts]

>>> tfidf = models.TfidfModel(corpus)
2013-06-07 21:58:30,490 : INFO : collecting document frequencies
2013-06-07 21:58:30,490 : INFO : PROGRESS: processing document #0
2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

>>> corpus_tfidf = tfidf[corpus]

这里我们拍脑门决定训练topic数量为10的LSI模型:
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])
2013-06-07 22:04:55,443 : INFO : scanning corpus to determine the number of features
2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

基于LSI模型的课程索引建立完毕,我们以Andrew Ng教授的机器学习公开课为例,这门课程在我们的coursera_corpus文件的第211行,也就是:

>>> print courses_name[210]
Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上,然后和其他课程计算相似度:
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程:
>>> print sort_sims[0:10]
[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:
>>> print courses_name[210]
Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课
>>> print courses_name[174]
Machine Learning

第三门课是Coursera的另一位创始人,同样是大牛的Daphne Koller教授的概率图模型公开课
>>> print courses_name[238]
Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课,有同学评价是Deep Learning的必修课。
>>> print courses_name[203]
Neural Networks for Machine Learning

感觉效果还不错,如果觉得有趣的话,也可以动手试试。

好了,这个系列就到此为止了,原计划写一下在英文维基百科全量数据上的实验,因为课程图谱目前暂时不需要,所以就到此为止,感兴趣的同学可以直接阅读gensim上的相关文档,非常详细。之后我可能更关注将NLTK应用到中文信息处理上,欢迎关注。

注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn

本文链接地址:http://www.52nlp.cn/如何计算两个文档的相似度三

如何计算两个文档的相似度(二)

Deep Learning Specialization on Coursera

上一节我们介绍了一些背景知识以及gensim , 相信很多同学已经尝试过了。这一节将从gensim最基本的安装讲起,然后举一个非常简单的例子用以说明如何使用gensim,下一节再介绍其在课程图谱上的应用。

二、gensim的安装和使用

1、安装
gensim依赖NumPySciPy这两大Python科学计算工具包,一种简单的安装方法是pip install,但是国内因为网络的缘故常常失败。所以我是下载了gensim的源代码包安装的。gensim的这个官方安装页面很详细的列举了兼容的Python和NumPy, SciPy的版本号以及安装步骤,感兴趣的同学可以直接参考。下面我仅仅说明在Ubuntu和Mac OS下的安装:

1)我的VPS是64位的Ubuntu 12.04,所以安装numpy和scipy比较简单"sudo apt-get install python-numpy python-scipy", 之后解压gensim的安装包,直接“sudo python setup.py install"即可;

2)我的本是macbook pro,在mac os上安装numpy和scipy的源码包废了一下周折,特别是后者,一直提示fortran相关的东西没有,google了一下,发现很多人在mac上安装scipy的时候都遇到了这个问题,最后通过homebrew安装了gfortran才搞定:“brew install gfortran”,之后仍然是“sudo python setpy.py install" numpy 和 scipy即可;

2、使用
gensim的官方tutorial非常详细,英文ok的同学可以直接参考。以下我会按自己的理解举一个例子说明如何使用gensim,这个例子不同于gensim官方的例子,可以作为一个补充。上一节提到了一个文档:Latent Semantic Indexing (LSI) A Fast Track Tutorial , 这个例子的来源就是这个文档所举的3个一句话doc。首先让我们在命令行中打开python,做一些准备工作:

>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

然后将上面那个文档中的例子作为文档输入,在Python中用document list表示:

>>> documents = ["Shipment of gold damaged in a fire",
... "Delivery of silver arrived in a silver truck",
... "Shipment of gold arrived in a truck"]

正常情况下,需要对英文文本做一些预处理工作,譬如去停用词,对文本进行tokenize,stemming以及过滤掉低频的词,但是为了说明问题,也是为了和这篇"LSI Fast Track Tutorial"保持一致,以下的预处理仅仅是将英文单词小写化:

>>> texts = [[word for word in document.lower().split()] for document in documents]
>>> print texts
[['shipment', 'of', 'gold', 'damaged', 'in', 'a', 'fire'], ['delivery', 'of', 'silver', 'arrived', 'in', 'a', 'silver', 'truck'], ['shipment', 'of', 'gold', 'arrived', 'in', 'a', 'truck']]

我们可以通过这些文档抽取一个“词袋(bag-of-words)",将文档的token映射为id:

>>> dictionary = corpora.Dictionary(texts)
>>> print dictionary
Dictionary(11 unique tokens)
>>> print dictionary.token2id
{'a': 0, 'damaged': 1, 'gold': 3, 'fire': 2, 'of': 5, 'delivery': 8, 'arrived': 7, 'shipment': 6, 'in': 4, 'truck': 10, 'silver': 9}

然后就可以将用字符串表示的文档转换为用id表示的文档向量:

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> print corpus
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2), (10, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (10, 1)]]

例如(9,2)这个元素代表第二篇文档中id为9的单词“silver”出现了2次。

有了这些信息,我们就可以基于这些“训练文档”计算一个TF-IDF“模型”:

>>> tfidf = models.TfidfModel(corpus)
2013-05-27 18:58:15,831 : INFO : collecting document frequencies
2013-05-27 18:58:15,881 : INFO : PROGRESS: processing document #0
2013-05-27 18:58:15,881 : INFO : calculating IDF weights for 3 documents and 11 features (21 matrix non-zeros)

基于这个TF-IDF模型,我们可以将上述用词频表示文档向量表示为一个用tf-idf值表示的文档向量:

>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
... print doc
...
[(1, 0.6633689723434505), (2, 0.6633689723434505), (3, 0.2448297500958463), (6, 0.2448297500958463)]
[(7, 0.16073253746956623), (8, 0.4355066251613605), (9, 0.871013250322721), (10, 0.16073253746956623)]
[(3, 0.5), (6, 0.5), (7, 0.5), (10, 0.5)]

发现一些token貌似丢失了,我们打印一下tfidf模型中的信息:

>>> print tfidf.dfs
{0: 3, 1: 1, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 1, 10: 2}
>>> print tfidf.idfs
{0: 0.0, 1: 1.5849625007211563, 2: 1.5849625007211563, 3: 0.5849625007211562, 4: 0.0, 5: 0.0, 6: 0.5849625007211562, 7: 0.5849625007211562, 8: 1.5849625007211563, 9: 1.5849625007211563, 10: 0.5849625007211562}

我们发现由于包含id为0, 4, 5这3个单词的文档数(df)为3,而文档总数也为3,所以idf被计算为0了,看来gensim没有对分子加1,做一个平滑。不过我们同时也发现这3个单词分别为a, in, of这样的介词,完全可以在预处理时作为停用词干掉,这也从另一个方面说明TF-IDF的有效性。

有了tf-idf值表示的文档向量,我们就可以训练一个LSI模型,和Latent Semantic Indexing (LSI) A Fast Track Tutorial中的例子相似,我们设置topic数为2:

>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
>>> lsi.print_topics(2)
2013-05-27 19:15:26,467 : INFO : topic #0(1.137): 0.438*"gold" + 0.438*"shipment" + 0.366*"truck" + 0.366*"arrived" + 0.345*"damaged" + 0.345*"fire" + 0.297*"silver" + 0.149*"delivery" + 0.000*"in" + 0.000*"a"
2013-05-27 19:15:26,468 : INFO : topic #1(1.000): 0.728*"silver" + 0.364*"delivery" + -0.364*"fire" + -0.364*"damaged" + 0.134*"truck" + 0.134*"arrived" + -0.134*"shipment" + -0.134*"gold" + -0.000*"a" + -0.000*"in"

lsi的物理意义不太好解释,不过最核心的意义是将训练文档向量组成的矩阵SVD分解,并做了一个秩为2的近似SVD分解,可以参考那篇英文tutorail。有了这个lsi模型,我们就可以将文档映射到一个二维的topic空间中:

>>> corpus_lsi = lsi[corpus_tfidf]
>>> for doc in corpus_lsi:
... print doc
...
[(0, 0.67211468809878649), (1, -0.54880682119355917)]
[(0, 0.44124825208697727), (1, 0.83594920480339041)]
[(0, 0.80401378963792647)]

可以看出,文档1,3和topic1更相关,文档2和topic2更相关;

我们也可以顺手跑一个LDA模型:

>>> lda = models.LdaModel(copurs_tfidf, id2word=dictionary, num_topics=2)
>>> lda.print_topics(2)
2013-05-27 19:44:40,026 : INFO : topic #0: 0.119*silver + 0.107*shipment + 0.104*truck + 0.103*gold + 0.102*fire + 0.101*arrived + 0.097*damaged + 0.085*delivery + 0.061*of + 0.061*in
2013-05-27 19:44:40,026 : INFO : topic #1: 0.110*gold + 0.109*silver + 0.105*shipment + 0.105*damaged + 0.101*arrived + 0.101*fire + 0.098*truck + 0.090*delivery + 0.061*of + 0.061*in

lda模型中的每个主题单词都有概率意义,其加和为1,值越大权重越大,物理意义比较明确,不过反过来再看这三篇文档训练的2个主题的LDA模型太平均了,没有说服力。

好了,我们回到LSI模型,有了LSI模型,我们如何来计算文档直接的相思度,或者换个角度,给定一个查询Query,如何找到最相关的文档?当然首先是建索引了:

>>> index = similarities.MatrixSimilarity(lsi[corpus])
2013-05-27 19:50:30,282 : INFO : scanning corpus to determine the number of features
2013-05-27 19:50:30,282 : INFO : creating matrix for 3 documents and 2 features

还是以这篇英文tutorial中的查询Query为例:gold silver truck。首先将其向量化:

>>> query = "gold silver truck"
>>> query_bow = dictionary.doc2bow(query.lower().split())
>>> print query_bow
[(3, 1), (9, 1), (10, 1)]

再用之前训练好的LSI模型将其映射到二维的topic空间:

>>> query_lsi = lsi[query_bow]
>>> print query_lsi
[(0, 1.1012835748628467), (1, 0.72812283398049593)]

最后就是计算其和index中doc的余弦相似度了:

>>> sims = index[query_lsi]
>>> print list(enumerate(sims))
[(0, 0.40757114), (1, 0.93163693), (2, 0.83416492)]

当然,我们也可以按相似度进行排序:

>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print sort_sims
[(1, 0.93163693), (2, 0.83416492), (0, 0.40757114)]

可以看出,这个查询的结果是doc2 > doc3 > doc1,和fast tutorial是一致的,虽然数值上有一些差别:

2dlsi

好了,这个例子就到此为止,下一节我们将主要说明如何基于gensim计算课程图谱上课程之间的主题相似度,同时考虑一些改进方法,包括借助英文的自然语言处理工具包NLTK以及用更大的维基百科的语料来看看效果。

未完待续...

注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn

本文链接地址:http://www.52nlp.cn/如何计算两个文档的相似度二

推荐《用Python进行自然语言处理》中文翻译-NLTK配套书

Deep Learning Specialization on Coursera

  NLTK配套书《用Python进行自然语言处理》(Natural Language Processing with Python)已经出版好几年了,但是国内一直没有翻译的中文版,虽然读英文原版是最好的选择,但是对于多数读者,如果有中文版,一定是不错的。下午在微博上看到陈涛sean 同学提供了NLTK配套书的中译本下载,就追问了一下,之后译者和我私信联系,并交流了一下,才发现是作者无偿翻译的,并且没有出版计划的。翻译是个很苦的差事,向译者致敬,另外译者说里面有一些错误,希望能得到nlper们的指正,大家一起来修正这个珍贵的NLTK中文版吧。另外译者希望在“52nlp”上做个推荐,这事是造福nlper的好事,我已经在“资源”里更新了本书的链接,以下是书的下载地址:

PYTHON自然语言处理中文翻译-NLTK Natural Language Processing with Python 中文版

  翻看了一下翻译版,且不说翻译质量,单看排版就让人觉得向一本正式的翻译书籍,说明译者是非常有心的。以下是从翻译版中摘录的“译者的话”:

  作为一个自然语言处理的初学者,看书看到“训练模型”,这模型那模型的,一直不知
道模型究竟是什么东西。看了这本书,从预处理数据到提取特征集,训练模型,测试修改等,一步一步实际操作了之后,才对模型一词有了直观的认识(算法的中间结果,存储在计算机中的一个个pkl 文件,测试的时候直接用,前面计算过的就省了)。以后听人谈“模型”的时候也有了底气。当然,模型还有很多其他含义。还有动词的“配价”、各种搭配、客观逻辑对根据文法生成的句子的约束如何实现?不上机动手做做,很难真正领悟。

  自然语言处理理论书籍很多,讲实际操作的不多,能讲的这么系统的更少。从这个角度
讲,本书是目前世界上最好的自然语言处理实践教程。初学者若在看过理论之后能精读本书,必定会有获益。这也是翻译本书的目的之一。

  本书是译者课余英文翻译练习,抛砖引玉。书中存在很多问题,尤其是第10 章命题逻
辑和一阶逻辑推理在自然语言处理中的应用。希望大家多多指教。可以在微博上找到我(w
eibo.com/chentao1999)。虽然读中文翻译速度更快,但直接读原文更能了解作者的本意。

  原书作者在书的最后列出了迫切需要帮助改进的条目,对翻译本书建议使用目标语言的
例子,目前本书还只能照搬英文的例子,希望有志愿者能加入本书的中文化进程中,为中文
自然语言处理做出贡献。

  将本书作学习和研究之用,欢迎传播、复制、修改。山寨产品请留下译者姓名和微博。
用于商业目的,请与原书版权所有者联系,译者不承担由此产生的责任。

翻译:陈涛(weibo.com/chentao1999)

2012 年4 月7 日

   最后希望大家在读这本书的过程中,记录一下需要勘误的地方,可以在“评论”中给出勘误建议,一起来修正这本书。谢谢!

Google's Python Class

Deep Learning Specialization on Coursera

  自然语言处理和脚本语言的关系还是很密切的,我一直比较喜欢用Perl,不过因为NLTK的缘故,我学习了一下Python,也立即被Python的严谨所征服。印象《Learning Python》中提到了Perl和Python的一段八卦:Perl的发明者是语言学家,而Python的发明者则是数学科班出身,因此前者崇尚自由,而后者推崇严谨。大意如此,但是对于Perl和Python,我同样受用,该用哪个时就用哪个,没必要比较。 继续阅读

HMM在自然语言处理中的应用一:词性标注6

Deep Learning Specialization on Coursera

  有一段时间没有谈HMM和词性标注了,今天我们继续这个系列的最后一个部分:介绍一个开源的HMM词性标注工具并且利用Brown语料库构造一个英文词性标注器。
  上一节借用umdhmm构造的HMM词性标注工具是二元语法(bigram)标注器,因为我们只考虑了前一个词性标记和当前词性标记,算的上是最基本的马尔科夫模型标注器。这个HMM词性标注器可以通过好几种方式进行扩展,一种方式就是考虑更多的上下文,不只考虑前面一个词性标记,而是考虑前面两个词性标记,这样的标注器称之为三元语法(trigram)标注器,是非常经典的一种词性标注方法,在《自然语言处理综论》及《统计自然语言处理基础》中被拿来介绍。 继续阅读

“知行合一”与自然语言处理

Deep Learning Specialization on Coursera

  最近迷上了《明朝那些事儿》,周四从当当网收到寄过来的三、四、五册之后,本计划半个月的精神食粮就在这三天完成了,这也差点耽误了“我爱自然语言处理”的周末任务。不过还好,读《明朝那些事儿》的时候,王守仁(阳明)先生的“知行合一”给我留下了深刻的印象,且下意识的联想到了自然语言处理,于是就准备在这里瞎侃侃自己的感受了! 继续阅读