标签归档:python

Coursera上Python课程(公开课)汇总推荐:从Python入门到应用Python

Deep Learning Specialization on Coursera

Python是深度学习时代的语言,Coursera上有很多Python课程,从Python入门到精通,从Python基础语法到应用Python,满足各个层次的需求,以下是Coursera上的Python课程整理,仅供参考,这里也会持续更新。

1. 密歇根大学的“Python for Everybody Specialization(人人都可以学习的Python专项课程)”

这个系列对于学习者的编程背景和数学要求几乎为零,非常适合Python入门学习。这个系列也是Coursera上最受欢迎的Python学习系列课程,强烈推荐。这个Python系列的目标是“通过Python学习编程并分析数据,开发用于采集,清洗,分析和可视化数据的程序(Learn to Program and Analyze Data with Python-Develop programs to gather, clean, analyze, and visualize data.” ,以下是关于这个系列的简介:

This Specialization builds on the success of the Python for Everybody course and will introduce fundamental programming concepts including data structures, networked application program interfaces, and databases, using the Python programming language. In the Capstone Project, you’ll use the technologies learned throughout the Specialization to design and create your own applications for data retrieval, processing, and visualization.

这个系列包含4门子课程和1门毕业项目课程,包括Python入门基础,Python数据结构, 使用Python获取网络数据(Python爬虫),在Python中使用数据库以及Python数据可视化等。以下是具体子课程的介绍:

1.1 Programming for Everybody (Getting Started with Python)

Python入门级课程,这门课程暂且翻译为“人人都可以学编程-从Python开始”,如果没有任何编程基础,就从这门课程开始吧:

This course aims to teach everyone the basics of programming computers using Python. We cover the basics of how one constructs a program from a series of simple instructions in Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course. This course will cover Chapters 1-5 of the textbook “Python for Everybody”. Once a student completes this course, they will be ready to take more advanced programming courses. This course covers Python 3.

1.2 Python Data Structures(Python数据结构)

Python基础课程,这门课程的目标是介绍Python语言的核心数据结构(This course will introduce the core data structures of the Python programming language.),关于这门课程:

This course will introduce the core data structures of the Python programming language. We will move past the basics of procedural programming and explore how we can use the Python built-in data structures such as lists, dictionaries, and tuples to perform increasingly complex data analysis. This course will cover Chapters 6-10 of the textbook “Python for Everybody”. This course covers Python 3.

1.3 Using Python to Access Web Data(使用Python获取网页数据--Python爬虫)

Python应用课程,只有使用Python才能学以致用,这门课程的目标是展示如何通过爬取和分析网页数据将互联网作为数据的源泉(This course will show how one can treat the Internet as a source of data):

This course will show how one can treat the Internet as a source of data. We will scrape, parse, and read web data as well as access data using web APIs. We will work with HTML, XML, and JSON data formats in Python. This course will cover Chapters 11-13 of the textbook “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-10 of the textbook and the first two courses in this specialization. These topics include variables and expressions, conditional execution (loops, branching, and try/except), functions, Python data structures (strings, lists, dictionaries, and tuples), and manipulating files. This course covers Python 3.

1.4 Using Databases with Python(Python数据库)

Python应用课程,在Python中使用数据库。这门课程的目标是在Python中学习SQL,使用SQLite3作为抓取数据的存储数据库:

This course will introduce students to the basics of the Structured Query Language (SQL) as well as basic database design for storing data as part of a multi-step data gathering, analysis, and processing effort. The course will use SQLite3 as its database. We will also build web crawlers and multi-step data gathering and visualization processes. We will use the D3.js library to do basic data visualization. This course will cover Chapters 14-15 of the book “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-13 of the textbook and the first three courses in this specialization. This course covers Python 3.

1.5 Capstone: Retrieving, Processing, and Visualizing Data with Python(毕业项目课程:使用Python获取,处理和可视化数据)

Python应用实践课程,这是这个系列的毕业项目课程,目的是通过开发一系列Python应用项目让学生熟悉Python抓取,处理和可视化数据的流程。

In the capstone, students will build a series of applications to retrieve, process and visualize data using Python. The projects will involve all the elements of the specialization. In the first part of the capstone, students will do some visualizations to become familiar with the technologies in use and then will pursue their own project to visualize some other data that they have or can find. Chapters 15 and 16 from the book “Python for Everybody” will serve as the backbone for the capstone. This course covers Python 3.

2. 多伦多大学的编程入门课程"Learn to Program: The Fundamentals(学习编程:基础) "

Python入门级课程。这门课程以Python语言传授编程入门知识,实为零基础的Python入门课程。感兴趣的同学可以参考课程图谱上的老课程评论 :http://coursegraph.com/coursera_programming1 ,之前一个同学的评价是 “两个老师语速都偏慢,讲解细致,又有可视化工具Python Visualizer用于详细了解程序具体执行步骤,可以说是零基础学习python编程的最佳选择。”

Behind every mouse click and touch-screen tap, there is a computer program that makes things happen. This course introduces the fundamental building blocks of programming and teaches you how to write fun and useful programs using the Python language.

3. 莱斯大学的Python专项课程系列:Introduction to Scripting in Python Specialization

入门级Python学习系列课程,涵盖Python基础, Python数据表示, Python数据分析, Python数据可视化等子课程,比较适合Python入门。这门课程的目标是让学生可以在处理实际问题是使用Python解决问题:Launch Your Career in Python Programming-Master the core concepts of scripting in Python to enable you to solve practical problems.

This specialization is intended for beginners who would like to master essential programming skills. Through four courses, you will cover key programming concepts in Python 3 which will prepare you to use Python to perform common scripting tasks. This knowledge will provide a solid foundation towards a career in data science, software engineering, or other disciplines involving programming.

这个系列包含4门子课程,以下是具体子课程的介绍:

3.1 Python Programming Essentials(Python编程基础)

Python入门基础课程,这门课程将讲授Python编程基础知识,包括表达式,变量,函数等,目标是让用户熟练使用Python:

This course will introduce you to the wonderful world of Python programming! We'll learn about the essential elements of programming and how to construct basic Python programs. We will cover expressions, variables, functions, logic, and conditionals, which are foundational concepts in computer programming. We will also teach you how to use Python modules, which enable you to benefit from the vast array of functionality that is already a part of the Python language. These concepts and skills will help you to begin to think like a computer programmer and to understand how to go about writing Python programs. By the end of the course, you will be able to write short Python programs that are able to accomplish real, practical tasks. This course is the foundation for building expertise in Python programming. As the first course in a specialization, it provides the necessary building blocks for you to succeed at learning to write more complex Python programs. This course uses Python 3. While many Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This first course will use a Python 3 version of the CodeSkulptor development environment, which is specifically designed to help beginning programmers learn quickly. CodeSkulptor runs within any modern web browser and does not require you to install any software, allowing you to start writing and running small programs immediately. In the later courses in this specialization, we will help you to move to more sophisticated desktop development environments.

3.2 Python Data Representations(Python数据表示)

Python入门基础课程,这门课程依然关注Python的基础知识,包括Python字符串,列表等,以及Python文件操作:

This course will continue the introduction to Python programming that started with Python Programming Essentials. We'll learn about different data representations, including strings, lists, and tuples, that form the core of all Python programs. We will also teach you how to access files, which will allow you to store and retrieve data within your programs. These concepts and skills will help you to manipulate data and write more complex Python programs. By the end of the course, you will be able to write Python programs that can manipulate data stored in files. This will extend your Python programming expertise, enabling you to write a wide range of scripts using Python This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course introduces basic desktop Python development environments, allowing you to run Python programs directly on your computer. This choice enables a smooth transition from online development environments.

3.3 Python Data Analysis(Python数据分析)

Python基础课程,这门课程将讲授通过Python读取和分析表格数据和结构化数据等,例如TCSV文件等:

This course will continue the introduction to Python programming that started with Python Programming Essentials and Python Data Representations. We'll learn about reading, storing, and processing tabular data, which are common tasks. We will also teach you about CSV files and Python's support for reading and writing them. CSV files are a generic, plain text file format that allows you to exchange tabular data between different programs. These concepts and skills will help you to further extend your Python programming knowledge and allow you to process more complex data. By the end of the course, you will be comfortable working with tabular data in Python. This will extend your Python programming expertise, enabling you to write a wider range of scripts using Python. This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course uses basic desktop Python development environments, allowing you to run Python programs directly on your computer.

3.4 Python Data Visualization(Python数据可视化)

Python应用课程,这门课程将基于前3门课程学习的Python知识,抓取网络数据,然后清洗,处理和分析数据,并最终可视化呈现数据:

This if the final course in the specialization which builds upon the knowledge learned in Python Programming Essentials, Python Data Representations, and Python Data Analysis. We will learn how to install external packages for use within Python, acquire data from sources on the Web, and then we will clean, process, analyze, and visualize that data. This course will combine the skills learned throughout the specialization to enable you to write interesting, practical, and useful programs. By the end of the course, you will be comfortable installing Python packages, analyzing existing data, and generating visualizations of that data. This course will complete your education as a scripter, enabling you to locate, install, and use Python packages written by others. You will be able to effectively utilize tools and packages that are widely available to amplify your effectiveness and write useful programs.

4. 莱斯大学的计算(机)基础专项课程系列:Fundamentals of Computing Specialization

入门级Python编程学习课程系列,这个系列覆盖了大部分莱斯大学一年级计算机科学新生的学习材料,学生通过Python学习现代编程语言技巧,并将这些技巧应用到20个左右的有趣的编程项目中。

This Specialization covers much of the material that first-year Computer Science students take at Rice University. Students learn sophisticated programming skills in Python from the ground up and apply these skills in building more than 20 fun projects. The Specialization concludes with a Capstone exam that allows the students to demonstrate the range of knowledge that they have acquired in the Specialization.

这个系列包括Python交互式编程设计,计算原理,算法思维等6门课程和1门毕业项目课程,目标是让学生像计算机科学家一样编程和思考(Learn how to program and think like a Computer Scientist),以下是子课程的相关介绍:

4.1 An Introduction to Interactive Programming in Python (Part 1)(Python交互式编程导论上)

Python入门级课程,这门课程将讲授Python编程基础知识,例如普通表达式,条件表达式和函数,并用这些知识构建一个简单的交互式应用。

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 1 of this course, we will introduce the basic elements of programming (such as expressions, conditionals, and functions) and then use these elements to create simple interactive applications such as a digital stopwatch. Part 1 of this class will culminate in building a version of the classic arcade game "Pong".

4.2 An Introduction to Interactive Programming in Python (Part 2)(Python交互式编程导论下)

Python入门级课程,这门课程将继续讲授Python基础知识,例如列表,词典和循环,并将使用这些知识构建一个简单的游戏例如Blackjack:

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 2 of this course, we will introduce more elements of programming (such as list, dictionaries, and loops) and then use these elements to create games such as Blackjack. Part 1 of this class will culminate in building a version of the classic arcade game "Asteroids". Upon completing this course, you will be able to write small, but interesting Python programs. The next course in the specialization will begin to introduce a more principled approach to writing programs and solving computational problems that will allow you to write larger and more complex programs.

4.3 Principles of Computing (Part 1)(计算原理上)

编程基础课程,这门课程聚焦在了编程的基础上,包括编码标准和测试,数学基础包括概率和组合等。

This two-part course builds upon the programming skills that you learned in our Introduction to Interactive Programming in Python course. We will augment those skills with both important programming practices and critical mathematical problem solving skills. These skills underlie larger scale computational problem solving and programming. The main focus of the class will be programming weekly mini-projects in Python that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 1 of this course, the programming aspect of the class will focus on coding standards and testing. The mathematical portion of the class will focus on probability, combinatorics, and counting with an eye towards practical applications of these concepts in Computer Science. Recommended Background - Students should be comfortable writing small (100+ line) programs in Python using constructs such as lists, dictionaries and classes and also have a high-school math background that includes algebra and pre-calculus.

4.4 Principles of Computing (Part 2)(计算原理下)

编程基础课程,这门课程聚焦在搜索、排序、递归等主题上:

This two-part course introduces the basic mathematical and programming principles that underlie much of Computer Science. Understanding these principles is crucial to the process of creating efficient and well-structured solutions for computational problems. To get hands-on experience working with these concepts, we will use the Python programming language. The main focus of the class will be weekly mini-projects that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 2 of this course, the programming portion of the class will focus on concepts such as recursion, assertions, and invariants. The mathematical portion of the class will focus on searching, sorting, and recursive data structures. Upon completing this course, you will have a solid foundation in the principles of computation and programming. This will prepare you for the next course in the specialization, which will begin to introduce a structured approach to developing and analyzing algorithms. Developing such algorithmic thinking skills will be critical to writing large scale software and solving real world computational problems.

4.5 Algorithmic Thinking (Part 1)(算法思维上)

编程基础课程,这门课程聚焦在算法思维的培养上,讲授图算法的相关概念并用Python实现:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part course builds on the principles that you learned in our Principles of Computing course and is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to real-world computational problems. In part 1 of this course, we will study the notion of algorithmic efficiency and consider its application to several problems from graph theory. As the central part of the course, students will implement several important graph algorithms in Python and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Recommended Background - Students should be comfortable writing intermediate size (300+ line) programs in Python and have a basic understanding of searching, sorting, and recursion. Students should also have a solid math background that includes algebra, pre-calculus and a familiarity with the math concepts covered in "Principles of Computing".

4.6 Algorithmic Thinking (Part 2)(算法思维下)

编程基础课程,这门课程聚焦在培养学生的算法思维,并了解一些高级算法主题,例如分治法,动态规划等:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part class is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to computational problems. In part 2 of this course, we will study advanced algorithmic techniques such as divide-and-conquer and dynamic programming. As the central part of the course, students will implement several algorithms in Python that incorporate these techniques and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Once students have completed this class, they will have both the mathematical and programming skills to analyze, design, and program solutions to a wide range of computational problems. While this class will use Python as its vehicle of choice to practice Algorithmic Thinking, the concepts that you will learn in this class transcend any particular programming language.

4.7 The Fundamentals of Computing Capstone Exam(计算基础毕业项目课程)

Python应用课程,基于以上子课程的学习,计算基础毕业项目课程将用Python和所学的知识完成 20+ 项目:

While most specializations on Coursera conclude with a project-based course, students in the "Fundamentals of Computing" specialization have completed more than 20+ projects during the first six courses of the specialization. Given that much of the material in these courses is reused from session to session, our goal in this capstone class is to provide a conclusion to the specialization that allows each student an opportunity to demonstrate their individual mastery of the material in the specialization. With this objective in mind, the focus in this Capstone class will be an exam whose questions are updated periodically. This approach is designed to help insure that each student is solving the exam problems on his/her own without outside help. For students that have done their own work, we do not anticipate that the exam will be particularly hard. However, those students who have relied too heavily on outside help in previous classes may have a difficult time. We believe that this approach will increase the value of the Certificate for this specialization.

5. 密歇根大学的 Applied Data Science with Python(Python数据科学应用专项课程系列)

Python应用系列课程,这个系列的目标主要是通过Python编程语言介绍数据科学的相关领域,包括应用统计学,机器学习,信息可视化,文本分析和社交网络分析等知识,并结合一些流行的Python工具包,例如pandas, matplotlib, scikit-learn, nltk以及networkx等Python工具。

The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have basic a python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.

这个系列课程有5门课程,包括Python数据科学导论课程(Introduction to Data Science in Python),Python数据可视化(Applied Plotting, Charting & Data Representation in Python),Python机器学习(Applied Machine Learning in Python) ,Python文本挖掘(Applied Text Mining in Python) , Python社交网络分析(Applied Social Network Analysis in Python),以下是具体子课程的介绍:

5.1 Introduction to Data Science in Python(Python数据科学导论)

Python基础和应用课程,这门课程从Python基础讲起,然后通过pandas数据科学库介绍DataFrame等数据分析中的核心数据结构概念,让学生学会操作和分析表格数据并学会运行基础的统计分析工具。

This course will introduce the learner to the basics of the python programming environment, including how to download and install python, expected fundamental python programming techniques, and how to find help with python programming questions. The course will also introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the DataFrame as the central data structure for data analysis. The course will end with a statistics primer, showing how various statistical measures can be applied to pandas DataFrames. By the end of the course, students will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses. This course should be taken before any of the other Applied Data Science with Python courses: Applied Plotting, Charting & Data Representation in Python, Applied Machine Learning in Python, Applied Text Mining in Python, Applied Social Network Analysis in Python.

5.2 Applied Plotting, Charting & Data Representation in Python(Python数据可视化)

Python应用课程,这门课程聚焦在通过使用matplotlib库进行数据图表的绘制和可视化呈现:

This course will introduce the learner to information visualization basics, with a focus on reporting and charting using the matplotlib library. The course will start with a design and information literacy perspective, touching on what makes a good and bad visualization, and what statistical measures translate into in terms of visualizations. The second week will focus on the technology used to make visualizations in python, matplotlib, and introduce users to best practices when creating basic charts and how to realize design decisions in the framework. The third week will describe the gamut of functionality available in matplotlib, and demonstrate a variety of basic statistical charts helping learners to identify when a particular method is good for a particular problem. The course will end with a discussion of other forms of structuring and visualizing data. This course should be taken after Introduction to Data Science in Python and before the remainder of the Applied Data Science with Python courses: Applied Machine Learning in Python, Applied Text Mining in Python, and Applied Social Network Analysis in Python.

5.3 Applied Machine Learning in Python(Python机器学习)

Python应用课程,这门课程主要聚焦在通过Python应用机器学习,包括机器学习和统计学的区别,机器学习工具包scikit-learn的介绍,有监督学习和无监督学习,数据泛化问题(例如交叉验证和过拟合)等。

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis. This course should be taken after Introduction to Data Science in Python and Applied Plotting, Charting & Data Representation in Python and before Applied Text Mining in Python and Applied Social Analysis in Python.

5.4 Applied Text Mining in Python(Python文本挖掘)

Python应用课程,这门课程主要聚焦在文本挖掘和文本分析基础,包括正则表达式,文本清洗,文本预处理等,并结合NLTK讲授自然语言处理的相关知识,例如文本分类,主题模型等。

This course will introduce the learner to text mining and text manipulation basics. The course begins with an understanding of how text is handled by python, the structure of text both to the machine and to humans, and an overview of the nltk framework for manipulating text. The second week focuses on common manipulation needs, including regular expressions (searching for text), cleaning text, and preparing text for use by machine learning processes. The third week will apply basic natural language processing methods to text, and demonstrate how text classification is accomplished. The final week will explore more advanced methods for detecting the topics in documents and grouping them by similarity (topic modelling). This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

5.5 Applied Social Network Analysis in Python(Python社交网络分析)

Python应用课程,这门课程通过Python工具包 NetworkX 介绍社交网络分析的相关知识。

This course will introduce the learner to network analysis through the NetworkX library. The course begins with an understanding of what network analysis is and motivations for why we might model phenomena as networks. The second week introduces the concept of connectivity and network robustness.. The third week will explore ways of measuring the importance or centrality of a node in a network. The final week will explore the evolution of networks over time and cover models of network generation and the link prediction problem. This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

您可以继续在课程图谱上挖掘Coursera上新的Python课程,也欢迎推荐到这里。

注:本文首发“课程图谱博客”:http://blog.coursegraph.com ,同步发布到这里,原文链接地址:http://blog.coursegraph.com/coursera%E4%B8%8Apython%E8%AF%BE%E7%A8%8B%EF%BC%88%E5%85%AC%E5%BC%80%E8%AF%BE%EF%BC%89%E6%B1%87%E6%80%BB%E6%8E%A8%E8%8D%90%EF%BC%9A%E4%BB%8Epython%E5%85%A5%E9%97%A8%E5%88%B0%E5%BA%94%E7%94%A8python

Andrew Ng 深度学习课程系列第四门课程卷积神经网络开课

Deep Learning Specialization on Coursera

Andrew Ng 深度学习课程系列第四门课程卷积神经网络(Convolutional Neural Networks)将于11月6日开课 ,不过课程资料已经放出,现在注册课程已经可以听课了 ,这门课程属于Coursera上的深度学习专项系列 ,这个系列有5门课,前三门已经开过好几轮,但是第4、第5门课程一直处于待定状态,新的一轮将于11月7号开始,感兴趣的同学可以关注:Deep Learning Specialization

This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images. You will: - Understand how to build a convolutional neural network, including recent variations such as residual networks. - Know how to apply convolutional networks to visual detection and recognition tasks. - Know to use neural style transfer to generate art. - Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data. This is the fourth course of the Deep Learning Specialization.

个人认为这是目前互联网上最适合入门深度学习的课程系列了,Andrew Ng 老师善于讲课,另外用Python代码抽丝剥茧扣作业,课程学起来非常舒服,参考我之前写得两篇小结:

Andrew Ng 深度学习课程小记

Andrew Ng (吴恩达) 深度学习课程小结

额外推荐: 深度学习课程资源整理

Andrew Ng (吴恩达) 深度学习课程小结

Deep Learning Specialization on Coursera

Andrew Ng (吴恩达) 深度学习课程从宣布到现在大概有一个月了,我也在第一时间加入了这个Coursera上的深度学习系列课程,并且在完成第一门课“Neural Networks and Deep Learning(神经网络与深度学习)”的同时写了关于这门课程的一个小结:Andrew Ng 深度学习课程小记。之后我断断续续的完成了第二门深度学习课程“Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization"和第三门深度学习课程“Structuring Machine Learning Projects”的相关视频学习和作业练习,也拿到了课程证书。平心而论,对于一个有经验的工程师来说,这门课程的难度并不高,如果有时间,完全可以在一个周内完成三门课程的相关学习工作。但是对于一个完全没有相关经验但是想入门深度学习的同学来说,可以预先补习一下Python机器学习的相关知识,如果时间允许,建议先修一下 CourseraPython系列课程Python for Everybody Specialization 和 Andrew Ng 本人的 机器学习课程

吴恩达这个深度学习系列课 (Deep Learning Specialization) 有5门子课程,截止目前,第四门"Convolutional Neural Networks" 和第五门"Sequence Models"还没有放出,不过上周四 Coursera 发了一封邮件给学习这门课程的用户:

Dear Learners,

We hope that you are enjoying Structuring Machine Learning Projects and your experience in the Deep Learning Specialization so far!

As we are nearing the one month anniversary of the Deep Learning Specialization, we wanted to thank you for your feedback on the courses thus far, and communicate our timelines for when the next courses of the Specialization will be available.

We plan to begin the first session of Course 4, Convolutional Neural Networks, in early October, with Course 5, Sequence Models, following soon after. We hope these estimated course launch timelines will help you manage your subscription as appropriate.

If you’d like to maintain full access to current course materials on Coursera’s platform for Courses 1-3, you should keep your subscription active. Note that if you only would like to access your Jupyter Notebooks, you can save these locally. If you do not need to access these materials on platform, you can cancel your subscription and restart your subscription later, when the new courses are ready. All of your course progress in the Specialization will be saved, regardless of your decision.

Thank you for your patience as we work on creating a great learning experience for this Specialization. We look forward to sharing this content with you in the coming weeks!

Happy Learning,

Coursera

大意是第四门深度学习课程 CNN(卷积神经网络)将于10月上旬推出,第五门深度学习课程 Sequence Models(序列模型, RNN等)将紧随其后。对于付费订阅的用户,如果你想随时随地获取当前3门深度学习课程的所有资料,最好保持订阅;如果你仅仅想访问 Jupyter Notebooks,也就是获取相关的编程作业,可以先本地保存它们。你也可以现在取消订阅这门课程,直到之后的课程开始后重新订阅,你的所有学习资料将会保存。所以一个比较省钱的办法,就是现在先离线保存相关课程资料,特别是编程作业等,然后取消订阅。当然对于视频,也可以离线下载,不过现在免费访问这门课程的视频有很多办法,譬如Coursera本身的非订阅模式观看视频,或者网易云课堂免费提供了这门课程的视频部分。不过我依然觉得,吴恩达这门深度学习课程,如果仅仅观看视频,最大的功效不过30%,这门课程的精华就在它的练习和编程作业部分,特别是编程作业,非常值得揣摩,花钱很值。

再次回到 Andrew Ng 这门深度学习课程的子课程上,第二门课程是“Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization",有三周课程,包括是深度神经网络的调参、正则化方法和优化算法讲解:

第一周课程是关于深度学习的实践方面的经验 (Practical aspects of Deep Learning), 包括训练集/验证集/测试集的划分,Bias 和
Variance的问题,神经网络中解决过拟合 (Overfitting) 的 Regularization 和 Dropout 方法,以及Gradient Check等:


这周课程依然强大在编程作业上,有三个编程作业需要完成:

完成编程的作业的过程也是一个很好的回顾课程视频的过程,可以把一些听课中容易忽略的点补上。

第二周深度学习课程是关于神经网络中用到的优化算法 (Optimization algorithms),包括 Mini-batch gradient descent,RMSprop, Adam等优化算法:

编程作业也很棒,在老师循循善诱的预设代码下一步一步完成了几个优化算法。

第三周深度学习课程主要关于神经网络中的超参数调优和深度学习框架问题(Hyperparameter tuning , Batch Normalization and Programming Frameworks),顺带讲了一下多分类问题和 Softmax regression, 特别是最后一个视频简单介绍了一下 TensorFlow , 并且编程作业也是和TensorFlow相关,对于还没有学习过Tensorflow的同学,刚好是一个入门学习机会,视频介绍和作业设计都很棒:


第三门深度学习课程Structuring Machine Learning Projects”更简单一些,只有两周课程,只有 Quiz, 没有编程作业,算是Andrew Ng 老师关于深度学习或者机器学习项目方法论的一个总结:

第一周课程主要关于机器学习的策略、项目目标(可量化)、训练集/开发集/测试集的数据分布、和人工评测指标对比等:


课程虽然没有提供编程作业,但是Quiz练习是一个关于城市鸟类识别的机器学习案例研究,通过这个案例串联15个问题,对应着课程视频中的相关经验,值得玩味。

第二周课程的学习目标是:

“Understand what multi-task learning and transfer learning are
Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets”

主要讲解了错误分析(Error Analysis), 不匹配训练数据和开发/测试集数据的处理(Mismatched training and dev/test set),机器学习中的迁移学习(Transfer learning)和多任务学习(Multi-task learning),以及端到端深度学习(End-to-end deep learning):

这周课程的选择题作业仍然是一个案例研究,关于无人驾驶的:Autonomous driving (case study),还是用15个问题串起视频中得知识点,体验依然很棒。

最后,关于Andrew Ng (吴恩达) 深度学习课程系列,Coursera上又启动了新一轮课程周期,9月12号开课,对于错过了上一轮学习的同学,现在加入新的一轮课程刚刚好。不过相信 Andrew Ng 深度学习课程会成为他机器学习课程之后 Coursera 上又一个王牌课程,会不断滚动推出的,所以任何时候加入都不会晚。另外,如果已经加入了这门深度学习课程,建议在学习的过程中即使保存资料,我都是一边学习一边保存这门深度学习课程的相关资料的,包括下载了课程视频用于离线观察,完成Quiz和编程作业之后都会保存一份到电脑上,方便随时查看。

索引:Andrew Ng 深度学习课程小记

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:Andrew Ng (吴恩达) 深度学习课程小结 http://www.52nlp.cn/?p=9761

维基百科语料中的词语相似度探索

Deep Learning Specialization on Coursera

之前写过《中英文维基百科语料上的Word2Vec实验》,近期有不少同学在这篇文章下留言提问,加上最近一些工作也与Word2Vec相关,于是又做了一些功课,包括重新过了一遍Word2Vec的相关资料,试了一下gensim的相关更新接口,google了一下"wikipedia word2vec" or "维基百科 word2vec" 相关的英中文资料,发现多数还是走得这篇文章的老路,既通过gensim提供的维基百科预处理脚本"gensim.corpora.WikiCorpus"提取维基语料,每篇文章一行文本存放,然后基于gensim的Word2Vec模块训练词向量模型。这里再提供另一个方法来处理维基百科的语料,训练词向量模型,计算词语相似度(Word Similarity)。关于Word2Vec, 如果英文不错,推荐从这篇文章入手读相关的资料: Getting started with Word2Vec

这次我们仅以英文维基百科语料为例,首先依然是下载维基百科的最新XML打包压缩数据,在这个英文最新更新的数据列表下:https://dumps.wikimedia.org/enwiki/latest/ ,找到 "enwiki-latest-pages-articles.xml.bz2" 下载,这份英文维基百科全量压缩数据的打包时间大概是2017年4月4号,大约13G,我通过家里的电脑wget下载大概花了3个小时,电信100M宽带,速度还不错。

接下来就是处理这份压缩的XML英文维基百科语料了,这次我们使用WikiExtractor:

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

WikiExtractor是一个Python 脚本,专门用于提取和清洗Wikipedia的dump数据,支持Python 2.7 或者 Python 3.3+,无额外依赖,安装和使用都非常方便:

安装:
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install

使用:
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2

......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

这个过程总计花了2个多小时,提取了大概537万多篇文章。关于我的机器配置,可参考:《深度学习主机攒机小记

提取后的文件按一定顺序切分存储在多个子目录下:

每个子目录下的又存放若干个以wiki_num命名的文件,每个大小在1M左右,这个大小可以通过参数 -b 控制:

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

我们看一下wiki_00里的具体内容:


Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.



Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...

...

每个wiki_num文件里又存放若干个doc,每个doc都有相关的tag标记,包括id, url, title等,很好区分。

这里我们按照Gensim作者提供的word2vec tutorial里"memory-friendly iterator"方式来处理英文维基百科的数据。代码如下,也同步放到了github里:train_word2vec_with_gensim.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyright 2017 @ Yu Zhen
 
import gensim
import logging
import multiprocessing
import os
import re
import sys
 
from pattern.en import tokenize
from time import time
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)
 
 
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext
 
 
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for root, dirs, files in os.walk(self.dirname):
            for filename in files:
                file_path = root + '/' + filename
                for line in open(file_path):
                    sline = line.strip()
                    if sline == "":
                        continue
                    rline = cleanhtml(sline)
                    tokenized_line = ' '.join(tokenize(rline))
                    is_alpha_word_line = [word for word in
                                          tokenized_line.lower().split()
                                          if word.isalpha()]
                    yield is_alpha_word_line
 
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Please use python train_with_gensim.py data_path"
        exit()
    data_path = sys.argv[1]
    begin = time()
 
    sentences = MySentences(data_path)
    model = gensim.models.Word2Vec(sentences,
                                   size=200,
                                   window=10,
                                   min_count=10,
                                   workers=multiprocessing.cpu_count())
    model.save("data/model/word2vec_gensim")
    model.wv.save_word2vec_format("data/model/word2vec_org",
                                  "data/model/vocabulary",
                                  binary=False)
 
    end = time()
    print "Total procesing time: %d seconds" % (end - begin)

注意其中的word tokenize使用了pattern里的英文tokenize模块,当然,也可以使用nltk里的word_tokenize模块,做一点修改即可,不过nltk对于句尾的一些词的work tokenize处理的不太好。另外我们设定词向量维度为200, 窗口长度为10, 最小出现次数为10,通过 is_alpha() 函数过滤掉标点和非英文词。现在可以用这个脚本来训练英文维基百科的Word2Vec模型了:
python train_word2vec_with_gensim.py enwiki

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts
2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types
2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types
2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types
2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types
......
2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads                                                                      2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads                                                                      2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads                                                                      2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s                          
2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None                                                       
2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm                 
2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy                                                              
2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy                                                           
2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table                
2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim               
2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary    
2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org                                                             
Total procesing time: 44476 seconds

这个训练过程中大概花了12多小时,训练后的文件存放在data/model下:

我们来测试一下这个英文维基百科的Word2Vec模型:

textminer@textminer:/opt/wiki/data$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from gensim.models import Word2Vec
 
In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim')

首先来测试几个单词的相似单词(Word Similariy):

word:

In [3]: en_wiki_word2vec_model.most_similar('word')
Out[3]: 
[('phrase', 0.8129693269729614),
 ('meaning', 0.7311851978302002),
 ('words', 0.7010501623153687),
 ('adjective', 0.6805518865585327),
 ('noun', 0.6461974382400513),
 ('suffix', 0.6440576314926147),
 ('verb', 0.6319557428359985),
 ('loanword', 0.6262609958648682),
 ('proverb', 0.6240501403808594),
 ('pronunciation', 0.6105246543884277)]

similarity:

In [4]: en_wiki_word2vec_model.most_similar('similarity')
Out[4]: 
[('similarities', 0.8517599701881409),
 ('resemblance', 0.786037266254425),
 ('resemblances', 0.7496883869171143),
 ('affinities', 0.6571112275123596),
 ('differences', 0.6465682983398438),
 ('dissimilarities', 0.6212711930274963),
 ('correlation', 0.6071442365646362),
 ('dissimilarity', 0.6062943935394287),
 ('variation', 0.5970577001571655),
 ('difference', 0.5928016901016235)]

nlp:

In [5]: en_wiki_word2vec_model.most_similar('nlp')
Out[5]: 
[('neurolinguistic', 0.6698148250579834),
 ('psycholinguistic', 0.6388964056968689),
 ('connectionism', 0.6027182936668396),
 ('semantics', 0.5866401195526123),
 ('connectionist', 0.5865628719329834),
 ('bandler', 0.5837364196777344),
 ('phonics', 0.5733655691146851),
 ('psycholinguistics', 0.5613113641738892),
 ('bootstrapping', 0.559638261795044),
 ('psychometrics', 0.5555593967437744)]

learn:

In [6]: en_wiki_word2vec_model.most_similar('learn')
Out[6]: 
[('teach', 0.7533557415008545),
 ('understand', 0.71148681640625),
 ('discover', 0.6749690771102905),
 ('learned', 0.6599283218383789),
 ('realize', 0.6390970349311829),
 ('find', 0.6308424472808838),
 ('know', 0.6171890497207642),
 ('tell', 0.6146825551986694),
 ('inform', 0.6008728742599487),
 ('instruct', 0.5998791456222534)]

man:

In [7]: en_wiki_word2vec_model.most_similar('man')
Out[7]: 
[('woman', 0.7243080735206604),
 ('boy', 0.7029494047164917),
 ('girl', 0.6441491842269897),
 ('stranger', 0.63275545835495),
 ('drunkard', 0.6136815547943115),
 ('gentleman', 0.6122575998306274),
 ('lover', 0.6108279228210449),
 ('thief', 0.609005331993103),
 ('beggar', 0.6083744764328003),
 ('person', 0.597919225692749)]

再来看看其他几个相关接口:

In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
Out[8]: [('queen', 0.7752252817153931)]
 
In [9]: en_wiki_word2vec_model.similarity('woman', 'man')
Out[9]: 0.72430799548282099
 
In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'

我把这篇文章的相关代码还有另一篇“中英文维基百科语料上的Word2Vec实验”的相关代码整理了一下,在github上建立了一个 Wikipedia_Word2vec 的项目,感兴趣的同学可以参考。

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:维基百科语料中的词语相似度探索 http://www.52nlp.cn/?p=9454

自然语言处理工具包spaCy介绍

Deep Learning Specialization on Coursera

spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

sudo apt-get install build-essential python-dev git
sudo pip install -U spacy

不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用"all"参数下载所有的数据:

sudo python -m spacy.en.download all

或者可以分别下载相关的模型和用glove训练好的词向量数据:


# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型
python -m spacy.en.download parser

# 这个过程下载glove训练好的词向量数据
python -m spacy.en.download glove

下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
776M en-1.1.0
774M en_glove_cc_300_1m_vectors-1.0.0

进入到英文数据模型下:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
424M deps
8.0K meta.json
35M ner
12M pos
84K tokenizer
300M vocab
6.3M wordnet

可以用如下命令检查模型数据是否安装成功:


textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"
OK

也可以用pytest进行测试:


# 首先找到spacy的安装路径:
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
/usr/local/lib/python2.7/dist-packages/spacy

# 再安装pytest:
sudo python -m pip install -U pytest

# 最后进行测试:
python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0
rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:
collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....
......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:


textminer@textminer:~$ ipython
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加载英文模型数据,稍许等待
In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)
it's word tokenize test for spacy

In [5]: for token in test_doc:
print(token)
...:
it
's
word
tokenize
test
for
spacy

英文断句:


In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:
print(sent)
...:
Natural language processing (NLP) deals with the application of computational models to text or speech data.
Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.
NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.
From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.


词干化(Lemmatize):


In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:
print(token, token.lemma_, token.lemma)
...:
(you, u'you', 472)
(are, u'be', 488)
(best, u'good', 556)
(., u'.', 419)
(it, u'it', 473)
(is, u'be', 488)
(lemmatize, u'lemmatize', 1510296)
(test, u'test', 1351)
(for, u'for', 480)
(spacy, u'spacy', 173783)
(., u'.', 419)
(I, u'i', 570)
(love, u'love', 644)
(these, u'these', 642)
(books, u'book', 1011)

词性标注(POS Tagging):


In [10]: for token in test_doc:
print(token, token.pos_, token.pos)
....:
(you, u'PRON', 92)
(are, u'VERB', 97)
(best, u'ADJ', 82)
(., u'PUNCT', 94)
(it, u'PRON', 92)
(is, u'VERB', 97)
(lemmatize, u'ADJ', 82)
(test, u'NOUN', 89)
(for, u'ADP', 83)
(spacy, u'NOUN', 89)
(., u'PUNCT', 94)
(I, u'PRON', 92)
(love, u'VERB', 97)
(these, u'DET', 87)
(books, u'NOUN', 89)

命名实体识别(NER):


In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:
print(ent, ent.label_, ent.label)
....:
(Rami Eid, u'PERSON', 346)
(Stony Brook University, u'ORG', 349)
(New York, u'GPE', 350)

名词短语提取:


In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:
print(np)
....:
Natural language processing
Natural language processing (NLP) deals
the application
computational models
text
speech
data
Application areas
NLP
automatic (machine) translation
languages
dialogue systems
a human
a machine
natural language
information extraction
the goal
unstructured text
structured (database) representations
flexible ways
NLP technologies
a dramatic impact
the way
people
computers
the way
people
the use
language
the way
people
the vast amount
linguistic data
electronic form
a scientific viewpoint
NLP
fundamental questions
formal models
example
natural language phenomena
algorithms
these models

基于词向量计算两个单词的相似度:


In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)
Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)
oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)
Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)
hippos

In [24]: apples.similarity(oranges)
Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)
Out[25]: 0.038474555379008429

当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

参考:
spaCy官方文档
Getting Started with spaCy

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:自然语言处理工具包spaCy介绍 http://www.52nlp.cn/?p=9386

反向传播算法入门资源索引

Deep Learning Specialization on Coursera

1、一切从维基百科开始,大致了解一个全貌:
反向传播算法 Backpropagation

2、拿起纸和笔,再加上ipython or 计算器,通过一个例子直观感受反向传播算法:
A Step by Step Backpropagation Example

3、再玩一下上篇例子对应的200多行Python代码: Neural Network with Backpropagation

4、有了上述直观的反向传播算法体验,可以从1986年这篇经典的论文入手了:Learning representations by back-propagating errors

5、如果还是觉得晦涩,推荐读一下"Neural Networks and Deep Learning"这本深度学习在线书籍的第二章:How the backpropagation algorithm works

6、或者可以通过油管看一下这个神经网络教程的前几节关于反向传播算法的视频: Neural Network Tutorial

7、hankcs 同学对于上述视频和相关材料有一个解读: 反向传播神经网络极简入门

8、这里还有一个比较简洁的数学推导:Derivation of Backpropagation

9、神牛gogo 同学对反向传播算法原理及代码解读:神经网络反向传播的数学原理

10、关于反向传播算法,更本质一个解释:自动微分反向模式(Reverse-mode differentiation )Calculus on Computational Graphs: Backpropagation

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:反向传播算法入门资源索引 http://www.52nlp.cn/?p=9350

深度学习主机环境配置: Ubuntu16.04+GeForce GTX 1080+TensorFlow

Deep Learning Specialization on Coursera

Update: 文章写于一年前,有些地方已经不适合了,最近升级了一下深度学习服务器,同时配置了一下环境,新写了文章,可以同时参考: 从零开始搭建深度学习服务器: 基础环境配置(Ubuntu + GTX 1080 TI + CUDA + cuDNN) 从零开始搭建深度学习服务器: 深度学习工具安装(TensorFlow + PyTorch + Torch)

这个系列写了好几篇文章,这是相关文章的索引,仅供参考:

接上文《深度学习主机环境配置: Ubuntu16.04+Nvidia GTX 1080+CUDA8.0》,我们继续来安装 TensorFlow,使其支持GeForce GTX 1080显卡。

1 下载和安装cuDNN

cuDNN全称 CUDA Deep Neural Network library,是NVIDIA专门针对深度神经网络设计的一套GPU计算加速库,被广泛用于各种深度学习框架,例如Caffe, TensorFlow, Theano, Torch, CNTK等。

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK.

Deep learning researchers and framework developers worldwide rely on cuDNN for high-performance GPU acceleration. It allows them to focus on training neural networks and developing software applications rather than spending time on low-level GPU performance tuning. cuDNN accelerates widely used deep learning frameworks, including Caffe, TensorFlow, Theano, Torch, and CNTK. See supported frameworks for more details.

首先需要下载cuDNN,直接从Nvidia官方下载链接选择一个版本,不过下载cuDNN前同样需要登录甚至填写一个简单的调查问卷: https://developer.nvidia.com/rdp/cudnn-download,这里选择的是支持CUDA8.0的cuDNN v5版本,而支持CUDA8的5.1版本虽然显示在下载选择项里,但是提示:cuDNN 5.1 RC for CUDA 8RC will be available soon - please check back again.

屏幕快照 2016-07-17 上午11.17.39

安装cuDNN比较简单,解压后把相应的文件拷贝到对应的CUDA目录下即可:

tar -zxvf cudnn-8.0-linux-x64-v5.0-ga.tgz

cuda/include/cudnn.h
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.5
cuda/lib64/libcudnn.so.5.0.5
cuda/lib64/libcudnn_static.a

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

继续阅读

斯坦福大学深度学习与自然语言处理第一讲:引言

Deep Learning Specialization on Coursera

斯坦福大学在三月份开设了一门“深度学习与自然语言处理”的课程:CS224d: Deep Learning for Natural Language Processing,授课老师是青年才俊 Richard Socher,他本人是德国人,大学期间涉足自然语言处理,在德国读研时又专攻计算机视觉,之后在斯坦福大学攻读博士学位,拜师NLP领域的巨牛 Chris ManningDeep Learning 领域的巨牛 Andrew Ng,其博士论文是《Recursive Deep Learning for Natural Language Processing and Computer Vision》,也算是多年求学生涯的完美一击。毕业后以联合创始人及CTO的身份创办了MetaMind,作为AI领域的新星创业公司,MetaMind创办之初就拿了800万美元的风投,值得关注。

回到这们课程CS224d,其实可以翻译为“面向自然语言处理的深度学习(Deep Learning for Natural Language Processing)”,这门课程是面向斯坦福学生的校内课程,不过课程的相关材料都放到了网上,包括课程视频,课件,相关知识,预备知识,作业等等,相当齐备。课程大纲相当有章法和深度,从基础讲起,再讲到深度学习在NLP领域的具体应用,包括命名实体识别,机器翻译,句法分析器,情感分析等。Richard Socher此前在ACL 2012和NAACL 2013 做过一个Tutorial,Deep Learning for NLP (without Magic),感兴趣的同学可以先参考一下: Deep Learning for NLP (without Magic) - ACL 2012 Tutorial - 相关视频及课件 。另外,由于这门课程的视频放在Youtube上,@爱可可-爱生活 老师维护了一个网盘链接:http://pan.baidu.com/s/1pJyrXaF ,同步更新相关资料,可以关注。
继续阅读

Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱

Deep Learning Specialization on Coursera

曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python。离开腾讯创业后,第一个作品课程图谱也是选择了Python系的Flask框架,渐渐的将自己的绝大部分工作交给了Python。这些年来,接触和使用了很多Python工具包,特别是在文本处理,科学计算,机器学习和数据挖掘领域,有很多很多优秀的Python工具包可供使用,所以作为Pythoner,也是相当幸福的。其实如果仔细留意微博,你会发现很多这方面的分享,自己也Google了一下,发现也有同学总结了“Python机器学习库”,不过总感觉缺少点什么。最近流行一个词,全栈工程师(full stack engineer),作为一个苦逼的创业者,天然的要把自己打造成一个full stack engineer,而这个过程中,这些Python工具包给自己提供了足够的火力,所以想起了这个系列。当然,这也仅仅是抛砖引玉,希望大家能提供更多的线索,来汇总整理一套Python网页爬虫,文本处理,科学计算,机器学习和数据挖掘的兵器谱。

一、Python网页爬虫工具集

一个真实的项目,一定是从获取数据开始的。无论文本处理,机器学习和数据挖掘,都需要数据,除了通过一些渠道购买或者下载的专业数据外,常常需要大家自己动手爬数据,这个时候,爬虫就显得格外重要了,幸好,Python提供了一批很不错的网页爬虫工具框架,既能爬取数据,也能获取和清洗数据,我们也就从这里开始了:

1. Scrapy

Scrapy, a fast high-level screen scraping and web crawling framework for Python.

鼎鼎大名的Scrapy,相信不少同学都有耳闻,课程图谱中的很多课程都是依靠Scrapy抓去的,这方面的介绍文章有很多,推荐大牛pluskid早年的一篇文章:《Scrapy 轻松定制网络爬虫》,历久弥新。

官方主页:http://scrapy.org/
Github代码页: https://github.com/scrapy/scrapy

2. Beautiful Soup

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

读书的时候通过《集体智慧编程》这本书知道Beautiful Soup的,后来也偶尔会用用,非常棒的一套工具。客观的说,Beautifu Soup不完全是一套爬虫工具,需要配合urllib使用,而是一套HTML/XML数据分析,清洗和获取工具。

官方主页:http://www.crummy.com/software/BeautifulSoup/

3. Python-Goose

Html Content / Article Extractor, web scrapping lib in Python

Goose最早是用Java写得,后来用Scala重写,是一个Scala项目。Python-Goose用Python重写,依赖了Beautiful Soup。前段时间用过,感觉很不错,给定一个文章的URL, 获取文章的标题和内容很方便。

Github主页:https://github.com/grangier/python-goose

二、Python文本处理工具集

从网页上获取文本数据之后,依据任务的不同,就需要进行基本的文本处理了,譬如对于英文来说,需要基本的tokenize,对于中文,则需要常见的中文分词,进一步的话,无论英文中文,还可以词性标注,句法分析,关键词提取,文本分类,情感分析等等。这个方面,特别是面向英文领域,有很多优秀的工具包,我们一一道来。
继续阅读

如何计算两个文档的相似度(三)

Deep Learning Specialization on Coursera

上一节我们用了一个简单的例子过了一遍gensim的用法,这一节我们将用课程图谱的实际数据来做一些验证和改进,同时会用到NLTK来对课程的英文数据做预处理。

三、课程图谱相关实验

1、数据准备
为了方便大家一起来做验证,这里准备了一份Coursera的课程数据,可以在这里下载:coursera_corpus,(百度网盘链接: http://t.cn/RhjgPkv,密码: oppc)总共379个课程,每行包括3部分内容:课程名\t课程简介\t课程详情, 已经清除了其中的html tag, 下面所示的例子仅仅是其中的课程名:

Writing II: Rhetorical Composing
Genetics and Society: A Course for Educators
General Game Playing
Genes and the Human Condition (From Behavior to Biotechnology)
A Brief History of Humankind
New Models of Business in Society
Analyse Numérique pour Ingénieurs
Evolution: A Course for Educators
Coding the Matrix: Linear Algebra through Computer Science Applications
The Dynamic Earth: A Course for Educators
...

好了,首先让我们打开Python, 加载这份数据:

>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']

2、引入NLTK
NTLK是著名的Python自然语言处理工具包,但是主要针对的是英文处理,不过课程图谱目前处理的课程数据主要是英文,因此也足够了。NLTK配套有文档,有语料库,有书籍,甚至国内有同学无私的翻译了这本书: 用Python进行自然语言处理,有时候不得不感慨:做英文自然语言处理的同学真幸福。

首先仍然是安装NLTK,在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK:http://nltk.org/install.html ,最主要的还是要先装好依赖NumPy和PyYAML,其他没什么问题。安装NLTK完毕,可以import nltk测试一下,如果没有问题,还有一件非常重要的工作要做,下载NLTK官方提供的相关语料:

>>> import nltk
>>> nltk.download()

这个时候会弹出一个图形界面,会显示两份数据供你下载,分别是all-corpora和book,最好都选定下载了,这个过程需要一段时间,语料下载完毕后,NLTK在你的电脑上才真正达到可用的状态,可以测试一下布朗语料库

>>> from nltk.corpus import brown
>>> brown.readme()
'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192

现在我们就来处理刚才的课程数据,如果按此前的方法仅仅对文档的单词小写化的话,我们将得到如下的结果:

>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

注意其中很多标点符号和单词是没有分离的,所以我们引入nltk的word_tokenize函数,并处理相应的数据:

>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', ',', 'research', ',', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', ',', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', '...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

对课程的英文数据进行tokenize之后,我们需要去停用词,幸好NLTK提供了一份英文停用词数据:

>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
>>> len(english_stopwords)
127

总计127个停用词,我们首先过滤课程语料中的停用词:
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', ',', 'research', ',', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', ',', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', ',', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', ',', 'visual', ',', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', ',', 'demonstrations', ',', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', ',', 'rhetoric', 'course', 'design', ',', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', ',', 'students', ',', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', ',', 'writers', 'exchange', ',', 'place', 'exchange', 'work', 'feedback']

停用词被过滤了,不过发现标点符号还在,这个好办,我们首先定义一个标点符号list:
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']

然后过滤这些标点符号:
>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
>>> print texts_filtered[0]
['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', 'research', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', 'ideas', 'talents', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', 'visual', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', 'demonstrations', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', 'rhetoric', 'course', 'design', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', 'students', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', 'writers', 'exchange', 'place', 'exchange', 'work', 'feedback']

更进一步,我们对这些英文单词词干化(Stemming),NLTK提供了好几个相关工具接口可供选择,具体参考这个页面: http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer, Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('stemmed')
'stem'
>>> st.stem('stemming')
'stem'
>>> st.stem('stemmer')
'stem'
>>> st.stem('running')
'run'
>>> st.stem('maximum')
'maxim'
>>> st.stem('presumably')
'presum'

让我们调用这个接口来处理上面的课程数据:
>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', 'sery', 'interact', 'read', 'research', 'compos', 'act', 'along', 'assign', 'design', 'help', 'becom', 'effect', 'consum', 'produc', 'alphabet', 'vis', 'multimod', 'texts.', 'join', 'us', 'becom', 'effect', 'writ', '...', 'bet', 'citizens.', 'rhet', 'compos', 'cours', 'writ', 'exchang', 'word', 'idea', 'tal', 'support.', 'introduc', 'vary', 'rhet', 'concepts\xe2\x80\x94that', 'idea', 'techn', 'inform', 'persuad', 'audiences\xe2\x80\x94that', 'help', 'becom', 'effect', 'consum', 'produc', 'writ', 'vis', 'multimod', 'texts.', 'class', 'includ', 'short', 'video', 'demonst', 'activities.', 'envid', 'rhet', 'compos', 'learn', 'commun', 'includ', 'enrol', 'cours', 'instructors.', 'bring', 'expert', 'writ', 'rhet', 'cours', 'design', 'design', 'assign', 'cours', 'infrastruct', 'help', 'shar', 'expery', 'writ', 'stud', 'profess', 'us.', 'collab', 'facilit', 'wex', 'writ', 'exchang', 'plac', 'exchang', 'work', 'feedback']

在我们引入gensim之前,还有一件事要做,去掉在整个语料库中出现次数为1的低频词,测试了一下,不去掉的话对效果有些影响:

>>> all_stems = sum(texts_stemmed, [])
>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

3、引入gensim
有了上述的预处理,我们就可以引入gensim,并快速的做课程相似度的实验了。以下会快速的过一遍流程,具体的可以参考上一节的详细描述。

>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> dictionary = corpora.Dictionary(texts)
2013-06-07 21:37:07,120 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-06-07 21:37:07,263 : INFO : built Dictionary(3341 unique tokens) from 379 documents (total 46417 corpus positions)

>>> corpus = [dictionary.doc2bow(text) for text in texts]

>>> tfidf = models.TfidfModel(corpus)
2013-06-07 21:58:30,490 : INFO : collecting document frequencies
2013-06-07 21:58:30,490 : INFO : PROGRESS: processing document #0
2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

>>> corpus_tfidf = tfidf[corpus]

这里我们拍脑门决定训练topic数量为10的LSI模型:
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])
2013-06-07 22:04:55,443 : INFO : scanning corpus to determine the number of features
2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

基于LSI模型的课程索引建立完毕,我们以Andrew Ng教授的机器学习公开课为例,这门课程在我们的coursera_corpus文件的第211行,也就是:

>>> print courses_name[210]
Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上,然后和其他课程计算相似度:
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程:
>>> print sort_sims[0:10]
[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:
>>> print courses_name[210]
Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课
>>> print courses_name[174]
Machine Learning

第三门课是Coursera的另一位创始人,同样是大牛的Daphne Koller教授的概率图模型公开课
>>> print courses_name[238]
Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课,有同学评价是Deep Learning的必修课。
>>> print courses_name[203]
Neural Networks for Machine Learning

感觉效果还不错,如果觉得有趣的话,也可以动手试试。

好了,这个系列就到此为止了,原计划写一下在英文维基百科全量数据上的实验,因为课程图谱目前暂时不需要,所以就到此为止,感兴趣的同学可以直接阅读gensim上的相关文档,非常详细。之后我可能更关注将NLTK应用到中文信息处理上,欢迎关注。

注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn

本文链接地址:http://www.52nlp.cn/如何计算两个文档的相似度三