分类目录归档:数据挖掘

fastText原理及实践(达观数据王江)

王江题图

本文首先会介绍一些预备知识,比如softmax、ngram等,然后简单介绍word2vec原理,之后来讲解fastText的原理,并着手使用keras搭建一个简单的fastText分类器,最后,我们会介绍fastText在达观数据的应用。

 

NO.1
预备知识
Softmax回归

 

Softmax回归(Softmax Regression)又被称作多项逻辑回归(multinomial logistic regression),它是逻辑回归在处理多类别任务上的推广。

在逻辑回归中, 我们有m个被标注的样本:技术干货丨fastText原理及实践其中技术干货丨fastText原理及实践。因为类标是二元的,所以我们有技术干货丨fastText原理及实践。我们的假设(hypothesis)有如下形式:

技术干货丨fastText原理及实践

代价函数(cost function)如下:

技术干货丨fastText原理及实践

在Softmax回归中,类标是大于2的,因此在我们的训练集技术干货丨fastText原理及实践

中,技术干货丨fastText原理及实践。给定一个测试输入x,我们的假设应该输出一个K维的向量,向量内每个元素的值表示x属于当前类别的概率。具体地,假设技术干货丨fastText原理及实践形式如下:

技术干货丨fastText原理及实践

代价函数如下:

技术干货丨fastText原理及实践

 

其中1{·}是指示函数,即1=1,1=0

既然我们说Softmax回归是逻辑回归的推广,那我们是否能够在代价函数上推导出它们的一致性呢?当然可以,于是:

技术干货丨fastText原理及实践

可以看到,逻辑回归是softmax回归在K=2时的特例。

分层Softmax

 

你可能也发现了,标准的Softmax回归中,要计算y=j时的Softmax概率:技术干货丨fastText原理及实践,我们需要对所有的K个概率做归一化,这在|y|很大时非常耗时。于是,分层Softmax诞生了,它的基本思想是使用树的层级结构替代扁平化的标准Softmax,使得在计算技术干货丨fastText原理及实践时,只需计算一条路径上的所有节点的概率值,无需在意其它的节点。

下图是一个分层Softmax示例:

11

树的结构是根据类标的频数构造的霍夫曼树。K个不同的类标组成所有的叶子节点,K-1个内部节点作为内部参数,从根节点到某个叶子节点经过的节点和边形成一条路径,路径长度被表示为技术干货丨fastText原理及实践。于是,技术干货丨fastText原理及实践就可以被写成:

技术干货丨fastText原理及实践

其中:

技术干货丨fastText原理及实践表示sigmoid函数;

技术干货丨fastText原理及实践表示n节点的左孩子;

技术干货丨fastText原理及实践是一个特殊的函数,被定义为:

技术干货丨fastText原理及实践

技术干货丨fastText原理及实践是中间节点技术干货丨fastText原理及实践的参数;X是Softmax层的输入。

上图中,高亮的节点和边是从根节点到 技术干货丨fastText原理及实践 的路径,路径长度技术干货丨fastText原理及实践

可以被表示为:

技术干货丨fastText原理及实践

于是,从根节点走到叶子节点技术干货丨fastText原理及实践,实际上是在做了3次二分类的逻辑回归。

通过分层的Softmax,计算复杂度一下从|K|降低到log|K|。

 

n-gram特征

 

在文本特征提取中,常常能看到n-gram的身影。它是一种基于语言模型的算法,基本思想是将文本内容按照字节顺序进行大小为N的滑动窗口操作,最终形成长度为N的字节片段序列。看下面的例子:

我来到达观数据参观

相应的bigram特征为:我来 来到 到达 达观 观数 数据 据参 参观

相应的trigram特征为:我来到 来到达 到达观 达观数 观数据 数据参 据参观

 

注意一点:n-gram中的gram根据粒度不同,有不同的含义。它可以是字粒度,也可以是词粒度的。上面所举的例子属于字粒度的n-gram,词粒度的n-gram看下面例子:

 

我 来到 达观数据 参观
 
相应的bigram特征为:我/来到 来到/达观数据 达观数据/参观
相应的trigram特征为:我/来到/达观数据 来到/达观数据/参观 

n-gram产生的特征只是作为文本特征的候选集,你后面可能会采用信息熵、卡方统计、IDF等文本特征选择方式筛选出比较重要特征。

 

NO.2

Word2vec

你可能要问,这篇文章不是介绍fastText的么,怎么开始介绍起了word2vec?

最主要的原因是word2vec的CBOW模型架构和fastText模型非常相似。于是,你看到facebook开源的fastText工具不仅实现了fastText文本分类工具,还实现了快速词向量训练工具。word2vec主要有两种模型:skip-gram 模型和CBOW模型,这里只介绍CBOW模型,有关skip-gram模型的内容请参考达观另一篇技术文章:

漫谈Word2vec之skip-gram模型

模型架构

CBOW模型的基本思路是:用上下文预测目标词汇。架构图如下所示:

技术干货丨fastText原理及实践

输入层由目标词汇y的上下文单词 技术干货丨fastText原理及实践 组成, 技术干货丨fastText原理及实践 是被onehot编码过的V维向量,其中V是词汇量;隐含层是N维向量h;输出层是被onehot编码过的目标词y。输入向量通过 技术干货丨fastText原理及实践维的权重矩阵W连接到隐含层;隐含层通过 技术干货丨fastText原理及实践 维的权重矩阵 技术干货丨fastText原理及实践 连接到输出层。因为词库V往往非常大,使用标准的softmax计算相当耗时,于是CBOW的输出层采用的正是上文提到过的分层Softmax。

前向传播

输入是如何计算而获得输出呢?先假设我们已经获得了权重矩阵技术干货丨fastText原理及实践技术干货丨fastText原理及实践(具体的推导见第3节),隐含层h的输出的计算公式:

技术干货丨fastText原理及实践

即:隐含层的输出是C个上下文单词向量的加权平均,权重为W

接着我们计算输出层的每个节点:

技术干货丨fastText原理及实践

这里技术干货丨fastText原理及实践是矩阵技术干货丨fastText原理及实践的第j列,最后,将技术干货丨fastText原理及实践作为softmax函数的输入,得到技术干货丨fastText原理及实践

技术干货丨fastText原理及实践

反向传播学习权重矩阵

在学习权重矩阵和过程中,我们首先随机产生初始值,然后feed训练样本到我们的模型,并观测我们期望输出和真实输出的误差。接着,我们计算误差关于权重矩阵的梯度,并在梯度的方向纠正它们。

首先定义损失函数,objective是最大化给定输入上下文,target单词的条件概率。因此,损失函数为:

技术干货丨fastText原理及实践

这里,技术干货丨fastText原理及实践表示目标单词在词库V中的索引。

如何更新权重技术干货丨fastText原理及实践?

我们先对E关于技术干货丨fastText原理及实践求导:

技术干货丨fastText原理及实践

技术干货丨fastText原理及实践函数表示:

技术干货丨fastText原理及实践

于是,技术干货丨fastText原理及实践的更新公式:

技术干货丨fastText原理及实践

如何更新权重W

我们首先计算E关于隐含层节点的导数:

技术干货丨fastText原理及实践

然后,E关于权重的导数为:

技术干货丨fastText原理及实践

于是,技术干货丨fastText原理及实践的更新公式:

技术干货丨fastText原理及实践

 

NO.3

fastText分类

终于到我们的fastText出场了。这里有一点需要特别注意,一般情况下,使用fastText进行文本分类的同时也会产生词的embedding,即embedding是fastText分类的产物。除非你决定使用预训练的embedding来训练fastText分类模型,这另当别论。

字符级别的n-gram

word2vec把语料库中的每个单词当成原子的,它会为每个单词生成一个向量。这忽略了单词内部的形态特征,比如:“apple” 和“apples”,“达观数据”和“达观”,这两个例子中,两个单词都有较多公共字符,即它们的内部形态类似,但是在传统的word2vec中,这种单词内部形态信息因为它们被转换成不同的id丢失了。

 

为了克服这个问题,fastText使用了字符级别的n-grams来表示一个单词。对于单词“apple”,假设n的取值为3,则它的trigram有:

“<ap”,  “app”,  “ppl”,  “ple”, “le>”

其中,<表示前缀,>表示后缀。于是,我们可以用这些trigram来表示“apple”这个单词,进一步,我们可以用这5个trigram的向量叠加来表示“apple”的词向量。

这带来两点好处

1. 对于低频词生成的词向量效果会更好。因为它们的n-gram可以和其它词共享。

2. 对于训练词库之外的单词,仍然可以构建它们的词向量。我们可以叠加它们的字符级n-gram向量。

模型架构

之前提到过,fastText模型架构和word2vec的CBOW模型架构非常相似。下面是fastText模型架构图:

技术干货丨fastText原理及实践

注意:此架构图没有展示词向量的训练过程。可以看到,和CBOW一样,fastText模型也只有三层:输入层、隐含层、输出层(Hierarchical Softmax),输入都是多个经向量表示的单词,输出都是一个特定的target,隐含层都是对多个词向量的叠加平均。

不同的是,CBOW的输入是目标单词的上下文,fastText的输入是多个单词及其n-gram特征,这些特征用来表示单个文档;CBOW的输入单词被onehot编码过,fastText的输入特征是被embedding过;CBOW的输出是目标词汇,fastText的输出是文档对应的类标。

值得注意的是,fastText在输入时,将单词的字符级别的n-gram向量作为额外的特征;在输出时,fastText采用了分层Softmax,大大降低了模型训练时间。这两个知识点在前文中已经讲过,这里不再赘述。

fastText相关公式的推导和CBOW非常类似,这里也不展开了。

核心思想

现在抛开那些不是很讨人喜欢的公式推导,来想一想fastText文本分类的核心思想是什么?

仔细观察模型的后半部分,即从隐含层输出到输出层输出,会发现它就是一个softmax线性多类别分类器,分类器的输入是一个用来表征当前文档的向量;模型的前半部分,即从输入层输入到隐含层输出部分,主要在做一件事情:生成用来表征文档的向量。那么它是如何做的呢?叠加构成这篇文档的所有词及n-gram的词向量,然后取平均。叠加词向量背后的思想就是传统的词袋法,即将文档看成一个由词构成的集合。

于是fastText的核心思想就是:将整篇文档的词及n-gram向量叠加平均得到文档向量,然后使用文档向量做softmax多分类。这中间涉及到两个技巧:字符级n-gram特征的引入以及分层Softmax分类。

关于分类效果

还有个问题,就是为何fastText的分类效果常常不输于传统的非线性分类器?

假设我们有两段文本:

我 来到 达观数据

俺 去了 达而观信息科技

这两段文本意思几乎一模一样,如果要分类,肯定要分到同一个类中去。但在传统的分类器中,用来表征这两段文本的向量可能差距非常大。传统的文本分类中,你需要计算出每个词的权重,比如tfidf值, “我”和“俺” 算出的tfidf值相差可能会比较大,其它词类似,于是,VSM(向量空间模型)中用来表征这两段文本的文本向量差别可能比较大。

 

但是fastText就不一样了,它是用单词的embedding叠加获得的文档向量,词向量的重要特点就是向量的距离可以用来衡量单词间的语义相似程度,于是,在fastText模型中,这两段文本的向量应该是非常相似的,于是,它们很大概率会被分到同一个类中。

使用词embedding而非词本身作为特征,这是fastText效果好的一个原因;另一个原因就是字符级n-gram特征的引入对分类效果会有一些提升 。

 

NO.4

手写一个fastText

keras是一个抽象层次很高的神经网络API,由python编写,底层可以基于Tensorflow、Theano或者CNTK。它的优点在于:用户友好、模块性好、易扩展等。所以下面我会用keras简单搭一个fastText的demo版,生产可用的fastText请移步https://github.com/facebookresearch/fastText

如果你弄懂了上面所讲的它的原理,下面的demo对你来讲应该是非常明了的。

为了简化我们的任务:

1. 训练词向量时,我们使用正常的word2vec方法,而真实的fastText还附加了字符级别的n-gram作为特征输入;

2. 我们的输出层使用简单的softmax分类,而真实的fastText使用的是Hierarchical Softmax。

首先定义几个常量:

VOCAB_SIZE = 2000

EMBEDDING_DIM =100

MAX_WORDS = 500

CLASS_NUM = 5

VOCAB_SIZE表示词汇表大小,这里简单设置为2000;

EMBEDDING_DIM表示经过embedding层输出,每个词被分布式表示的向量的维度,这里设置为100。比如对于“达观”这个词,会被一个长度为100的类似于[ 0.97860014, 5.93589592, 0.22342691, -3.83102846, -0.23053935, …]的实值向量来表示;

MAX_WORDS表示一篇文档最多使用的词个数,因为文档可能长短不一(即词数不同),为了能feed到一个固定维度的神经网络,我们需要设置一个最大词数,对于词数少于这个阈值的文档,我们需要用“未知词”去填充。比如可以设置词汇表中索引为0的词为“未知词”,用0去填充少于阈值的部分;

CLASS_NUM表示类别数,多分类问题,这里简单设置为5。

模型搭建遵循以下步骤

1. 添加输入层(embedding层)。Embedding层的输入是一批文档,每个文档由一个词汇索引序列构成。例如:[10, 30, 80, 1000] 可能表示“我 昨天 来到 达观数据”这个短文本,其中“我”、“昨天”、“来到”、“达观数据”在词汇表中的索引分别是10、30、80、1000;Embedding层将每个单词映射成EMBEDDING_DIM维的向量。于是:input_shape=(BATCH_SIZE, MAX_WORDS), output_shape=(BATCH_SIZE,MAX_WORDS, EMBEDDING_DIM);

2. 添加隐含层(投影层)。投影层对一个文档中所有单词的向量进行叠加平均。keras提供的GlobalAveragePooling1D类可以帮我们实现这个功能。这层的input_shape是Embedding层的output_shape,这层的output_shape=( BATCH_SIZE, EMBEDDING_DIM);

3. 添加输出层(softmax层)。真实的fastText这层是Hierarchical Softmax,因为keras原生并没有支持Hierarchical Softmax,所以这里用Softmax代替。这层指定了CLASS_NUM,对于一篇文档,输出层会产生CLASS_NUM个概率值,分别表示此文档属于当前类的可能性。这层的output_shape=(BATCH_SIZE, CLASS_NUM)

4. 指定损失函数、优化器类型、评价指标,编译模型。损失函数我们设置为categorical_crossentropy,它就是我们上面所说的softmax回归的损失函数;优化器我们设置为SGD,表示随机梯度下降优化器;评价指标选择accuracy,表示精度。

用训练数据feed模型时,你需要:

1. 将文档分好词,构建词汇表。词汇表中每个词用一个整数(索引)来代替,并预留“未知词”索引,假设为0;

2. 对类标进行onehot化。假设我们文本数据总共有3个类别,对应的类标分别是1、2、3,那么这三个类标对应的onehot向量分别是[1, 0,
0]、[0, 1, 0]、[0, 0, 1];

3. 对一批文本,将每个文本转化为词索引序列,每个类标转化为onehot向量。就像之前的例子,“我 昨天 来到 达观数据”可能被转化为[10, 30,
80, 1000];它属于类别1,它的类标就是[1, 0, 0]。由于我们设置了MAX_WORDS=500,这个短文本向量后面就需要补496个0,即[10, 30, 80, 1000, 0, 0, 0, …, 0]。因此,batch_xs的 维度为( BATCH_SIZE,MAX_WORDS),batch_ys的维度为(BATCH_SIZE, CLASS_NUM)。

下面是构建模型的代码,数据处理、feed数据到模型的代码比较繁琐,这里不展示。

技术干货丨fastText原理及实践

 

NO.5

fastText原理及实践

fastText在达观数据的应用

fastText作为诞生不久的词向量训练、文本分类工具,在达观得到了比较深入的应用。主要被用在以下两个系统:

1. 同近义词挖掘。Facebook开源的fastText工具也实现了词向量的训练,达观基于各种垂直领域的语料,使用其挖掘出一批同近义词;

2. 文本分类系统。在类标数、数据量都比较大时,达观会选择fastText 来做文本分类,以实现快速训练预测、节省内存的目的。

http://credit-n.ru/zaymyi-next.html

Coursera上Python课程(公开课)汇总推荐:从Python入门到应用Python

Python是深度学习时代的语言,Coursera上有很多Python课程,从Python入门到精通,从Python基础语法到应用Python,满足各个层次的需求,以下是Coursera上的Python课程整理,仅供参考,这里也会持续更新。

1. 密歇根大学的“Python for Everybody Specialization(人人都可以学习的Python专项课程)”

这个系列对于学习者的编程背景和数学要求几乎为零,非常适合Python入门学习。这个系列也是Coursera上最受欢迎的Python学习系列课程,强烈推荐。这个Python系列的目标是“通过Python学习编程并分析数据,开发用于采集,清洗,分析和可视化数据的程序(Learn to Program and Analyze Data with Python-Develop programs to gather, clean, analyze, and visualize data.” ,以下是关于这个系列的简介:

This Specialization builds on the success of the Python for Everybody course and will introduce fundamental programming concepts including data structures, networked application program interfaces, and databases, using the Python programming language. In the Capstone Project, you’ll use the technologies learned throughout the Specialization to design and create your own applications for data retrieval, processing, and visualization.

这个系列包含4门子课程和1门毕业项目课程,包括Python入门基础,Python数据结构, 使用Python获取网络数据(Python爬虫),在Python中使用数据库以及Python数据可视化等。以下是具体子课程的介绍:

1.1 Programming for Everybody (Getting Started with Python)

Python入门级课程,这门课程暂且翻译为“人人都可以学编程-从Python开始”,如果没有任何编程基础,就从这门课程开始吧:

This course aims to teach everyone the basics of programming computers using Python. We cover the basics of how one constructs a program from a series of simple instructions in Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course. This course will cover Chapters 1-5 of the textbook “Python for Everybody”. Once a student completes this course, they will be ready to take more advanced programming courses. This course covers Python 3.

1.2 Python Data Structures(Python数据结构)

Python基础课程,这门课程的目标是介绍Python语言的核心数据结构(This course will introduce the core data structures of the Python programming language.),关于这门课程:

This course will introduce the core data structures of the Python programming language. We will move past the basics of procedural programming and explore how we can use the Python built-in data structures such as lists, dictionaries, and tuples to perform increasingly complex data analysis. This course will cover Chapters 6-10 of the textbook “Python for Everybody”. This course covers Python 3.

1.3 Using Python to Access Web Data(使用Python获取网页数据--Python爬虫)

Python应用课程,只有使用Python才能学以致用,这门课程的目标是展示如何通过爬取和分析网页数据将互联网作为数据的源泉(This course will show how one can treat the Internet as a source of data):

This course will show how one can treat the Internet as a source of data. We will scrape, parse, and read web data as well as access data using web APIs. We will work with HTML, XML, and JSON data formats in Python. This course will cover Chapters 11-13 of the textbook “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-10 of the textbook and the first two courses in this specialization. These topics include variables and expressions, conditional execution (loops, branching, and try/except), functions, Python data structures (strings, lists, dictionaries, and tuples), and manipulating files. This course covers Python 3.

1.4 Using Databases with Python(Python数据库)

Python应用课程,在Python中使用数据库。这门课程的目标是在Python中学习SQL,使用SQLite3作为抓取数据的存储数据库:

This course will introduce students to the basics of the Structured Query Language (SQL) as well as basic database design for storing data as part of a multi-step data gathering, analysis, and processing effort. The course will use SQLite3 as its database. We will also build web crawlers and multi-step data gathering and visualization processes. We will use the D3.js library to do basic data visualization. This course will cover Chapters 14-15 of the book “Python for Everybody”. To succeed in this course, you should be familiar with the material covered in Chapters 1-13 of the textbook and the first three courses in this specialization. This course covers Python 3.

1.5 Capstone: Retrieving, Processing, and Visualizing Data with Python(毕业项目课程:使用Python获取,处理和可视化数据)

Python应用实践课程,这是这个系列的毕业项目课程,目的是通过开发一系列Python应用项目让学生熟悉Python抓取,处理和可视化数据的流程。

In the capstone, students will build a series of applications to retrieve, process and visualize data using Python. The projects will involve all the elements of the specialization. In the first part of the capstone, students will do some visualizations to become familiar with the technologies in use and then will pursue their own project to visualize some other data that they have or can find. Chapters 15 and 16 from the book “Python for Everybody” will serve as the backbone for the capstone. This course covers Python 3.

2. 多伦多大学的编程入门课程"Learn to Program: The Fundamentals(学习编程:基础) "

Python入门级课程。这门课程以Python语言传授编程入门知识,实为零基础的Python入门课程。感兴趣的同学可以参考课程图谱上的老课程评论 :http://coursegraph.com/coursera_programming1 ,之前一个同学的评价是 “两个老师语速都偏慢,讲解细致,又有可视化工具Python Visualizer用于详细了解程序具体执行步骤,可以说是零基础学习python编程的最佳选择。”

Behind every mouse click and touch-screen tap, there is a computer program that makes things happen. This course introduces the fundamental building blocks of programming and teaches you how to write fun and useful programs using the Python language.

3. 莱斯大学的Python专项课程系列:Introduction to Scripting in Python Specialization

入门级Python学习系列课程,涵盖Python基础, Python数据表示, Python数据分析, Python数据可视化等子课程,比较适合Python入门。这门课程的目标是让学生可以在处理实际问题是使用Python解决问题:Launch Your Career in Python Programming-Master the core concepts of scripting in Python to enable you to solve practical problems.

This specialization is intended for beginners who would like to master essential programming skills. Through four courses, you will cover key programming concepts in Python 3 which will prepare you to use Python to perform common scripting tasks. This knowledge will provide a solid foundation towards a career in data science, software engineering, or other disciplines involving programming.

这个系列包含4门子课程,以下是具体子课程的介绍:

3.1 Python Programming Essentials(Python编程基础)

Python入门基础课程,这门课程将讲授Python编程基础知识,包括表达式,变量,函数等,目标是让用户熟练使用Python:

This course will introduce you to the wonderful world of Python programming! We'll learn about the essential elements of programming and how to construct basic Python programs. We will cover expressions, variables, functions, logic, and conditionals, which are foundational concepts in computer programming. We will also teach you how to use Python modules, which enable you to benefit from the vast array of functionality that is already a part of the Python language. These concepts and skills will help you to begin to think like a computer programmer and to understand how to go about writing Python programs. By the end of the course, you will be able to write short Python programs that are able to accomplish real, practical tasks. This course is the foundation for building expertise in Python programming. As the first course in a specialization, it provides the necessary building blocks for you to succeed at learning to write more complex Python programs. This course uses Python 3. While many Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This first course will use a Python 3 version of the CodeSkulptor development environment, which is specifically designed to help beginning programmers learn quickly. CodeSkulptor runs within any modern web browser and does not require you to install any software, allowing you to start writing and running small programs immediately. In the later courses in this specialization, we will help you to move to more sophisticated desktop development environments.

3.2 Python Data Representations(Python数据表示)

Python入门基础课程,这门课程依然关注Python的基础知识,包括Python字符串,列表等,以及Python文件操作:

This course will continue the introduction to Python programming that started with Python Programming Essentials. We'll learn about different data representations, including strings, lists, and tuples, that form the core of all Python programs. We will also teach you how to access files, which will allow you to store and retrieve data within your programs. These concepts and skills will help you to manipulate data and write more complex Python programs. By the end of the course, you will be able to write Python programs that can manipulate data stored in files. This will extend your Python programming expertise, enabling you to write a wide range of scripts using Python This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course introduces basic desktop Python development environments, allowing you to run Python programs directly on your computer. This choice enables a smooth transition from online development environments.

3.3 Python Data Analysis(Python数据分析)

Python基础课程,这门课程将讲授通过Python读取和分析表格数据和结构化数据等,例如TCSV文件等:

This course will continue the introduction to Python programming that started with Python Programming Essentials and Python Data Representations. We'll learn about reading, storing, and processing tabular data, which are common tasks. We will also teach you about CSV files and Python's support for reading and writing them. CSV files are a generic, plain text file format that allows you to exchange tabular data between different programs. These concepts and skills will help you to further extend your Python programming knowledge and allow you to process more complex data. By the end of the course, you will be comfortable working with tabular data in Python. This will extend your Python programming expertise, enabling you to write a wider range of scripts using Python. This course uses Python 3. While most Python programs continue to use Python 2, Python 3 is the future of the Python programming language. This course uses basic desktop Python development environments, allowing you to run Python programs directly on your computer.

3.4 Python Data Visualization(Python数据可视化)

Python应用课程,这门课程将基于前3门课程学习的Python知识,抓取网络数据,然后清洗,处理和分析数据,并最终可视化呈现数据:

This if the final course in the specialization which builds upon the knowledge learned in Python Programming Essentials, Python Data Representations, and Python Data Analysis. We will learn how to install external packages for use within Python, acquire data from sources on the Web, and then we will clean, process, analyze, and visualize that data. This course will combine the skills learned throughout the specialization to enable you to write interesting, practical, and useful programs. By the end of the course, you will be comfortable installing Python packages, analyzing existing data, and generating visualizations of that data. This course will complete your education as a scripter, enabling you to locate, install, and use Python packages written by others. You will be able to effectively utilize tools and packages that are widely available to amplify your effectiveness and write useful programs.

4. 莱斯大学的计算(机)基础专项课程系列:Fundamentals of Computing Specialization

入门级Python编程学习课程系列,这个系列覆盖了大部分莱斯大学一年级计算机科学新生的学习材料,学生通过Python学习现代编程语言技巧,并将这些技巧应用到20个左右的有趣的编程项目中。

This Specialization covers much of the material that first-year Computer Science students take at Rice University. Students learn sophisticated programming skills in Python from the ground up and apply these skills in building more than 20 fun projects. The Specialization concludes with a Capstone exam that allows the students to demonstrate the range of knowledge that they have acquired in the Specialization.

这个系列包括Python交互式编程设计,计算原理,算法思维等6门课程和1门毕业项目课程,目标是让学生像计算机科学家一样编程和思考(Learn how to program and think like a Computer Scientist),以下是子课程的相关介绍:

4.1 An Introduction to Interactive Programming in Python (Part 1)(Python交互式编程导论上)

Python入门级课程,这门课程将讲授Python编程基础知识,例如普通表达式,条件表达式和函数,并用这些知识构建一个简单的交互式应用。

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 1 of this course, we will introduce the basic elements of programming (such as expressions, conditionals, and functions) and then use these elements to create simple interactive applications such as a digital stopwatch. Part 1 of this class will culminate in building a version of the classic arcade game "Pong".

4.2 An Introduction to Interactive Programming in Python (Part 2)(Python交互式编程导论下)

Python入门级课程,这门课程将继续讲授Python基础知识,例如列表,词典和循环,并将使用这些知识构建一个简单的游戏例如Blackjack:

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard and the mouse. In part 2 of this course, we will introduce more elements of programming (such as list, dictionaries, and loops) and then use these elements to create games such as Blackjack. Part 1 of this class will culminate in building a version of the classic arcade game "Asteroids". Upon completing this course, you will be able to write small, but interesting Python programs. The next course in the specialization will begin to introduce a more principled approach to writing programs and solving computational problems that will allow you to write larger and more complex programs.

4.3 Principles of Computing (Part 1)(计算原理上)

编程基础课程,这门课程聚焦在了编程的基础上,包括编码标准和测试,数学基础包括概率和组合等。

This two-part course builds upon the programming skills that you learned in our Introduction to Interactive Programming in Python course. We will augment those skills with both important programming practices and critical mathematical problem solving skills. These skills underlie larger scale computational problem solving and programming. The main focus of the class will be programming weekly mini-projects in Python that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 1 of this course, the programming aspect of the class will focus on coding standards and testing. The mathematical portion of the class will focus on probability, combinatorics, and counting with an eye towards practical applications of these concepts in Computer Science. Recommended Background - Students should be comfortable writing small (100+ line) programs in Python using constructs such as lists, dictionaries and classes and also have a high-school math background that includes algebra and pre-calculus.

4.4 Principles of Computing (Part 2)(计算原理下)

编程基础课程,这门课程聚焦在搜索、排序、递归等主题上:

This two-part course introduces the basic mathematical and programming principles that underlie much of Computer Science. Understanding these principles is crucial to the process of creating efficient and well-structured solutions for computational problems. To get hands-on experience working with these concepts, we will use the Python programming language. The main focus of the class will be weekly mini-projects that build upon the mathematical and programming principles that are taught in the class. To keep the class fun and engaging, many of the projects will involve working with strategy-based games. In part 2 of this course, the programming portion of the class will focus on concepts such as recursion, assertions, and invariants. The mathematical portion of the class will focus on searching, sorting, and recursive data structures. Upon completing this course, you will have a solid foundation in the principles of computation and programming. This will prepare you for the next course in the specialization, which will begin to introduce a structured approach to developing and analyzing algorithms. Developing such algorithmic thinking skills will be critical to writing large scale software and solving real world computational problems.

4.5 Algorithmic Thinking (Part 1)(算法思维上)

编程基础课程,这门课程聚焦在算法思维的培养上,讲授图算法的相关概念并用Python实现:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part course builds on the principles that you learned in our Principles of Computing course and is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to real-world computational problems. In part 1 of this course, we will study the notion of algorithmic efficiency and consider its application to several problems from graph theory. As the central part of the course, students will implement several important graph algorithms in Python and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Recommended Background - Students should be comfortable writing intermediate size (300+ line) programs in Python and have a basic understanding of searching, sorting, and recursion. Students should also have a solid math background that includes algebra, pre-calculus and a familiarity with the math concepts covered in "Principles of Computing".

4.6 Algorithmic Thinking (Part 2)(算法思维下)

编程基础课程,这门课程聚焦在培养学生的算法思维,并了解一些高级算法主题,例如分治法,动态规划等:

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part class is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to computational problems. In part 2 of this course, we will study advanced algorithmic techniques such as divide-and-conquer and dynamic programming. As the central part of the course, students will implement several algorithms in Python that incorporate these techniques and then use these algorithms to analyze two large real-world data sets. The main focus of these tasks is to understand interaction between the algorithms and the structure of the data sets being analyzed by these algorithms. Once students have completed this class, they will have both the mathematical and programming skills to analyze, design, and program solutions to a wide range of computational problems. While this class will use Python as its vehicle of choice to practice Algorithmic Thinking, the concepts that you will learn in this class transcend any particular programming language.

4.7 The Fundamentals of Computing Capstone Exam(计算基础毕业项目课程)

Python应用课程,基于以上子课程的学习,计算基础毕业项目课程将用Python和所学的知识完成 20+ 项目:

While most specializations on Coursera conclude with a project-based course, students in the "Fundamentals of Computing" specialization have completed more than 20+ projects during the first six courses of the specialization. Given that much of the material in these courses is reused from session to session, our goal in this capstone class is to provide a conclusion to the specialization that allows each student an opportunity to demonstrate their individual mastery of the material in the specialization. With this objective in mind, the focus in this Capstone class will be an exam whose questions are updated periodically. This approach is designed to help insure that each student is solving the exam problems on his/her own without outside help. For students that have done their own work, we do not anticipate that the exam will be particularly hard. However, those students who have relied too heavily on outside help in previous classes may have a difficult time. We believe that this approach will increase the value of the Certificate for this specialization.

5. 密歇根大学的 Applied Data Science with Python(Python数据科学应用专项课程系列)

Python应用系列课程,这个系列的目标主要是通过Python编程语言介绍数据科学的相关领域,包括应用统计学,机器学习,信息可视化,文本分析和社交网络分析等知识,并结合一些流行的Python工具包,例如pandas, matplotlib, scikit-learn, nltk以及networkx等Python工具。

The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have basic a python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.

这个系列课程有5门课程,包括Python数据科学导论课程(Introduction to Data Science in Python),Python数据可视化(Applied Plotting, Charting & Data Representation in Python),Python机器学习(Applied Machine Learning in Python) ,Python文本挖掘(Applied Text Mining in Python) , Python社交网络分析(Applied Social Network Analysis in Python),以下是具体子课程的介绍:

5.1 Introduction to Data Science in Python(Python数据科学导论)

Python基础和应用课程,这门课程从Python基础讲起,然后通过pandas数据科学库介绍DataFrame等数据分析中的核心数据结构概念,让学生学会操作和分析表格数据并学会运行基础的统计分析工具。

This course will introduce the learner to the basics of the python programming environment, including how to download and install python, expected fundamental python programming techniques, and how to find help with python programming questions. The course will also introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the DataFrame as the central data structure for data analysis. The course will end with a statistics primer, showing how various statistical measures can be applied to pandas DataFrames. By the end of the course, students will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses. This course should be taken before any of the other Applied Data Science with Python courses: Applied Plotting, Charting & Data Representation in Python, Applied Machine Learning in Python, Applied Text Mining in Python, Applied Social Network Analysis in Python.

5.2 Applied Plotting, Charting & Data Representation in Python(Python数据可视化)

Python应用课程,这门课程聚焦在通过使用matplotlib库进行数据图表的绘制和可视化呈现:

This course will introduce the learner to information visualization basics, with a focus on reporting and charting using the matplotlib library. The course will start with a design and information literacy perspective, touching on what makes a good and bad visualization, and what statistical measures translate into in terms of visualizations. The second week will focus on the technology used to make visualizations in python, matplotlib, and introduce users to best practices when creating basic charts and how to realize design decisions in the framework. The third week will describe the gamut of functionality available in matplotlib, and demonstrate a variety of basic statistical charts helping learners to identify when a particular method is good for a particular problem. The course will end with a discussion of other forms of structuring and visualizing data. This course should be taken after Introduction to Data Science in Python and before the remainder of the Applied Data Science with Python courses: Applied Machine Learning in Python, Applied Text Mining in Python, and Applied Social Network Analysis in Python.

5.3 Applied Machine Learning in Python(Python机器学习)

Python应用课程,这门课程主要聚焦在通过Python应用机器学习,包括机器学习和统计学的区别,机器学习工具包scikit-learn的介绍,有监督学习和无监督学习,数据泛化问题(例如交叉验证和过拟合)等。

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis. This course should be taken after Introduction to Data Science in Python and Applied Plotting, Charting & Data Representation in Python and before Applied Text Mining in Python and Applied Social Analysis in Python.

5.4 Applied Text Mining in Python(Python文本挖掘)

Python应用课程,这门课程主要聚焦在文本挖掘和文本分析基础,包括正则表达式,文本清洗,文本预处理等,并结合NLTK讲授自然语言处理的相关知识,例如文本分类,主题模型等。

This course will introduce the learner to text mining and text manipulation basics. The course begins with an understanding of how text is handled by python, the structure of text both to the machine and to humans, and an overview of the nltk framework for manipulating text. The second week focuses on common manipulation needs, including regular expressions (searching for text), cleaning text, and preparing text for use by machine learning processes. The third week will apply basic natural language processing methods to text, and demonstrate how text classification is accomplished. The final week will explore more advanced methods for detecting the topics in documents and grouping them by similarity (topic modelling). This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

5.5 Applied Social Network Analysis in Python(Python社交网络分析)

Python应用课程,这门课程通过Python工具包 NetworkX 介绍社交网络分析的相关知识。

This course will introduce the learner to network analysis through the NetworkX library. The course begins with an understanding of what network analysis is and motivations for why we might model phenomena as networks. The second week introduces the concept of connectivity and network robustness.. The third week will explore ways of measuring the importance or centrality of a node in a network. The final week will explore the evolution of networks over time and cover models of network generation and the link prediction problem. This course should be taken after: Introduction to Data Science in Python, Applied Plotting, Charting & Data Representation in Python, and Applied Machine Learning in Python.

您可以继续在课程图谱上挖掘Coursera上新的Python课程,也欢迎推荐到这里。

注:本文首发“课程图谱博客”:http://blog.coursegraph.com ,同步发布到这里,原文链接地址:http://blog.coursegraph.com/coursera%E4%B8%8Apython%E8%AF%BE%E7%A8%8B%EF%BC%88%E5%85%AC%E5%BC%80%E8%AF%BE%EF%BC%89%E6%B1%87%E6%80%BB%E6%8E%A8%E8%8D%90%EF%BC%9A%E4%BB%8Epython%E5%85%A5%E9%97%A8%E5%88%B0%E5%BA%94%E7%94%A8python

达观数据对于大规模消息数据处理的系统架构

达观数据是为企业提供大数据处理、个性化推荐系统服务的知名公司,在应对海量数据处理时,积累了大量实战经验。其中达观数据在面对大量的数据交互和消息处理时,使用了称为DPIO的设计思路进行快速、稳定、可靠的消息数据传递机制,本文分享了达观数据在应对大规模消息数据处理时所开发的通讯中间件DPIO的设计思路和处理经验(达观数据架构师 桂洪冠)

一、数据通讯进程模型

q1在设计达观数据的消息数据处理机制时,首先充分借鉴了ZeroMQ和ProxyIO的设计思想。ZeroMQ提供了一种底层的网络通讯框架,提供了基本的RoundRobin负载均衡算法,性能优越,而ProxyIO是雅虎的网络通讯中间件,承载了雅虎内部大量计算节点间的实时消息处理。但是ZeroMQ没有实现基于节点健康状态的最快响应算法,并且ZeroMQ和ProxyIO对节点的状态管理,连接管理,负载均衡调度等也需要各应用自己来实现。

达观科技在借鉴两种设计思路的基础上,从进程模型、服务架构、线程模型、通讯协议、负载均衡、雪崩处理、连接管理、消息流程、状态监控等各方面进行了开拓,开发了DPIO(达观ProxyIO的简写,下文统称DPIO),确保系统高性能处理相关数据。

在DPIO的整个通讯框架体系中,采用集中管理、统一监控策略管理节点提供服务,节点间直接进行交互,并不依赖统一的管理节点(桂洪冠)。几种节点间通过http或者tcp协议进行消息传递、配置更新、状态跟踪等通讯行为。集群将不同应用的服务抽象成组的概念,相同应用的服务启动时加入的相同的组。每个通讯组有两种端点client和server。应用启动时通过配置决定自己是client端点还是server端点,在一个组内,每个应用只能有一个身份;不同组没要求。

  • 监控节点,顾名思义即提供系统监控服务的,用来给系统管理员查看集群中节点的服务状态及负载情况,系统对监控节点并无实时性及稳定性要求,在本模型中是单点系统。
  • 在上图的架构中把管理节点设计成双master结构,参考zookeeper集群管理思路,多个master通过一定算法分别服务于集群中一部分节点,相对于另外的服务节点则为备份管理节点,他们通过内部通讯同步数据,每个管理节点都有一个web服务为监控节点提供服务节点的状态数据。
  • 服务节点即是下文要谈的代理服务,根据服务对象不同分为应用端代理和服务端代理。集群中的服务节点根据提供服务的不同分为多个组,每个代理启动都需要注册到相应的组中,然后提供服务。

二、DPIO消息传递逻辑架构

DPIO服务节点内/间的通讯及消息传递模型见下图:q2

  • clientHost和serverHost间使用socketapi进行tcp通讯,相同主机内部的多个进程间使用共享内存传递消息内容,client和clientproxy、server和serverproxy之间通过domain socket进行事件通知;在socket连接的一方收到对端的事件通知后,从共享内存中获取消息内容。
  • clientproxy/serverproxy启动时绑定到host的一个端口响应应用api的连接,在连接到来时将该api对应的共享内存初始化,将偏移地址告诉给应用。clientproxy和serverproxy中分别维护了一个到应用api的连接句柄队列,并通过io复用技术监听这些连接上的读写事件。
  • serverproxy在启动时通过socket绑定到服务器的一个端口,并以server身份注册到一个group监听该端口的连接事件,当事件到达时回调注册的事件处理函数响应事件。
  • 在serverproxy内部通过不同的thread分别管理从本地应用建立的连接和从clientproxy建立的连接。thread的个数在启动proxy时由用户指定,默认是分别1个。每个clientproxy启动时会以client身份注册到一个group,并建立到同组的所有serverproxy的连接,clientproxy内部包含了连接的自管理能力及failover的处理(将在下面连接管理部分描述)。 DPIO实现了负载均衡,路由选择和透明代理的功能。

三、线程模型

DPIO的线程模型:

q3

App epoll thread检测从api来的请求信息,并将请求信息转发到待处理队列中。从已处理队列中获取应答包,并将处理结果转发给api

Io epoll thread检测从远端的proxy来的可写事件,并将请求包转发到远端的proxy。检测从远端的proxy的可读事件,并将应答包放在已处理队列中

Monitor thread检测DPIO的工作状态请求,将DPIO的工作状态返回。并将决定Io epoll thread和app epoll thread的负载均衡(桂洪冠)。

四、通信协议

q4

  1. Api与DPIO通信协议
  • 共享内存存储消息格式
字段 含义 长度
protocol len 协议包的总长度 4bytes
protocol head len 协议头的长度 1byte
Version_protocol_id 协议的版本号和协议号 1byte
Flag 消息标志,标志路由模式,是否记录来源地址,有二级路由,所以这个字段一定要Eg,末位表示要记录src,倒数第二位表示按roundrobin路由,倒数第3位表示按消息头路由,xxx 1byte
Proxy 来源/目的 proxy 2bytes
Api 来源/目的 api 2bytes
ApiTtl 协议包的发送时间 2Bytes
ClientTtl 消息存活的时间,后面添加,增加路由策略,选择app_server 2Bytes
ClientProcessTime 客户端处理所用时间 2Bytes
ServerTtl 消息存活的时间,后面添加,增加路由策略,选择app_client 2Bytes
timeout 协议包的超时时间 2 byte
Sid 消息序列号 4bytes
protocol body len Body长度 4bytes
protocol body 消息体 Size
  • 请求协议包
字段 含义 长度
protocol head len 协议头的长度 1byte
Version_protocol_id 协议的版本号和协议号 1byte
Flag 消息标志,标志路由模式,是否记录来源地址,有二级路由,所以这个字段一定要Eg,末位表示要记录src,倒数第二位表示按roundrobin路由,倒数第3位表示按消息头路由,xxx 1byte
ApiTtl 协议包的发送时间 2bytes
Timeout 协议包的超时时间 2bytes
Api 来源/目的 api 2bytes
Sid 消息序列号 4byte
Begin_offset 协议包的起始偏移 4bytes
len 协议包长度 4bytes
  • 响应协议包
字段 含义 长度
protocol head len 协议头的长度 1byte
Version_protocol_id 协议的版本号和协议号 1byte
Flag 消息标志,标志路由模式,是否记录来源地址,有二级路由,所以这个字段一定要Eg,末位表示要记录src,倒数第二位表示按roundrobin路由,倒数第3位表示按消息头路由,xxx 1byte
Result 处理结果 1byte
sid 消息序列号 4bytes
begin_offset 协议包的起始偏移 4bytes
len 协议包长度 4bytes

 

  1. Proxy与监控中心的监控信息
  • 请求协议包
字段 含义 长度
protocol len 协议包的总长度 4bytes
protocol head len 协议头的长度 4bytes
Version 协议的版本号 4bytes
protocol id 协议的协议号 4bytess
status_version 当前状态版本 4bytes
Proxy_identify_len 该proxy标识长度 4bytess
Proxy_identify 该proxy 标识 4bytes
protocol body 消息体 Size
  • 应答包
字段 含义 长度
protocol len 协议包的总长度 4bytes
protocol head len 协议头的长度 4bytes
Version 协议的版本号 4bytes
protocol id 协议的协议号 4bytess
protocol body len Body长度 4bytes
protocol body 消息体 Size

五、负载均衡

q5

DPIO的负载均衡基于最快响应法

DPIO将所有的统计信息更新到监控中心,监控中心通过处理所有的节点的状态信息,统一负责负载均衡。

DPIO从监控中心获取所有连接的负载均衡策略。每个连接知道只需知道自己的处理能力。

以上图为例,有三个proxy server处理程序。处理能力分别为50、30、20,一次epoll过程能够同时探测多个连接的可写事件。

假设:三个proxy server的属于同一epoll thread,且三个proxy server假设都处理能力无限大。

限制:如果刚开始时待处理队列的数据包个数为100个,多次发送轮回后proxy server A≥proxy server B≥proxy server C, 每个发送的最多发送协议包数为待处理队列协议包个数 * 该连接所占权重

六、雪崩处理

大型在线服务,特别是对于时延敏感的服务,当系统外部请求超过系统服务能力,而没有适当的过载保护措施时,当系统累计的超时请求达到一定规模,将可能导致系统缓冲区队列溢出,后端服务资源耗尽,最终像雪崩一样形成恶性循环。这时系统处理的每个请求都因为超时而无效,系统对外呈现的服务能力为0,且这种情况下不能自动恢复。

我们的解决策略是对协议包进行生命周期管理,现在协议包进出待处理队列和已处理队列时进行超时检测和超时处理(超时则丢弃)。

proxy client:

当app epoll thread将协议包放入待处理队列时,会将该协议包的发送时间、该协议包的超时时间,当前时间戳来判断该协议包是否已经超时。

当app epoll thread将协议包从已处理队列中移除时,会将该协议包的发送时间、该协议包的超时时间,已经当前时间戳来判断该协议包是否已经超时。

当Io epoll thread将协议包从待处理队列中移除时,会将该协议包的发送时间、该协议包的超时时间,当前时间戳,该连接的协议包的平均处理时间移除。

当io epoll thread将协议包放入已处理队列时,会将将该协议包的发送时间、该协议包的超时时间,已经当前时间戳来判断该协议包是否已经超时。

proxy server:

当App epoll thread将协议包从待处理队列中移除时,会将该协议包在客户端的处理时间、该协议包的超时时间、该协议包的proxy server接收时间戳、当前时间戳来判断该协议包是否已超时。

当app epoll thread将协议包放入已处理队列时,会将该协议包的发送时间、该协议包的超时时间,已经当前时间戳来判断该协议包是否已经超时。

当io epoll thread将协议包从已处理队列中移除时,会将该协议包的发送时间、该协议包的超时时间,已经当前时间戳来判断该协议包是否已经超时。

当io epoll thread将协议包放入待处理队列时,会将该协议包的发送时间、该协议包的超时时间来判断该协议包是否已超时。

七、连接管理

红黑树:

q6

红黑树:保存所有连接的最近的读/写时间戳。

当epoll_wait时,首先从红黑树中获取oldest的时间戳,并将当前时间戳与oldest时间戳的时间差作为epoll_wait的超时时间,当连接中有可读/写事件发送时,首先从红黑树中删除该节点,当可读/写事件处理完毕后,再将节点插入到红黑树中,当处理完所有连接的可读/写事件时,再从红黑树中依次从移除时间戳小于当前时间戳的连接,并触发该连接的timeout事件。

八、消息处理流程

q7

  1. apiclient通过调用api的接口,将消息传给
  2. api接受消息体,从共享内存中申请内存,填写消息头size(协议总长度)、Offset (协议版本号和协议号)、Headsize (协议头的总长度)、flag(路由策略),ApiTtl (协议包的发送时间)、timeout (协议包的超时时间)、sid(序列号),size(消息体长度)字段,封装成协议包,将协议包写入共享内存。
  3. api通过socket发送请求给proxy。
  4. app epoll thread通过检测api的可读事件,接受请求。通过解析请求内容,获取请求协议包所在的共享内存的偏移、请求协议包的长度和api连接index加入到处理队列。
  5. proxy client的io epoll thread通过检测对端DPIO连接的可写事件,从发送队列中获取请求包,将api的index加入到协议包的api index字段。
  6. proxy client的io epoll thread从共享内存中读取协议包,释放由请求包中所标识的内存空间。
  7. proxy server的io epoll thread通过检测对端DPIO的可读事件,接受请求。
  8. proxy server的io epoll thread从共享内存中申请空间,将proxy的index加入到协议包的proxy index字段。将请求内存写入到申请的空间中。
  9. proxy server的io epoll thread 将协议包在共享内存的偏移和协议包的长度加入的待处理队列中。
  10. app epoll thread从待处理队列中获取请求包,将协议包转发给相应的api进行处理。
  11. api通过检测DPIO的可读事件,解析请求内容。
  12. api通过解析请求内容,获取请求协议包在共享内存中的偏移和请求协议包的长度。从共享内存中读取请求内容,并释放相应空间。
  13. api将请求协议包返回给应用层进行处理。
  14. 应用层将应答包传给api。
  15. Api从共享内存中申请空间,将应答包写入到共享内存中。
  16. Api将应答包在共享内存中的偏移和应答包的大小写入到共享内存中。
  17. App epoll thread通过检测可读事件,将应答包写入到已处理队列中。
  18. proxy server的Io epoll thread通过检测对端的DPIO的可写事件,将已处理队列中获取应答包。
  19. proxy server的Io epoll thread从共享内存中读取应答包。
  20. Proxy client的Io epoll thread检测可读事件,读取应答包。
  21. Proxy client的Io epoll thread通过解析应答包,从共享内存中申请空间,将应答包写入到申请的内存中。
  22. Proxy client的Io epoll thread将应答包移入到已处理队列。
  23. App epoll thread通过检测api的可写事件,将已处理队列中获取应答包。
  24. App epoll thread发送应答包。
  25. Api通过检测可读事件,获取应答包,通过解析应到包,获取应答包在共享内存中的偏移和应到的大小,从共享内存中读取应到包。
  26. Api将应答包返回给应用端。(桂洪冠 陈运文)。

九、状态监控

q8

连接池中存在:当前可用连接个数

连接池中再分别获取每个连接的状态

每个可用连接分别维护以下信息:

连接处理的数据包个数、连接send失败次数、连接协议包的平均处理时间。

连接的连接状态(当重连失败达到一定次数时,定义为连接失败)。

连接的重连次数、连接的超时次数。

当监控线程accept到client的连接时,解析请求内容,然后调用连接池对象的statistics方法,连接池对象首先写入自己的统计信息,然后分别调用每个连接的statistics方法,每个连接分别填写自己的统计信息

本文小结

大规模消息传递会遇到很多可靠性、稳定性的问题,DPIO是达观在处理大数据通讯时的一些经验,和感兴趣的朋友们分享,期待与大家不断交流与合作
http://credit-n.ru/zaymyi-next.html

Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱

曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python。离开腾讯创业后,第一个作品课程图谱也是选择了Python系的Flask框架,渐渐的将自己的绝大部分工作交给了Python。这些年来,接触和使用了很多Python工具包,特别是在文本处理,科学计算,机器学习和数据挖掘领域,有很多很多优秀的Python工具包可供使用,所以作为Pythoner,也是相当幸福的。其实如果仔细留意微博,你会发现很多这方面的分享,自己也Google了一下,发现也有同学总结了“Python机器学习库”,不过总感觉缺少点什么。最近流行一个词,全栈工程师(full stack engineer),作为一个苦逼的创业者,天然的要把自己打造成一个full stack engineer,而这个过程中,这些Python工具包给自己提供了足够的火力,所以想起了这个系列。当然,这也仅仅是抛砖引玉,希望大家能提供更多的线索,来汇总整理一套Python网页爬虫,文本处理,科学计算,机器学习和数据挖掘的兵器谱。

一、Python网页爬虫工具集

一个真实的项目,一定是从获取数据开始的。无论文本处理,机器学习和数据挖掘,都需要数据,除了通过一些渠道购买或者下载的专业数据外,常常需要大家自己动手爬数据,这个时候,爬虫就显得格外重要了,幸好,Python提供了一批很不错的网页爬虫工具框架,既能爬取数据,也能获取和清洗数据,我们也就从这里开始了:

1. Scrapy

Scrapy, a fast high-level screen scraping and web crawling framework for Python.

鼎鼎大名的Scrapy,相信不少同学都有耳闻,课程图谱中的很多课程都是依靠Scrapy抓去的,这方面的介绍文章有很多,推荐大牛pluskid早年的一篇文章:《Scrapy 轻松定制网络爬虫》,历久弥新。

官方主页:http://scrapy.org/
Github代码页: https://github.com/scrapy/scrapy

2. Beautiful Soup

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

读书的时候通过《集体智慧编程》这本书知道Beautiful Soup的,后来也偶尔会用用,非常棒的一套工具。客观的说,Beautifu Soup不完全是一套爬虫工具,需要配合urllib使用,而是一套HTML/XML数据分析,清洗和获取工具。

官方主页:http://www.crummy.com/software/BeautifulSoup/

3. Python-Goose

Html Content / Article Extractor, web scrapping lib in Python

Goose最早是用Java写得,后来用Scala重写,是一个Scala项目。Python-Goose用Python重写,依赖了Beautiful Soup。前段时间用过,感觉很不错,给定一个文章的URL, 获取文章的标题和内容很方便。

Github主页:https://github.com/grangier/python-goose

二、Python文本处理工具集

从网页上获取文本数据之后,依据任务的不同,就需要进行基本的文本处理了,譬如对于英文来说,需要基本的tokenize,对于中文,则需要常见的中文分词,进一步的话,无论英文中文,还可以词性标注,句法分析,关键词提取,文本分类,情感分析等等。这个方面,特别是面向英文领域,有很多优秀的工具包,我们一一道来。
继续阅读