A good 经验之谈. Somehow it reminds me this --

You made a hidden preamble -- a given type of application in a given domain.

A recommendation: expand your blog a bit as a series, heading to a book.

My friend 吴军 did that quite successfully. Of course with statistics background. So he approached NLP from math perspective -- 数学之美 系列

You have very good thoughts and raw material. Just you need to put a bit more time to make your writing more approachable -- I am commenting on comments like "学习不了。" and "读起来鸭梨很大".

I know you said: "有时候想,也不能弄得太可读了,都是多年 的经验,后生想学的话,也该吃点苦头。:=)"

But as you already put in the efforts, why not make it more approachable?

The issue is, even if I am willing to 吃点苦头, I still don't know where to start 吃苦头, IF I have never built a real-life NLP system.

For example, 词汇主义 by itself is enough for an article. You need to mention its opponents and its history to put it into context. Then you need to give some examples.


吴军's series are super popular. When I first read one of his articles on the Google Blackboard, recommended by a friend, I was amazed how well he structured and carried the content. It is intriguing. (边注:当然,他那篇谈 Page Rank 的文章有偏颇,给年轻人一种印象,IT 事业的成功是由技术主宰的,而实际上技术永远是第二位的。对于所谓高技术企业,没有技术是万万不行的,但企业成功的关键却不是技术,这是显而易见的事实了。)For me, to be honest, I do not aim that high.  Never bothered polishing things to pursue perfection although I did make an effort to try to link my stuffs into a series for the convenience of cross reference inside the related series. There are missing links which I know I want to write about but which sort of depends on my mood or time slots.  I guess I am just not pressed and motivated to do the writing part.  Popularizing the technology is only a side effect of the blogging hobby at times.  The way I prove myself is to show that I will be able to build products worth of millions, or even hundreds of millions of dollars.



So far I have been fairly straightforward on what I write about.  If there is readability issue, it is mainly due to my lack of time.  Young people should be able to benefit from my writings especially once they start getting their hands dirty in building up a system.

Your discussion is fun. You can see and appreciate things hidden behind my work more than other readers.  After all, you have published in THE CL and you have almost terminated the entire segmentation as a scientific area. Seriously, it is my view that there is not much to do there after your work on tokenization both in theory and practice.

I feel some urgency now for having to do Chinese NLP asap.  Not many people have been though that much as what I have been (luckily), so I am in a position to potentially build a much more powerful system to make an impact on Chinese NLP, and hopefully on the IT landscape as well.  But time passes fast . That is why my focus is on the Chinese processing now, day and night.  I am keeping my hands dirty also with a couple of European languages, but they are less challenging and exciting.

作者 liwei999

  1. 让人很是期待啊...

    另,立委老师的这位朋友可是Jin Guo? 那篇segmentation终结者文章可是Critical Tokenization and its Properties?


    liwei999 回复:



  2. 其实 中文NLP现在搞的很多 国内很多单位 国外也有很多单位在搞
    别的不说 美国这边 只要涉及到多语种的 就一定有汉语。不管是资源、方法、系统,都有很不错的进展。好多项目都在进展中。


    liwei999 回复: