Learning Python while processing raw text: The NLTK book

donretag · on March 12, 2013

Here is another freely available book, Text Processing in Python: http://gnosis.cx/TPiP/

Plain text files and not tied to a library.

ColinWright · on March 12, 2013

I know this is Chapter 3, and hence jumping into the middle of the book, but lots of people here will have enough knowledge and experience to start from here and check up on unfamiliar terms. You may find you need to backtrack a little, but it seems to me this is a good place to start.

RRRA · on March 12, 2013

That is going to be very useful to me, thanks. I'm doing 2 classes right now, one where I have to present python and the other explain NLTK! :P

waterside81 · on March 12, 2013

For anyone whose interested in text classification (along the lines of what Chapter 6 in this book covers), check out our service https://www.repustate.com/predictive-analytics-machine-learn...

It's machine learning as a service: simple API calls to train, cross-validate & classify your data. We'll also be at PyCon this week in Santa Clara so come on by.

Currently in private beta, but we're ramping things up quickly.

sdoering · on March 13, 2013

Just a little bit OT, but regarding your website:

I just had the experience, that the auto-slider slid away, while reading your explanation on the third step. I manually had to click, to get the text back.

Maybe you should A/B-Test, if a manual slider is better here, as it is independent of your users reading-speed. Just as a suggestion. ;-)

mark_l_watson · on March 12, 2013

I wish you good luck with your business. I offer a free NLP API as a web service (kbsportal.com, written from scratch in Clojure) that unfortunately gets little use.

There are very good open source NLP libraries like Stanford NLP, OpenNLP, and NLTK. I think the opportunity for business might be in building custom language models for customers based whatever domains they deal with (e.g., medicine, housing, real estate, etc.)

bromang · on March 12, 2013

Do you really see any sort of market for providing operations that can be implemented very easily using python itself?

waterside81 · on March 12, 2013

The Python examples given in the book are very very rudimentary. The service behind our API is "real" machine learning (e.g. SVMs, RBMs, deep neural networks & the like). This is all transparent to the user - you just submit your data via our API.

npc · on March 12, 2013

I started using repustate's sentiment analysis for a project recently and it was impressively easy to get started with, although the sentiment scores can seem a little arbitrary at times.

why-el · on March 13, 2013

I am a recent grad who is interested in NPL and Arabic. I am trying to contact you but couldn't find a way. My email is in my profile if you want to get in touch.

tchalla · on March 12, 2013

In terms of Machine Learning for text data, Chapter 6 is highly recommended.

http://nltk.org/book/ch06.html

zissou · on March 12, 2013

I've learned a lot from the NLTK library, but unfortunately NLTK is terribly slow. Nevertheless, it is a fantastic place for beginners to start with text processing and learn from as the documentation is superb. However eventually one may want to start digging into the NLTK source to rewrite necessary functions using multiprocessing if they plan to process any "big" textual datasets.

danso · on March 12, 2013

Processing text is a great way to learn any programming language but I would think there's more interesting and varied practice found through web scraping, not to mention it's a whole lot easier

ColinWright · on March 12, 2013

Forgive me if I'm mistaken, but that comment feels like you've read the title of this submission, but not actually read the chapter. This isn't just about chopping and slicing strings, this is an entry point into a comprehensive book about Natural Language Processing, and its associated techniques as exemplified in Python.

danso · on March 12, 2013

The chapter contains HTML processing but that's a small subset of what this chapter covers. You don't need to learn word stems to do really interesting things with structured HTML. Also, web scraping involves more than text processing, but the programmatic navigation of websites, which does add some complexity but is pretty manageable with the libraries out there.

Edit: I'm obviously not saying NLP isn't useful, just that web scraping is more immediately useful. With NLP, besides learning the concepts, you have to find a source of raw text that's been unprocessed and yet contains something of real world value. With web text, you just have to collect what someone already thought was valuable to publish and find insights through the aggregation. It seems to me that the latter scenario is easier to grasp, with NLP being useful for going beyond what others have gathered and published.

ColinWright · on March 12, 2013

Thing is, this is a book about NLP, not a book about web scraping, so while what you say may be true (although personally I find more value learning about NLP than WS) it seems a little misplaced.

But there is value in both, depending on your objectives. I find web scraping trivial, and mining the text hard, hence my interest in NLP and machine learning.

seanlinehan · on March 12, 2013

I actually read this book a few weeks ago. I'm pretty new to Python as whole (4 months with the language), so I picked up a few small tricks have saved me quite a bit of time. It does not assume that you either know Python or linguistics very well, so I was certainly pleased that I was able to have my hand held through some of that. I recommend it!

nailer · on March 12, 2013

This is an excellent resource which I own, but seemed to be focused on language scientists who are unfamiliar with Python rather than developers who need to process text - which I suspect is a large portion of the audience.

agibsonccc · on March 12, 2013

What simpler use cases do you see? In my case, I'm the language guy this is targeted at and have no clue what most web devs would want this stuff for. Spam and text classification of some kind maybe? MAYBE certain kinds of named entity recognition?

sdoering · on March 13, 2013

Well you as a news-distributor could try to build a tagging-machine, something, that takes texts from a sports news agency for example and enriches it with meaningful tags/keywords, your data from your statistics-section (and so on), to later match other, related content, or match images, or anything like this. something, that you could transmit with the original texts, to make life easier for your customers, with sorting and managing these texts in an automatic fashion inside their content management systems.

[edit] This coming from a text guy, who recently started down the path of python and is hooked ;-)

forgotAgain · on March 12, 2013

Cover and TOC

http://nltk.org/book/

tootie · on March 12, 2013

I read it and their code examples are not great. Too many abbreviated and meaningless variable names.

dunham · on March 12, 2013

The "pattern" library is also worth checking out:

  http://www.clips.ua.ac.be/pages/pattern