Hacker News new | past | comments | ask | show | jobs | submit login
Learning Python while processing raw text: The NLTK book (nltk.org)
125 points by ColinWright on March 12, 2013 | hide | past | favorite | 23 comments



Here is another freely available book, Text Processing in Python: http://gnosis.cx/TPiP/

Plain text files and not tied to a library.


I know this is Chapter 3, and hence jumping into the middle of the book, but lots of people here will have enough knowledge and experience to start from here and check up on unfamiliar terms. You may find you need to backtrack a little, but it seems to me this is a good place to start.


That is going to be very useful to me, thanks. I'm doing 2 classes right now, one where I have to present python and the other explain NLTK! :P


For anyone whose interested in text classification (along the lines of what Chapter 6 in this book covers), check out our service https://www.repustate.com/predictive-analytics-machine-learn...

It's machine learning as a service: simple API calls to train, cross-validate & classify your data. We'll also be at PyCon this week in Santa Clara so come on by.

Currently in private beta, but we're ramping things up quickly.


Just a little bit OT, but regarding your website:

I just had the experience, that the auto-slider slid away, while reading your explanation on the third step. I manually had to click, to get the text back.

Maybe you should A/B-Test, if a manual slider is better here, as it is independent of your users reading-speed. Just as a suggestion. ;-)


I wish you good luck with your business. I offer a free NLP API as a web service (kbsportal.com, written from scratch in Clojure) that unfortunately gets little use.

There are very good open source NLP libraries like Stanford NLP, OpenNLP, and NLTK. I think the opportunity for business might be in building custom language models for customers based whatever domains they deal with (e.g., medicine, housing, real estate, etc.)


Do you really see any sort of market for providing operations that can be implemented very easily using python itself?


The Python examples given in the book are very very rudimentary. The service behind our API is "real" machine learning (e.g. SVMs, RBMs, deep neural networks & the like). This is all transparent to the user - you just submit your data via our API.


I started using repustate's sentiment analysis for a project recently and it was impressively easy to get started with, although the sentiment scores can seem a little arbitrary at times.


I am a recent grad who is interested in NPL and Arabic. I am trying to contact you but couldn't find a way. My email is in my profile if you want to get in touch.


In terms of Machine Learning for text data, Chapter 6 is highly recommended.

http://nltk.org/book/ch06.html


I've learned a lot from the NLTK library, but unfortunately NLTK is terribly slow. Nevertheless, it is a fantastic place for beginners to start with text processing and learn from as the documentation is superb. However eventually one may want to start digging into the NLTK source to rewrite necessary functions using multiprocessing if they plan to process any "big" textual datasets.


Processing text is a great way to learn any programming language but I would think there's more interesting and varied practice found through web scraping, not to mention it's a whole lot easier


Forgive me if I'm mistaken, but that comment feels like you've read the title of this submission, but not actually read the chapter. This isn't just about chopping and slicing strings, this is an entry point into a comprehensive book about Natural Language Processing, and its associated techniques as exemplified in Python.


The chapter contains HTML processing but that's a small subset of what this chapter covers. You don't need to learn word stems to do really interesting things with structured HTML. Also, web scraping involves more than text processing, but the programmatic navigation of websites, which does add some complexity but is pretty manageable with the libraries out there.

Edit: I'm obviously not saying NLP isn't useful, just that web scraping is more immediately useful. With NLP, besides learning the concepts, you have to find a source of raw text that's been unprocessed and yet contains something of real world value. With web text, you just have to collect what someone already thought was valuable to publish and find insights through the aggregation. It seems to me that the latter scenario is easier to grasp, with NLP being useful for going beyond what others have gathered and published.


Thing is, this is a book about NLP, not a book about web scraping, so while what you say may be true (although personally I find more value learning about NLP than WS) it seems a little misplaced.

But there is value in both, depending on your objectives. I find web scraping trivial, and mining the text hard, hence my interest in NLP and machine learning.


I actually read this book a few weeks ago. I'm pretty new to Python as whole (4 months with the language), so I picked up a few small tricks have saved me quite a bit of time. It does not assume that you either know Python or linguistics very well, so I was certainly pleased that I was able to have my hand held through some of that. I recommend it!


This is an excellent resource which I own, but seemed to be focused on language scientists who are unfamiliar with Python rather than developers who need to process text - which I suspect is a large portion of the audience.


What simpler use cases do you see? In my case, I'm the language guy this is targeted at and have no clue what most web devs would want this stuff for. Spam and text classification of some kind maybe? MAYBE certain kinds of named entity recognition?


Well you as a news-distributor could try to build a tagging-machine, something, that takes texts from a sports news agency for example and enriches it with meaningful tags/keywords, your data from your statistics-section (and so on), to later match other, related content, or match images, or anything like this. something, that you could transmit with the original texts, to make life easier for your customers, with sorting and managing these texts in an automatic fashion inside their content management systems.

[edit] This coming from a text guy, who recently started down the path of python and is hooked ;-)



I read it and their code examples are not great. Too many abbreviated and meaningless variable names.


The "pattern" library is also worth checking out:

  http://www.clips.ua.ac.be/pages/pattern




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: