Ask HN: Needs advice on learning NLP

jventura · on Feb 20, 2017

I would suggest to start simple and manually to get some feeling for the problems in the field. No frameworks, no tools, just you and Python!

Do a simple experiment: get some texts, split words between spaces (e.g line.split(" ")) and use a dict to count the frequency of the words. Sort the words by frequency, look at them, and you will eventually reach the same conclusion as in figure 1 of the paper by Luhn when working for IBM in 1958 (http://courses.ischool.berkeley.edu/i256/f06/papers/luhn58.p...)

There are lots of corpora out there in the wild, but if you need to roll your own from wikipedia texts you can use this tool I did: https://github.com/joaoventura/WikiCorpusExtractor

From this experiment, and depending if you like statistics or not, you can play a bit with the numbers. For instance, you can use Tf-Idf (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to extract potential keywords from documents. Check the formula, it only uses the frequency of occurrence of words in documents.

Only use tools such as Deep neural networks if you decide later that they are essential for what you need. I did an entire PhD on this area just with Python and playing with frequencies, no frameworks at all (an eg. of my work can be found at http://www.sciencedirect.com/science/article/pii/S1877050912...).

Good luck!

arvinsim · on Feb 20, 2017

Thanks for posting this. Good to know that using just vanilla Python is as viable for learning as using specialized frameworks.

syllogism · on Feb 20, 2017

You might find these posts interesting:

https://explosion.ai/blog/part-of-speech-pos-tagger-in-pytho...

https://explosion.ai/blog/parsing-english-in-python

These days I would say these articles are better indications of solving NLP problems with linear models -- tagging and parsing are less important than they used to be. Here's how I think about doing NLP with current neural network techniques: https://explosion.ai/blog/deep-learning-formula-nlp

jventura · on Feb 20, 2017

Oh, definitely you don't need much more than vanilla Python for playing around and test things. The only thing that you may need is a good tokenizer (i.e., to know where you should break the words), but the regex in [0] was good enough for what I needed. I was working with texts for three different languages (EN, DE and PT), and that is the reason I recommend statistical approaches, as they tend to be language independent.

[0] https://github.com/joaoventura/WikiCorpusExtractor/blob/mast...

navyad · on Feb 20, 2017

thanks for input. right now i am playing with built-in corpus and text of NLTK library.

hiddencost · on Feb 20, 2017

NLP for what purpose?

- Academic -- want results? deep learning [0], data munging [1,2] -- want to understand "why" / context? Jurafsky and Martin [1]

- Professional -- the data is easy to get and clean? deep learning [0] -- you need to do a lot of work to get the signal? [2]

- Personal -- http://karpathy.github.io/2015/05/21/rnn-effectiveness/ -- http://colah.github.io/posts/2014-07-NLP-RNNs-Representation...

(Andrej Karpathy and Chris Olah are some of my favorite writers)

[0] http://www.deeplearningbook.org/ [1] https://web.stanford.edu/~jurafsky/slp3/ [2] http://nlp.stanford.edu/IR-book/

deepGem · on Feb 20, 2017

Start with Machine Learning by Andrew Ng, on Coursera Once you get a hang of neural networks, which is chapter 4 in the course I think jump to Stanford's CS224n. It's helpful to complete Andrew's course as well.

http://web.stanford.edu/class/cs224n/

cs224n is not easy. Of course, you can learn NLP without deep learning, but today it makes sense to pursue this path. During the course of CS224n you'll get some project ideas as they discuss a ton of papers and the latest stuff.

rmchugh · on Feb 20, 2017

I think deep learning is a pretty hefty starting point for learning NLP. Cutting edge NLP seems to be more and more based on deep learning, but it's a rather steep learning curve for a beginner. I would have thought starting with the basics (like the NLTK book) was more useful. Once those concepts are mastered, one can progress to see what deep learning brings to the field.

navyad · on Feb 20, 2017

thanks.

mericsson · on Feb 20, 2017

Some good advice here: https://blog.ycombinator.com/how-to-get-into-natural-languag...

navyad · on Feb 20, 2017

Didn't know of this, highly helpful, thanks.

haidrali · on Feb 20, 2017

Keep reading and practice with this book http://www.nltk.org/book_1ed/, when you will complete this book you will have a good understanding of NLP. Sample product to work on suggestion would include

Implementing a classifier, For detail of it you can look at 13 chapter of http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Cover topics like Sentiment analysis, Document Summarisation etc

tu7001 · on Feb 20, 2017

The information retrieval book is great lecture, I'm going through this and implement algorithms, learn a lot.

haidrali · on Feb 20, 2017

I have implemented these two algorithms back in 2013 do check it out https://github.com/wonderer007/Naive-Bayes-classifier

kyrre · on Feb 20, 2017

no point wasting your time on nltk:

cs224d (videos, lecture notes, assignments)

a similar course: https://github.com/oxford-cs-deepnlp-2017/lectures

good paper: https://arxiv.org/abs/1103.0398 "Natural Language Processing (almost) from Scratch"

gtani · on Feb 20, 2017

I think you want to understand comp linguistics viewpoint: parsers, PoS taggin, dependency analysis, syntax trees;

and the machine learning perspective: embeddings in, say, 100-200 dimensional space (word2vec, glove) and topic modelling/LDA, and latent semantic analysis from the 90's. Then you can read about inputting embedding datasets into LSTM, GRU, content addressable memory/attention mechanisms etc that are being furiously introduced (you can scan the ICLR submissions and http://aclweb.org/anthology/.

_____________________

The Jurafsky/Martin draft 3rd ed is a good starting point, they've got about 2/3 of chapters drafted: https://web.stanford.edu/~jurafsky/slp3/ as well as the Stanford, Oxford, etc courses on NLP and comp linguistics, and Klein's https://people.eecs.berkeley.edu/~klein/cs288/fa14/ , Collins: http://www.cs.columbia.edu/~cs4705/ and other courses at MIT, CMU, UIUC etc

Also, try out the various standard benchmark datasets and tasks: https://arxiv.org/abs/1702.01923

________________

Last time i checked, this SoA page wasn't up to date and not very well summarized but will give you lots of project ideas: http://www.aclweb.org/aclwiki/index.php?title=State_of_the_a...

sainib · on Feb 20, 2017

This is one of the best resources for learning NLP using Python - https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0Qu...

Step by Step, one concept at a time with just a few mins of small videos.

sprobertson · on Feb 20, 2017

For the deep learning angle, I'm starting a project-based tutorial series on using neural networks (specifically RNNs) for NLP, in PyTorch: https://github.com/spro/practical-pytorch

So far it covers using RNNs for sequence classification and generation, and combining those for seq2seq translation. Next up is using recursive neural networks for structured intent parsing.

PS: To anyone who has searched for NLP tutorials, what tutorial have you wanted that you couldn't find?

stared · on Feb 20, 2017

See links in here: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html. Especially:

- Python packages: Gensim, spaCy

- book: https://web.stanford.edu/~jurafsky/slp3/

demonshalo · on Feb 20, 2017

I think the best way to start is tackling a specific problem. Ex. Try building a summarizer for any given piece of text.

Start by using traditional statistical methods first in order to understand what works and what doesn't. From there, you can go on to work on an ML solution to the same problem in order to see the actual difference between the two approaches in terms of comparable output.