Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How? just get started working on a fun problem. A good place to start is keyword extraction. You don't need a PhD or expensive tools. All you need is some free time and willingness to read some cool stuff.

Copy a few articles into text files and get working on implementing some of these methods until you have enough of an understanding to construct your own methods for the fun of it.

Here's some good reading material:

https://www.facebook.com/notes/facebook-engineering/under-th...

https://www.researchgate.net/profile/Stuart_Rose/publication...

http://cdn.intechopen.com/pdfs/5338.pdf

https://arxiv.org/pdf/1603.03827v1.pdf

https://www.quora.com/Sentiment-Analysis-What-are-the-good-w...

http://hrcak.srce.hr/file/207669

http://nlp.stanford.edu/fsnlp/promo/colloc.pdf

https://arxiv.org/ftp/cs/papers/0410/0410062.pdf

http://delivery.acm.org/10.1145/1120000/1119383/p216-hulth.p...

Edit: Don't get deterred by the math formulas in these papers. They look far more complicated than they actually are.



Another fun thing is to paste article text into some API, like the Watson demo, so you can see what kinds of things are possible:

https://alchemy-language-demo.mybluemix.net/

I played around with this a bit to develop https://www.findlectures.com, so knowing what works/doesn't work there I'm developing some NLP scripts to support my use cases.


I never thought about this particular use-case. The subtitle for TED talks should be an ocean of info for you to extract keywords from :D Pretty neat site you got there. I will be using it. Thanks!


I would say that a good example for starting in this field would be to implement something like Tf-Idf [0] for identifying keywords on a set of documents. I don't know where one can find current datasets for this, but I made WikiCorpusExtractor [1] to build sets of documents from the Wikipedia.

The only thing one really needs is to count the frequency of words in each document and do very simple math. Tf-Idf is still very relevant today and provides you with a very good idea on how statistics is used on text-mining.

[0] https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[1] https://github.com/joaoventura/WikiCorpusExtractor


I started even simpler than that. I started by just eliminating stopwords and count the frequency in each word in the document itself. I did not use a set of documents as the goal was for the algorithm to be used on the spot for a single block of text.

A few months later and after many iterations + a whole lot of testing, the algorithm now can extract super relevant keywords 90%+ of the time!

I wish I knew about the WikiCorpusExtractor. Thanks for the link!


Thank you very much for the reading material


u welcome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: