An algorithm for generating automatic hashtags

Udo · on Sept 19, 2013

I think we - and to some degree the Twitter platform itself - are using hash tags redundantly, and this algorithm is just a manifestation of this redundancy that is killing data quality. These pathological tweets do tend to look like the example sentence provided, maybe even more extreme:

  #Swayy #Launches Into Public #Beta To Curate #Content For Your #SocialMedia Audience

Now, all of these words would be reachable with a normal search, so why do we over-tag everything? Are users really going to see what other Tweets have been recently tagged #Content? It makes even less sense with product names like #Swayy.

A more reasonable approach would be to tag things that are not part of the sentence itself:

  We're launching into public beta to curate content for social media! #Swayy

Or inline, on occasion, to express that you're taking part in a meme:

  Dear gods, #IHateIt when it's cold outside

We don't need algorithmic help to find hash tags in these cases either, and I'm arguing that automatically converting every third word into a hash tag doesn't do Twitter feeds any good, quality-wise.

sp332 · on Sept 19, 2013

#Content means you're talking about content, and it's not just a word you used in passing. Similarly, #launches adds your tweet to the conversation people are having about launches, instead of making people sort through missile launches, some guy who launches into a story, and misspelled lunches to get to relevant tweets.

Madrigal · on Sept 19, 2013

At some degree, word frequency counting does classify correctly what the story is about, so in my opinion it do adds quality to it. Think of it like regular tagging. However, the threshold for how many hashtags should be produced with this method should be very low, as it offers no contextual or meta information about it.

yeukhon · on Sept 20, 2013

I totally agree. Look at Stackoverflow's way. I think they scan for the common tag names appear in title and text and suggest tags. OP's presentation really doesn't impress me.

danmaz74 · on Sept 19, 2013

The tech is nice, but adding hashtags like this does, in my opinion, more harm than good. Hashtagifying common words does rarely make sense, except in those few cases when there is a specific conversation going on about that word for any special reason.

Users ask us all the time to add an automatic hashtagging feature to hashtagify.me, but I'm resisting those requests because bad hashtagging makes hashtags less useful. It would be great to find an algorithm that (at least almost) always finds hashtags that are really relevant, but until that will happen it's better to ask users to make a little effort.

[edited for clarity]

AznHisoka · on Sept 19, 2013

For those people who are interested in topic/article classification and NLP, Twitter can a gold-mine, especially hashtags. If you gather the hashtags for a million articles, you pretty much have a co-location database. Now you can mine that data and see which hashtags are common if you have "Google Panda" in your title for instance, or which hash tags are commonly used with #seo. Hashtags are basically structured semantic data, if you look at them in aggregation. A good tool for doing this is SOLR or ElasticSearch. Simply import all the hashtags for a bunch of articles to the index, and do a faceted search for a specific hashtag, or keyword, and you'll get the top 10 associated hashtags that are highly related to that keyword.

AznHisoka · on Sept 19, 2013

Of course, the applications of this go beyond hashtags. You can apply this towards topic classification of content.

shlomib · on Sept 19, 2013

something similar to - http://hashtagify.me

joosters · on Sept 19, 2013

A strange example. Does the author really think hashtagifying the word 'content' has improved the tweet in some way? Do they expect people to be searching Twitter for #content and getting some useful results?

gliese1337 · on Sept 19, 2013

Probably not, but he did disclaim that it was pretty naive and could be improved in many ways. I think it's a pretty darn good first pass. That particular issues comes about from tagging words that are common in the target text without reference to whether or not that's actually significant- i.e., whether it's common in the text just because it's a common word overall, rather than because it's actually an indication of the text subject. That should be pretty easy to fix by comparing with an English word frequency list.

shlomib · on Sept 19, 2013

Agree. Maybe some TF-IDF solution.

shlomib · on Sept 19, 2013

https://twitter.com/search?q=%23Content

dtauzell · on Sept 19, 2013

You can even get a #content t-shirt: http://bunnyaimee.tumblr.com/post/61678190440/we-opted-for-a...

sp332 · on Sept 19, 2013

Contrast with the un-hashtagged version: https://twitter.com/search?q=content

stephen_mcd · on Sept 19, 2013

Here's a terrible one I wrote years ago for a Twitter bot:

https://github.com/stephenmcd/babbler/blob/master/babbler/ta...

If I recall, it simply extracts non-dictionary words from the outgoing tweet, then actually queries the Twitter API itself to gauge the popularity of each potential hashtag, only using the most popular.

AnSavvides · on Sept 19, 2013

Although very basic, this is really nice - the wonders of NLTK! You explain your algorithm in very simple terms so kudos for that.

It might be a good idea to put this in a GitHub repository rather than a simple gist maybe? I am sure there will be plenty of people (including myself) interested in contributing, it's much easier doing it on a repository rather than a gist :)

shlomib · on Sept 19, 2013

Good idea! Actually I'm thinking about creating some open source “social-NLP” python package. What do you think?

AnSavvides · on Sept 19, 2013

That would be really cool!

yankoff · on Sept 19, 2013

Cool stuff. I have been playing with nltk and gensim myself recently in order to solve similar problem. But I want to combine unsupervised topic modeling (LDA) with some supervised learning algorithm (probably naive bayes). Struggling with LDA at this point, but pretty interesting stuff.

louyang · on Sept 19, 2013

A friend and I cofounded another content curator also, we do tagging but in a different fashion: http://wintria.com

Nice article also, I don't think you guys take enough advantage of the article body though.

shlomib · on Sept 19, 2013

Obviously this is NOT the algorithm we use in our product (Swayy)...

louyang · on Sept 19, 2013

I didn't say it was, what was that remark about?

solvemenow · on Sept 19, 2013

Build classification for these generated hash tags. Give a score and put this into a feedback loop to build better hashtags.

btbuildem · on Sept 19, 2013

What's the point exactly?

icedog · on Sept 20, 2013

So later generations will have good material for ridiculing us.

bowerbird · on Sept 20, 2013

that time can't come soon enough for me. :+) #fedupwiththehashtagcrap

-bowerbird

quarterto · on Sept 19, 2013

#Eww. I #hate it when #hashtags are used inline.

shlomib · on Sept 19, 2013

So just filter the hashtags from the output and put them at the end of the original title :P

zeckalpha · on Sept 19, 2013

Or just make them links and hide the #.

zeckalpha · on Sept 19, 2013

f (x) = x + " #yolo"

jheriko · on Sept 19, 2013

#hashtagsarelame :)

krapp · on Sept 19, 2013

#honestly_though_hacker_news_could_use_them_or_at_least_something_like_them #meta #hackernews

sv123 · on Sept 19, 2013

#tinytext