Hacker News new | past | comments | ask | show | jobs | submit login
An algorithm for generating automatic hashtags (swayy.co)
46 points by shlomib on Sept 19, 2013 | hide | past | favorite | 33 comments



I think we - and to some degree the Twitter platform itself - are using hash tags redundantly, and this algorithm is just a manifestation of this redundancy that is killing data quality. These pathological tweets do tend to look like the example sentence provided, maybe even more extreme:

  #Swayy #Launches Into Public #Beta To Curate #Content For Your #SocialMedia Audience
Now, all of these words would be reachable with a normal search, so why do we over-tag everything? Are users really going to see what other Tweets have been recently tagged #Content? It makes even less sense with product names like #Swayy.

A more reasonable approach would be to tag things that are not part of the sentence itself:

  We're launching into public beta to curate content for social media! #Swayy
Or inline, on occasion, to express that you're taking part in a meme:

  Dear gods, #IHateIt when it's cold outside
We don't need algorithmic help to find hash tags in these cases either, and I'm arguing that automatically converting every third word into a hash tag doesn't do Twitter feeds any good, quality-wise.


#Content means you're talking about content, and it's not just a word you used in passing. Similarly, #launches adds your tweet to the conversation people are having about launches, instead of making people sort through missile launches, some guy who launches into a story, and misspelled lunches to get to relevant tweets.


At some degree, word frequency counting does classify correctly what the story is about, so in my opinion it do adds quality to it. Think of it like regular tagging. However, the threshold for how many hashtags should be produced with this method should be very low, as it offers no contextual or meta information about it.


I totally agree. Look at Stackoverflow's way. I think they scan for the common tag names appear in title and text and suggest tags. OP's presentation really doesn't impress me.


The tech is nice, but adding hashtags like this does, in my opinion, more harm than good. Hashtagifying common words does rarely make sense, except in those few cases when there is a specific conversation going on about that word for any special reason.

Users ask us all the time to add an automatic hashtagging feature to hashtagify.me, but I'm resisting those requests because bad hashtagging makes hashtags less useful. It would be great to find an algorithm that (at least almost) always finds hashtags that are really relevant, but until that will happen it's better to ask users to make a little effort.

[edited for clarity]


For those people who are interested in topic/article classification and NLP, Twitter can a gold-mine, especially hashtags. If you gather the hashtags for a million articles, you pretty much have a co-location database. Now you can mine that data and see which hashtags are common if you have "Google Panda" in your title for instance, or which hash tags are commonly used with #seo. Hashtags are basically structured semantic data, if you look at them in aggregation. A good tool for doing this is SOLR or ElasticSearch. Simply import all the hashtags for a bunch of articles to the index, and do a faceted search for a specific hashtag, or keyword, and you'll get the top 10 associated hashtags that are highly related to that keyword.


Of course, the applications of this go beyond hashtags. You can apply this towards topic classification of content.


something similar to - http://hashtagify.me


A strange example. Does the author really think hashtagifying the word 'content' has improved the tweet in some way? Do they expect people to be searching Twitter for #content and getting some useful results?


Probably not, but he did disclaim that it was pretty naive and could be improved in many ways. I think it's a pretty darn good first pass. That particular issues comes about from tagging words that are common in the target text without reference to whether or not that's actually significant- i.e., whether it's common in the text just because it's a common word overall, rather than because it's actually an indication of the text subject. That should be pretty easy to fix by comparing with an English word frequency list.


Agree. Maybe some TF-IDF solution.




Contrast with the un-hashtagged version: https://twitter.com/search?q=content


Here's a terrible one I wrote years ago for a Twitter bot:

https://github.com/stephenmcd/babbler/blob/master/babbler/ta...

If I recall, it simply extracts non-dictionary words from the outgoing tweet, then actually queries the Twitter API itself to gauge the popularity of each potential hashtag, only using the most popular.


Although very basic, this is really nice - the wonders of NLTK! You explain your algorithm in very simple terms so kudos for that.

It might be a good idea to put this in a GitHub repository rather than a simple gist maybe? I am sure there will be plenty of people (including myself) interested in contributing, it's much easier doing it on a repository rather than a gist :)


Good idea! Actually I'm thinking about creating some open source “social-NLP” python package. What do you think?


That would be really cool!


Cool stuff. I have been playing with nltk and gensim myself recently in order to solve similar problem. But I want to combine unsupervised topic modeling (LDA) with some supervised learning algorithm (probably naive bayes). Struggling with LDA at this point, but pretty interesting stuff.


A friend and I cofounded another content curator also, we do tagging but in a different fashion: http://wintria.com

Nice article also, I don't think you guys take enough advantage of the article body though.


Obviously this is NOT the algorithm we use in our product (Swayy)...


I didn't say it was, what was that remark about?


Build classification for these generated hash tags. Give a score and put this into a feedback loop to build better hashtags.


What's the point exactly?


So later generations will have good material for ridiculing us.


that time can't come soon enough for me. :+) #fedupwiththehashtagcrap

-bowerbird


#Eww. I #hate it when #hashtags are used inline.


So just filter the hashtags from the output and put them at the end of the original title :P


Or just make them links and hide the #.


f (x) = x + " #yolo"


#hashtagsarelame :)


#honestly_though_hacker_news_could_use_them_or_at_least_something_like_them #meta #hackernews


#tinytext




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: