Everything I take from our set of OTS components will have live, verifiable examples of running at scale.
> multicast, real-time, distributed message passing system
For this the choice really comes down to RabbitMQ or ejabberd. An XMPP solution is appealing for the obvious benefits of having a "presence" concept, but an AMQP solution keeps us closer to the current reality of Twitter.
So we start with a large distributed rabbit setup. A few clusters scattered across the world, connected via shovel pipes and using some nested queue/exchange plumbing to wire it all together.
> full-text search
Some queues dump to a FTS setup that keeps recent messages in RAM and migrates them to disk as they get older. SOLR is probably a better solution here, but I know the Sphinx delta/full-index model better and would reach for that first.
> archiving
Dump a firehose queue to disk and send copies/updates out to various services that want it. Pushing it to HDFS would probably be the first choice I would look at, just because it is easy to go from there to various Hadoop analytics.
> deletions
Nothing is ever deleted, it just becomes invisible. This is just a flag to add to the FTS indexes and archives.
> open APIs with quotas, abuse detection mechanisms, spam fighting algorithms, different authentication protocols, etc.
None of these is particularly difficult once you have the general framework setup, and the structure of the system actually makes it pretty easy to wire in things like realtime abuse and spam prevention once you get rolling. There is only one thing hard about building a better Twitter, getting the userbase to make it worthwhile. If Twitter was in a different market where the network effect was not so strongly self-reinforcing then it would have been cloned and re-implemented better back when the fail whale was our constant companion, but this is not the case so making a "better" twitter is of little value.
> highly-scalable
Everything I take from our set of OTS components will have live, verifiable examples of running at scale.
> multicast, real-time, distributed message passing system
For this the choice really comes down to RabbitMQ or ejabberd. An XMPP solution is appealing for the obvious benefits of having a "presence" concept, but an AMQP solution keeps us closer to the current reality of Twitter.
So we start with a large distributed rabbit setup. A few clusters scattered across the world, connected via shovel pipes and using some nested queue/exchange plumbing to wire it all together.
> full-text search
Some queues dump to a FTS setup that keeps recent messages in RAM and migrates them to disk as they get older. SOLR is probably a better solution here, but I know the Sphinx delta/full-index model better and would reach for that first.
> archiving
Dump a firehose queue to disk and send copies/updates out to various services that want it. Pushing it to HDFS would probably be the first choice I would look at, just because it is easy to go from there to various Hadoop analytics.
> deletions
Nothing is ever deleted, it just becomes invisible. This is just a flag to add to the FTS indexes and archives.
> dynamic subscription changes, access restriction
Handled completely by the message queues.
> open APIs with quotas, abuse detection mechanisms, spam fighting algorithms, different authentication protocols, etc.
None of these is particularly difficult once you have the general framework setup, and the structure of the system actually makes it pretty easy to wire in things like realtime abuse and spam prevention once you get rolling. There is only one thing hard about building a better Twitter, getting the userbase to make it worthwhile. If Twitter was in a different market where the network effect was not so strongly self-reinforcing then it would have been cloned and re-implemented better back when the fail whale was our constant companion, but this is not the case so making a "better" twitter is of little value.