Sort was a design requirement, but not a reason not to use UUIDs, which are trivially sorted by date (in fact, Cassandra not only does this, but in many contexts actually renders UUIDs as timestamps). So, while that rules out various other implementations, the issue with UDID was just the size.
What they ended up with was pretty much UUID but for a more limited deployment: the real issue is just that UUID is extreme overkill, designed to allow every individual user to generate unique identifiers (something that sounds cool, but in practice is worthless as you can't trust the clients).
Correct. However, that is both the generally fair assumption when dealing with use cases that actually care about UUIDs, and is the fair thing to compare Snowflake against, as Snowflake is pretty much a 64-bit limited-entropy implementation of the general concept of a v1 UUID.
I was under the impression they went with a home baked id generator was to be able to encode the logical shard id into the id. I think UUID 1 encodes info about the hardware (mac address), but their data changes physical hardware as they split their data up to new shards.
Where the data ends up is not important to its identifier: the reason to have a "logical shard ID" is to provide for sequence and timestamp uniqueness.
The way to think about the problem is that at the granularity of your timestamp you lose the ability to uniquely generate identifiers across multiple nodes (on the single node, this is handled with the sequence number), so you need some kind of identifier for the instance of the generator itself: with UUIDs, they spend a relatively immense number of bits storing the computer's MAC, but if you know you know you only have a smaller-scale deployment and are able/willing to centrally plan identifiers for the deployed nodes, you can get buy with something akin to this "shard ID".
(BTW, note that a v1 UUID need not be generated with a true MAC: the specification both notes that you can just by node addresses from them for a fairly small cost, or to use fake addresses that are marked as such by setting the multicast bit. The result is that if you want to use UUID v1 off-the-shelf with "shard IDs" you are more than welcome to do so, not just in the "analogously-equivalent" sense but in the "to the letter of the spec" sense. You can find more information in Section 4.5 of the UUID specification.)
Actually they also noted that it helps with easier mapping. By including the logical shard ID in the ID, they don't need to keep a giant index of IDs-to-shards to figure out which machine an ID lives on. Just a tiny mapping of logical to physical shards, which every app server instance can cache in memory.
One less moving part at the expense of wedding yourself to a prepartitioned logical shard scheme, I guess. (I wonder how painful it would be to rebucket data into a different logical shard should the need arise...)
The Instagram guys said they didn't actually bench using a UUID + created_at column when they spoke at SFPUG, they just used their approach because it was recommended to them and seemed to work. That said, it is a pretty awesome robust mechanism, it's just arguably complex and not proven to be necessary even at their scale.
http://instagram-engineering.tumblr.com/post/10853187575/sha...