Incidentally, if any hackers are looking for jobs working on an interesting problem, I know the Rethinks are hiring. It would be the perfect job for a lot of hackers. You'd get to solve big problems starting with a blank slate, and you'd get to work with smart, totally pragmatic people (Slava and Mike). Plus they now have enough to actually pay you.
Not to mention they are very transparent with their compensation and stock options (Scroll down - http://rethinkdb.com/jobs/). People who run their hiring process and business like that seem like very good people to work for indeed.
I would jump on this immediately if I was in the bay area and didn't already have my own startup. As it is, I'll have to satisfy my (data)base urges by writing my own high performance SSD-optimized key-value store.
Tarsnap splits data into blocks of ~64 kB which are individually compressed and encrypted before being uploaded. I use S3 for back-end storage, but need to keep track of where I put each of the blocks.
It's not that the metadata is demanding; it's just that there's a lot of it. For each $/month Tarsnap takes in, I have about 200,000 table entries.
"There’s obviously risk involved with trying to redefine how people structure their databases"
TechCrunch misses the point that Rethink is explicitly not doing this. The MySQL engine is below the SQL parsing layer, so as-is MySQL apps should be able to run against it.
i was looking at this the other day. how do they address (what i assume are) the larger space requirements of append-only databases with the lower capacities of ssds? am i wrong about the space requirements, or is moore's law going to fix it, or is there some kind of background compaction?
Is there really a need for segment cleaning as Rosenblum lays it out?
They clean segments to reduce fragmentation and allow for decent-sized extents to write new data into. Since your writes don't need to be long and contiguous, do you really need to empty the live data out of segments?
You DO need to identify which snapshots are stale, and consequently mark certain blocks as free. But I see no need for compaction.
Isolation levels are really a poor design decision, because they imply the use of locks. Serializable is great, but impossible to implement efficiently. Repeatable read, read committed, and read uncommitted can be implemented efficiently, but allow for various unpleasant isolation artifacts.
The one we're implementing is really the one everyone wants - snapshot isolation. It can be implemented very efficiently, and is stronger than repeatable read, read committed, and read uncommitted (so you should never want these three). It's not as strong as serializable, but nobody can give you a scalable serializable isolation level.
Snapshot isolation also guarantees consistency, but requires all transactions to be idempotent (so they could be rerun in case of a conflict). It's the best of both worlds, in practice most other databases already behave this way anyway.
so for inserts do you need to have some kind of uniqueness constraint that makes sure that a repeated insert is rejected (the first example that came into my head when i read "idempotent" was a simple insert, which isn't, as far as i can tell)?
[sorry if this seems like an interrogation - it's just interesting stuff you're doing...]
Essentially, it means that any transaction might potentially be rolled back and rerun. This isn't a problem for SQL, but suppose I select some stuff, get back into the host programming language, fire off some rockets into space from the Kennedy Space Center, and then insert some data about the launch into the database. This is a big problem, because if the insertion fails because of potential conflicts, the whole thing needs to be rolled back (including the rocket launch), and rerun. A lot of software is written to account for this (i.e. don't perform any external state modification you can't roll back until you've confirmed the transaction is committed), but a lot of software isn't. To really have great isolation and performance, you need to write software this way. For people that don't, we'll support serializable level, but there are very strong limitations as to how efficient this can be.
This is very interesting project, all the current DB are optimized for normal HD(and the standard HD is the slowest part of our PC). But with development of the Solid-State Drives we will have more and more fast drives. So the Database who will take advantage of the these new SSDrives will lead the way in Database design technology. It is right time to invest in this technology.
But really, rotating disks are not that bad for most database use cases. B-trees, the usual on-disk database structure, are designed to keep similar data on the same disk page, which means that if you request row 42, row 43 will be in memory by the time you need it. So the slowness of the disk is abstracted away; iterate over your data in index order, and it's always fast.
Hash tables have a theoretical advantage over balanced trees, and an SSD would make a naive hash table implementation easier to implement. But if you are smart (like, say, BerkeleyDB), hash tables and balanced trees have almost the same real world performance.
RethinkDB might be better for write-heavy operations, but that's because SSDs are better for random writes.
if you request row 42, row 43 will be in memory by the time you need it
True, but this is rarely the case for OLTP workloads. What happens when there is a credit card transaction with a user ID 100731, followed by a credit card transaction with a user ID 8762592? Even for range queries, what you're saying is true only if you're walking through the primary index. The second you start walking through the secondary indices, you're back to random read land (my Facebook friends, for example, are extremely unlikely to be stored in the user table sequentially).
SSDs are better for random writes
Random writes are very tricky on SSDs because of the slow erase operation. The FTL controllers are getting much better at this on micro benchmarks, but it's very difficult to measure random write performance profile over different timelines and different disk space utilization scenarios.
I'm not convinced. Modern OSes use locking extensively and do perfectly fine on multicore applications (FreeBSD's pgsql performance scales linearly up to 16 cores last time I saw graphs).
Obviously you need to be smart about how you do your locking (no giant lock!) but the mere fact of having locking is not automatically a problem.
I misspoke, I should have said 'many-core'.
Yes, you're probably right that no respectable database is going to have a problem with lock contention on 16 cores. But, AMD released 12 core processors this week. Its likely we'll see the average DB server have 48 cores sometime in the next year or two, and who knows after that.
I quoted 16 cores because that's the biggest hardware the FreeBSD project had available when those benchmarks were being run -- I suspect that it scales linearly quite a bit further than that.
I read over their page more carefully, and I think I see what they mean by "no locks". It means that the database stays internally consistent regardless of read or write order. When you start a transaction, you see the data in the log before you started, but you don't see any changes after you start. Fine.
You can get more isolation that this, and you need to to really keep your data consistent, but all DBs except Berkeley seem to have this off by default. So I am not too bothered by this, but I would be interested in seeing how well Rethink handles concurrent OLTP applications that actually care about data integrity. Caring about data integrity is slow, and Rethink might not speed this up all that much. Or it might :)
Congratulations guys, you deserve it, and thank you to anybody else out there writing drivers or optimizing software for changes happening in hardware that we all take for granted.