Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From the response;

"We can't use database transactions because performing a big move would slow Basecamp down for everyone. So we have to log the process of each step of the move, and make it so any failure in the move can be rolled back gracefully. That means a move is actually a series of copies and deletions instead of just changing a field for each moved item"



Why is a transaction necessary? Just change the one field project_id in the table todolists to point to the new project. Because the messages point at the todolist_id instead of at the project_id you don't need to modify anything else.


You're imagining files belong to comments, comments belong to items, items belong to lists, lists belong to projects, and changing the list's foreign key re-parents the whole hierarchy.

Now consider that Basecamp allows comments on messages, todo items, and milestones. What does your schema look like now? Add a feature to show a user all recent comments across all of his projects. What does your query look like? How does it perform? Get all of this (not to mention the other features) working at scale.

Maybe you can build something like Basecamp that works at Basecamp's scale without resorting to denormalization, sharding, and/or partitioning. But I doubt it.


I don't see how this is a problem. The todolist messages still point to a todolist instead of at a project. Can you elaborate?


Looking at just comments on items in todo lists: comment's have an item_id, items have a list_id, and lists have a project_id. You want to get all comments across all projects a user is a member of, so you end with a small handful of joins. Remember that these are reasonable large tables. When you consider the fact that comments can appear on nearly anything, you end up with more joins. Are you certain this will stand up under Basecamp's load even before you consider all of the other features you need? This is all before we run into sharding or partitioning as a scaling strategy.


No, but what is the fast schema that you have in mind that makes it hard to move messages? Perhaps fast and easily movable messages are not exclusive.


I missed this reply, but in case you come back to see it: the common approach to this issue is denormalization. That is, instead of needing to join comments to todo items to todo lists to get the project_id, you duplicate the project_id on comments and you can just get all the relevant comments with very few joins. Obviously, this makes moves need to hit the comments table.

That's a small piece of the picture. With partitioning, a move could entail removing data from one table and inserting it into another. With sharding, a move could mean removing the data from one database and inserting it into another, which probably means manually ensuring consistency because your typical transaction won't span databases.


What's the point of having ACID when you don't use its advantages, one might wonder.


(in reply to all threads stemming from your comment when I posted this)

What's all this talk about "re-implementing" and "if" they have transactions in their DB? They explicitly state that they can't use transactions because it would slow things down, not because they don't have them. They also state that Since moving one milestone could potentially result in hundreds of database operations...

This means they have a transaction supporting database. There's no re-implementing onto a different system. There's an apparently-broken implementation.


So your solution is to reimplement Basecamp on top of another database?


From a couple of web searches it sounds like they're on MySQL. Targeting a database that has working transactions has to be less work than implementing transactions yourself from scratch, which it seems everyone stuck on MySQL eventually has to do (that's what they're describing, and I know we've done similar things).


Depends what MySQL table types they are using. I think MyISAM has no transactional ability.


I'd argue any database on modern hardware where you can't do a single transaction on a few hundred rows without noticeably affecting performance does not have "working transactions".


Rails has always been very MySQL-centric, probably because 37S & DHH use it for Basecamp. Perhaps now that PostgreSQL v9.0 has built in master/slave replication they will finally make the switch.


Duh


Well you're ignoring possible disadvantages, like locking an inordinate number of rows in several tables while a long-running copy takes place. Transactions are all well and good, until they run counter to the usability and stability of your app.


How amusing. Absolutely not an issue in any grown-up database that implements row versioning/MVCC a la Postgres or Oracle. The sad thing is he (and most MySQL users) probably believe this is inherent to all RDBMSs.


I mean this as a genuine question since I don't use it, but: MySQL's engine locks the whole table when a transaction touches it?


The default MyISAM engine does. The behavior of InnoDB is similar to that of SQL Server: a row lock is a data structure held in memory, seperate from the row data. It's computationally expensive to lock this way. However SQL Server escalates row locks into page locks if it thinks it will help (e.g. "lock this row, and the next one, and the next one..." is translated on the fly to "lock this entire page" where a page is an on-disk allocation containing many rows) so now we have a few page locks to manage rather than many row locks. MySQL can't so it struggles when you need to lock many rows at once. Oracle manages this by keeping a row's lock status in-line (e.g. in the block buffer cache) so there is no additional overhead per-lock.

I see this all the time; developers who have "grown up" in an environment where locks and cursors are expensive pick up some odd habits that don't translate well when they code in an environment (such as Oracle) where locks are cheap and cursors are free.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: