Short on details, and I thought the speculation on transferring the user info to a NoSQL solution sounds naive, but interesting to get a peek into the internal operations at a large enterprise. I think I want to go validate some of our database backup right now.
Very interesting, I wonder how many people read about that issue and then went and checked their backups. Heck, I actually verified the backups on my home network after reading about this yesterday.
I agree. Also his comment about "over engineering" making the system "more brittle" was odd.
For data that important, I would have mirrored the databases to "warm standby" servers. They could have been back up in minutes with no data loss. Sure it would have doubled the cost, but how much money did they lose during the outage.
Otherwise you'd know that they had a fault that propagated to the hot spare. It's also utterly daft to think that a financial enterprise as large as JPM/Chase wouldn't already be running a HA setup. In this case it appears to be Oracle RAC.
I'm astounded how often I have to remind people that replication and backups are very different things, and that you need both.
I'm also depressed how many utterly thoughtless comments are made here on hackernews lately.
No Oracle RAC shares the same storage between two or more nodes.
What they had here would appear to be database A running on storage A which is replicated at the storage level to storage B where database B waits in an idle state. Because the replication system is "blind" - it only sees its own filesystem containing bytes, not Oracle data structures - it can't tell a good Oracle block from a bad one and copies it.
I do this sort of setup for a living and you would be amazed at how many "architects" there are around who have completely drunk the storage vendor kool-aid and don't really understand how anything works (not even storage...).
I did read the article that was referenced. I did not read the article that that article referenced. My point was about the comment on "over engineering". This problem was not cause by over engineering.
The way I read it was the problem was corruption inside the database and the warm backup was corrupted during the automatic mirroring before they noticed the problem. So at that point, both the PROD and Failover instance are busted once the issue was determined. To resolve, it looks like they had to rollback to the last valid full DB backup from Sunday and then apply the log backups iteratively from Sunday to catch up the DB before bringing it back online.
At my shop we had a similar issue (but at the SAN level, not the DB level) where the corruption issue was data that exposed a bug in the system. The data was automatically mirrored to the warm standby machine. When PROD crashed, the standby was brought up and immediately crashed also. We had to rebuild from tape backups which was stupid-slow (trademarked term there ;-). All in all it was a horrible mess that was root-caused to a bug in vendor firmware. Eerily similar to the JPMorgan Chase issue in the OP.
I'm guessing they were using the storage to do the replication, rather than DataGuard to replicate and RMAN to make the initial copy, which checksums the blocks on the way - it'll tell you off the bat if you have any block-level corruption, there's no way for the storage to do this because it can't tell a valid Oracle block from any other sort of block. Because DataGuard is Oracle-aware, you always have a valid standby - if the primary datafiles are corrupt, you can still ship the redo logs (which you will be multiplexing too).
I'll also hazard that they did it this way because some "enterprise architects" designed the system - no Oracle DBA would have done it like that for precisely those reasons.
NoSQL absolutely would not help in this case. If you are trading on the web you need the clickstream for the regulators, just like a bank tapes every phone conversation.
If you're keeping ALL of your user profile data in ACID-compliant databases you're probably doing it wrong.
Large modern websites store tons of information about a user which may not in any way be necessary to even keep for anything other than data mining, or perhaps preferences, click/hit tracking, etc. I can't see how such data is important in any way in regards to finances or trades, so why it couldn't be done on a much-less-resource-intense database solution I don't understand.
Moreover, the cascading effect of a database failure is made much worse by putting all your eggies in one basket and depending on this one cluster of databases to keep the whole ship afloat. In a good design scenario, much of the site should still keep operating even if the backend databases are timing out from load. For example, your cache layer (if not expired) should continue serving cached content/logins/etc. This may not be as useful for clients that sign in randomly or throughout the day, but for people who use the site frequently or stay logged in throughout the day their sessions should stay active in this scenario.
The content in the user profile which doesn't require ACID compliance could also be using caching and nosql/mysql/etc which would keep the apps working even longer in the event of an outage of a particular piece of technology. Because this technology doesn't require some of the more complicated requirements of Oracle RAC it may also be easier to recover/restore old data, again assuming this doesn't have a particular need for ACID.
for such a shop as Chase it sounds kind of simplistic and cheap (though i don't think they bought it cheap :) - only 8 machine cluster, only one standby, no flashback ...
Perhaps you don't understand how RAC works. A RAC cluster is cache-coherent with a shared disk system, in this case an EMC SAN. It's designed to be both scalable and fault tolerant. The replication would have been handled by the SAN itself, at the block level. There would be two completely independent (edit:DISK) cabinets that would replicate synchronously. Some software assumes synchronous replication and it's cheaper to just spend a ton of money on an expensive replicating SAN and Oracle RAC than it is to rebuild the software, so an async replication scenario is out of the question.
No, no, no. The standby is not open for queries in that scenario. How can it be? It's playing no role in this setup, all the work is being done on the storage, it's not even aware of it until you try to activate it and it takes ownership of the controlfile.
So do those 8 machines and the code on it represent the system before or after the Wamu merger, or something in between? I've heard many of these banks such as Citigroup have something like 13 different databases or systems, many of which duplicate functionality.
having worked at an enterprise software company and working with several big clients (including banks), I find it surprising (and shocking to some extent) that JPMC didnt have a more efficient disaster recovery process in place.
I am not saying they didnt have one, just that disaster recovery scenarios should factor into such outages. Hypothetical fire drills etc. are needed at such critical businesses like banks.
My guess is that a bunch of people @ jpmc will most likely be losing their jobs over this.
I read it differently. These things happen, you can get data corruption replicated to the hot spare, i.e. failure more catastrophic than this setup is able to handle.
They were able to identify the problem and successfully recover from backup and successfully replay missing transactions in a reasonable amount of time for the setup this large. In my book it's a success.
With the same experience you have, I am not shocked at all. People design and implement "processes" to "prevent" production issues from happening, but they are mostly feel-good sounding things on top of "let's cross our fingers and hope nothing bad happens".
This usually works, which is why people think it's an acceptable policy. But real planning involves things like software correctness, proper test procedures, ways of making a test environment that's exactly identical to production, and so on. This is hard (and slows down development... "tests, what a waste of time!"), so people instead say, "let's try really hard to not fuck something up".