I've been doing database, and specifically PostgreSQL, administration and HA setups for a long time now. This stuff is a lot harder than people think it is. People who roll their own solutions, thinking "Oh, this will totes be good enough!" tend to find themselves very painfully surprised that it isn't.
I've seen multiple deployments (not necessarily specific to PostgresSQL) engineer themselves into a corner with what people feel will be a highly available roll your own solution, complete with convincing sounding blog posts.
In every case, at some point, there were implementation/software bug related issues that ultimately caused more unplanned outages than I've ever seen a single, well run server experience.
Based on experience is there a common bug or scenario that you see overlooked often? Like say what happens during the transition between leaders, or handling multiple failures (multiple netsplits..)?
I can't really identify a common problem. Things I've seen include:
* After a complete, planned shutdown, neither server is happy to start until it sees the other one online. In the end, neither ends up booting.
* A failover occurs, at which point you find out the hard way there is state being stored in a non-replicate file. I've seen this with several different Asterisk HA solutions in particular.
* A failover occurs, and non-database aware storage snapshots leave the redundant server with a non-mountable mirror of the database.
I'm just wading into the HA waters with Postgres. I somewhat understand the tradeoffs between simplicity and robustness, but what would be your recommendation on how to proceed for someone who is a newbie?
This is going to sound cynical and self-serving (even though I'm not actually available for hire right now), but find someone who knows what they're doing and buy their time. It probably won't be cheap, but it will almost without doubt be cheaper than what you'll do to yourself if you try to hand-roll database HA.
Isn’t the major problem knowing that the expert actually know what they are doing and not just think they know what they are doing? Any tips on how you can separate the true experts from the deluded?
For someone who doesn't already have a background in and depth of understanding of this stuff, I'd first probably look for any relevant blog posts or articles written by the folks with whom you're considering a consulting arrangement. If they do exist, you can do some research on the things they're talking about and hopefully get at least a first-pass approximation of their full of shit factor.
Any consultant worth their day (let alone week) rate should be also able to refer you to previous clients, from whom you can hopefully get some sense of how satisfied people are with the candidate's work.
And, ultimately, there's an intuition factor at work here. It's been my consistent experience that if a candidate gives you some kind of hinky vibe, don't use them. When folks I've worked with haven't followed that, the results have pretty reliably been poor, at best.
Thanks for this. As you rightly point out setting up these systems can involve knowing all the rare edge cases. I am sure I could put together a system that functioned - that is until it ran straight into an edge case I have not thought of. Getting this right is not easy.
I'll second this. Getting your data store right is way too important to leave to the inexperienced (and I say that being, if I had to judge, the semi-experienced; I've refused gigs centered around HA data stores because I won't put my name on something I am not 100% sure will work for them without issues).
I've been doing database, and specifically PostgreSQL, administration and HA setups for a long time now. This stuff is a lot harder than people think it is. People who roll their own solutions, thinking "Oh, this will totes be good enough!" tend to find themselves very painfully surprised that it isn't.