Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For listennotes.com, we did a postgres 9.6 => 11 upgrade (in 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops, and ~45 seconds downtime for write ops.

Our database is less than 1TB. One master (for writes + some reads) + multiple slaves (read-only).

Here's what we did -

1, Launched a new read-only db with pg9.6, let's call it DB_A.

2, Stopped all offline tasks, and only maintained a minimal fleet of online servers (e.g., web, api...).

3, Changed all db hosts (no matter master or slave) in /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use old read-only db with pg9.6, let's call it DB_B. From this point on, all write ops should fail.

4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and promoted it to be a master db.

5, Changed /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use DB_A for all db hosts. By this point, DB_A is a master db. And write ops should be good now.

6, Changed /etc/hosts for all other servers and brought back all services.

Step 4 is the most critical. If it fails or runs too long (e.g., more than 10 minutes), then we had to rollback by changing /etc/hosts on those online servers.

We carefully rehearsed these steps for an entire week, and timed each step. By the time we did it on production, we knew how many seconds/minutes each step would take. And we tried to automate as many things as possible in bash scripts.



We did something similar recently jumping from 10 to 13. We took measurements, did some dry runs, and came up with strategies to ensure our read-only followers would work fine and we’d have a minimum downtime for writes.

We missed one or two pieces of reconnecting things afterwards, and some of that seems to be limitations of Heroku Postgres that we couldn’t change. Hopefully those keep improving.


wow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: