Hi. I'm one of the database engineers at Heap. This is a good question. There are several reasons why we use EC2. First of all, I will say I love RDS as a product. We actually do use RDS for a number of our services. We use Postgres on EC2 only for our primary data store. As for reasons why we use EC2:
Cost - Our primary data store has >1 Petabyte of raw data stored across dozens of Postgres instances. The amount of data we store is at the point where RDS is too expensive for us. The cost of an instance on RDS is more than twice the cost on EC2. For example, an on-demand r4.8xl on EC2 instance costs $2.13 an hour, while an RDS r4.8xl costs $4.80 an hour.
Performance - The only kind of disk available on RDS is EBS. EBS is slow compared to the NVMe the i3s provide. We used to use r3s with EBS and got a major speedup when we switched to i3s. As a side note, the cost of an i3 is also less than the cost of an r3 with an equivalent amount of EBS.
Configuration - By using EC2 we can configure our machines in ways we wouldn't be able to if we used RDS. For example, we run ZFS on our EC2 instances which compresses our data by 2x. By compressing our data, we get a major cost saving and a major performance boost at the same time! There isn't an easy way to compress your data if you use RDS.
Introspection - There are times where we've needed to debug performance problems with Postgres and EXPLAIN ANALYZE won't suffice. A good example is we used flame graphs to see what Postgres was using CPU for. We made a small change that resulted in a 10x improvement to ingestion throughput. If you are curious, I wrote a blog post on this investigation: https://heapanalytics.com/blog/engineering/basic-performance...
Hi, I'm one of the engineers at Amazon working on EC2.
You can also run get bare metal I3 instances by launching the "i3.metal" instance type. You don't need to wait for the Nitro hypervisor, you can go with no hypervisor at all.
When we first launched the i3 instances, i3.metals weren't available. We've been wanting to do experiments with i3.metals, but we've been unable to get confirmation that our reservations will transfer over. Until we know that we'll be able to transfer the reservations, there isn't any reason for us to do experiments.
Since you work at Amazon, do you have a sense of big of a difference there is in performance between i3 and i3.metal for database workloads like Postgres?
And you get 4 extra cores and 24 gigs of ram "for free" vs i3.16xl (which is what we use). I think we looked into switching but it wasn't clear if the reservations could be switched over.
We are running vanilla ZFS-on-Linux. We don't use snapshots for backups as the Postgres backups are more convenient. Postgres provides point in time recovery, which is useful. There are also tools like wal-e for automatically writing and restoring Postgres backups to S3.
As for stability, have been two major sources of instability with ZFS:
The first issue was with the default value of arc_shrink_shift. By default, ZFS will evict ~1% of ARC, the in memory file cache, to disk at a time. Our machines have several hundred gigs of ARC, so ZFS was evicting several gigs of data to disk at a time. This was causing our machines to frequently become unresponsive for several seconds.
The other issue is for some reason ZFS will lock up for long periods of time if we delete several hundred gigs of data. We haven't been able to identify a root cause of the problem. So far we've worked around this problem by adding a sleep in between data deletions.
Other than these problems, ZFS has worked pretty well for us.
We store two copies of every piece of data on two different machines. When a single machine goes down, we have code for spinning up a new machine and restoring the data that was on the machine.
Over the course of a month, we usually have about one machine fail.
Cost - Our primary data store has >1 Petabyte of raw data stored across dozens of Postgres instances. The amount of data we store is at the point where RDS is too expensive for us. The cost of an instance on RDS is more than twice the cost on EC2. For example, an on-demand r4.8xl on EC2 instance costs $2.13 an hour, while an RDS r4.8xl costs $4.80 an hour.
Performance - The only kind of disk available on RDS is EBS. EBS is slow compared to the NVMe the i3s provide. We used to use r3s with EBS and got a major speedup when we switched to i3s. As a side note, the cost of an i3 is also less than the cost of an r3 with an equivalent amount of EBS.
Configuration - By using EC2 we can configure our machines in ways we wouldn't be able to if we used RDS. For example, we run ZFS on our EC2 instances which compresses our data by 2x. By compressing our data, we get a major cost saving and a major performance boost at the same time! There isn't an easy way to compress your data if you use RDS.
Introspection - There are times where we've needed to debug performance problems with Postgres and EXPLAIN ANALYZE won't suffice. A good example is we used flame graphs to see what Postgres was using CPU for. We made a small change that resulted in a 10x improvement to ingestion throughput. If you are curious, I wrote a blog post on this investigation: https://heapanalytics.com/blog/engineering/basic-performance...