Hi. I'm one of the database engineers at Heap. This is a good question. There ar...

_msw_ · on Sept 21, 2018

Hi, I'm one of the engineers at Amazon working on EC2.

You can also run get bare metal I3 instances by launching the "i3.metal" instance type. You don't need to wait for the Nitro hypervisor, you can go with no hypervisor at all.

malisper · on Sept 22, 2018

When we first launched the i3 instances, i3.metals weren't available. We've been wanting to do experiments with i3.metals, but we've been unable to get confirmation that our reservations will transfer over. Until we know that we'll be able to transfer the reservations, there isn't any reason for us to do experiments.

Since you work at Amazon, do you have a sense of big of a difference there is in performance between i3 and i3.metal for database workloads like Postgres?

kalmar · on Sept 22, 2018

And you get 4 extra cores and 24 gigs of ram "for free" vs i3.16xl (which is what we use). I think we looked into switching but it wasn't clear if the reservations could be switched over.

mattbillenstein · on Sept 22, 2018

Curious re ZFS - any stability issues? Are you leveraging snapshots for backups? Special configs or vanilla ZFS-on-Linux?

malisper · on Sept 23, 2018

We are running vanilla ZFS-on-Linux. We don't use snapshots for backups as the Postgres backups are more convenient. Postgres provides point in time recovery, which is useful. There are also tools like wal-e for automatically writing and restoring Postgres backups to S3.

As for stability, have been two major sources of instability with ZFS:

The first issue was with the default value of arc_shrink_shift. By default, ZFS will evict ~1% of ARC, the in memory file cache, to disk at a time. Our machines have several hundred gigs of ARC, so ZFS was evicting several gigs of data to disk at a time. This was causing our machines to frequently become unresponsive for several seconds.

The other issue is for some reason ZFS will lock up for long periods of time if we delete several hundred gigs of data. We haven't been able to identify a root cause of the problem. So far we've worked around this problem by adding a sleep in between data deletions.

Other than these problems, ZFS has worked pretty well for us.

mattbillenstein · on Sept 24, 2018

Cool, thanks for the info -- glad to see people pushing these tools much harder than I plan to in the near future ;)

gigatexal · on Sept 22, 2018

I too want to hear about this in production as I’m thinking of moving ours to it as well.

scarface74 · on Sept 22, 2018

Thanks for the explanation.

kakwa_ · on Sept 24, 2018

I was also looking at i3 instances. But the fact that storage is not persistent kind of puts me off.

How do you manage this?

Also, how frequently do i3 instances fail?

malisper · on Sept 24, 2018

We store two copies of every piece of data on two different machines. When a single machine goes down, we have code for spinning up a new machine and restoring the data that was on the machine.

Over the course of a month, we usually have about one machine fail.

sre-devops · on Sept 26, 2018

why not replication?