We are running vanilla ZFS-on-Linux. We don't use snapshots for backups as the Postgres backups are more convenient. Postgres provides point in time recovery, which is useful. There are also tools like wal-e for automatically writing and restoring Postgres backups to S3.
As for stability, have been two major sources of instability with ZFS:
The first issue was with the default value of arc_shrink_shift. By default, ZFS will evict ~1% of ARC, the in memory file cache, to disk at a time. Our machines have several hundred gigs of ARC, so ZFS was evicting several gigs of data to disk at a time. This was causing our machines to frequently become unresponsive for several seconds.
The other issue is for some reason ZFS will lock up for long periods of time if we delete several hundred gigs of data. We haven't been able to identify a root cause of the problem. So far we've worked around this problem by adding a sleep in between data deletions.
Other than these problems, ZFS has worked pretty well for us.
As for stability, have been two major sources of instability with ZFS:
The first issue was with the default value of arc_shrink_shift. By default, ZFS will evict ~1% of ARC, the in memory file cache, to disk at a time. Our machines have several hundred gigs of ARC, so ZFS was evicting several gigs of data to disk at a time. This was causing our machines to frequently become unresponsive for several seconds.
The other issue is for some reason ZFS will lock up for long periods of time if we delete several hundred gigs of data. We haven't been able to identify a root cause of the problem. So far we've worked around this problem by adding a sleep in between data deletions.
Other than these problems, ZFS has worked pretty well for us.