Hacker News new | past | comments | ask | show | jobs | submit login
S3QL: an S3 FS with encryption, de-dup, immutable trees and snapshotting (bitbucket.org/nikratio)
205 points by tambourine_man on Sept 1, 2015 | hide | past | favorite | 79 comments



I was recently looking into options for offsite, encrypted (at-rest & in-transit), deduplicated, snapshotted, backups to S3. S3QL was one of the options I investigated, but unfortunately their crypto is MAC-then-encrypt [0], which is no good.

A small sample of the other options considered:

    * EncFS + rsnapshot/rdiff-backup
        * Weak cryptography in EncFS [1]
    * Duplicity
        * Uses forward snapshots, so recovery takes longer with each successive snapshot, unless you upload a "fresh" copy of the data
    * ext4 on LVM on LUKS on a loopback device provided by s3backer [2]
        * OK I'll admit this one is pretty far out there, but it actually works decently well (ignoring the massive FUSE overhead)
        * Downside: a slight hiccup in network connectivity guarantees data corruption
    * Attic
        * Known data corruption issues with large datasets [3]
Does anyone know of anything that can provide the above requirements? At this point it's really more of a thought experiment, as I've decided to go with Tarsnap [4] for my use-case.

[0]: http://www.rath.org/s3ql-docs/impl_details.html

[1]: https://defuse.ca/audits/encfs.htm

[2]: https://github.com/archiecobbs/s3backer

[3]: http://librelist.com/browser/attic/2015/3/31/comparison-of-a...

[4]: https://www.tarsnap.com/


I run a slightly crazy setup for about 200GB of data:

* raid1 on all data. This isn't a backup - it's for end-to-end detection of bit flips in (some of) the storage path (and at rest). This needs a better solution (zfs?)

* zbackup of all selected files to external hard disk, overnight. This handles the de-duplication.

* duplicity of the zbackup dataset to S3, immediately following. This is usually a small upload - zbackup diffs are tiny and it doesn't touch files it doesn't need to.

* Rate limited so full backup is about a week, incrementals usually only a few MB.

It seems to work well so far, and I'm prepared to do a fresh full backup every year, swapping disks periodically. This general idea is to keep data in 3 physical places: local, external and remote. Local can be recovered from external. External can be thrown away and recreated. Remote recreated.

Wish there was an all-in-one solution with all the checkboxes checked. It feels silly to have to get there with a bunch of scripts.


RAID1 will not get you any protection from bit flips unless you have some specialized software doing that.

RAID1 will speed up reading by spreading reads across both disks, but data returned by the array comes from one disk or the other, not both. This means there is no opportunity to detect if a bit flip occurred.

Adding a checksumming filesystem that is handling the RAID1 in software would solve this problem. (i.e. If it were hardware RAID as opposed to ZFS's RAID-Z, ZFS could detect a bit-flip on the hardware RAID1, but it can't do anything about it since it is not aware of the two physical disks as a redundant source of data.)


Yeah the RAID1 part is weak sauce. However, my specific setup mitigates these flaws:

* The intention is to detect at-rest bit flips before they progressively pollute everywhere. I don't mind if it's not instantaneously detected on-access. I perform a nightly full scan - so there's up to 24 hours where an at-rest bit-flip may lie undetected, but it won't progress past that.

* I use ECC at all cache hierarchy levels possible, and ECC DDR, with active background scrub. So bit flips here won't occur (with vanishingly small probability).

* I use software-RAID1 only. The path between DDR, thru CPU, and to storage controller is unlikely to have a bit flip. From there onwards, the data is essentially written twice, so there is vanishingly low probability of both having a bit-flip, except for systematic failure for that bit pattern in two attempts to two different disks and two different controllers.

So there's still some places in the stack where errors can be introduced, but the most common areas of fault are either duplicated or covered by detection mechanisms.

My goal is to detect, but not correct, transport and at-rest bit flips. I'll discard everything when an error is detected, and use backups.

For my next setup I'll probably switch to a filesystem with a better end-to-end error detection story, such as ZFS, and ditch RAID altogether.


How do you detect bit flips with software RAID? A scrub can detect differences, but without checksums it's ambiguous which copy is wrong. And in the case of mdadm, which has no concept of a file system, it won't report the affected file. I'm not even sure you get an LBA, but if you do, then you have to go look that up, accounting for offsets, with the chosen file system to get a file.

Conversely, Btrfs scrubs will report a corrupt file path if there's a bit flip in a single copy. If there's another copy available then there's just a kernel note that there was a data csum error and the problem was fixed, no file name path.


The entire point of my system is that I don't need to care which copy is good. If it scrubs and detects any error, I'll mount disks independently as two filesystems, and diff the contents. I'll then restore corrupted data from backup.

Yes, a checksumming filesystem would be better, but at the time of creation (and even now?) none of the filesystem choices were mature, or proven to be sufficiently reliable. Ext4 and software RAID1 are both mature and proven reliable.


raid1 on all data. This isn't a backup - it's for end-to-end detection of bit flips in (some of) the storage path (and at rest). This needs a better solution (zfs?)

Yeah, btrfs or zfs are the better choice for this than RAID1 in the modern world I think.

ZFS anywhere it runs natively, btrfs on Linux.

(My experience with btrfs is that it’s fine on a server, but on laptops I’ve ended up with unmountable, unfixable filesystems a number of times. To be fair to btrfs I’ve always been able to recover the data though.)


Anecdotal, but I use btrfs on my laptop since august 2014, had a good number of hard power off not related to btrfs and the filesystem never failed me once.

I had some problem with space though using intensively docker: in some situations btrfs thought that the free space was finished and I had to manually rebuild the metadata.


The Docker-on-Btrfs problems are/were severe enough that CoreOS now defaults to overlayfs on Ext4. I have a bunch of EC2 servers on CoreOS, and we ended up wiping Docker partitions regularly until that switch...


IIRC docker (or any other virtual machine filesystem) in a single image file and hashing filesystems mix really, really badly.

Docker could use BTRFS / ZFS snapshots instead of using whole filesystem image files though, so this ought to get better.


Of the four methods you list, and their "warts", I would say that duplicity seems the easiest to overcome:

"Uses forward snapshots, so recovery takes longer with each successive snapshot, unless you upload a "fresh" copy of the data"

(unless you have a relatively slow network connection)

duplicity will interoperate with plain old SFTP, so we[1] have supported it since it was released. We have also contributed funds to the development of duplicity. In these 10 or so years that it has been around, we've never seen any complaints from customers or evidence of things breaking - it is quite solid.

That being said, we do really want to support attic since so many people ask for it. It requires a server-side component to be run, and we try to keep our environment squeaky clean and dead simple[2] - but I think we can cxFreeze it and run it as a binary executable on our end...

[1] rsync.net, with HN-readers discount. Just email.

[2] There are no interpreters in the jail that each account is locked in, so we can't run a python component on our end.


Although git-annex is not a backup system, it works for me as one to backup photos, disk images, some documents etc (although I use it with hubiC and self-hosted SFTP storages, not S3). Maybe it could meet your requirements, too.

Files (or chunks, if it's enabled) are gpg-encrypted (to a key or using a symmetric passphrase) and could be stored in nearly anything that could store named blobs. Supports various modes (archival, backup, RAID-like). Performance is fair for me.

Downside is that it's not a "real" filesystem, but a git repo that tracks metadata, plus blob storage. Assistant daemon saves a lot of keystrokes, though. There is also some FUSE-based implementation, but I haven't tried it. Getting started is easy, so I suggest taking a look and toying around for half an hour. Another downside is that it - being a git repo - may have problems with storing another repositories inside. Haven't tried this, though.


Many hundreds of folks are pointing very similar git-annex based backup schemes to us and we're always happy to have an ultra-clueful customer such as that.

Joey and I put together a special discount rate for git-annex folks that want to point at rsync.net:

http://www.rsync.net/products/git-annex-pricing.html


> offsite, encrypted (at-rest & in-transit), deduplicated, snapshotted, backups to S3

I think restic[1] is going for that. I don't think it's "ready" yet, though.

[1] https://github.com/restic/restic


Gave it a quick go - plugged it into an existing Attic backup script so it scanned/excluded the same dirs. Result:

7.9MB Attic etc backup -> 22MB Restic backup.

43GB Attic homes backup -> 71GB Restic backup.


That's dramatic. I'm not sure if restic has compression yet. If not, that could explain it.


Looks like you're right: https://github.com/restic/restic/issues/21

And of course you can't do it after the fact because you can't turn off the crypto.


If you're open to using something proprietary and commercially supported then depending on your scale you should check out http://storreduce.com. We're a growing startup in this space.

We do real-time compression, encryption (on the wire and at rest) and deduplication for object storage. We currently support an s3-compatible (including a full policy engine) api out the front end, and on the backend we can store to anything that exposes an s3 api (S3, Glacier etc.). Because of the s3-compatible interface we work with any existing client tools that work with s3.

We pride ourselves on our speed and scale. We can do 600mb/s sustained throughput and easily scale to multi-petabyte datasets. We typically see 95%-97% dedupe ratios on backup data. We support high availability clustering and replication (for example, replicate between regions for DR).

We don't currently support snapshotting but it's something we can implement relatively easily if people need it.

Our deployment model is based on a virtual appliance and can be deployed in the cloud or on premise. We can also do things like an on-premise writer (that only uploads unique, deduped data over the network), and a reader in the cloud to support cloud workloads or DR.

We have a real focus on backup to cloud in addition to supporting real time big-data use cases in the cloud.

Disclaimer: I work here - if you would like to contact me please feel free tpower@storreduce.com.


Sorry to say, but if the source is not available for a crypto tool, it's probably not worth looking at.

Best of luck though.


Perhaps they can move the encryption into an extra open source module in the future.


Encryption should happen client-side [1], an S3-to-S3 gateway wouldn't help unless you deploy one on all client nodes. s3cmd and duplicity has support via GPG, but not all S3 clients will know what to do with those files.

[1] http://www.skylable.com/blog/2014/09/transparency-reports-se...

Disclaimer: I'm co-founder of Skylable


Our server software is typically deployed as close to the source data as possible. This lets us move only deduplicated data over the WAN. It also supports our encryption model where we encrypt using a pluggable key management service e.g. Amazon KMS or a on-premise HSM before any data leaves the customer site. This is essentially the same model as traditional tape and disk backup software within the datacentre.

Additionally we use HTTPS between clients and our server, and our server and the storage provider (e.g. S3), as well as being able to enable server-side encryption for S3.


Try Borg, a fork of Attic with more active development: https://borgbackup.github.io/borgweb/

Also check out Atticmatic, a wrapper script for both Attic and Borg that makes them much nicer to configure and use: https://torsion.org/atticmatic/


Camlistore can use S3 as a backend and does deduplication. Not sure about the encryption though.

[0]: https://camlistore.org/


It's supported encryption since last year or thereabouts. I really like the look of camlistore and have done a bit of playing around with it. Was considering deploying it a while back but the docs are [still] lacking and far as I can tell doesn't seem to have gotten to a production ready state yet.

Here's a gist showing how to configure an encrypted store on s3: https://gist.github.com/cknave/3ddae29cc466663cb40e


The attic corruption in question is apparently caused by a bug in the version of msgpack the reporter was using, can be mitigated by updating dependencies - see https://github.com/jborg/attic/issues/264


I'm glad this post has made it to the top of HN. I've also been looking around at these sorts of solutions over the last couple of weeks. I hadn't rolled anything out yet but s3ql was the leader for me at the moment. I don't know so much about the encryption side but to be honest it's lower down my list of priorities.

Another solution I've seen people using is attic[0] as a centralised backup for multiple machines and then running s3ql to push that over to s3.

[0] https://attic-backup.org


>MAC-then-encrypt [0], which is no good.

He must have red Practical Cryptography from Ferguson and Schneier. In any case, it's not big issue. Encrypt-then-MAC is easier to get wrong.


No, MAC-then-encrypt is very, very hard to get right, because you have to avoid creating side-channels (including timing side-channels) between the decryption operation, the padding check (if you're using a block cipher mode that requires padding), and the MAC check.

More to the point, if you're fielding a new MAC-then-encrypt design, then I don't trust your crypto, because your crypto knowledge is over a decade old. Krawczyk's paper was published in 2001: https://eprint.iacr.org/2001/045


There are other cryptographers that tend to agree that encrypt-then-mac is optimal.[0]

[0]http://www.thoughtcrime.org/blog/the-cryptographic-doom-prin...


I don't know about the constructions and attack scenarios pertaining to this app, but MAC-then-encrypt is generally insecure; it's not "easier to get wrong", but is itself a flaw (typically: of exposing the cipher to malicious ciphertext, rather than screening it out with a MAC, which is designed to distinguish between valid and invalid ciphertext).


Rogaway's bringing it back with a vengeance though:

http://web.cs.ucdavis.edu/~rogaway/aez/aez.pdf

With AEZ, the "MAC" is actually just a block of zeros.


Rogaway brought it back with a vengeance back in 2006, with SIV: http://web.cs.ucdavis.edu/~rogaway/papers/keywrap.pdf

Essentially all 'misuse resistant' modes are some form of MtE, since in those modes one necessarily needs to process the entire message before beginning encryption. AEZ is MtE in the loosest sense, basically another way to say it's a two-pass/offline mode.


Is there a distinction to be drawn here between "generic composition" and "whatever the term is for the opposite of generic composition"?


Not sure I understand precisely what you're asking. 'Generic composition' is a tricky term, because it tends to hide the assumptions underlying what we're actually composing together. Bellare and Namprempre's paper, which is the most usually cited in this sort of discussion, takes a probabilistic encryption scheme (read: something that takes a random IV) and a MAC, and results in a probabilistic AE scheme. This is obviously not the only way to model encryption and authentication schemes (SIV above transforms those same primitives into a nonce-based AE scheme instead, or DAE with a fixed nonce), but this tends to be overlooked from GC discussions. In some models, MtE and friends is OK. But the truth is that EtM is safe in the widest range of assumptions, while the other ones are more brittle.


That's not MAC-then-encrypt, that's "ECB mode is pretty okay if your message and a nonce fits in your block" plus "let's make an arbitrary-length block cipher". In any case, the result is AEAD, not MAC-then-encrypt.

(That said, AEZ is really interesting.)


If MAC-then-encrypt is no good, why don't you report it as a bug against S3QL (with trustworthy references of course)? It may even get fixed.


Tarsnap was exactly what I was going to suggest. Exactly what don't you like about it//why do you want to replace it?


I'm curious to see how this compares to tarsnap for encrypted backups. One cool thing that people might not know about tarsnap is that you can generate read-only subkeys of your main key!

https://www.tarsnap.com/man-tarsnap-keymgmt.1.html


I'm also curious. I hadn't seen S3QL before it appeared at the top of HN.

Two things jump out at me from a quick glance at the site:

1. "S3QL splits files in blocks of a configurable size (default: 10 MB)" <-- this is quite a large block size and will result in significantly less deduplication than tarsnap's variable-length average 64kB blocks. (On the other hand, large blocks significantly reduce the amount of work and RAM needed; S3QL's tuning probably makes sense for a "live" filesystem.)

2. "all data can AES encrypted with a 256 bit key" <-- leaving aside the grammatical problem, I can't find anything beyond this about key management. At best this means that anyone who can decrypt data can encrypt it, and vice versa; but the usual rule about crypto documentation applies here: If people don't realize that the details matter, they're probably doing something wrong. (Unfortunately it's 1AM and I'm allergic to snakes, or else I would dig into the code to see exactly how their crypto works.)


Having used both S3QL and Tarsnap.. In my experience, S3QL's use of remote mounted filesystems is fundamentally unsuited to unattended backups. On many occasions, S3QL's mounted filesystem would break mid-backup due to network issues, and stay mounted and unable to recover. Then, from that point forward, no backups would ever work again without manual intervention because the mount point was already open.

Additionally, S3QL would periodically issue new releases that didn't support old versions of the file format, or only supported them a set number of releases back. So if you didn't upgrade frequently enough, you'd find yourself with a release that refused to read your existing gigabytes of backup data. And then at that point, you have to do a binary search to find and recompile old releases in the vain attempt to resurrect your data and avoid having to do a full backup from scratch.

Bottom line: Stay far away from S3QL and instead use Attic, Borg, or Tarsnap.

Also, if you select Attic or Borg, check out Atticmatic: https://torsion.org/atticmatic/


Regarding the network issues: please file a bug report so that they can be fixed.


A difference with Tarsnap is that s3ql is free and open-source software (both the client and server), so you can also use it on your own server, your friends' servers, etc.

I used s3ql to back up my stuff on servers where I had shell access and disk space but didn't trust root.

By contrast, from what I understand, tarsnap requires you to use the official server.


The closest thing I've seen to this is Jungle Disk, which is closed source, offers a less fully featured file system, but does have Windows and Mac clients.


There was also ObjectiveFS discussed on HN a few days ago, also closed source: https://news.ycombinator.com/item?id=10117506

Main difference appears to be that it supports concurrent mounts, while S3QL can be mounted by one client at a time.


Surely the "main" difference is that ObjectiveFS is a paid service with prices starting at $7/mount.

S3QL is FOSS.


JungleDisk works, but the clients are buggy PoS that crash/hang whenever they choose. I've also had JungleDisk's server daemon silently quit working & lock up on some servers, which is scary when you go to restore a file and realise no backups have happened in a while. It doesn't engender trust.

That said, I still use it as it's easier & less error-prone than setting up my own backup system with tarsnap.


Seconding this. I used Jungle for a while on my laptop, and it actually runs a local WebDAV server that acts as a caching proxy, which it then mounts as a local volume -- and it's horrible. Slow, unstable and prone to mysterious hangs and errors. This was a couple of years ago, but I wouldn't trust it unless they have completely rewritten it since then.


> This was a couple of years ago, but I wouldn't trust it unless they have completely rewritten it since then.

The software hasn't been updated since May 2011, when they released version 3.16.


The reporting is very poor also - if the backups don't run, you arent alerted to failure or success, they just dont appear on the emailed reports. Many a server has not been backed up because of this...


Each backup job reports when successfully completed to provide an audit trail for your successful backups. If you are scheduled to run daily backups and you do not see the daily email the backup did not successfully run. We're open to understanding if there is a better behavior for this process. My worry is if we email on both cases people see the message come through and ignore the details where it shows 'failed'.


Thanks for reaching out Bret. The issue I've found is it's much harder to monitor on an email NOT arriving, than one arriving with specific text. With the latter I can be as lazy as setting up a GMail fiter, with the former... I'd have to build my own reporting system that knows you'll email about 8-9am and if the email doesn't arrive, assume it's failed. That's a lot of overhead.

How to best monitor something doesn't happen (e.g. a scheduled, remote job) is something I've yet to solve.


Not sure why it isn't listed in the overview, but this is based on SQLite (hence the name). HVent looked into the details yet but it looks pretty fun


I think SQLite is only used for the metadata, which seems like a reasonable choice. The speed of metadata operations won't matter next to the slowness of the read/write operations (slow relative to local SSDs.)


Isn't that kind limited? How exactly are two different clients going to access the metadata at the same time, if SQLite3 is used?!


I don't think that's the aim of the project. They do state that it's "especially suitable for online backup and archival". We typically have one network location where we store data for backup and archival, so only one machine would be having a connection anyway.


SQLite supports multiple readers when Write Ahead Logging (WAL) is enabled. This mode is so good and transforms the whole operation of SQLite that it is shame that 1) it isn't the default, 2) more people don't know about it.


There is also `PRAGMA synchronous = NORMAL` that helps a lot. Your database will still be consistent and cannot become corrupted, although you do have to give up transaction durabibility after a power loss:

https://www.sqlite.org/pragma.html#pragma_synchronous


The way s3ql works is that it copies the db from s3 to your local machine. You make changes and then write the new version back to s3 later with the data. So it won't help in this case.


I see. Out of interest how does it write back the new version of the SQLite database file with only the differences? Does S3 support binary patching?


I believe it uploads the whole thing each time. I think it may even upload it with a new index counter in the name to version it (but I can't find that in the docs now).

http://www.rath.org/s3ql-docs/impl_details.html#metadata-sto...


I think you found the Achilles heel. This won't scale, so it is only suitable for small file systems.


Indeed, I'm good at that ;)

It could probably be improved (subject to the nuances of S3 which I'm not fully familiar with). One way to fix it would be to copy the concept of SQLite's WAL mode. Use an appended write operation on S3 (if it supports it) to append to an existing file that contains the transaction log. Then at certain intervals (say every few thousand transactions) one can finally flush that log to be stored in the main database file.

This would substantially reduce the number of times the database would need to be re-uploaded in full.


They won't: https://bitbucket.org/nikratio/s3ql/wiki/FAQ#!can-i-access-a...

"""In principle, both VPN and NFS/CIFS-alike functionality could be integrated into S3QL to allow simultaneous mounting using mount.s3ql directly. However, consensus among the S3QL developers is that this is not worth the increased complexity."""


Sure looks interesting. It seems like a good idea for specific use cases, like for instance backup storage. The SQLite usage for the inode table is a neat hack. Implementation details here: http://www.rath.org/s3ql-docs/impl_details.html


Ah, then I wouldn't expect concurrent connections to be on the menu, for a shared workspace.


If this is interesting, recommend looking at Tahoe-LAFS too: https://www.tahoe-lafs.org/trac/tahoe-lafs

(And https://leastauthority.com/how_it_works )


s3fs also layers a filesystem on top of S3 but preserves the native object format:

https://github.com/s3fs-fuse/s3fs-fuse

This approach allows use of other tools like s3cmd and the Amazon web console but prevents advanced features like deduplication and snapshotting.


This looks cool and I am going to give this a try. The problem for me is, as is usually the case with such project, the packaging. If this thing is production-ready, then why must I check for installed dependencies by running random commands [1]? If it's a Python project, why isn't it distributed on PyPI? I don't want to download stuff from BitBucket manually and install it by executing setup.py. I understand that the project supports multiple OS's. That's great. But there are simple steps that can be taken to make installing this thing via automated tools (Puppet, Chef, Ansible, etc.) easier than how it's set up now. A Debian package would be so nice for Ubuntu/Debian.

[1] http://www.rath.org/s3ql-docs/installation.html#dependencies


There is plenty extensive packaging: https://bitbucket.org/nikratio/s3ql/wiki/Installation

The documentation is just somewhat messed up...


Nevermind the things I said above. Thank you for pointing this out. This is exactly what I was looking for.


> Immutable Trees. Directory trees can be made immutable, so that their contents can no longer be changed in any way whatsoever. This can be used to ensure that backups can not be modified after they have been made.

That would be perfect for things we don't want to ever change...like container images. Or configuration files.


What is the benefit of immutable trees over versioned backups - or to put it another way: why wouldn't you keep a change history in your backups?


Backups are for data generated by people using applications running on services provided by infrastructure. What I'm referring to is the configuration portion of launching and running a service, not the data generated after a human uses it.

The thing that pops out of this is TRUST. We need to be able to trust an application we are running is the one we want to run, who was responsible for writing it, who's responsible for running it, and all the bits in-between.


I might be mistaken but I believe that feature refers to the ability to make a bit of your file system read only. It's just there so you don't accidentally wipe out a backup you meant to keep.

You would still do versioned backups and then make one a month immutable. Or whatever your strategy is.


How well would this work for something like pointing OwnCloud storage at? Most of my cloud experience is with an internal vmware cluster, so I'm not sure how to evaluate something like this.


I'm not sure what the point would be - OwnCloud already knows how to work with S3 directly.


Just as an example app.

Basically, use S3 the same way you use an in house SAN.

Also, if I remember correctly, S3 integration is per account, and only if you enable the external file storage app. It wouldn't let you store all data on S3. So, mounting a bucket as the OwnCloud data folder would be the way I'd get that.


So would it be possible to mount this as a Volume in OS X and create (encrypted) Time Machine backups to Amazon S3?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: