Zsync: Differential file downloading over HTTP using the rsync algorithm (2010)

blaz0 · on June 20, 2021

I like zsync quite a bit. The chunking patcher we wrote for League of Legends is similar to zsync in some key ways.

One of the things we did to improve the efficiency of the core rolling hash based chunking logic was to add file type aware chunking. Basically, we teach the patcher about the different archive formats that our games use (say, Unreal Engine PAKs) and the algorithm ensures that chunks stop at the file boundaries within the archive, so a chunk doesn’t straddle two unrelated entries. That way when files move around in the archive we can more reliably match their chunks across versions. I think this helped improve download sizes by 5-10% in our case.

dividuum · on June 20, 2021

zsync is great. I'm currently using it to deliver system updates to Raspberry Pis. The system update (and initial install file) is a single install.zip file that gets extracted into the SD boot partition. zsync helps greatly with regards to transfer sizes as a normal OS upgrade is usually between 3-5MB instead of the full 50MB download for the full ZIP.

Of course for that you have to specially prepare how to store the files within the ZIP: For example the kernel image is stored, not compressed at it itself is already compressed. gzip based files (like initramfs.gz) are created with the rsyncable flag. And so one for other file types.

A really nice feature of zsync is that you can add any number of local data sources when syncing. It then uses all those to search for matching local data before retrieving it remotely. All without any explicit local state or special server side preparations. I usually add the previous and current OS install zip, the currently extracted OS files and previous interrupted transfers. zsync then magically uses all those to reduce download size as much as possible. This also means that the OS can initially reconstruct its own full install.zip file by using its extracted files and fetching the few remaining data chunks remotely.

kbaker · on June 20, 2021

Unfortunately zsync is a bit old and needs some updates for modern HTTP (doesn't support HTTPS, hand-written C HTTP client. We had to patch the header parsing and mime handling to get it to be compatible with our test server...)

If it swapped to libcurl as a backend, it would be much more flexible and stay up-to-date, since the idea is solid. That being said there are many spiritual successors to look into:

- zchunk

- casync

- desync

- bita

dividuum · on June 20, 2021

> If it swapped to libcurl as a backend [..]

There's https://github.com/probonopd/zsync-curl. Seem pretty active. So there's hope.

bruce_one · on June 21, 2021

I believe probonopd is maintaining that fork as part of their work on AppImage - so it does have an active user group, which is great.

(The AppImage docs suggest using zsync as a mechanism for making images updateable: https://docs.appimage.org/packaging-guide/optional/updates.h...)

traceroute66 · on June 20, 2021

> Unfortunately zsync is a bit old

Yup. Last update of the Github repo was 2015.

zsync is effectively abandonware.

ksec · on June 20, 2021

>zsync is open source, distributed under version 2 of the Artistic License.

Artistic License 2.0 [1],

>The Artistic License is a software license used for certain free and open-source software packages, most notably the standard implementation of the Perl programming language and most CPAN modules, which are dual-licensed under the Artistic License and the GNU General Public License (GPL).

>The Artistic license 2.0 is also notable for its excellent license compatibility with other FOSS licenses due to a relicensing clause, a property other licenses like the GPL are missing.

Hopefully saves everyone from Googling. At least that was my immediate reaction when I read the page.

[1] https://en.wikipedia.org/wiki/Artistic_License

mimimi31 · on June 20, 2021

Sounds a bit similar to lftp[1]. While I don't think it supports partial updates or updates of compressed files like zsync, it doesn't rely on a meta-file and also supports a range of other protocols. You can essentially just point it at a website, directory index, ftp server, torrent etc. and it will let you browse the files, create mirrors or update local files.

[1] https://lftp.yar.ru

qorrect · on June 20, 2021

lftp is amazing. Less of a client and more of a toolbox, but it can do just about anything.

loloquwowndueo · on June 20, 2021

Notably Ubuntu offers zsync downloads which is particularly useful when following daily development images, since only small deltas that changed from earlier images are downloaded.

There’s also a version of zsync somewhere that uses libcurl and allows sending custom headers which allows for example sending cookies to enable zsync to handle images on servers requiring custom authentication.

goodpoint · on June 20, 2021

Casync is a much more efficient protocol due to the rolling hash support:

https://github.com/systemd/casync/

dividuum · on June 20, 2021

Is it? From what I understand zsync also uses rolling hashes and retrieves as little data as possible. I just explicitly tested that by having a 1MB /dev/urandom file and appending and prefixing the string 'test' to it and checking the resulting HTTP requests when syncing that file:

sendto(4, "GET /test HTTP/1.1\r\nUser-Agent: zsync/0.6.2\r\nHost: localhost:8000\r\nReferer: http://localhost:8000/test.zsync\r\nRange: bytes=1048576-1050623\r\nConnection: close\r\n\r\n", 161, 0, NULL, 0)

sendto(4, "GET /test HTTP/1.1\r\nUser-Agent: zsync/0.6.2\r\nHost: localhost:8000\r\nReferer: http://localhost:8000/test.zsync\r\nRange: bytes=0-2047\r\nConnection: close\r\n\r\n", 152, 0, NULL, 0) = 152

In both cases it transferred the only relevant 2048 byte chunk resulting from the blocksize setting of zsyncmake. Would the casync transfer be smaller than that?

michael-ax · on June 20, 2021

Why am i thinking that an auto-generated wikipedia/.zsync file could be the best way to maintain local copies of that site and others like it?

Server-side maintenance/generation of .zsyncs might be transformative wherever replication is desirable, no?

tyingq · on June 20, 2021

Wikipedia does support rsync on their mirrors.

michael-ax · on June 20, 2021

yes, now imagine you could have the performance benefit of pre-computed hashes for that rsync process .. and we'd have zsync. :)

riedel · on June 20, 2021

Would be great to have an rsync like protocol running in the webbrowser ideally based on webrtc purely P2P. No install needed. I hate all those custom clients for all those cloud storage services. Particularly was looking for an option for a client with a lot of IT restrictions to upload a large number of large files. Embarrassingly often we end up shipping encrypted hard drives.

spinax · on June 20, 2021

This may be of interest to you: https://rclone.org/

rclone is basically rsync with cloud native endpoints, it's not browser based but does what needs to be done very well. It does have an experimental web GUI but it's not a feature I've tried personally. https://github.com/rclone/rclone-webui-react

thegagne · on June 20, 2021

Just don’t use it to keep two systems in sync when you pay for transactions as it can be very costly.

tyingq · on June 20, 2021

The changes you would need to make for HTTP to support a rsync style pull protocol are pretty minimal. Two of them, really.

1) A checksum header in the HEAD response

2) Altering the HEAD method to optionally accept the same range request headers that a GET does, and have the checksum header return the checksum for that range.

I'm mildly surprised it hasn't happened. Pretty sure you could do it with vanilla php or the nginx lua module.

zie · on June 20, 2021

You have smattering of disasters, that let you sort of fake it till you make it? :)

For 1)

You have SRI: https://developer.mozilla.org/en-US/docs/Web/Security/Subres... You have Etag and you have Content-MD5: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...

Like you already mentioned etag sucks, because you just get a guarantee X hasn't changed but no way to now how it was calculated. i.e. it is opaque.

SRI is only useful for everything outside of the original document(i.e. for <script> tags and the like.

Content-MD5 actually works.. except it's using the broken MD5 digest.. so it might work for the use case, depending on the details of the use-case. :)

2 is a no-go as far as I'm aware.

Though you could perhaps do an HTML document where you link to each "chunk" of the file using SRI.. though that would be a total bastardization of HTML and a total hack.. and a bad one at that :) Also the chunks would be decided well ahead of time that way, so not the same as a RANGE..

ComputerGuru · on June 20, 2021

The first is already there in the form of a strong etag. Http conditional requests with if-range support etag validation but that is a document checksum not a chunk checksum.

tyingq · on June 20, 2021

>The first is already there in the form of a strong etag

Unfortunately, there's no standard as to how that's computed, so it wouldn't be useful in an rsync style setup unless you controlled both ends.

ComputerGuru · on June 20, 2021

Or stored it as opaque metadata on first retrieval, regardless of how it was computed, paired with the actual hash. When one is invalidated, the other is too.

zie · on June 20, 2021

There is also Content-MD5: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...

riedel · on June 20, 2021

Thanks. Looking up support for this I found dcache.org which supports rfc3230 which supports even more digests.

hinkley · on June 20, 2021

Never underestimate the bandwidth of a station wagon full of data tapes, barreling down the highway.

mkl95 · on June 20, 2021

I hate cloud clients so much. Every company seems to be using a different one and I haven't seen a single one with good UX.

tyingq · on June 20, 2021

Interesting, though it appears work stopped before rsync moved from MD4 to MD5. I suppose that doesn't matter since it's a standalone thing.