Hacker News new | past | comments | ask | show | jobs | submit login
Zsync: Differential file downloading over HTTP using the rsync algorithm (2010) (moria.org.uk)
153 points by gjvc on June 20, 2021 | hide | past | favorite | 28 comments



I like zsync quite a bit. The chunking patcher we wrote for League of Legends is similar to zsync in some key ways.

One of the things we did to improve the efficiency of the core rolling hash based chunking logic was to add file type aware chunking. Basically, we teach the patcher about the different archive formats that our games use (say, Unreal Engine PAKs) and the algorithm ensures that chunks stop at the file boundaries within the archive, so a chunk doesn’t straddle two unrelated entries. That way when files move around in the archive we can more reliably match their chunks across versions. I think this helped improve download sizes by 5-10% in our case.


zsync is great. I'm currently using it to deliver system updates to Raspberry Pis. The system update (and initial install file) is a single install.zip file that gets extracted into the SD boot partition. zsync helps greatly with regards to transfer sizes as a normal OS upgrade is usually between 3-5MB instead of the full 50MB download for the full ZIP.

Of course for that you have to specially prepare how to store the files within the ZIP: For example the kernel image is stored, not compressed at it itself is already compressed. gzip based files (like initramfs.gz) are created with the rsyncable flag. And so one for other file types.

A really nice feature of zsync is that you can add any number of local data sources when syncing. It then uses all those to search for matching local data before retrieving it remotely. All without any explicit local state or special server side preparations. I usually add the previous and current OS install zip, the currently extracted OS files and previous interrupted transfers. zsync then magically uses all those to reduce download size as much as possible. This also means that the OS can initially reconstruct its own full install.zip file by using its extracted files and fetching the few remaining data chunks remotely.


Unfortunately zsync is a bit old and needs some updates for modern HTTP (doesn't support HTTPS, hand-written C HTTP client. We had to patch the header parsing and mime handling to get it to be compatible with our test server...)

If it swapped to libcurl as a backend, it would be much more flexible and stay up-to-date, since the idea is solid. That being said there are many spiritual successors to look into:

- zchunk

- casync

- desync

- bita


> If it swapped to libcurl as a backend [..]

There's https://github.com/probonopd/zsync-curl. Seem pretty active. So there's hope.


I believe probonopd is maintaining that fork as part of their work on AppImage - so it does have an active user group, which is great.

(The AppImage docs suggest using zsync as a mechanism for making images updateable: https://docs.appimage.org/packaging-guide/optional/updates.h...)


> Unfortunately zsync is a bit old

Yup. Last update of the Github repo was 2015.

zsync is effectively abandonware.


>zsync is open source, distributed under version 2 of the Artistic License.

Artistic License 2.0 [1],

>The Artistic License is a software license used for certain free and open-source software packages, most notably the standard implementation of the Perl programming language and most CPAN modules, which are dual-licensed under the Artistic License and the GNU General Public License (GPL).

>The Artistic license 2.0 is also notable for its excellent license compatibility with other FOSS licenses due to a relicensing clause, a property other licenses like the GPL are missing.

Hopefully saves everyone from Googling. At least that was my immediate reaction when I read the page.

[1] https://en.wikipedia.org/wiki/Artistic_License


Sounds a bit similar to lftp[1]. While I don't think it supports partial updates or updates of compressed files like zsync, it doesn't rely on a meta-file and also supports a range of other protocols. You can essentially just point it at a website, directory index, ftp server, torrent etc. and it will let you browse the files, create mirrors or update local files.

[1] https://lftp.yar.ru


lftp is amazing. Less of a client and more of a toolbox, but it can do just about anything.


Notably Ubuntu offers zsync downloads which is particularly useful when following daily development images, since only small deltas that changed from earlier images are downloaded.

There’s also a version of zsync somewhere that uses libcurl and allows sending custom headers which allows for example sending cookies to enable zsync to handle images on servers requiring custom authentication.


Casync is a much more efficient protocol due to the rolling hash support:

https://github.com/systemd/casync/


Is it? From what I understand zsync also uses rolling hashes and retrieves as little data as possible. I just explicitly tested that by having a 1MB /dev/urandom file and appending and prefixing the string 'test' to it and checking the resulting HTTP requests when syncing that file:

sendto(4, "GET /test HTTP/1.1\r\nUser-Agent: zsync/0.6.2\r\nHost: localhost:8000\r\nReferer: http://localhost:8000/test.zsync\r\nRange: bytes=1048576-1050623\r\nConnection: close\r\n\r\n", 161, 0, NULL, 0)

sendto(4, "GET /test HTTP/1.1\r\nUser-Agent: zsync/0.6.2\r\nHost: localhost:8000\r\nReferer: http://localhost:8000/test.zsync\r\nRange: bytes=0-2047\r\nConnection: close\r\n\r\n", 152, 0, NULL, 0) = 152

In both cases it transferred the only relevant 2048 byte chunk resulting from the blocksize setting of zsyncmake. Would the casync transfer be smaller than that?


Why am i thinking that an auto-generated wikipedia/.zsync file could be the best way to maintain local copies of that site and others like it?

Server-side maintenance/generation of .zsyncs might be transformative wherever replication is desirable, no?


Wikipedia does support rsync on their mirrors.


yes, now imagine you could have the performance benefit of pre-computed hashes for that rsync process .. and we'd have zsync. :)


Would be great to have an rsync like protocol running in the webbrowser ideally based on webrtc purely P2P. No install needed. I hate all those custom clients for all those cloud storage services. Particularly was looking for an option for a client with a lot of IT restrictions to upload a large number of large files. Embarrassingly often we end up shipping encrypted hard drives.


This may be of interest to you: https://rclone.org/

rclone is basically rsync with cloud native endpoints, it's not browser based but does what needs to be done very well. It does have an experimental web GUI but it's not a feature I've tried personally. https://github.com/rclone/rclone-webui-react


Just don’t use it to keep two systems in sync when you pay for transactions as it can be very costly.


The changes you would need to make for HTTP to support a rsync style pull protocol are pretty minimal. Two of them, really.

1) A checksum header in the HEAD response

2) Altering the HEAD method to optionally accept the same range request headers that a GET does, and have the checksum header return the checksum for that range.

I'm mildly surprised it hasn't happened. Pretty sure you could do it with vanilla php or the nginx lua module.


You have smattering of disasters, that let you sort of fake it till you make it? :)

For 1)

You have SRI: https://developer.mozilla.org/en-US/docs/Web/Security/Subres... You have Etag and you have Content-MD5: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...

Like you already mentioned etag sucks, because you just get a guarantee X hasn't changed but no way to now how it was calculated. i.e. it is opaque.

SRI is only useful for everything outside of the original document(i.e. for <script> tags and the like.

Content-MD5 actually works.. except it's using the broken MD5 digest.. so it might work for the use case, depending on the details of the use-case. :)

2 is a no-go as far as I'm aware.

Though you could perhaps do an HTML document where you link to each "chunk" of the file using SRI.. though that would be a total bastardization of HTML and a total hack.. and a bad one at that :) Also the chunks would be decided well ahead of time that way, so not the same as a RANGE..


The first is already there in the form of a strong etag. Http conditional requests with if-range support etag validation but that is a document checksum not a chunk checksum.


>The first is already there in the form of a strong etag

Unfortunately, there's no standard as to how that's computed, so it wouldn't be useful in an rsync style setup unless you controlled both ends.


Or stored it as opaque metadata on first retrieval, regardless of how it was computed, paired with the actual hash. When one is invalidated, the other is too.



Thanks. Looking up support for this I found dcache.org which supports rfc3230 which supports even more digests.


Never underestimate the bandwidth of a station wagon full of data tapes, barreling down the highway.


I hate cloud clients so much. Every company seems to be using a different one and I haven't seen a single one with good UX.


Interesting, though it appears work stopped before rsync moved from MD4 to MD5. I suppose that doesn't matter since it's a standalone thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: