I like zsync quite a bit. The chunking patcher we wrote for League of Legends is similar to zsync in some key ways.
One of the things we did to improve the efficiency of the core rolling hash based chunking logic was to add file type aware chunking. Basically, we teach the patcher about the different archive formats that our games use (say, Unreal Engine PAKs) and the algorithm ensures that chunks stop at the file boundaries within the archive, so a chunk doesn’t straddle two unrelated entries. That way when files move around in the archive we can more reliably match their chunks across versions. I think this helped improve download sizes by 5-10% in our case.
zsync is great. I'm currently using it to deliver system updates to Raspberry Pis. The system update (and initial install file) is a single install.zip file that gets extracted into the SD boot partition. zsync helps greatly with regards to transfer sizes as a normal OS upgrade is usually between 3-5MB instead of the full 50MB download for the full ZIP.
Of course for that you have to specially prepare how to store the files within the ZIP: For example the kernel image is stored, not compressed at it itself is already compressed. gzip based files (like initramfs.gz) are created with the rsyncable flag. And so one for other file types.
A really nice feature of zsync is that you can add any number of local data sources when syncing. It then uses all those to search for matching local data before retrieving it remotely. All without any explicit local state or special server side preparations. I usually add the previous and current OS install zip, the currently extracted OS files and previous interrupted transfers. zsync then magically uses all those to reduce download size as much as possible. This also means that the OS can initially reconstruct its own full install.zip file by using its extracted files and fetching the few remaining data chunks remotely.
Unfortunately zsync is a bit old and needs some updates for modern HTTP (doesn't support HTTPS, hand-written C HTTP client. We had to patch the header parsing and mime handling to get it to be compatible with our test server...)
If it swapped to libcurl as a backend, it would be much more flexible and stay up-to-date, since the idea is solid. That being said there are many spiritual successors to look into:
>zsync is open source, distributed under version 2 of the Artistic License.
Artistic License 2.0 [1],
>The Artistic License is a software license used for certain free and open-source software packages, most notably the standard implementation of the Perl programming language and most CPAN modules, which are dual-licensed under the Artistic License and the GNU General Public License (GPL).
>The Artistic license 2.0 is also notable for its excellent license compatibility with other FOSS licenses due to a relicensing clause, a property other licenses like the GPL are missing.
Hopefully saves everyone from Googling. At least that was my immediate reaction when I read the page.
Sounds a bit similar to lftp[1]. While I don't think it supports partial updates or updates of compressed files like zsync, it doesn't rely on a meta-file and also supports a range of other protocols. You can essentially just point it at a website, directory index, ftp server, torrent etc. and it will let you browse the files, create mirrors or update local files.
Notably Ubuntu offers zsync downloads which is particularly useful when following daily development images, since only small deltas that changed from earlier images are downloaded.
There’s also a version of zsync somewhere that uses libcurl and allows sending custom headers which allows for example sending cookies to enable zsync to handle images on servers requiring custom authentication.
Is it? From what I understand zsync also uses rolling hashes and retrieves as little data as possible. I just explicitly tested that by having a 1MB /dev/urandom file and appending and prefixing the string 'test' to it and checking the resulting HTTP requests when syncing that file:
In both cases it transferred the only relevant 2048 byte chunk resulting from the blocksize setting of zsyncmake. Would the casync transfer be smaller than that?
Would be great to have an rsync like protocol running in the webbrowser ideally based on webrtc purely P2P. No install needed. I hate all those custom clients for all those cloud storage services. Particularly was looking for an option for a client with a lot of IT restrictions to upload a large number of large files. Embarrassingly often we end up shipping encrypted hard drives.
rclone is basically rsync with cloud native endpoints, it's not browser based but does what needs to be done very well. It does have an experimental web GUI but it's not a feature I've tried personally. https://github.com/rclone/rclone-webui-react
The changes you would need to make for HTTP to support a rsync style pull protocol are pretty minimal. Two of them, really.
1) A checksum header in the HEAD response
2) Altering the HEAD method to optionally accept the same range request headers that a GET does, and have the checksum header return the checksum for that range.
I'm mildly surprised it hasn't happened. Pretty sure you could do it with vanilla php or the nginx lua module.
Like you already mentioned etag sucks, because you just get a guarantee X hasn't changed but no way to now how it was calculated. i.e. it is opaque.
SRI is only useful for everything outside of the original document(i.e. for <script> tags and the like.
Content-MD5 actually works.. except it's using the broken MD5 digest.. so it might work for the use case, depending on the details of the use-case. :)
2 is a no-go as far as I'm aware.
Though you could perhaps do an HTML document where you link to each "chunk" of the file using SRI.. though that would be a total bastardization of HTML and a total hack.. and a bad one at that :) Also the chunks would be decided well ahead of time that way, so not the same as a RANGE..
The first is already there in the form of a strong etag. Http conditional requests with if-range support etag validation but that is a document checksum not a chunk checksum.
Or stored it as opaque metadata on first retrieval, regardless of how it was computed, paired with the actual hash. When one is invalidated, the other is too.
One of the things we did to improve the efficiency of the core rolling hash based chunking logic was to add file type aware chunking. Basically, we teach the patcher about the different archive formats that our games use (say, Unreal Engine PAKs) and the algorithm ensures that chunks stop at the file boundaries within the archive, so a chunk doesn’t straddle two unrelated entries. That way when files move around in the archive we can more reliably match their chunks across versions. I think this helped improve download sizes by 5-10% in our case.