Hacker News new | past | comments | ask | show | jobs | submit login
Bit-sync: synchronizing arbitrary data using rsync algorithm in pure JavaScript (github.com/claytongulick)
105 points by ingve on May 18, 2021 | hide | past | favorite | 17 comments



"The md5 algorithm needs to be rewritten with performance in mind"

Poked around a little, and this area seems a bit under-researched since MD5 went out of favor for many use cases. For example, the SubtleCrypto digest doesn't support it.

There's this WASM implementation[1], that shows 2-37 MB/sec, presumably on fairly modern hardware. But, I can also find an old pure Javascript implementation[2] that shows 5-7 MB/sec on an ancient Pentium4 running Opera[2]. Makes you wonder how it would fare with a recent Chrome/V8 and hardware. There's online tests, but with small inputs and Chrome is fast enough that I get a lot of NaN/Infinity.

Maybe it would be better to use SHA-1 or some other digest supported by the built-in browser SubtleCrypto package. Wouldn't technically be rsync, but this package seems to be intended for use on both sides.

[1] https://www.npmjs.com/package/md5-wasm

[2] http://www.myersdaily.org/joseph/javascript/md5-speed-test.h...


Hi, author here. It's funny you mention this, I was just revisiting this lib and thinking that SubtleCrypto would be a better option. Also interesting idea to use the md5-wasm, thanks for the link.

It's been years since I made this, probably due for a little love.


The rsync algorithm is pretty cool. I implemented it in C# a couple of years ago: https://github.com/zmj/rsync-delta


Does this only work with files or can any source be used as long as it is a Stream? Basically, I'd love to have something like this for arbitrary data, not just files.

Also, looks like the README has a broken link: https://librsync.github.io/rdiff.html


Yep, Streams or Pipes for IO. No seeking, forward-only reads - using this with network in/out was my goal.

Thanks for pointing out the broken link. Looks like it's here now: https://librsync.github.io/page_rdiff.html


I'd like to see something like this used for big JavaScript bundles.

Many webapps now have 2MB+ of JavaScript, and lots of web companies do daily or even hourly software releases, which effectively means every time I visit your site I have to download many megabytes of JavaScript which takes many seconds. It's frustrating. Diffing the JavaScript and sending just the changed lines of code would save hundreds of millions of people a few seconds each day, and is totally worth it.


All of this has already happened, and all of this will happen again.

- The original rsync author wrote rproxy (https://rproxy.samba.org/), an ingenious system wrapping http inside the rsync protocol. On subsequent visits, the proxies see the resemblance and sends a diff. I've toyed with it myself (https://github.com/rakoo/rproxy/) but I don't think website really care about how disastrously huge and cumbersome they are to use

- another idea is to invert rsync: instead of the client calculating a signature, sending it to the server, and the server calculating the delta for every request of every client, it is possible to have the server calculate the signature once every time actually changes, send the signature to the client, let the client calculate the parts they need and download only them. In short, zsync (http://zsync.moria.org.uk/)

There are a million things already done in this sector, but for some reason it has never caugh up... I think there's just not enough interest


I really like this idea, but I'm not sure if it would work in practice. It's basically working against webpack's code splitting because those are tagged with hashes. Additionally, the first load will still be quite heavy, whereas with code splitting at least there's some semblance of a smaller initial load. Since it's not native to the browser, it would have a nontrivial initialization overhead as well.

I think it could be a cool idea if it's offered as a static file hosting as a service kind of deal that requires zero config, but then again if you're the kind of person that wants a zero config static file hosting, you're probably not pushing daily/hourly software releases?

It would be neat to have something like this more transparently implemented though. Seeing a 2MB+ module rebuild for a minor typo fix is aggravating.


One possible implementation:

* For each bundle, webpack produces the compiled bundle and a seperate diff file between the latest release and every previously published release. Filenames could be [oldhash]-[newhash].js

* The webpack loader can check if the browser has that specific module cached (I believe service workers are able to inspect the http cache like this), and then based on the version that is cached, request the correct filename for a diff to the latest version.

* Then apply the diff and load the javascript as normal.

I just did a quick test with a webapp of mine, and a small diff file made with the rsync algorithm (via rdiff-backup) of a webpack bundle was just 150 bytes.


> Then apply the diff and load the javascript as normal.

Can service workers mutate the HTTP cache? I haven't seen it yet, but if not, it seems like the client might be forever pinned to the old version. I guess most likely we'd have to avoid the HTTP cache all together and run our own application module caching. That could work, but wonder if the performance overhead would be worth it...

I'll play around this a little more though, it's a really cool idea. Getting a clean API seems tricky but doable.


HTTP already has range requests, so it seems like you could do this in a relatively straightforward way.


That is a very interesting idea.

I've been thinking about what I should do in a V2 of this lib.


Related (rsync in the browser):

https://webdeltasync.github.io/


This is a cool project!

Looks like they use my bit-sync lib for the actual rsync part [1] which is neat.

I should modernize this lib so that it's easier for them to add as a dependency.

https://github.com/WebDeltaSync/WebRsync/blob/master/public/...


Thanks for making bit-sync, it's awesome! It immediately got me thinking about cool things you could do in the browser with it! It's great that they didn't try to reinvent the wheel and used your work.

The WebRSync doesn't seem to be particularly active so I don't think you have to worry about it, but if you find time to modernize your lib that'd be awesome -- just need the JS community to get their excited about doing cool stuff with rsync in the browser (sprinkle some WASM in there maybe somewhere).


Yep, the unexpected attention this received today has lit a fire under me.

I'll get on it.


Also check out the Rabin rolling hash algorithm for smart chunking of files which can be useful to either dedupe dynamic content or efficiently sync files.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: