Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The "rolling window hashes" from the comment suggests sub-file matching at any offset. (See Bently-McIlroy diff algo/how rsync efficiently finds matches, for example.) I'm not aware that git performs this sort of deduplication.

Better yet would be to use a rolling hash to decide where to cut the blocks, and then use a locality-aware hash (SimHash, etc.) to find similar blocks. Perform a topological sort to decide which blocks to store as diffs of others.

Microsoft had some enterprise product that performed distribution somewhat like this, but also recursively using similarity hashes to see if the diffs were similar to existing files on the far machine.



Git does delta compression on packfiles, and if the same data occurs in different in different files, it actually can deduplicate it, even if it's not at the same offset.

Here's a demo.

First, create two test files. The files both contain the same two 1-megabyte chunks of random bytes, but in the opposite order:

    $ openssl rand 1000000 > a
    $ openssl rand 1000000 > b
    $ cat a b > ab
    $ cat b a > ba
    $ du -sh ab ba
    2.0M        ab
    2.0M        ba
Commit them to Git and see that it requires 4 MB to store these two 2 MB files:

    $ git init
    Initialized empty Git repository in /tmp/x/.git/
    $ git add ab ba
    $ git commit -m 'add two files'
    [master (root-commit) 7f75af0] add two files
     2 files changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 ab
     create mode 100644 ba
    $ du -sh .git
    4.0M        .git
Run garbage collection, which creates a packfile. Note "delta 1" and note that disk usage dropped to a bit over the size of one of the files.

    $ git gc
    Enumerating objects: 4, done.
    Counting objects: 100% (4/4), done.
    Delta compression using up to 6 threads
    Compressing objects: 100% (4/4), done.
    Writing objects: 100% (4/4), done.
    Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
    $ du -sh .git
    2.1M        .git
I'm not sure as if it's as sophisticated as some backup tools, though.


Microsoft's implementation is called Remote Differential Compression: https://learn.microsoft.com/en-us/previous-versions/windows/...

It's available as a built-in component of Windows, it's just a library with an API.

Essentially the MS RDC protocol is just rsync run twice in a row, with the rsync metadata copied via rsync to compress it further.


> Essentially the MS RDC protocol is just rsync run twice in a row, with the rsync metadata copied via rsync to compress it further.

There's an important difference is that RDC uses a locality-sensive hash algorithm (MinHash, IIRC) to find files likely to have matching sections, whereas rsync only considers the version of the same file sitting on the far host. rsync encodes differences on each file in isolation, whereas RDC looks at the entire corpus of files on the volume.

For example, if you do the Windows equivalent of cat local/b.txt >> local/a.txt, rsync is going to miss the opportunity to encode local/a.txt -> remote/a.txt using matching runs from a common local/b.txt and remote/b.txt. However, RDC has the opportunity to notice that the local/a.txt -> remote/a.txt diff is very similar to remote/b.txt and further delta-encode the diff as a diff against remote/b.txt.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: