The "rolling window hashes" from the comment suggests sub-file matching at any o...

adrianmonk · on Jan 25, 2024

Git does delta compression on packfiles, and if the same data occurs in different in different files, it actually can deduplicate it, even if it's not at the same offset.

Here's a demo.

First, create two test files. The files both contain the same two 1-megabyte chunks of random bytes, but in the opposite order:

    $ openssl rand 1000000 > a
    $ openssl rand 1000000 > b
    $ cat a b > ab
    $ cat b a > ba
    $ du -sh ab ba
    2.0M        ab
    2.0M        ba

Commit them to Git and see that it requires 4 MB to store these two 2 MB files:

    $ git init
    Initialized empty Git repository in /tmp/x/.git/
    $ git add ab ba
    $ git commit -m 'add two files'
    [master (root-commit) 7f75af0] add two files
     2 files changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 ab
     create mode 100644 ba
    $ du -sh .git
    4.0M        .git

Run garbage collection, which creates a packfile. Note "delta 1" and note that disk usage dropped to a bit over the size of one of the files.

    $ git gc
    Enumerating objects: 4, done.
    Counting objects: 100% (4/4), done.
    Delta compression using up to 6 threads
    Compressing objects: 100% (4/4), done.
    Writing objects: 100% (4/4), done.
    Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
    $ du -sh .git
    2.1M        .git

I'm not sure as if it's as sophisticated as some backup tools, though.

jiggawatts · on Jan 25, 2024

Microsoft's implementation is called Remote Differential Compression: https://learn.microsoft.com/en-us/previous-versions/windows/...

It's available as a built-in component of Windows, it's just a library with an API.

Essentially the MS RDC protocol is just rsync run twice in a row, with the rsync metadata copied via rsync to compress it further.

KMag · on Jan 25, 2024

> Essentially the MS RDC protocol is just rsync run twice in a row, with the rsync metadata copied via rsync to compress it further.

There's an important difference is that RDC uses a locality-sensive hash algorithm (MinHash, IIRC) to find files likely to have matching sections, whereas rsync only considers the version of the same file sitting on the far host. rsync encodes differences on each file in isolation, whereas RDC looks at the entire corpus of files on the volume.

For example, if you do the Windows equivalent of cat local/b.txt >> local/a.txt, rsync is going to miss the opportunity to encode local/a.txt -> remote/a.txt using matching runs from a common local/b.txt and remote/b.txt. However, RDC has the opportunity to notice that the local/a.txt -> remote/a.txt diff is very similar to remote/b.txt and further delta-encode the diff as a diff against remote/b.txt.