CLI tool, written in Rust, to diff directory snapshots

codetrotter · on Jan 22, 2024

OP should have a look at ZFS. With large amounts of data I feel that ZFS snapshots might be far more time efficient to compare than diffing full directories.

Bonus: FreeBSD is currently considering adding Rust to their base system. They have ZFS natively in FreeBSD already. Perhaps OP will find joy in FreeBSD :D

timetraveller26 · on Jan 22, 2024

Yes, zfs rules. If you are doing this regularly you should consider a fs like zfs or use other solution like Borg (which is based on git).

Though this has the convenience of being a more universal solution.

jotaen · on Jan 22, 2024

I haven’t looked into ZFS yet, but thanks to your comment, it’s now on my todo list.

One idea behind my implementation was to have something that’s more agnostic of specific file systems. But I guess that’s an aspect that may be worth to reconsider.

22c · on Jan 23, 2024

This appears to be filesystem agnostic and could be quite useful in scenarios where switching to ZFS is not an easy option.

mustache_kimono · on Jan 22, 2024

Curious -- how does your system handle hardlinks? I have a ZFS system which does a ZFS "roll forward":

    --roll-forward=<ROLL_FORWARD> traditionally 'zfs rollback' is a destructive operation, whereas httm roll-forward is non-destructive.  httm will copy only files and their attributes that have changed since a specified snapshot, from that snapshot, to its live dataset...

One of more difficult problems I had to deal with was hardlink resolution. Basically I still have to scan the whole dataset, and create a map of hardlinks before any run.

jotaen · on Jan 22, 2024

Oh thanks, that’s a good point. I actually haven’t considered hardlinks at all yet.

I should probably look into that at some point → https://github.com/jotaen/snapdiff/issues/2

rhettbull · on Jan 22, 2024

I've created a similar tool in python that I find quite useful: https://github.com/RhetTbull/dirsnapshot This doesn't compute content hashes as your tool does but is designed to highlight files added, removed, or changed and I use it primarily for reverse engineering projects. One key feature is it can store the snapshot in a sqlite snapshot database that doesn't take much space then you can compare the directory against the stored snapshot database at some point in the future. Diffs are computed based on stat() info: mode, ownership, size, mtime. In addition to a CLI it also provides a python API so you can use it directly in your own code.

grow2grow · on Jan 22, 2024

> (For some extra “entropy”, by the way, snapdiff also takes the file size into consideration when comparing files.)

I get what you mean by "entropy", but wouldn't it be more direct to just say matching hashes will have their file sizes compared as a remedy to the collision?

Thanks for the effort, the "back yard diy" utilities are great for learning, especially with an accompanying article.

jotaen · on Jan 22, 2024

Yeah, fair point – I couldn’t think of a better word, that’s why I wrapped it in quotes. I’ve tried to simplified the phrasing:

> For some extra safety margin to avoid collisions, by the way, `snapdiff` also takes the file size into consideration when comparing files.

rpigab · on Jan 22, 2024

This looks nice! I like TUI programs for this purpose.

Previously, I've used Beyond Compare 4 by Scooter Software (GUI, free to try), it's nice to have more options because diff -r doesn't get you very far.

I also like the ability to find duplicate images or any files regardless of location with Czkawka (Github qarmin/czkawka).

gumby · on Jan 22, 2024

Why does it matter if it is written in Rust or assembly code?

throwaway8582 · on Jan 22, 2024

Aside from arguments about performance and memory safety, I'm generally more likely to try something written in Rust (or Go) because projects in those languages tends to be easy to build or download as a static binary. For Rust projects, `cargo install <name>` generally works. On the other hand, when I see something written in C++ or Python, it's an indicator that there may be significantly more work involved

archargelod · on Jan 23, 2024

As a counterpoint - I'm generally more likely to avoid tools built in Rust or Go. Because I don't have their toolchains installed at all times. With tools written in C - I can often just clone repo and build it with one-two commands without clobbering my dotfiles or downloading big toolchains, that otherwise useless for me.

I agree with you on Python, but I also want to add Javascript. Just remembering trying to install something from npm sends shivers down my spine. Never again.

omaranto · on Jan 22, 2024

I think if you do write something in Rust it is customary to mention it to avoid getting tons of suggestions to rewrite it in Rust.

jacquesm · on Jan 22, 2024

Because 'in Rust' is good for at least 30 upvotes.

timetraveller26 · on Jan 22, 2024

They are special prompts for the HN LLM

jotaen · on Jan 22, 2024

I get the suspicious sentiment, but I mentioned it for other reasons in the title. Apart from solving a personal need, this project is largely about tinkering with Rust and performance optimisations. I was hoping that mentioning the language prominently would help attract people that may give valuable feedback regarding those things.

adamzochowski · on Jan 23, 2024

RAR supports both incremental and differential backups.

If you make it solid archive and sort files by extensions and enable max compression with ppm, then these backups will be some of the most efficient possible.

hartator · on Jan 22, 2024

Isn't this just Git?

quacker · on Jan 22, 2024

I was surprised to not see a mention of `git diff` anywhere in the post.

Out of curiosity, I setup two directories where the second has a new file, a modified file, and a removed file compared to the first:

    $ tree dir1/ dir2/
    dir1/
    ├── empty.txt
    ├── modified.txt
    ├── remove-this.txt
    └── unchanged.txt
    dir2/
    ├── empty.txt
    ├── modified.txt
    ├── new-file.txt
    └── unchanged.txt

Then `git diff --name-status dir1 dir2` outputs the following, showing the changes by file name.

    $ git diff --name-status dir1 dir2
    M       dir1/modified.txt
    A       dir2/new-file.txt
    D       dir1/remove-this.txt

This also doesn't require running `git init` on the two directories either, so it's immediately usable out of the box.

jotaen · on Jan 22, 2024

I suppose you could make a similar thing happen based on git, or maybe even a combination of `find` and `diff`.

It’s certainly an interesting debate whether to use a general-purpose approach vs creating a dedicated and fully customisable implementation. In this case, I was interested in the exact numbers and output structure that snapdiff produces. I’m not sure what it would take to make the same output happen by using git, or how well that would work on directory sizes in (or beyond) the 100.000 files / 100 GB region.

If someone would be up for trying that out and sharing their insights, I’d be interested to learn about it.

gkfasdfasdf · on Jan 23, 2024

A single git repo keeps all history as compressed objects under .git and would continue to grow in size, while OP is deleting older snapshots to reclaim disk space. So it doesn't really fit the git use case.

mylittlebrain · on Jan 22, 2024

That was my first thought. A content addressable file system* like Git or Mercurial, would be the simplest thing that could work. * Not sure these VCSs could be called a CAS.

sys_64738 · on Jan 22, 2024

What does this do that plain old "diff dir1 dir2" command doesn't?

anacrolix · on Jan 23, 2024

Could you just dry run rsync? It does exactly this.