Hacker News new | past | comments | ask | show | jobs | submit login
CLI tool, written in Rust, to diff directory snapshots (jotaen.net)
42 points by jotaen on Jan 22, 2024 | hide | past | favorite | 25 comments



OP should have a look at ZFS. With large amounts of data I feel that ZFS snapshots might be far more time efficient to compare than diffing full directories.

Bonus: FreeBSD is currently considering adding Rust to their base system. They have ZFS natively in FreeBSD already. Perhaps OP will find joy in FreeBSD :D


Yes, zfs rules. If you are doing this regularly you should consider a fs like zfs or use other solution like Borg (which is based on git).

Though this has the convenience of being a more universal solution.


I haven’t looked into ZFS yet, but thanks to your comment, it’s now on my todo list.

One idea behind my implementation was to have something that’s more agnostic of specific file systems. But I guess that’s an aspect that may be worth to reconsider.


This appears to be filesystem agnostic and could be quite useful in scenarios where switching to ZFS is not an easy option.


Curious -- how does your system handle hardlinks? I have a ZFS system which does a ZFS "roll forward":

    --roll-forward=<ROLL_FORWARD> traditionally 'zfs rollback' is a destructive operation, whereas httm roll-forward is non-destructive.  httm will copy only files and their attributes that have changed since a specified snapshot, from that snapshot, to its live dataset...
One of more difficult problems I had to deal with was hardlink resolution. Basically I still have to scan the whole dataset, and create a map of hardlinks before any run.


Oh thanks, that’s a good point. I actually haven’t considered hardlinks at all yet.

I should probably look into that at some point → https://github.com/jotaen/snapdiff/issues/2


I've created a similar tool in python that I find quite useful: https://github.com/RhetTbull/dirsnapshot This doesn't compute content hashes as your tool does but is designed to highlight files added, removed, or changed and I use it primarily for reverse engineering projects. One key feature is it can store the snapshot in a sqlite snapshot database that doesn't take much space then you can compare the directory against the stored snapshot database at some point in the future. Diffs are computed based on stat() info: mode, ownership, size, mtime. In addition to a CLI it also provides a python API so you can use it directly in your own code.


> (For some extra “entropy”, by the way, snapdiff also takes the file size into consideration when comparing files.)

I get what you mean by "entropy", but wouldn't it be more direct to just say matching hashes will have their file sizes compared as a remedy to the collision?

Thanks for the effort, the "back yard diy" utilities are great for learning, especially with an accompanying article.


Yeah, fair point – I couldn’t think of a better word, that’s why I wrapped it in quotes. I’ve tried to simplified the phrasing:

> For some extra safety margin to avoid collisions, by the way, `snapdiff` also takes the file size into consideration when comparing files.


This looks nice! I like TUI programs for this purpose.

Previously, I've used Beyond Compare 4 by Scooter Software (GUI, free to try), it's nice to have more options because diff -r doesn't get you very far.

I also like the ability to find duplicate images or any files regardless of location with Czkawka (Github qarmin/czkawka).


Why does it matter if it is written in Rust or assembly code?


Aside from arguments about performance and memory safety, I'm generally more likely to try something written in Rust (or Go) because projects in those languages tends to be easy to build or download as a static binary. For Rust projects, `cargo install <name>` generally works. On the other hand, when I see something written in C++ or Python, it's an indicator that there may be significantly more work involved


As a counterpoint - I'm generally more likely to avoid tools built in Rust or Go. Because I don't have their toolchains installed at all times. With tools written in C - I can often just clone repo and build it with one-two commands without clobbering my dotfiles or downloading big toolchains, that otherwise useless for me.

I agree with you on Python, but I also want to add Javascript. Just remembering trying to install something from npm sends shivers down my spine. Never again.


I think if you do write something in Rust it is customary to mention it to avoid getting tons of suggestions to rewrite it in Rust.


Because 'in Rust' is good for at least 30 upvotes.


They are special prompts for the HN LLM


I get the suspicious sentiment, but I mentioned it for other reasons in the title. Apart from solving a personal need, this project is largely about tinkering with Rust and performance optimisations. I was hoping that mentioning the language prominently would help attract people that may give valuable feedback regarding those things.


RAR supports both incremental and differential backups.

If you make it solid archive and sort files by extensions and enable max compression with ppm, then these backups will be some of the most efficient possible.


Isn't this just Git?


I was surprised to not see a mention of `git diff` anywhere in the post.

Out of curiosity, I setup two directories where the second has a new file, a modified file, and a removed file compared to the first:

    $ tree dir1/ dir2/
    dir1/
    ├── empty.txt
    ├── modified.txt
    ├── remove-this.txt
    └── unchanged.txt
    dir2/
    ├── empty.txt
    ├── modified.txt
    ├── new-file.txt
    └── unchanged.txt
Then `git diff --name-status dir1 dir2` outputs the following, showing the changes by file name.

    $ git diff --name-status dir1 dir2
    M       dir1/modified.txt
    A       dir2/new-file.txt
    D       dir1/remove-this.txt
This also doesn't require running `git init` on the two directories either, so it's immediately usable out of the box.


I suppose you could make a similar thing happen based on git, or maybe even a combination of `find` and `diff`.

It’s certainly an interesting debate whether to use a general-purpose approach vs creating a dedicated and fully customisable implementation. In this case, I was interested in the exact numbers and output structure that snapdiff produces. I’m not sure what it would take to make the same output happen by using git, or how well that would work on directory sizes in (or beyond) the 100.000 files / 100 GB region.

If someone would be up for trying that out and sharing their insights, I’d be interested to learn about it.


A single git repo keeps all history as compressed objects under .git and would continue to grow in size, while OP is deleting older snapshots to reclaim disk space. So it doesn't really fit the git use case.


That was my first thought. A content addressable file system* like Git or Mercurial, would be the simplest thing that could work. * Not sure these VCSs could be called a CAS.


What does this do that plain old "diff dir1 dir2" command doesn't?


Could you just dry run rsync? It does exactly this.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: