Ha, this is awesome, thanks for checking that out. One point of note though, I'm pretty sure it would have been faster to build the kernel + OpenZFS on the Debian/RISC-V in QEMU. QEMU on decent hardware runs very fast and much faster than the D1.
The M1 result is skewed because for some reason AES emulation is much faster on Ryzen. The rest of the integer stuff is faster on the M1, up to 30% faster.
I don't know about low power (or inexpensive, for that matter), but the HiFive Unmatched has lots of IO. 8 lanes of PCIe 3.0 can be expanded into quite a few SATA 3 ports.
For ARM, rk3568 is available now and has 2 lanes of PCIe 3.0 and up to 3 SATA-3 ports. As opposed to the rpi4 which has a single PCIe 2.1 lane.
NanoPI M4 burned two SD cards and destroyed its own eMMC module due to some weird behavior while I was using it for NAS.
No idea what caused it, I suspect excessive writes and wearing the flash storage or corrupting the bootloader and not being able to recognize the boot media. The "wear" I mentioned could be something else of course, but except the normal OS writes, everything went to the SATA drives.
It sounded like a solid piece of hardware with the 4 SATA-ports-hat and good CPU, but at the end it turned out to be un-reliable as at some point the OS hang, couldn't boot from the OS storage and the bootloader wasn't seeing the partitions via UART debug session.
Huh. The only thing I can think of off the top of my head that would cause it to chew up 2 SD cards is excessive amounts of logging. Nothing else really seems to make much sense.
Yeah I would love not being dependent on x86 and its horrid UEFI mess. Then again... I believe technically ARM (and even RISC-V) board manufacturers can also make something similar. Let's hope that doesn't become a thing though.
Why not? I've been working from the view that UEFI is the one of the few things worth keeping from x86_64 - I do not want to deal with every board having its own special boot process that prevents one disk being able to boot different machines. And UEFI is already here and works, so why reinvent the wheel?
The application profiles (like RVA22) have a bunch of requirements on hardware that must be present, the boot process and interface the firmware offers to the OS (i.e. opensbi).
Ok but does any of that work in practice? Can I write a random distro ISO to an SD card and expect it to work on any RISC-V board that has the same pointer size? Cause that’s how it works in PC land thanks to ACPI tables.
I just checked the Ubuntu RISC-V download page and it has 2 different ISOs for 2 boards from the same vendor! And apparently those are the only 2 boards with ISOs available 0_o
Now there’s a hot take! In PC land the same bootloader, kernel, config files, etc will work in pretty much computer. Hot swappable components are the norm here.
Meanwhile in ARM/RISC-V embedded land, every chip and every board is its own special snowflake. With no ACPI/UEFI, someone’s gotta hardcode the config of every device on every board, including the ones inside the SoC. Naturally the communities around these boards are even more fragmented than Linux distros already are.
All file systems have metadata which is good to
keep in memory. Building several 50+TB NAS boxes recently, it isn’t just ZFS either. And it isn’t some sort of linear performance penalties sometimes when you don’t have enough RAM. It can be kernel panics, exponential decay in performance, etc.
That’s a good point - I’ve seen free memory drop every time I’ve built the larger file systems (and not from just the cache), but I never tried to quantify it. And I don’t see any good stats or notes on it.
seems like no one is building these larger systems on boxes small enough for it to matter, or at least google isn’t finding it.
Another way of saying this is that RAM usage doesn't meaningfully scale with storage size in scope of storage systems encountered by actual non theoretical people because the minimum ram available on any system one encounters is sufficient to service the amount of storage that it is possible to use on said system.
The main case of deduplication that I know is hosting many virtual machines with the same OS. If that OS is Windows (explains why VM and not containers), there are dozens of GB of data that is duplicated per VM. It is not 'nobody', even if it is not common and not always worth.
> There has not ever been a reason for memory to be correlated with storage capacity nor any reason to believe that such a correlation ought to exist.
However specific implementations can indeed have memory requirements that scale in relation to storage capacity. For example, if the implementation keeps the bitmap of free space in memory, then more storage = larger bitmap = more memory required.
There's been several attempts in ZFS to reduce memory overhead. I'm pretty sure that if you took a decade old version of ZFS you'd struggle to run it on a system with 512MB RAM.
My first ZFS box (running OpenSolaris 2008.11) ran on 512MB, sometimes including a Gnome 2 desktop. It wasn't fast - I didn't need it to be - but it absolutely did work.
Interesting. I started with ZFS around 2009, and I recall several people struggling to us it on a system with less than 2GB, even after messing with tuneables. That was on FreeBSD though, so maybe implementation specific.
I've been using ZFS on Solaris 10 on a 4 x Pentium Pro @ 200 MHz a socket, with 256 MB of RAM since July 2006. Wasn't a speed demon but it ran okay for years as our central storage server until we upgraded to faster hardware.
Your stories got me inspired, so I downloaded the FreeBSD 8.0 image, released 2009, and fired it up in a VM with 256MB memory. I then created a 2TB dynamically-allocated disk in the VM and used it to create a single-vdev pool.
ZFS does complain that the minimum recommended memory is 512MB and that I can expect unstable behavior. However basic file I/O seems to work, I copied some multi-GB files around and such without issues.
So seems the bare minimum was lower than I recalled, at least on a plain system.
A proper test would include heavily fragmenting the pool, and preferably with more vdevs. But it's something.
What's supposed to happen with ZFS is that if the rest of system needs more memory, ZFS should back off and release some of the memory that it's using. It can do this because most of its memory use is just caching, and while there no doubt is a limit to how far you can take this, I never heard anyone question the effectiveness of this on the OpenSolaris/Illumos implementation.
I don't follow FreeBSD closely, but if memory serves there was a concern, particularly in the early days of the ZFS port, that their implementation couldn't be counted on to release RAM fast enough, if the system suddenly came under significant memory pressure. Hence the advice was to always run with more-than-sufficient RAM to minimize the likelihood of getting into low memory situations. I think this is a significant part of why FreeNAS considers 8GB to be the minimum supported configuration.
So it seems to me that this isn't really about ZFS's RAM requirement, rather it's about hedging against the volatility of the RAM requirements of other software on the same box, in case ZFS can't back off fast enough.
These days I run ZFS on Linux, and I remember about 4 years back spinning up some bulk data processing job that was configured to use 14GB of RAM, on a 16GB box, and watching ZFS's ARC RAM use drop in a single second from 5.5GB to 0.5GB. So I'm satisfied that for my purposes, on ZoL in recent times, this isn't an issue I need to worry about.
At present 512MB of RAM is notable in how ridiculously tiny it is and 2TB is still an acceptable amount of storage. Without resorting to decades obsolete software can you put a pin on exactly how much storage it would take to render that tiny amount of RAM unusable and then explain how much storage it would take to render a machine with 4GB of RAM likewise unusable so that we may demonstrate memory usage scaling with storage?
My point was merely that your blanket statement doesn't really hold water, since any actual memory requirements by a filesystem would be implementation specific.
I will agree that ZFS should handle large pools once you clear the ~fixed minimum memory requirement.
We've been using an old CORAID box running 24 drives with 50TB usable (100TB actual) on 16GB of RAM using FreeNAS 9.x for years without noticeable problems :) I've tried to upgrade to 32GB a couple times but for whatever reason the board won't allow more than 16GB even though according to Intel docs the RAM should be compatible. We have up to 12 PCs connected at gigabit and never any noticeable lag, even while resilvering, though I'm sure it would be faster if there was more RAM available.
> On Debian unstable (at time of writing, March 2022: bookworm/sid), OpenZFS is not yet available as debian package (zfsutils-linux). Hence you have to compile and build OpenZFS yourself.
> Unfortunately OpenZFS seems not to support (yet) cross-compiling for the RISC-V platform, hence you have to build the kernel on the RISC-V board.
Both of these seem like low hanging fruit, yes? If the upstream code supports it I'm surprised Debian doesn't already have packages, and cross-compiling isn't that special.
hope this coupled with more tech on risc-v hardware can bring it to the level of raspbery pi with all the community and hardware devices and the accessories and all that.
I hope it won't be a decade, but remember the (original) Raspberry Pi launched on a very mature part, with a _very_ mature (ancient) ISA.
Outside the discount pricing, Intel has promised to tape out SiFive's P650. Revos, Tenstorrent, and others are also working on fast cores, but it'll be at least 2-3 years before they hit the market if at all.
So far SiFive's dual issue in-order core (~ 40 GeekBench 5.4.1) (like on now-cancelled BeagleV) is the fastest chip you can buy as a lay person. The D1 (~ 32 GB 5.4.1) is cheaper but less powerful.
It will only happen if a company decides to try and be the next raspberry pi chip (or as a by-product like the OG broadcom chip from the Pi1).
The actual chip designs, I think, are already there in terms of getting a high-performance risc-v chip built, but currently the market and tech stack is still getting they so we are still at the scaled-down-test phase in the high end market.
FWIW though I don't see much reason to care about the ISA of the CPU beyond it being RISC, chances are it'll still be full of all kinds of closed source crap like the Pi.
>chances are it'll still be full of all kinds of closed source crap like the Pi.
RISC-V's application profiles (like RVA22) do have requirements regarding some hardware that must be present (serial port with a specific interface), boot process and firmware interfaces.
These are there to prevent an ARM-like messy situation.
But can it do dedupe on such a box? I think the recommendation is still "1GB of RAM for each TB of storage" if you're using dedupe...
I still have some boards with ~512mb RAM lying around (an UltraSPARC for example) that I'd love to re-purpose to a cheap NAS, just for the heck of doing it on a non-x86 platform....
I’ve tried dedup out, and even with a large powerful box with a LOT of duplicate files (multi-TB repositories of media files which get duplicated several times due to coarse snapshotting from other less fancy systems), I get near zero deduplication. I think it was literally low single digits percents.
ZFS dedup is block based, and actual block size varies depending on data feed rate for most workloads (zfs queues up async writes and merges them), so in practice once a file gets some non-zero block offset somewhere which happens all the time, even identical files don’t dedup.
While regular dedup is only a win for highly specific workload, the file-based deduplication[1][2] which is in the works seems like it can have some potential.
They discussed it, along with some options for a background-scanning dedup service (trying to find potential files to dedup), in the February leadership meeting[3].
ZFS Dedup has been wonderful for me : dedupratio = 7.05x (144 GB stored on a 25 GB volume, and still 1.3 GB left free).
I use it for backups of versions of the same folders and files slowly evolving over a long period of time ( > 15 years) that gives a lot of duplication, of course. (I could also use compression on top of it)
Backups are the 99% case for duplicate files, but aren't snapshots a better replacement in every way? Snapshots are already deduplicated as soon as you take them, plus they're instant. Maybe if your backups are coming from a non-zfs system, but you could probably convert normal backups into snapshots without too much trouble.
Why is dedup even present when the primary use case (backups) is better served in every way by snapshots?
Dedup (if it worked like it might have!) could solve the backup use case without needing to dictate your workflow.
In theory, it could also really help with virtualized disk workloads where there may be a lot of duplicated data from the base OS, but you can't use a snapshot (easily) because windows won't run from a zfs filesystem. You could maybe do snapshotting on zfs volumes, but that's not as flexible as a dedupe that worked as imagined.
Personally, I think online dedupe ends up being too expensive in memory and computation and ends up missing things because of divergent block sizes or offsets as another poster mentioned. ZFS doesn't support an offline dedupe, but I think btrfs does. That might be more interesting. It's still expensive to find duplicates, but it's possible, and it'd be neat to be able to rewrite the metadata to refer to a single copy and free some space, maybe.
Wow, that’s worse than I realized! Honestly, this makes me wonder whether the feature should even exist in ZFS. Given the enormous hardware requirements and minimal savings... well, I’d be curious to hear if anyone has ever found a real use case.
Does anyone actually use dedup? I think even the OpenZFS documentation says compression is more useful in practice. If at all, dedup should be an offline feature, to be run as scheduled by the operator.
My setup tries to get the absolute highest bandwidth and uses NVMe sticks in a stripe (I get my redundancy elsewhere), no compression, no dedup and yet can only hit ~ 3.5 GB/s reads (TrueNAS Core, EPYC 7443P, Samsung 980PRO, 256 GiB). I hope TrueNAS SCALE will perform better.
My first ever large (> 4TB) ZFS pool is still stuck with dedup. It's a backup server, gets about 2x with deduplication.
At the time, it was the difference between slow and impossible: I couldn't afford another 2x of disks.
These days, the pool could fit on a portable SSD that would fit in my pocket.
Careful, file-based dedup on top of ZFS might be more effective.
Small changes to single, large files see some advantage with block based deduplication. You see this in collections disk images for virtual machines.
You might see that in database applications, depending on log structure. I don't know, I don't have that experience.
For most of us, file-based deduplication might work out better, and is almost certainly easier to understand. You can come with a mental model of what you're working with, dealing with successive collections of files.
Even though files are just another abstraction over blocks, it's an abstraction that leaks less without the deduplication.
I haven't used a combination of encryption and deduplication. That was Really Hard for ZFS to implement, and I'm not sure how meaningful such a combination is in practice.
> no compression, no dedup and yet can only hit ~ 3.5 GB/s reads (TrueNAS Core, EPYC 7443P, Samsung 980PRO, 256 GiB)
Hmmm, that 3.5GB/s sound low. From rough memory of doing initial storage benchmarking of our "new" Hetzner dedicated boxes a few months ago (AX51-NVMe, https://www.hetzner.com/dedicated-rootserver/ax51-nvme), they were giving about 10GB/s with mirrored NVMe drives.
It would be nice if ZFS was able to combine dedup and compression - basically be able to notice that a block/file/datastream was similar/identical to another one, and do compression along with a pointer ...
Though there is no clean way to disable either. Compression can be removed from files by rewriting them, but removing deduplication requires copying over all data to a fresh pool.
Zfs send/recv sends the blocks as written to the original filesystem (which is why it can be so fast, it doesn’t have to ‘understand’ what it happening or defragment things to read like reading a file does), but that also means undoing or applying dedup won’t work correctly unless it’s screwing with things you probably don’t want it too.
One issue I had is that due to what I eventually tracked down as power issues, I had some corrupted data written to disk under my zfs pool (at the media write later), and I had dedup on.
So dedup, unfortunately, actually made it REALLY suck to fix, because I couldn’t even copy a new version of the file to the same pool! It kept nuking the duplication, and keep the old bad data and I then couldn’t read the copy. :s
It even did this after I deleted everything, because prune couldn’t remove the bad underlying entries because it was having a media failure.
So delete files, scrub, put new files on resulted in them having the exact same failure.
When I nuked the pool and recreated it, it was all fine though.
Zfs send/recv actually does send data at a logical level, unless instructed otherwise. There are options to send deduplicated streams, streams maintaining compression, and raw streams but none of those are the default. Also, see my reply to a sibling comment.
So at a pool level you might not be able to turn it off once it's turned on, but you can also turn off deduplication per file system, including in properties you set when receiving a stream. I wasn't confident this would work, but a test proved it can. (chicken_test/dedup_source had deduplication enabled and 16 copies of the same 100MiB file)
chicken:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP
HEALTH ALTROOT
chicken_test 15G 144M 14.9G - - 0% 0% 16.00x
ONLINE -
chicken:~# zfs send chicken_test/dedup_source@send | zfs recv -o dedup=off
chicken_test/nodedup_dest
chicken:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP
HEALTH ALTROOT
chicken_test 15G 2.29G 12.7G - - 0% 15% 16.00x
ONLINE -
chicken:~# zfs get dedup chicken_test/nodedup_dest
NAME PROPERTY VALUE SOURCE
chicken_test/nodedup_dest dedup off local
chicken:~# zfs destroy -r chicken_test/dedup_source
chicken:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP
HEALTH ALTROOT
chicken_test 15G 2.29G 12.7G - - 0% 15% 1.00x
ONLINE -
> Honestly, this makes me wonder whether the feature should even exist in ZFS.
My understanding is that the OpenZFS project devs feels that the answer is an emphatic no, it should not exist, but they're committed to backwards compatibility so they won't drop it. (Take with a grain of salt; that's an old memory and I can't seem to find a source in 30s of searching.)
Without restricting pretty heavily how you can interact with files or causing severe bottlenecks it's the best that can be done probably, since the FS API doesn't provide any real guarantees about what data WILL be written later, or how much of it, etc. So it has to figure things out as it goes with minimal performance impact.
Yeah, I think the author may be mixing up recommendations for dedup vs non-dedup. The solution is always to not enable dedup, it's a niche feature that's not worthwhile outside of very specific scenarios.
It isn't required (in the strict sense) to keep the dedup table in memory, the problem is that performance is dire when it doesn't. It would be pretty similar to virtual memory thrashing, when the table is not fully in memory.
ADD: Geekbench 5.4.1 on RISC-V
- under QEMU/Ryzen 9 3900XT: 82
- under QEMU/M1: 76
- Native D1: 32 (https://browser.geekbench.com/v5/cpu/13259016)
The M1 result is skewed because for some reason AES emulation is much faster on Ryzen. The rest of the integer stuff is faster on the M1, up to 30% faster.