There still are. As someone who has done both production and homelab deployments: unless you are specifically just looking for experience with it and just setting up a demo - don't bother.
When it works, it works great - when it goes wrong it's a huge headache.
Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:
- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.
- longhorn was solid for me when i was in the k8s world
- glusterfs is still fine as long as you know what you are going into.
I just want to hoard data. I hate having to delete stuff to make space. Things disappear from the web every day. I should hold onto them.
My requirements for a storage solution are:
> Single root file system
> Storage device failure tolerance
> Gradual expansion capability
The problem with every storage solution I've ever seen is the lack of gradual expandability. I'm not a corporation, I'm just a guy. I don't have the money to buy 200 hard disks all at once. I need to gradually expand capacity as needed.
I was attracted to this ceph because it apparently allows you to throw a bunch of drives of any make and model at it and it just pools them all up without complaining. The complexity is nightmarish though.
ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID. Expansion features seem to be just about to land for quite a few years now. I remember getting excited about it after seeing news here only for people to deflate my expectations. Btrfs has a flexible block allocator which is just what I need but... It's btrfs.
> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.
if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.
Looks good. I don't mind the overhead. That seems to be much more resilient compared to RAID5/6 and its ZFS equivalents and addresses all the concerns I outlined in this comment:
I'm still somewhat alarmed by the possibility of the surviving mirror drive failing during resilver and destroying the pool... Are there any failure chance calculators for this pool of mirrors topology? No doubt it's much lower than the RAID5/6 setups but still.
Is this topology usable in btrfs without the famous reliability issues? How good is ZFS support is on Linux? I'm a Linux guy so I'd really like to keep using Linux if possible. Maybe Linux LVM/mdadm?
50% storage efficiency is a tough pill to swallow, but drives are pretty big and the ability to expand as you go means it can be cheaper in the long run to just buy the larger, new drives coming out than pay upfront for a bunch of drives in a raidz config.
ZFS using mirrors is extremely easy to expand. Need more space and you have small drives? Replace the drives in a mirror one by one with bigger ones. Need more space and already have huge drives? Just add another vdev mirror. And the added benefit of not living in fear of drive failure while resilvering as it is much faster with mirrors than raidX.
Sure the density isn't great as you're essentially running at 50% or raw storage but - touches wood - my home zpool has been running strong for about a decade doing the above from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives (8x mirrors, differing sized drives but matched per mirror like a 10tb x2 mirror, a 16tb x2 mirror etc).
Edit: Just realised someone else as already mentioned a pool or mirrors. Consider this another +1.
> Replace the drives in a mirror one by one with bigger ones.
That's exactly what I meant by "just as bad as RAID". Expanding an existing array is analogous to every single drive in the array failing and getting replaced with higher capacity drives.
When a drive fails, the array is in a degraded state. Additional drive failures put the entire system in danger of data loss. The rebuilding process generates enormous I/O loads on all the disks. Not only does it take an insane amount of time, according to my calculations the probability of read errors happening during the error recovery process is about 3%. Such expansion operations have a real chance of destroying the entire array.
That's not the case it mirrored vdevs. There is no degredatuon of the array with a failed drive in a mirrored vdev, it continues humming along perfectly fine.
Also resilvers are not as intensive when rebuilding a mirror as you are just copying from one disk in the vdev to the other, not all X other drives and recalculating parity at the same time. This means less reads across the entire array and much much quicker resilver times, thus less window for drive failure.
I see. That addresses my concerns, and it's starting to make a lot of sense. I'm gonna study this in depth, starting with the post you linked. Thank you.
I've run Ceph at home since the jewel release. I migrated to it after running FreeNAS.
I use it for RBD volumes for my OpenStack cluster and for CephFS. With a total raw capacity of around 350TiB. Around 14 of that is nvme storage for RBD and CephFS metadata. The rest is rust. This is spread across 5 nodes.
I currently am only buying 20TB exos drives for rust. SMR and I think HSMR are both no goes for Ceph as are non enterprise SSDs, so storage is expensive. Ibdinhave a mix of disks though as the cluster has grown organically. So I have a few 6TB WD Reds in there, before their SMR shift.
My networks for OpenStack, Ceph and Ceph backend are all 10Gbps. With the flash storage when repairing I get about 8GiB/s a second. With rust it is around 270MiB/s. The bottle neck I think is due to 3 of the nodes running on first gen xeon-d boards, the the few Reds do slow things down too. The 4th node runs an AMD Rome CPU, and the newest an AMD Genoa cpu. So I am looking at about 5k CAD a node before disks. I colocate the MDS, OSDs and MONs, with 64GiB of ram each. Each node gets 6 rust, and 2 nvme drives.
Complexity is pretty simple. I deployed the initial iteration by hand, and then when cephadmin was released i converted it daemon by daemon smoothly. I find on the mailing lists and Reddit most of the people encountering problems deploy it via Proxmox and don't really understand Ceph because of it.
If you're willing to use mirror vdevs, expansions can be done two drives at a time.Also, depending on how often your data changes, you should check out snapraid. Doesn't have all the features of ZFS but its perfect for stuff that rarely changes (media or, in your case, archiving).
Also unionfs or similar can let you merge zfs and snapraid into one unified filesystem so you can place important data in zfs and unchanging archive data in snapraid.
On a single host, you could do this with LVM. Add a pair of disks, make them a RAID 1, create a physical volume on them, then a volume group, then a logical volume with XFS on top. To expand, you add a pair of disks, RAID 1 them, and add them to the LVM. It's a little stupid, but it would work.
If multiple nodes are not off the table, also look into seaweedfs.
Also consider how (or if) you are going to back up your hoard of data.
> Also consider how (or if) you are going to back up your hoard of data.
I actually emailed backblaze years ago about their supposedly unlimited consumer backup plan. Asked them if they would really allow me to dump into their systems dozens of terabytes of encrypted undeduplicable data. They responded that yes, they would. Still didn't believe them, these corporations never really mean it when they say unlimited. Plus they had no Linux software.
EOS (https://cern.ch/eos, https://github.com/cern-eos/eos) is probably a bit more complicated than other solutions to setup and manage, but does allow to add/remove new disks and nodes serving data on the fly. This is essential to let us upgrade harware of the clusters serving experimental data with minimal to no downtime.
Not sure what the multidisk consensus is for btrfs now-a-days but adding/removing devices is trivial, you can do "offline" dedupe, and you can rebalance data if you change the disk config.
As an added bonus it's also in-tree so you don't have to worry about kernel updates breaking things
I think you can also potentially do btrfs+LVM and let LVM manage multi device. Not sure what performance looks like there, though
> glusterfs is still fine as long as you know what you are going into.
Does that include storage volumes for databases? I was using glusterFS as a way to scale my swarm cluster horizontally and I am reasonably sure that it corrupted one database to the point I lost more than a few hours of data. I was quite satisfied with the setup until I hit that.
I know that I am considered crazy for sticking with Docker Swarm until now, but aside from this lingering issue with how to manage stateful services, I've honestly don't feel the need to move yet to k8s. My clusters is ~10 nodes running < 30 stacks and it's not like I have tens of people working with me on it.
> Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication.
This is probably a nogo for most use cases where you work with large datasets....
Minio doesn't make any sense to me in a homelab. Unless I'm reading it wrong it sounds like a giant pain to add more capacity while it is already in use. There's basically no situation where I'm more likely to add capacity over time than a homelab.
You get a new nas (minio server pool) and you plug it into your home lab (site replication) and now it's part of the distributed minio storage layer (k8s are happy). How is that hard? It's the same basic thing for Ceph or any distributed JBOD mass storage engine. Minio has some funkiness with how you add more storage but it's totally capable of doing it while in use. Everything is atomic.
Ceph is sort of a storage all-in-one: it provides object storage, block storage, and network file storage. May I ask, which of these are you using seaweedfs for? Is it as performant as Ceph claims to be?
I really wish there was a benchmark comparing all of these + MinIO and S3. I'm in the market for a key value store, using S3 for now but eyeing moving to my own hardware in the future and having to do all the work to compare these is one of the major things making me procrastinate.
Minio gives you "only" S3 object storage. I've setup a 3-node Minio cluster for object storage on Hetzner, each server having 4x10TB, for ~50€/month each. This means 80TB usable data for ~150€/month. It can be worth it if you are trying to avoid egress fees, but if I were building a data lake or anything where the data was used mostly for internal services, I'd just stick with S3.
minio is good but you really need fast disks.
They also really don't like, when you want to change the size of your cluster setup.
No plan to add cache disks, they just say use faster disks.
I have it running, goes smoothly but not really user friendly to optimize
Note that the Red Hat Gluster Storage product has a defined support lifecycle through to 31-Dec-24, after which the Red Hat Gluster Storage product will have reached its EOL. Specifically, RHGS 3.5 represents the final supported RHGS series of releases.
For folks using GlusterFS currently, what's your plan after this year?
Curious, what do you mean by "know what you go into" re glusterfs?
I recently tried ceph in a homelab setup, gave up because of complexity, and settled on glusterfs. I'm not a pro though, so I'm not sure if there's any shortcomings that are clear to everybody but me, hence why your comment caught my attention.
I played around with it and it has a very cool web UI, object storage & file storage, but it was very hard to get decent performance and it was possible to get the metadata daemons stuck pretty easily with a small cluster. Ultimately when the fun wore off I just put zfs on a single box instead.
I have some experience with Ceph, both for work, and with homelab-y stuff.
First, bear in mind that Ceph is a distributed storage system - so the idea is that you will have multiple nodes.
For learning, you can definitely virtualise it all on a single box - but you'll have a better time with discrete physical machines.
Also, Ceph does prefer physical access to disks (similar to ZFS).
And you do need decent networking connectivity - I think that's the main thing people think of, when they think of high hardware requirements for Ceph. Ideally 10Gbe at the minimum - although more if you want higher performance - there can be a lot of network traffic, particularly with things like backfill. (25Gbps if you can find that gear cheap for homelab - 50Gbps is a technological dead-end. 100Gbps works well).
But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe will work fine, and you should get acceptable performance, and it'll be good for learning.
You can install Ceph directly on bare-metal, or if you want to do the homelab k8s route, you can use Rook (https://rook.io/).
Hope this helps, and good luck! Let me know if you have any other questions.
They have a PCIe slot and can take 8th/9th gen intel cpus (6 core, etc). That PCIe slot should let you throw in a decent network card (eg 10GbE, 25GbE, etc).
I run Ceph in my lab. It's pretty heavy on CPU, but it works well as long as you're willing to spring for fast networking (at least 10Gb, ideally 40+) and at least a few nodes with 6+ disks each if you're using spinners. You can probably get away with far fewer disks per node if you're going all-SSD.
I just set up a three-node Proxmox+Ceph cluster a few weeks ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs of 1TB and 2TB mix (was 5 until I noticed one of my scavenged SSDs was failed). Single 1gbps network on each so I am seeing 30-120MB/s disk performance depending on things. I think in a few months I will upgrade to 10gbps for about $400.
I'm about 1/2 through the process of moving my 15 virtual machines over. It is a little slow but tolerable. Not having to decide on RAIDs or a NAS ahead of time is amazing. I can throw disks and nodes at it whenever.
I’ve ran Ceph in my home lab since Jewel (~8 years ago). Currently up to 70TB storage on a single node. Have been pretty successful vertically scaling, but will have to add a 2nd node here in a bit.
Ceph isn’t the fastest, but it’s incredibly resilient and scalable. Haven’t needed any crazy hardware requirements, just ram and an i7.
Yes. I first tried it with Rook, and that was a disaster, so I shifted to Longhorn. That has had its own share of problems, and is quite slow. Finally, I let Proxmox manage Ceph for me, and it’s been a dream. So far I haven’t migrated my K8s workloads to it, but I’ve used it for RDBMS storage (DBs in VMs), and it works flawlessly.
I don’t have an incredibly great setup, either: 3x Dell R620s (Ivy Bridge-era Xeons), and 1GBe. Proxmox’s corosync has a dedicated switch, but that’s about it. The disks are nice to be fair - Samsung PM863 3.84 TB NVMe. They are absolutely bottlenecked by the LAN at the moment.
I plan on upgrading to 10GBe as soon as I can convince myself to pay for an L3 10G switch.
The main blocker (other than needing to buy new NICs, since everything I have already came with quad 1/1/10/10) is I'm heavily invested into the Ubiquiti ecosystem, and since they killed off the USW-Leaf (and the even more brief UDC-Leaf), they don't have anything that fits the bill.
I'm not entirely opposed to getting a Mikrotik or something and it just being the oddball out, but it's nice to have everything centrally managed.
EDIT: They do have the PRO-Aggregation, but there are only 4x 25G ports. Technically it _would_ meet my needs for Ceph, and Ceph only.
If you want decent performance, you need a lot of OSDs especially if you use HDD. But a lot of consumer SDDs will suffer terrible performance degradation with writes depending on the circumstances and workloads.
The hardware minimums are real, and the complexity floor is significant. Do not deploy Ceph unless you mean it.
I started considering alternatives when my NAS crossed 100 TB of HDDs, and when a scary scrub prompted me to replace all the HDDs, I finally pulled the trigger. (ZFS resilvered everything fine, but replacing every disk sequentially gave me a lot of time to think.) Today I have far more HDD capacity and a few hundred terabytes of NVMe, and despite its challenges, I wouldn't dare run anything like it without Ceph.
It's been a while since I've done some benchmarks, but it can definitely do 40MB/s sustained writes, which is very good given the single 1GbE links on each node, and 5TB SMR drives.
Latency is hilariously terrible though. It's funny to open a text file over the network in vi, paste a long blob of text and watch it sync that line by line over the network.
If by "rub" you mean scrub, then yes, although I increased the scrub intervals. There's no need to scrub everything every week.
Works great, depending on what you want to do. Running on SBCs or computers with cheap sata cards will greatly reduce the performance. It's been running well for years after I found out the issues regarding SMR drives and the SATA card bottlenecks.
45Drives has a homelab setup if you're looking for a canned solution.
To learn about Ceph, I recommend you create at least 3 KVM virtual machines (using virt-manager) on a development box, network them together, and use cephadm to set up a cluster between the VMs. The RAM and storage requirements aren't huge (Ceph can run on Raspberry Pis, after all) and I find it a lot easier to figure things out when I have a desktop window for every node.
I recently set up Ceph twice. Now that Ceph (specifically RBD) is providing the storage for virtual machines, I can live-migrate VMs between hosts and reboot hosts (with zero guest downtime) anytime I need. I'm impressed with how well it works.
Proxmox makes Ceph easy, even with just one single server if you are homelabbing...
I had 4 NUCs running Proxmox+Ceph for a few years, and apart from slightly annoying slowness syncing after spinning the machines up from cold start, it all ran very smoothly.
One reason for using Ceph instead of other RAID solutions on a single machine is that it supports disk failures more flexibly.
In most RAIDs (including ZFS's, to my knowledge), the set of disks that can fail together is static.
Say you have physical disks A B C D E F; common setup is to group RAID1'd disks into a pool such as `mirror(A, B) + mirror(C, D) + mirror(E, F)`.
With that, if disk A fails, and then later B fails before you replace A, your data is lost.
But with Ceph, and replication `size = 2`, when A fails, Ceph will (almost) immediately redistribute your data so that it has 2 replicas again, across all remaining disks B-F. So then B can fail and you still have your data.
So in Ceph, you give it a pool of disks and tell it to "figure out the replication" iself. Most other systems don't offer that; the human defines a static replication structure.
For the same reason you would use one in enterprise deployments: if setup properly, it's easier to scale. You don't need to invest in a huge storage server upfront, but could build it out as needed with cheap nodes. Assuming it works painlessly as a single node filesystem, of which I'm not yet convinced if the existing solutions do.
Not really. Most consumer motherboards have a limited number of SATA ports, and server hardware is more expensive, noisy and requires a lot of space. Consumers usually go with branded NAS appliances, which are also expensive and limited at scaling.
Setting up a cluster of small heterogeneous nodes is cheaper, more flexible, and can easily be scaled as needed, _assuming_ that the distributed storage software is easy to work with and trouble-free. This last part is what makes it difficult to setup and maintain, but if the software is stable, I would prefer this approach for home use.
I'm indifferent towards the distributed nature thing. What I want is ceph's ability to pool any combination of drives of any make, model and capacity into organized redundant fault tolerant storage, and its ability to add arbitrary drives to that pool at any point in the system's lifetime. RAID-like solutions require identical drives and can't be easily expanded.
lol, wrong place to ask questions of such practicality.
that said, I played with virtualization and I didn't need to.
but then I retired a machine or two and it has been very helpful.
And I used to just use physical disks and partitions. But with the VMs I started using volume manager. It became easier to grow and shrink storage.
and...
well, now a lot of this is second nature. I can spin up a new "machine" for a project and it doesn't affect anything else. I have better backups. I can move a virtual machine.
yeah, there are extra layers of abstraction but hey.
You’re absolutely not wrong - but asking a devops engineer why they over engineered their home cluster is sort of like asking a mechanic “why is your car so fast? Couldn’t you just take the bus?”