I know I'm going to suffer hard for this comment. However I'd suggest reading the ZFS optimisation guide.
I know the allure of raid 7 is strong, however if a disk were to fail on a LUM thats 24 disks wide, the chances of a rebuild is pretty low. (I know you've since changed that) the rebuild time is around 60-90 hours with no load.
(we use 14 disk raid6 LUNs for the balance of performance vs safety, however we've come pretty close to loosing it all.)
You'll get better performance if you have many smaller vdevs. You'll also compartmentalize the risk of multiple disk failure.
a raid 0+6 of four LUNs of 5 disks will have much greater performance, and will rebuild much faster, without risking too much. However you will loose more space
Nope, you're right on the money. My first thought was "this guy needs more VDEVs". Given the hardware and use case, 4x 6 disk RAIDZ2 VDEVs would make a ton more sense, and could also be cross-cabled to allow a controller failure.
ZFS is pretty great, but it's not perfect. Occasionally you have to dump and re-create a whole pool to fix corruption:
[nsivo@news.ycombinator.com ~]$ zpool status -v
pool: arc
state: ONLINE
scan: scrub repaired 0 in 0h43m with 0 errors on Wed Aug 13 20:14:53 2014
config:
NAME STATE READ WRITE CKSUM
arc ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/arc0 ONLINE 0 0 0
gpt/arc1 ONLINE 0 0 0
errors: No known data errors
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 5h3m with 0 errors on Tue Aug 26 00:19:05 2014
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gpt/tank0 ONLINE 0 0 0
gpt/tank1 ONLINE 0 0 0
gpt/tank2 ONLINE 0 0 0
gpt/tank3 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
aacd5p1 ONLINE 0 0 0
aacd6p1 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/usr/nginx/var/log/nginx-access-appengine.log.0
I can't delete that file or recover it from backup, and don't really trust the pool or our disk controller anymore. So we're moving to a new server, again. Good times.
I imagine this would be disastrous for someone who's filled a 24 disk array with data, and has to purchase a duplicate to safely restore to.
I got ZFS corruption myself. Although another issue (which actually brought my attention) was also common system crashes, whenever ARC was enabled.
When inspecting cored dump it looked like random bit flips.
In my case the problem turned out to be a broken WiFi card (PCI). After I replaced it no issues for 2 years now.
Regardless, it's still frustrating that so many failure modes for ZFS have an answer of "blow it all away and start again". And that's with Solaris/ZFS.
There are other ways to fix issues with ZFS pools without having to "blow it all away and start again". In the example that started this conversation fork, there's a simple parameter to pass to the zpool utility that would have saved the OP from dropping his pool:
zpool clear tank
ZFS does actually come with a decent set of tools for fixing and clearing faults - which I'm starting to realise isn't as widely known (possibly because people rarely run into a situation where they need it?)
Anecdotally speaking; I've ran the same pool for years on consumer hardware and that pool has even survived a failing SATA controller that was randomly dropping disks during heavy IO and causing kernel panics. So I've lost count of the number times ZFS has saved my storage pool from complete data loss where other file systems would have just given up.
I've had the same thing happen on a similar system, with ECC. In that case I eventually tracked it down to a dodgy power supply, but the damage was done... it was time to dump and reload.
Corruption happens. A good filesystem needs to cope.
The memory is ECC. No data I care about was affected. I'm just surprised there's no "do whatever you need to make it consistent so I can move on with life" option and instead I have to bring up a whole new pool.
You should post on the illumos zfs-discuss mailing list. The most amount of experts will be there to provide comments. I just saw the hostname of the machine; your "clout" may get you some attention. :-)
You very well may have some flaky hardware which is causing the issue, in which case, it probably needs to be fixed sooner than later. And maybe with some magic, you can get the pool into a happy state without rebuilding.
If you are having random errors like that it might be a sign of something bigger.
As mentioned in my other response I had random bit flips in memory. After I replaced my broken NIC crashes stopped. But I got a nasty surprise when I removed a filesystem - kernel panic due to failed assertion, and then kernel panic for the same assertion every single time I tried to import the pool.
I suspect it possibly could be resolved somehow without losing data, but decided to simply restore it since my backup was relatively fresh.
I suspect you might have something similar to what I experienced, and even though you use ECC memory I doubt it would help much when corruption is happening on a bus.
This guy would have more predictible performance and higher reliability if he had organized the array as two 12-drive raidz2 vdevs (instead of 18-drive and 6-drive). The reasons are because (1) the chances of triple drive failure though low is higher in a 18-drive vdev than in a 12-drive one, and (2) performance (IOPS) is variable between two asymmetrical vdevs.
There is very rarely a good reason for doing asymmetrical vdevs in a single ZFS pool.
I approve of his choice of HBA (based on the SAS2008 chip). For those looking for even bigger setups, I maintain a list of controllers at http://blog.zorinaq.com/?e=10 where you can even find 16-, 24- and 32-port controllers. 8-port ones are the most cost effective as of today.
I have been a happy ZFS user for 9 years, both personally and professionally. Using snapshots, running periodical scrubs, having dealt with a dozen of drive failures over the years. Everything works beautifully - it's a pleasure to admin ZFS.
Echoing one of the comments -- why Linux? I thought most open source ZFS configurations used FreeBSD. (I know there are the open source Solaris variants, but I would imagine most customers still pay for support.)
There is indeed a great and active open source ecosystem of Illumos (ex open-solaris) distributions.
Illumos is the de-facto upstream of open-zfs and has many features not found in oracle solaris. [1]
So it is well worth checking out at least one of those:
- SmartOS, which is a fantastic Cloud-Hypervisor [2]
- OmniOS, a classic server distribution, probably best for what OP is building. [3]
- NexentaStor, a storage appliance - but i have not tried it yet [4]
I can definitely recommend using ZFS on a Illumos system as it has much better integration with things like kstats and fault management architecture that I would miss on any other OS.
snw's post might be reflecting the fact that OpenIndiana isn't seeing a lot of updates (except perhaps for those that come directly from Illumos.) For example, when the heartbleed bug came to light, it turned out that OpenIndiana wasn't vulnerable, because the version of ssh they had was so old it had never had the vulnerability introduced into it in the first place.
As yet I don't have any hands-on experience with OmniOS, but I'd tend to agree that it seems the best direction to go looking in, for a conventional OpenSolaris-derived OS.
this is indeed the biggest reason. They have a "hipster"-branch [1] that is getting some updates but I don't follow that closely. Another reason is that they try to target the "developer desktop" which is not that interesting to me personally and also probably a lot more work (driver wise) than they might have the manpower for...
OmniOS on the other hand is pretty solid and with pkgsrc (default package system on SmartOS and NetBSD) you get nearly 13k packages of open-source software.
My List is also missing a few other commercial players that use and contribute to Illumos:
- Delphix, uses ZFS snapshots to do fancy things with big databases
- Pluribus has "Netvisor OS" that is some SDN Router/Networking appliance [3]
And then there is the long tail of small hobby distributions:
- Tribblix, "Retro with modern features" [4]
- DilOS, which has a focus on Xen and Sparc [5]
and probably more that I forgot and don't know.
It is a very nice community and lots of interesting technology gets developed.
Our company migrated nearly all servers from Linux to SmartOS in the last 2 years and we could not be happier.
Thanks, that's helpful! I'm mainly experienced with Debian, but I've been exploring whether Debian-for-everything is still the best policy. In terms of actually trying anything else, I've only been doing some test setups of FreeBSD so far, but "something Illumos" has been on my radar as well and I've been trying to understand the landscape there, along with playing a little bit (not very seriously, I'll admit) on a SmartOS instance through Joyent's cloud.
Technically it seems quite impressive, and Illumos being more or less the ZFS upstream is appealing. The distribution situation has been more confusing, though. OpenIndiana seemed to have a community, which is why I was first looking into that option, but it does indeed not seem to be very active. That's one attraction of FreeBSD & Debian, that they're community supported, and I can with fairly high confidence expect they'll be around in 5 or 10 years supporting the distribution. It's less clear to me how to wade into Illumos in a way that mitigates risk of the upstream going away. I think Illumos itself fits that description, with a community that will be around, but does any individual distribution? There are clearly resources behind OmniOS and SmartOS, which gives some confidence, but they are also quite heavily concentrated resources (one company drives each, and it's not impossible that they could change focus and deemphasize development of the public distribution).
Thats a very valid concern. For SmartOS it would be a major blow if Joyent were to pull all ressources. They have some of the brightest engineers I know working on it and I have learned a lot from them by just using the system and hanging out on the irc channel in freenode.
Without them the speed of development would definitly take a hit.
On the other hand they have been - and still are - very good at building a community around it. There are multiple people and organisations building their own SmartOS images/derivates ([1], [2]). Pull-requests on github come in from a diverse enough group that I think it would survive even without joyent.
For OmniOS I don't have enough insight to make that judgement. They have a very clear and detailed release plan and recently hired some good people just to work on OmniOS. These things definitly add some confidence but of course is not a guarantee forever. But if that would be needed one can always buy support from them.
I used 0.6 series on Ubuntu LTS and it was stable. I have since then upgraded my Proliant uServer to pfSense 2.2 to do the routing/firewall duty and it seems equally stable at the least.
We've been running ZFS on Linux in production at MixRank for 6 months. We push it hard and haven't had any issues yet. We're using arrays of 18x Intel 530 240GB SSDs in raidz2. It's performing very well: 697k read, 260k write IOPS; 3768 MB/s read, 1614 MB/s write throughput.
I recommend it, but only after you've done enough research to feel comfortable with all of ways to configure it and understand the internals from a high level. For a personal setup though, you can just install it and forget it.
Nice build, I have something similar but with 3TB disks. My disks are grouped into 8 disks as the thought of trying to recover the whole array if that fails I'd probably cry a little.
Also I invested in 10GbE, at first was only server to server using some SFP+ NICs and DAC leads I picked up off eBay for between £10 and £80. But have since bought a couple of Dell 5524 switches and a PCIe enclosure for my Mac and having 650MB/s read/write to my server from the desktop certainly reduces the waiting around :)
$2000 + Chelsio NIC ($230) plus 12 x 4TB WD Red ($2000)
So for $4250. I get 48TB and a better performing system in 2U.
His system costs 6012 EUR, or about 7900 USD at today's exchange rate. For $8500 I can fit 96TB in the same space, have 192GB of RAM, 16 cores @ 2.4GHz, 740GB of SSD for ZIL and l2arc.
You're not comparing correctly. Your system would only have 96TB if you weren't using any form of RAID-like caching. The guy from the article is using ZFS caching to improve data reliability.
ZFS SLOG (ZIL on a dedicated device) and L2ARC do nothing for data reliability, they are only performance enhancements. Reliability is guaranteed by the CoW and atomic nature of the FS.
ZFS uses integrated software RAID (in the zpool layer) for technical reasons, not merely to "shift administration into the zfs command set". Resilvering, checksumming, scrubs are all a unified concept that are file aware, not merely block aware as nearly every other RAID. The implications of this are massive and if you don't understand, please don't write an article on file systems.
For various reasons, the snapshots and volume management are more usable due to the integrated design and CoW, and also as pillars for ZFS send/receive.
The "write hole" piece is bullshit. ZFS is an atomic file system. It has no vulnerability to a write hole.
The "file system check" piece is bullshit. Again, ZFS is an atomic file system. The ZIL is played back on a crash to catch up sync writes. A scrub is not necessary after a hard crash.
Quite frankly, for any modern RAID you probably should be using ZFS unless you are a skilled systems engineer and are balancing some kind of trade off (stacking on higher level FS like gluster/ceph, fail in place, object storage, etc). You should even use ZFS on single drive systems for checksumming and CoW, and the new possibilities for system management with concepts like boot environments that let you roll back failed upgrades.
Hardware RAID controller quality isn't spectacular, and the author clearly has not looked at the drivers to dish out such bad advice. You want FOSS kernel programmers handling as much of your data integrity as possible, not corporate firmware and driver developers that cycle entirely every 2 years (LSI/Avago). And there's effectively one vendor left, LSI/Avago, that makes the RAID controllers used in the enterprise.
ZFS is production grade on Linux. Btrfs will be ready in 2 years, said everyone since it's inception and every 2 years thereafter. It's a pretty risky option right now, but when it's ready it delivers the same features the author tries bizarrely to dismiss in his article. ZFS is the best and safest route for RAID storage in 2014 and will remain such for at least "2 years".
> My zpool is now the appropriate number of disks (2^n + parity) in the VDEVs. So I have one 18 disk RAIDZ2 VDEV (2^4+2) and one 6 disk RAIDZ2 VDEV (2^2+2) for a total of twenty-four drives.
That's a catastrophe waiting to happen. EMC recommends no more than 8 disks in a parity array group (ZFS recommends no more than 9), specifically because larger arrays cause longer rebuilds, which are more likely to trigger a URE, in which case your SOL.
HAMMER is currently in the middle of a major rewrite to HAMMER2. That's supposed to support multi-master cluster usage, in addition to ZFS/btrfs-like features, so is quite ambitious. The new version is the main focus of full-time development, but not yet considered production-ready. I think the old version is stable-ish but was considered a dead-end by the main developer, so it doesn't have much deployment or development. I would consider the project research-stage at this point, though it's quite interesting research.
HAMMER has great potential... but its utterly devoid of the momentum required to be anything but a toy at this stage. I am a huge fan of where the developers want to take HAMMER, but I can't run it on any of my hardware to help so I'm just stuck waiting to see progress.
Isn't there still a danger of typing something silly, an 'rsync --delete' type of command that bricks the whole thing? Rather than the one Borg Cube I think I would prefer to add disks/machines with disks on an as required basis, or keep the last system in the loft 'just in case'.
I am still amazed though at how much disk people 'need'. Even in uncompressed 4K resolution it would take a long time to watch 71TiB of video.
He does not mention it, bug ZFS has great support for snapshots (compared to e.g. LVM snapshots). You don't have to carve out disk space in advance where snapshots are stored and they are relatively high performance, so you can have an hourly, daily and weekly rotating set of snapshots (there are some scripts around for that retention). As the snapshots take up space, your available disk space just diminishes. And you can set it up so while in you are in dir "foo", your "foo/.snapshot/16:00/" has the contents of that directory when shapshotted hourly.
Another cool feature: you can do a send a diff between 2 snapshots as a file stream (over e.g. SSH) and replay it on another machine. So that's like using rsync except it's actually a stable snapshot (rsync can't fully handle files that changes while it's copying them) and you don't have to re-checksum every part of every changed file. So you can use that for DR.
Transparent compression is also great (as is deduping -- if you have tons of spare memory!)
BTRFS will be able to do the same thing... one day.
> BTRFS will be able to do the same thing... one day.
It already can. btrfs sub snap does snapshots (and they work the same way), btrfs send/receive do the diffs, supports lzo/zlib compression and deduplication
The last time I tried Btrfs (which was about a year ago on ArchLinux) I found Btrfs snapshots to work very differently to ZFS. At least superficially as most of the guides I could find advocated using rsync to restore snapshots - which even now I find it hard to believe there isn't a cleaner solution.
Aside all of the positive things already mentioned about ZFS, one of the biggest selling points for me is just how easy it is to administrate. The zfs and zpool command line utilities are child's play to use where as btrfs felt a little less intuitive (to me at least).
I really wanted to like Btrfs as it would have been nice to have a native Linux solution that would work on root partitions (my theory being that Btrfs snapshots could provide restore points before running system updates on ArchLinux) but sadly it proved to be more hassle than benefit for me.
I have a similar system at home (almost exactly the same except I don't have the quad network adapter) using 2T drives from Hitachi. I quite love being able to save everything, and the peace of mind that ZFS brings.
As a plug, I've been using CrashPlanPro for offsite backup, I've been very happy with their service.
I'd like something like this but not such a big box. Are there any open source drobo-like projects out there? I'd like something like a drobo but one that's hacker friendly. Anyone know of a chassis like that?
Closest I can think of is the SilverStone DS380, which is a NAS-sized case that gives you 8 hot-swap SATA bays, 4 2.5" internal mounts, and supports mini-ITX motherboards.
... if you're going to use ZFS, you should really think about using FreeBSD. Open Solaris would be the obvious choice, but with solaris's lack of support, FreeBSD is second to none in ZFS support...
I have a similar capacity solution running at home, though I have far more (fewer disks per) vdevs. I've been running it for a few years now, so some things aren't as optimized as they could be, but it still works well for my uses...
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
lanayru 73.3T 53.7T 19.6T 73% 1.00x ONLINE -
# zpool status
pool: lanayru
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support
feature flags.
scan: scrub repaired 1.58M in 201h47m with 0 errors on Fri Feb 21 13:37:24 2014
config:
NAME STATE READ WRITE CKSUM
lanayru ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
ata-SAMSUNG_HD204UI_______________-part1 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
ata-ST2000DL004_HD204UI_______________-part1 ONLINE 0 0 0
ata-ST2000DL004_HD204UI_______________-part1 ONLINE 0 0 0
ata-ST2000DL004_HD204UI_______________-part1 ONLINE 0 0 0
ata-ST2000DL004_HD204UI_______________-part1 ONLINE 0 0 0
ata-ST2000DL004_HD204UI_______________-part1 ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1 ONLINE 0 0 0
raidz2-5 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
raidz2-6 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1 ONLINE 0 0 0
logs
mirror-4 ONLINE 0 0 0
ata-KINGSTON_SV300S37A60G_________________-part1 ONLINE 0 0 0
ata-KINGSTON_SV300S37A60G_________________-part1 ONLINE 0 0 0
cache
ata-KINGSTON_SV300S37A60G_________________-part2 ONLINE 0 0 0
ata-KINGSTON_SV300S37A60G_________________-part2 ONLINE 0 0 0
errors: No known data errors
If anyone has any questions, I'd be happy to try to answer.
Hey mate, I've got a few questions if you don't mind:
How much does it cost in terms of power to have something of that size running 24/7 (I'm assuming it is?)?
Now you've set it up, does it require much of your time to manage monthly?
What sorts of things are you using the space for?
Lastly, how much did it cost in parts/time to get it all up and running, does it work out cheaper rolling your own or would you have been cheaper using a cloud solution instead, or does that just not meet your use-case/needs?
It is running 24/7; I don't feel comfortable powering it down regularly; that is something I would worry about from the OP's setup, I wouldn't want to subject all those mechanical drives to so many power cycles over time. I don't have figures for only that machine, but my entire rack, which includes a router machine, two ISP modems, the ZFS-running machine and two SAS expanders, averages around 330 watts.
After setting it up, I wouldn't say that it requires any time to manage. Getting it all set up just right, with SMART alerts and capacity warnings and backups and snapshots, all of which I roll myself with various shell scripts, took a long time. Besides that initial investment, the only "management" I have to do is respond to any SMART alerts, add more vdevs as the pool fills up, and manage my files as I would on any other filesystem.
I use the space for just about everything. Lots of backups. I use the local storage on all of my workstations as sort of "scratch" space and anything that matters in the long term is stored on the server. The highest density stuff is, of course, media: I have high definition video and photographs from my digital SLRs, I have tons of DV video obtained as part of an analog video digitization workflow, media rips, and downloaded videos. (I even have a couple of terabytes used up by a youtube-dl script that downloads all of my channel subscriptions for offline viewing, and that's something I doubt anyone would do unless they had so many terabytes free.) I keep copies of large datasets that seem important (old Usenet archives, IMDB, full Wikipedia dumps). I keep copies of old software to go with my old hardware collection. I have almost every file I've ever created on any computer in my entire life, with the exception of a handful of Zip and floppy disks that rotted away before I could preserve them, but that is only a few hundred gigabytes. I scan every paper document that I can (my largest format scanner is 12x18 inches, so anything larger than that is waiting for new hardware to arrive someday), so almost all of my mail and legal documents are on there too.
(I had a dream the other night that someone got access to this machine and deleted everything. Worst. Nightmare. Ever.)
A cloud solution would not have met my use case, since one of the primary needs I have is to be self-sufficient in terms of storing my own data, and I also want immediate local access to a lot of the things on there. I do use various cloud solutions, but only for backup, never as primary storage.
Rolling it myself was definitely cheaper than any out-of-the-box hardware solution I've seen. The computer itself is a Supermicro board with some Xeon middle-of-the-range chip and a ton of RAM, and an LSI SAS card. Connected to the SAS card are two 24-bay SAS expander chassis, which contain the drives, which are all SATA.
I'd say that building something like this would cost you maybe about 4000USD, not counting the cost of the drives. The drives were all between $90 and $120 when I bought them, but of course capacity eventually started going up for the same price over time, so let's say another 3500USD for the drives.
I've got some external hard drives that I rotate in and store off-site but still nearby (down the street) for some data. I also have constant online backups running to various locations/services (Linode, AWS, CrashPlan, Dreamhost, some private services). I don't backup everything, only the irreplacable personal data (so, I'm not backing up Wikipedia dumps); at current count, at most 6TiB of the data is irreplacable.
Would you recommend building something like this for a much smaller system? 10TiB or so maybe, I do not need that much, or do you think buying a NAS of some kind would be better?
I kind of want to set something like this up while spending the least amount of money. I am comfortable enough with Debian/Linux to do most things, but I have never managed anything like this. In the end I want to end up with somewhere relatively safe to store data pretty much in the same way you are, I just do not need 70TiB, and I have no experience with ZFS/hardware stuff/storage.
By "something like this", do you mean ZFS? I am a HUGE fan of ZFS, and I do think that it's worth using in any situation where data integrity is a high priority.
As far as ZFS on Linux, it still has its wrinkles. I use it because, like you, I'm comfortable with Debian, and I didn't want to maintain a foreign system just for my data storage, and I still wanted to use the machine for other things too. (I actually started with zfs-fuse, before ZFS on Linux was an option.)
So, I don't know. If you just want a box to store stuff on, you might want to just look into FreeNAS, which is a FreeBSD distribution that makes it very easy to set up a storage appliance based on ZFS. FreeBSD's ZFS implementation is generally considered production-ready, so you avoid some ZFS on Linux wrinkles, too.
So, I'd recommend checking out the FreeNAS website, and maybe also http://www.reddit.com/r/datahoarder/ for ideas/other opinions. I do a lot of things in weird idiosyncratic ways, so I'm not sure I'd recommend anyone do it exactly how I have. :)
If you're comfortable with Debian then you shouldn't have too many issues with FreeBSD as there is a lot of transferable knowledge between the two (FreeBSD even supports a lot of GNU flags which most other UNIXes don't).
Plus FreeBSD has a lot of good documentation (and the forums have proven a good resource in the past too) - so you're never going it alone (and obviously you have the usual mailing groups and IRC channels on Freenode).
While I do run quite a few Debian (amongst other Linux) I honestly find my FreeBSD server to be the most enjoyable / least painful platform to administrate. Obviously that's just personal preference, but I would definitely recommend trying FreeBSD to anyone considering ZFS.
As far as I'm concerned, the most identifiable characteristic of Debian is the packaging system, dpkg/apt. I've used FreeBSD occasionally, and that's what I always end up missing about Debian. I did consider going with Nexenta or Debian GNU/kFreeBSD, but whatever, ZoL works well enough. :)
FreeBSD 10 has switched to a new package manager, so it might be worth giving it another look next time you're bored and fancy trying something new.
I can understand your preference though. I'm not a fan of apt much personally, but pacman is one of the reasons I've stuck with ArchLinux over the years - despite it's faults :)
By 'something like this' I meant pretty much what you just said: Would you do it the same way (your own everything) if you needed a much smaller system, or would you go with something like FreeNAS, like you suggested? I am confident I c an get it working good either way, but I would rather not spend half my days having to tweak and worry about stuff working correctly. I understand that it will need maintenance and monitoring of course, but I would much rather be more of a end-user having a working system than being the sysadmin that has to fix it all the time. :-)
Well, if you don't get a kick out of "tweaking and worrying", yes, I definitely recommend FreeNAS. Although I'm confident in my system now, it took a long time to get this way, and I could've saved hundreds of hours by just going with something like FreeNAS (had it existed); I stuck with it because I kinda enjoy doing things the hard way.
I do kind of get a kick out of that, but at the same time I also just want a safe system for storing data. If I end up building something like this I will take a look at FreeNAS! Thanks!
I have a similar setup with 12TB capacity. ext4 over mdadm RAID-6 w/ 2 spare drives. It's specifically setup such that any single failure (including SATA expansion card) can't bring down the pool. It's been stable for ~2 years, and it's really nice to have that much storage in the house.
ZFS still protects you from bitrot when compared to ext4 over mdraid. When you get to many terabytes of data, it's almost guaranteed that you're going to lose something to bitrot. In my case, my most recent scrub detected and repaired 1.58MB of bitrot. And in any given month, `zpool status` will show one or two checksum errors as having been corrected in real-time, as I was working with the corresponding files directly.
This is probably the number one thing that excites people about ZFS over any other solution, and it's something that isn't really easily implemented on a standard RAID + standard filesystem arrangement, since this sort of functionality depends on the filesystem knowing about the underlying disk arrangement.
"ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy."
ZFS & btrfs detect and fix silent corruption - where no errors are emitted from the hardware.
I think the pertinent question is: when the filesystem goes to read a 4K block, and one drive's copy of this block in the RAID-1 set is different to its counterpart 4K block on another disk, which one wins?
I don't know how often they fail, but I will say that the failures that I hope I never see again (or at least, I hope I never see again outside of a ZFS system) are not drive failures, but those involving intermittent disk controller or backplane faults. In comparison to the chaos I've seen this cause on NTFS systems, ZFS copes astonishingly well.
No, every default mdadm install performs a complete scrub on the first Sunday of the month. Every block of the array is read back and validated. For RAID modes with parity (e.g. RAID-5, RAID-6) it is able to detect and fix the offending disk when a silent error occurs. You can trigger such a scrub whenever you want (I run mine once a week).
Scrubbing the entire raid volume is significantly different from scrubbing every piece of data as it gets written/read.
First, in between your monthly/weekly scrubs your disks/controllers will be silently corrupting data, possibly generating errors on multiple devices resulting in data loss depending on raid type. ZFS detects corruption much more quickly.
Second, your traditional raid recovery is to rewriting the entire device to fix a single block. Let's say you're using RAID5 and you're rewriting parity. You get another block error. Oops, now you've lost everything. Since disks have an uncorrectible block error rate of 1 in 10^15 bits, you only need a moderately sized array to almost guarantee data loss. ZFS rewrites corrupt data on the fly.
Every time you read or write from a RAID volume, it does perform validation and write-back on error detection. I think your mental model of how linux software RAID works needs updating.
I'm not trying to argue that mdadm is better than ZFS, just that in this case they pretty much compare the same.
If the _drive_ reports a read error it will. If there is silent data corruption it wont. You can test this by using dd to corrupt the underlying data on a drive.
Almost all of the bitrot I see is on the oldest vdevs, which at this point probably contain mostly only old snapshots that are almost never accessed. My oldest vdevs are... 4-5 years old.
I haven't experienced ANY drive failures. Which means I'm probably on track for one soon...
When I started out, I began with a single 5-disk vdev using a SATA port multiplier enclosure to connect the drives. Over time, I bought more SATA port multipliers, but eventually ran into tons of problems with the SATA PM technology. I do not recommend anyone use SATA PM for anything that really matters, and if you must use it, do not run more than one port multiplier on a single system at a time. So, I had tons of failures where the drives would just drop out due to their enclosure, at least once a month.
After switching to SAS and SAS expanders for drive connectivity, about a year ago, I have had NO problems at all. Rock solid.
Edit: I think I've lucked out by choosing very good, slow, low-temperature drives, first the HD204UI by Samsung, then the WD Red series. With my air conditioning, and the airflow in my rack, the drives average around 32 degrees C (a little colder than Google's report would suggest is best, but close). I would get very anxious if I were running with faster/hotter drives, or "green" drives not designed for 24/7 use.
Thanks. I've got a number of WD Red series drives too, though my rack seems to run a fair bit hotter than yours. (37 degrees C). No failures in a year so fingers crossed.
Hi there. I'd love to hear a little more about what kind of methods of organisation you employ for managing such a large volume of data (particularly the parts which aren't downloaded from somewhere else). Not so much in terms of the storage infrastructure, but in terms of directory structures, links, indexes, etc..
Do you have millions++ of files or just a lot of very large files?
I'm asking because I feel that not enough is published on the subject of personal filing/archiving systems, whereas it's something we all do and there's a lot of best practice sitting out there uncaptured.
A lot of my older files, sadly, are stored in "SORT/Sort Me/To be sorted/Old computer/Sort again/Miscellaneous..." and the like. My server has an mlocate index, so I'll use mlocate, and I'll use find sometimes. I make sure to preserve metadata like last-modified/created dates, so I can use that to narrow things down.
Newer stuff, I try to keep a bit more organized, but I still have lots of unmanaged stuff floating around. For big projects, or big files, that's easy enough; my photos are sorted into a Y/M/D hierarchy, my VHS digitization projects are fairly well organized, some other things have their own structure. For my scanned documents, I just dump them all into a mess of folders, but then have a custom Django app with a management command that indexes them and gives me a nice "document management" website, and then I just search based on OCR'd text or title or date.
I really hate hierarchical filesystems. After using computers for this long, I'm convinced that hierarchy-optional, metadata-driven stuff is the only future I'll be happy in. I long for the ability to save things without really having to say anything about where it's saved, and still be able to find it... So, sorry, I don't think I have a satisfactory answer for you, as I don't think there's a good solution to this problem as long as we have filesystems where the organizational primative is a hierarchy. Even with tag-based systems that build on top of that, it's usually clunky and you still fundamentally have to figure out where to save something "first", even if you plan to access it via tag/metadata later. Such a pain.
My own approach, if you're interested, is to treat the filesystem as a repository of bytestreams, loosely organised by YYYY folders and then a single level below that, A-Z. I then read everything into a database, deduplicating by file hash and have a 3NF-modelled metadata layer (with 6NF history tables based on the anchor modelling concept) in Postgresql, also with a Django front-end. Only the file hash is stored in the database, not the binary blob. I keep things in sync using Dropbox's delta API.
Or at least, that's the plan. I've only implemented it as far as photo storage is concerned. Haven't yet figured out if Dropbox can be part of the general solution - security and privacy concerns.
I wrote up a spec very similar to this (though I just used the hash itself for the folders, as in HA/HAS/HASH structure [there's probably a name for that scheme]), but haven't gotten around to implementing it. My main problem with actually implementing such a system is that I don't really like depending on Django or web-based interfaces; I'm a huge fan of files, and UNIX-style tools that operate on them, I just don't like the hierarchical filesystem. I've considered that a FUSE frontend to such a system would probably address most of my concerns, but at that point it's still a big huge abstraction layer that I start to feel uncomfortable for nebulous reasons aside.
But, very nice. It's nice to hear that I'm not the only person driven to such extremes. :)
As you can see, I do not perform scrubs as often as I should. Until I switched to SAS a year ago, I wasn't able to complete a scrub at all. The scrub you see is one of the few I've been able to complete. I need a week or two where I'm not using the filesystem that much, because the scrub really kills performance of the filesystem with the version of ZFS on Linux I'm running. I'm intending to do an upgrade to the latest version of ZoL, and then run a scrub, sometime in the next 3 months.
I haven't upgraded because, well, I haven't really seen a need to? The original reason is that this pool began under zfs-fuse, and when I switched to ZoL I kept the version at the last version supported by zfs-fuse so that I could switch back if needed. I doubt I'll ever switch back, but I do like the idea of maintaining compatibility with other ZFS implementations in case of any problems. I suppose when the OpenZFS unification stuff actually finishes, I'll be happy to upgrade to the latest version?
I think this is where the benefit of using raw disks comes into play; if you develop a problem with ZoL then you can always switch to OpenIndiana or FreeBSD (I run my ZFS array from FreeBSD).
Another question, have you checked your block size is configured correctly? I hadn't even realised that mine were wrong until I'd upgraded to the newer versions of ZFS, which throw the following helpful message:
pool: primus
state: ONLINE
status: One or more devices are configured to use a non-native block size.
Expect reduced performance.
action: Replace affected devices with devices that support the
configured block size, or migrate data to a properly configured
pool.
scan: scrub repaired 0 in 16h44m with 0 errors on Tue Aug 26 17:54:48 2014
config:
NAME STATE READ WRITE CKSUM
primus ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0 block size: 512B configured, 4096B native
ada2 ONLINE 0 0 0 block size: 512B configured, 4096B native
ada3 ONLINE 0 0 0 block size: 512B configured, 4096B native
raidz1-1 ONLINE 0 0 0
ada7 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada6 ONLINE 0 0 0
cache
ada1 ONLINE 0 0 0
errors: No known data errors
Just in case you were curious, this is what my pool looks like:
Yeah, my oldest vdevs were configured with 512B block size, because that was the default and the ZFS community wasn't being particularly loud about ashift=12 being a good idea until later. As far as I know, there is no easy way to solve that problem? Is it possible to replace one vdev with another? Off the top of my head, my understanding is that you can replace individual disks, but replacing entire vdevs isn't possible?
If it is possible, yeah, I'll definitely replace the older vdevs entirely with new ones that have better ashift. And, while I'm at it, I'll probably switch to all 6-disk raidz2s, since that's another thing that I only learned too late, that raidz2 works best with an even number of disks in the vdev...
AFAIK the block size is a pool wide setting. So you can't even mix and match block sizes, let alone incrementally upgrade it. And what's more, you can't even change the pool wide setting - so the only solution is to create a new pool and rsync your data up (which is just horrible!)
However I've only done minimal investigation into this issue so if you do find a safe way to upgrade the block size then I would love to know (I'm in a similar situation to yourself in that regard)
It's not pool-wide, it's vdev-specific. You can add new vdevs with more optimal ashift values. With my pool, half of the vdevs have sub-optimal ashift, the other half are good.
I used FreeNAS with ZFS before but was put off by their HW recommendations at around 1GB ECC ram for each TB storage space. They do say home users can soften this up, whatever that means.
Are they totally wrong, do they have some kind of special implementation that requires lots of ram not to fail, or are you just not worried about it?
It's not freenas specific but ZFS in general likes plenty of ram (and ECC is a must). It's questionable whether the GB for TB holds once you have "enough" in the 8+ GB range of ram
I only have 32GB of RAM for ~73TiB under ZFS management, so I have huge SSD cache devices in my pool. The cache devices really lower ZFS' need for RAM, and it works well for me. Before the cache devices, 32GB felt very tight.
So, yes, you definitely need a lot of RAM. Consider cache devices if you can't do the GB/TB ratio.
I've got some external hard drives that I rotate in and store off-site but still nearby (down the street) for some data. I also have constant online backups running to various locations/services (Linode, AWS, CrashPlan, Dreamhost, some private services). I don't backup everything, only the irreplacable personal data (so, I'm not backing up Wikipedia dumps); at current count, at most 6TiB of the data is irreplacable. I also prioritize, so even out of that data, the absolute most important data is backed up more often, to more locations, and the less important data is backed up to fewer locations.
Cool, thanks! That's pretty close to my current approach as well. The super important data is replicated into Dropbox, and a few hundred gigs of somewhat important data is cloned over to a friend's box manually every few weeks.
What does "scrub repaired 1.58M in 201h47m with 0 errors on Fri Feb 21 13:37:24 2014" mean?
I have a much smaller zfs setup at home, as well as a few at work, and I would be super concerned if the scrubs were repairing data. Am I overreacting?
It means I ran a scrub that completed on February 21, 2014 after almost 202 hours, and there were no errors, but there were some corrupt data blocks that were repaired from mirrors, adding up to 1.58MB of data that was repaired.
Sorry, I should have worded that better. I know what the message means, but what do you make of the fact that it found & repaired corrupt data? If I understand right (and correct me if I'm wrong), corrupt data should only occur if some bit of hardware is failing or possibly after improper shutdown. Am I wrong? Is it no big deal to see scrubs repairing data?
My understanding is that some significant (thousands of bytes) silent corruption is inevitable when you start reaching huge capacities over long-ish periods of time; even cosmic radiation has the potential to flip a few bits here and there every once in a while.
So, yes, I think even with the best hardware, and proper maintenance, seeing some data repaired in a scrub is to be expected.
(That said, I did spend way too long using technology [SATA PM] that often failed and made it impossible for me to run a scrub. It's very possible that normal error rates are more like what I'm seeing now, a single byte every month or so, and that the megabyte figure is representative of errors from the days of my arrays dropping out unexpectedly.)
If anyone else is wondering like I was that's 440$ a year in electricity at 15 cents per kWh, which should be a reasonable rate. Some places it will be double or more though.
I have kids. This translates into lots of home video.
I am also ripping all my DVDs onto my home server. That chews through a bit of storage, but it's also a lot more convenient (and, with kids, less likely to end in tears of sctrached or bejammed DVDs). BluRays will be next.
Took a 10 day vacation with a single DSLR shooting RAW... 60GB of photos/videos. I've not gone through and deleted anything yet... but I don't really have to either.
I have a similar pack of raw images, which I never got to clean up since I got spare storage space and it is not even that big - looking at those new Panasonic GH4 cameras that can shoot raw video with some crazy bitrate, you will need loads of storage.
I know a great many people collect recordings of tv series and movies, it's a hobby that can get quite out of hand disk wise. Most don't go for such a professional solution though. I know one who had 20+ external USB drives connected to a few servers.
I know the allure of raid 7 is strong, however if a disk were to fail on a LUM thats 24 disks wide, the chances of a rebuild is pretty low. (I know you've since changed that) the rebuild time is around 60-90 hours with no load.
(we use 14 disk raid6 LUNs for the balance of performance vs safety, however we've come pretty close to loosing it all.)
You'll get better performance if you have many smaller vdevs. You'll also compartmentalize the risk of multiple disk failure.
a raid 0+6 of four LUNs of 5 disks will have much greater performance, and will rebuild much faster, without risking too much. However you will loose more space