More

jspiros · on Aug 31, 2014

I have a similar capacity solution running at home, though I have far more (fewer disks per) vdevs. I've been running it for a few years now, so some things aren't as optimized as they could be, but it still works well for my uses...

  # zpool list
  NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
  lanayru  73.3T  53.7T  19.6T    73%  1.00x  ONLINE  -

  # zpool status
    pool: lanayru
   state: ONLINE
  status: The pool is formatted using a legacy on-disk format.  The pool can
          still be used, but some features are unavailable.
  action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
          pool will no longer be accessible on software that does not support
          feature flags.
    scan: scrub repaired 1.58M in 201h47m with 0 errors on Fri Feb 21 13:37:24 2014
  config:
  
          NAME                                                  STATE     READ WRITE CKSUM
          lanayru                                               ONLINE       0     0     0
            raidz2-0                                            ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
            raidz2-1                                            ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
              ata-SAMSUNG_HD204UI_______________-part1          ONLINE       0     0     0
            raidz2-2                                            ONLINE       0     0     0
              ata-ST2000DL004_HD204UI_______________-part1      ONLINE       0     0     0
              ata-ST2000DL004_HD204UI_______________-part1      ONLINE       0     0     0
              ata-ST2000DL004_HD204UI_______________-part1      ONLINE       0     0     0
              ata-ST2000DL004_HD204UI_______________-part1      ONLINE       0     0     0
              ata-ST2000DL004_HD204UI_______________-part1      ONLINE       0     0     0
            raidz2-3                                            ONLINE       0     0     0
              ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68AX9N0_WD-____________-part1    ONLINE       0     0     0
            raidz2-5                                            ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
            raidz2-6                                            ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
              ata-WDC_WD30EFRX-68EUZN0_WD-____________-part1    ONLINE       0     0     0
          logs
            mirror-4                                            ONLINE       0     0     0
              ata-KINGSTON_SV300S37A60G_________________-part1  ONLINE       0     0     0
              ata-KINGSTON_SV300S37A60G_________________-part1  ONLINE       0     0     0
          cache
            ata-KINGSTON_SV300S37A60G_________________-part2    ONLINE       0     0     0
            ata-KINGSTON_SV300S37A60G_________________-part2    ONLINE       0     0     0
  
  errors: No known data errors

If anyone has any questions, I'd be happy to try to answer.

apinnes · on Aug 31, 2014

Hey mate, I've got a few questions if you don't mind:

How much does it cost in terms of power to have something of that size running 24/7 (I'm assuming it is?)?

Now you've set it up, does it require much of your time to manage monthly?

What sorts of things are you using the space for?

Lastly, how much did it cost in parts/time to get it all up and running, does it work out cheaper rolling your own or would you have been cheaper using a cloud solution instead, or does that just not meet your use-case/needs?

jspiros · on Aug 31, 2014

It is running 24/7; I don't feel comfortable powering it down regularly; that is something I would worry about from the OP's setup, I wouldn't want to subject all those mechanical drives to so many power cycles over time. I don't have figures for only that machine, but my entire rack, which includes a router machine, two ISP modems, the ZFS-running machine and two SAS expanders, averages around 330 watts.

After setting it up, I wouldn't say that it requires any time to manage. Getting it all set up just right, with SMART alerts and capacity warnings and backups and snapshots, all of which I roll myself with various shell scripts, took a long time. Besides that initial investment, the only "management" I have to do is respond to any SMART alerts, add more vdevs as the pool fills up, and manage my files as I would on any other filesystem.

I use the space for just about everything. Lots of backups. I use the local storage on all of my workstations as sort of "scratch" space and anything that matters in the long term is stored on the server. The highest density stuff is, of course, media: I have high definition video and photographs from my digital SLRs, I have tons of DV video obtained as part of an analog video digitization workflow, media rips, and downloaded videos. (I even have a couple of terabytes used up by a youtube-dl script that downloads all of my channel subscriptions for offline viewing, and that's something I doubt anyone would do unless they had so many terabytes free.) I keep copies of large datasets that seem important (old Usenet archives, IMDB, full Wikipedia dumps). I keep copies of old software to go with my old hardware collection. I have almost every file I've ever created on any computer in my entire life, with the exception of a handful of Zip and floppy disks that rotted away before I could preserve them, but that is only a few hundred gigabytes. I scan every paper document that I can (my largest format scanner is 12x18 inches, so anything larger than that is waiting for new hardware to arrive someday), so almost all of my mail and legal documents are on there too.

(I had a dream the other night that someone got access to this machine and deleted everything. Worst. Nightmare. Ever.)

A cloud solution would not have met my use case, since one of the primary needs I have is to be self-sufficient in terms of storing my own data, and I also want immediate local access to a lot of the things on there. I do use various cloud solutions, but only for backup, never as primary storage.

Rolling it myself was definitely cheaper than any out-of-the-box hardware solution I've seen. The computer itself is a Supermicro board with some Xeon middle-of-the-range chip and a ton of RAM, and an LSI SAS card. Connected to the SAS card are two 24-bay SAS expander chassis, which contain the drives, which are all SATA.

I'd say that building something like this would cost you maybe about 4000USD, not counting the cost of the drives. The drives were all between $90 and $120 when I bought them, but of course capacity eventually started going up for the same price over time, so let's say another 3500USD for the drives.

wtbob · on Aug 31, 2014

With all that data onsite, what are you doing for off-site backups?

jspiros · on Aug 31, 2014

I've got some external hard drives that I rotate in and store off-site but still nearby (down the street) for some data. I also have constant online backups running to various locations/services (Linode, AWS, CrashPlan, Dreamhost, some private services). I don't backup everything, only the irreplacable personal data (so, I'm not backing up Wikipedia dumps); at current count, at most 6TiB of the data is irreplacable.

flexd · on Aug 31, 2014

Would you recommend building something like this for a much smaller system? 10TiB or so maybe, I do not need that much, or do you think buying a NAS of some kind would be better?

I kind of want to set something like this up while spending the least amount of money. I am comfortable enough with Debian/Linux to do most things, but I have never managed anything like this. In the end I want to end up with somewhere relatively safe to store data pretty much in the same way you are, I just do not need 70TiB, and I have no experience with ZFS/hardware stuff/storage.

jspiros · on Sept 1, 2014

By "something like this", do you mean ZFS? I am a HUGE fan of ZFS, and I do think that it's worth using in any situation where data integrity is a high priority.

As far as ZFS on Linux, it still has its wrinkles. I use it because, like you, I'm comfortable with Debian, and I didn't want to maintain a foreign system just for my data storage, and I still wanted to use the machine for other things too. (I actually started with zfs-fuse, before ZFS on Linux was an option.)

So, I don't know. If you just want a box to store stuff on, you might want to just look into FreeNAS, which is a FreeBSD distribution that makes it very easy to set up a storage appliance based on ZFS. FreeBSD's ZFS implementation is generally considered production-ready, so you avoid some ZFS on Linux wrinkles, too.

So, I'd recommend checking out the FreeNAS website, and maybe also http://www.reddit.com/r/datahoarder/ for ideas/other opinions. I do a lot of things in weird idiosyncratic ways, so I'm not sure I'd recommend anyone do it exactly how I have. :)

laumars · on Sept 1, 2014

If you're comfortable with Debian then you shouldn't have too many issues with FreeBSD as there is a lot of transferable knowledge between the two (FreeBSD even supports a lot of GNU flags which most other UNIXes don't).

Plus FreeBSD has a lot of good documentation (and the forums have proven a good resource in the past too) - so you're never going it alone (and obviously you have the usual mailing groups and IRC channels on Freenode).

While I do run quite a few Debian (amongst other Linux) I honestly find my FreeBSD server to be the most enjoyable / least painful platform to administrate. Obviously that's just personal preference, but I would definitely recommend trying FreeBSD to anyone considering ZFS.

jspiros · on Sept 1, 2014

As far as I'm concerned, the most identifiable characteristic of Debian is the packaging system, dpkg/apt. I've used FreeBSD occasionally, and that's what I always end up missing about Debian. I did consider going with Nexenta or Debian GNU/kFreeBSD, but whatever, ZoL works well enough. :)

laumars · on Sept 1, 2014

FreeBSD 10 has switched to a new package manager, so it might be worth giving it another look next time you're bored and fancy trying something new.

I can understand your preference though. I'm not a fan of apt much personally, but pacman is one of the reasons I've stuck with ArchLinux over the years - despite it's faults :)

jspiros · on Sept 1, 2014

I'll keep that in mind; I do sometimes find myself with some time to play with things. :)

flexd · on Sept 1, 2014

By 'something like this' I meant pretty much what you just said: Would you do it the same way (your own everything) if you needed a much smaller system, or would you go with something like FreeNAS, like you suggested? I am confident I c an get it working good either way, but I would rather not spend half my days having to tweak and worry about stuff working correctly. I understand that it will need maintenance and monitoring of course, but I would much rather be more of a end-user having a working system than being the sysadmin that has to fix it all the time. :-)

Thanks for the link, I will take a look there.

jspiros · on Sept 1, 2014

Well, if you don't get a kick out of "tweaking and worrying", yes, I definitely recommend FreeNAS. Although I'm confident in my system now, it took a long time to get this way, and I could've saved hundreds of hours by just going with something like FreeNAS (had it existed); I stuck with it because I kinda enjoy doing things the hard way.

flexd · on Sept 2, 2014

I do kind of get a kick out of that, but at the same time I also just want a safe system for storing data. If I end up building something like this I will take a look at FreeNAS! Thanks!

maaku · on Sept 1, 2014

I have a similar setup with 12TB capacity. ext4 over mdadm RAID-6 w/ 2 spare drives. It's specifically setup such that any single failure (including SATA expansion card) can't bring down the pool. It's been stable for ~2 years, and it's really nice to have that much storage in the house.

You don't need ZFS for this, as cool as it is.

jspiros · on Sept 1, 2014

ZFS still protects you from bitrot when compared to ext4 over mdraid. When you get to many terabytes of data, it's almost guaranteed that you're going to lose something to bitrot. In my case, my most recent scrub detected and repaired 1.58MB of bitrot. And in any given month, `zpool status` will show one or two checksum errors as having been corrected in real-time, as I was working with the corresponding files directly.

This is probably the number one thing that excites people about ZFS over any other solution, and it's something that isn't really easily implemented on a standard RAID + standard filesystem arrangement, since this sort of functionality depends on the filesystem knowing about the underlying disk arrangement.

"ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy."

https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data

maaku · on Sept 1, 2014

How is that any different from standard RAID? That's exactly the problem RAID was created to solve...

csirac2 · on Sept 1, 2014

ZFS & btrfs detect and fix silent corruption - where no errors are emitted from the hardware.

I think the pertinent question is: when the filesystem goes to read a 4K block, and one drive's copy of this block in the RAID-1 set is different to its counterpart 4K block on another disk, which one wins?

maaku · on Sept 1, 2014

I didn't specify RAID-1. RAID-5 or RAID-6 can reconstruct the correct value in a silent fail.

Honest question: how often to drives silently fail? Drives contain per-sector checksums these days, explicitly to prevent this problem.

radiowave · on Sept 1, 2014

I don't know how often they fail, but I will say that the failures that I hope I never see again (or at least, I hope I never see again outside of a ZFS system) are not drive failures, but those involving intermittent disk controller or backplane faults. In comparison to the chaos I've seen this cause on NTFS systems, ZFS copes astonishingly well.

justincormack · on Sept 1, 2014

A normal raid does not check the checksums on read. It only uses them after a device failure.

Also it may have copies of the data eg raid 1 but does not know which is correct if they differ.

maaku · on Sept 1, 2014

No, every default mdadm install performs a complete scrub on the first Sunday of the month. Every block of the array is read back and validated. For RAID modes with parity (e.g. RAID-5, RAID-6) it is able to detect and fix the offending disk when a silent error occurs. You can trigger such a scrub whenever you want (I run mine once a week).

dsturnbull2049 · on Sept 2, 2014

Scrubbing the entire raid volume is significantly different from scrubbing every piece of data as it gets written/read.

First, in between your monthly/weekly scrubs your disks/controllers will be silently corrupting data, possibly generating errors on multiple devices resulting in data loss depending on raid type. ZFS detects corruption much more quickly.

Second, your traditional raid recovery is to rewriting the entire device to fix a single block. Let's say you're using RAID5 and you're rewriting parity. You get another block error. Oops, now you've lost everything. Since disks have an uncorrectible block error rate of 1 in 10^15 bits, you only need a moderately sized array to almost guarantee data loss. ZFS rewrites corrupt data on the fly.

maaku · on Sept 2, 2014

Every time you read or write from a RAID volume, it does perform validation and write-back on error detection. I think your mental model of how linux software RAID works needs updating.

I'm not trying to argue that mdadm is better than ZFS, just that in this case they pretty much compare the same.

justincormack · on Sept 2, 2014

If the _drive_ reports a read error it will. If there is silent data corruption it wont. You can test this by using dd to corrupt the underlying data on a drive.

maaku · on Sept 3, 2014

Hrm. I'm going to test that.

erkkie · on Sept 1, 2014

That's interesting, I've been running a 6x3TB raidz2 for a year or so on wd reds and no bitrot so far, no checksum errors either, regular scrubs.

jspiros · on Sept 1, 2014

Almost all of the bitrot I see is on the oldest vdevs, which at this point probably contain mostly only old snapshots that are almost never accessed. My oldest vdevs are... 4-5 years old.

easytiger · on Aug 31, 2014

Another question is what, if any, failures have you experienced

jspiros · on Aug 31, 2014

I haven't experienced ANY drive failures. Which means I'm probably on track for one soon...

When I started out, I began with a single 5-disk vdev using a SATA port multiplier enclosure to connect the drives. Over time, I bought more SATA port multipliers, but eventually ran into tons of problems with the SATA PM technology. I do not recommend anyone use SATA PM for anything that really matters, and if you must use it, do not run more than one port multiplier on a single system at a time. So, I had tons of failures where the drives would just drop out due to their enclosure, at least once a month.

After switching to SAS and SAS expanders for drive connectivity, about a year ago, I have had NO problems at all. Rock solid.

Edit: I think I've lucked out by choosing very good, slow, low-temperature drives, first the HD204UI by Samsung, then the WD Red series. With my air conditioning, and the airflow in my rack, the drives average around 32 degrees C (a little colder than Google's report would suggest is best, but close). I would get very anxious if I were running with faster/hotter drives, or "green" drives not designed for 24/7 use.

easytiger · on Sept 1, 2014

Thanks. I've got a number of WD Red series drives too, though my rack seems to run a fair bit hotter than yours. (37 degrees C). No failures in a year so fingers crossed.

jspiros · on Sept 1, 2014

If you haven't seen it, you might find it interesting:

http://research.google.com/archive/disk_failures.pdf

Scramblejams · on Sept 1, 2014

Would you mind sharing what you use for SAS expanders?

jspiros · on Sept 2, 2014

My HBA is an LSI 9207-8e (http://www.lsi.com/products/host-bus-adapters/pages/lsi-sas-...) and my two SAS expander chassis are NORCO DS-24E (http://www.norcotek.com/item_detail.php?categoryid=8&modelno...), which are just cases that include Areca ARC-8026 SAS expanders (http://www.areca.us/products/sascableexpander.htm) built-in.

jl6 · on Sept 1, 2014

Hi there. I'd love to hear a little more about what kind of methods of organisation you employ for managing such a large volume of data (particularly the parts which aren't downloaded from somewhere else). Not so much in terms of the storage infrastructure, but in terms of directory structures, links, indexes, etc..

Do you have millions++ of files or just a lot of very large files?

I'm asking because I feel that not enough is published on the subject of personal filing/archiving systems, whereas it's something we all do and there's a lot of best practice sitting out there uncaptured.

jspiros · on Sept 2, 2014

I hate hierarchical filesystems.

A lot of my older files, sadly, are stored in "SORT/Sort Me/To be sorted/Old computer/Sort again/Miscellaneous..." and the like. My server has an mlocate index, so I'll use mlocate, and I'll use find sometimes. I make sure to preserve metadata like last-modified/created dates, so I can use that to narrow things down.

Newer stuff, I try to keep a bit more organized, but I still have lots of unmanaged stuff floating around. For big projects, or big files, that's easy enough; my photos are sorted into a Y/M/D hierarchy, my VHS digitization projects are fairly well organized, some other things have their own structure. For my scanned documents, I just dump them all into a mess of folders, but then have a custom Django app with a management command that indexes them and gives me a nice "document management" website, and then I just search based on OCR'd text or title or date.

I really hate hierarchical filesystems. After using computers for this long, I'm convinced that hierarchy-optional, metadata-driven stuff is the only future I'll be happy in. I long for the ability to save things without really having to say anything about where it's saved, and still be able to find it... So, sorry, I don't think I have a satisfactory answer for you, as I don't think there's a good solution to this problem as long as we have filesystems where the organizational primative is a hierarchy. Even with tag-based systems that build on top of that, it's usually clunky and you still fundamentally have to figure out where to save something "first", even if you plan to access it via tag/metadata later. Such a pain.

jl6 · on Sept 2, 2014

My own approach, if you're interested, is to treat the filesystem as a repository of bytestreams, loosely organised by YYYY folders and then a single level below that, A-Z. I then read everything into a database, deduplicating by file hash and have a 3NF-modelled metadata layer (with 6NF history tables based on the anchor modelling concept) in Postgresql, also with a Django front-end. Only the file hash is stored in the database, not the binary blob. I keep things in sync using Dropbox's delta API.

Or at least, that's the plan. I've only implemented it as far as photo storage is concerned. Haven't yet figured out if Dropbox can be part of the general solution - security and privacy concerns.

jspiros · on Sept 2, 2014

Very cool.

I wrote up a spec very similar to this (though I just used the hash itself for the folders, as in HA/HAS/HASH structure [there's probably a name for that scheme]), but haven't gotten around to implementing it. My main problem with actually implementing such a system is that I don't really like depending on Django or web-based interfaces; I'm a huge fan of files, and UNIX-style tools that operate on them, I just don't like the hierarchical filesystem. I've considered that a FUSE frontend to such a system would probably address most of my concerns, but at that point it's still a big huge abstraction layer that I start to feel uncomfortable for nebulous reasons aside.

But, very nice. It's nice to hear that I'm not the only person driven to such extremes. :)

laumars · on Aug 31, 2014

How often do you perform scrubs? The best practice for consumer drives is weekly, but at 200+ hours scrub time I can't see weekly being practical :p

Also, why haven't you upgraded your ZFS file system?

jspiros · on Sept 1, 2014

As you can see, I do not perform scrubs as often as I should. Until I switched to SAS a year ago, I wasn't able to complete a scrub at all. The scrub you see is one of the few I've been able to complete. I need a week or two where I'm not using the filesystem that much, because the scrub really kills performance of the filesystem with the version of ZFS on Linux I'm running. I'm intending to do an upgrade to the latest version of ZoL, and then run a scrub, sometime in the next 3 months.

I haven't upgraded because, well, I haven't really seen a need to? The original reason is that this pool began under zfs-fuse, and when I switched to ZoL I kept the version at the last version supported by zfs-fuse so that I could switch back if needed. I doubt I'll ever switch back, but I do like the idea of maintaining compatibility with other ZFS implementations in case of any problems. I suppose when the OpenZFS unification stuff actually finishes, I'll be happy to upgrade to the latest version?

laumars · on Sept 1, 2014

I think this is where the benefit of using raw disks comes into play; if you develop a problem with ZoL then you can always switch to OpenIndiana or FreeBSD (I run my ZFS array from FreeBSD).

Another question, have you checked your block size is configured correctly? I hadn't even realised that mine were wrong until I'd upgraded to the newer versions of ZFS, which throw the following helpful message:

      pool: primus
     state: ONLINE
    status: One or more devices are configured to use a non-native block size.
            Expect reduced performance.
    action: Replace affected devices with devices that support the
            configured block size, or migrate data to a properly configured
            pool.
      scan: scrub repaired 0 in 16h44m with 0 errors on Tue Aug 26 17:54:48 2014
    config:
    
        NAME        STATE     READ WRITE CKSUM
        primus      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0  block size: 512B configured, 4096B native
            ada2    ONLINE       0     0     0  block size: 512B configured, 4096B native
            ada3    ONLINE       0     0     0  block size: 512B configured, 4096B native
          raidz1-1  ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada6    ONLINE       0     0     0
        cache
          ada1      ONLINE       0     0     0
    
    errors: No known data errors

Just in case you were curious, this is what my pool looks like:

    NAME                      USED  AVAIL  REFER  MOUNTPOINT
    primus                   7.00T   115G   287G  /primus
    primus/audio              774G   115G   774G  /primus/audio
    primus/devel             17.2M   115G  10.9M  /primus/devel
    primus/documents         6.13G   115G  6.08G  /primus/documents
    primus/downloads          275G   115G   275G  /primus/downloads
    primus/git                325M   115G   316M  /git
    primus/jails             10.7G   115G  49.3K  /jails
    primus/jails/alphatrion   212M   115G   752M  /jails/alphatrion
    primus/jails/cybertron   91.6M   115G   729M  /jails/cybertron
    primus/jails/elitaone     370M   115G   940M  /jails/elitaone
    primus/jails/galvatron    750M   115G  1.25G  /jails/galvatron
    primus/jails/megatron     937M   115G  1.31G  /jails/megatron
    primus/jails/template    2.60G   115G   691M  /jails/template
    primus/jails/unicron     5.81G   115G  5.33G  /jails/unicron
    primus/pictures          11.3G   115G  11.3G  /primus/pictures
    primus/videos            5.66T   115G  5.66T  /primus/videos
    zroot                    12.7G  94.6G  6.56G  /

(in case you weren't aware; jails are FreeBSD containers - the FreeBSD equivalent of LXC / OpenVZ)

jspiros · on Sept 1, 2014

Yeah, my oldest vdevs were configured with 512B block size, because that was the default and the ZFS community wasn't being particularly loud about ashift=12 being a good idea until later. As far as I know, there is no easy way to solve that problem? Is it possible to replace one vdev with another? Off the top of my head, my understanding is that you can replace individual disks, but replacing entire vdevs isn't possible?

If it is possible, yeah, I'll definitely replace the older vdevs entirely with new ones that have better ashift. And, while I'm at it, I'll probably switch to all 6-disk raidz2s, since that's another thing that I only learned too late, that raidz2 works best with an even number of disks in the vdev...

laumars · on Sept 1, 2014

AFAIK the block size is a pool wide setting. So you can't even mix and match block sizes, let alone incrementally upgrade it. And what's more, you can't even change the pool wide setting - so the only solution is to create a new pool and rsync your data up (which is just horrible!)

However I've only done minimal investigation into this issue so if you do find a safe way to upgrade the block size then I would love to know (I'm in a similar situation to yourself in that regard)

jspiros · on Sept 1, 2014

It's not pool-wide, it's vdev-specific. You can add new vdevs with more optimal ashift values. With my pool, half of the vdevs have sub-optimal ashift, the other half are good.

laumars · on Sept 2, 2014

That's handy to know. Thanks

jimlei · on Sept 1, 2014

I used FreeNAS with ZFS before but was put off by their HW recommendations at around 1GB ECC ram for each TB storage space. They do say home users can soften this up, whatever that means.

Are they totally wrong, do they have some kind of special implementation that requires lots of ram not to fail, or are you just not worried about it?

ref: http://doc.freenas.org/index.php/Hardware_Recommendations

erkkie · on Sept 1, 2014

It's not freenas specific but ZFS in general likes plenty of ram (and ECC is a must). It's questionable whether the GB for TB holds once you have "enough" in the 8+ GB range of ram

jspiros · on Sept 1, 2014

I only have 32GB of RAM for ~73TiB under ZFS management, so I have huge SSD cache devices in my pool. The cache devices really lower ZFS' need for RAM, and it works well for me. Before the cache devices, 32GB felt very tight.

So, yes, you definitely need a lot of RAM. Consider cache devices if you can't do the GB/TB ratio.

softkitty · on Sept 2, 2014

Yes they are wrong. This is a very soft recommendation. This recommendation comes from community users experience.

aftbit · on Aug 31, 2014

How do you handle backups of that much data? I'm struggling to backup my ~5TB pool.

jspiros · on Aug 31, 2014

I've got some external hard drives that I rotate in and store off-site but still nearby (down the street) for some data. I also have constant online backups running to various locations/services (Linode, AWS, CrashPlan, Dreamhost, some private services). I don't backup everything, only the irreplacable personal data (so, I'm not backing up Wikipedia dumps); at current count, at most 6TiB of the data is irreplacable. I also prioritize, so even out of that data, the absolute most important data is backed up more often, to more locations, and the less important data is backed up to fewer locations.

aftbit · on Aug 31, 2014

Cool, thanks! That's pretty close to my current approach as well. The super important data is replicated into Dropbox, and a few hundred gigs of somewhat important data is cloned over to a friend's box manually every few weeks.

sean13013 · on Aug 31, 2014

What does "scrub repaired 1.58M in 201h47m with 0 errors on Fri Feb 21 13:37:24 2014" mean?

I have a much smaller zfs setup at home, as well as a few at work, and I would be super concerned if the scrubs were repairing data. Am I overreacting?

jspiros · on Sept 1, 2014

It means I ran a scrub that completed on February 21, 2014 after almost 202 hours, and there were no errors, but there were some corrupt data blocks that were repaired from mirrors, adding up to 1.58MB of data that was repaired.

See http://docs.oracle.com/cd/E23823_01/html/819-5461/gbbwa.html... for more information.

sean13013 · on Sept 1, 2014

Sorry, I should have worded that better. I know what the message means, but what do you make of the fact that it found & repaired corrupt data? If I understand right (and correct me if I'm wrong), corrupt data should only occur if some bit of hardware is failing or possibly after improper shutdown. Am I wrong? Is it no big deal to see scrubs repairing data?

jspiros · on Sept 1, 2014

My understanding is that some significant (thousands of bytes) silent corruption is inevitable when you start reaching huge capacities over long-ish periods of time; even cosmic radiation has the potential to flip a few bits here and there every once in a while.

http://en.wikipedia.org/wiki/Data_corruption#Silent_data_cor...

So, yes, I think even with the best hardware, and proper maintenance, seeing some data repaired in a scrub is to be expected.

(That said, I did spend way too long using technology [SATA PM] that often failed and made it impossible for me to run a scrub. It's very possible that normal error rates are more like what I'm seeing now, a single byte every month or so, and that the megabyte figure is representative of errors from the days of my arrays dropping out unexpectedly.)

maccam94 · on Sept 1, 2014

Hey Jay, you should probably set up a cron job for scrubs :) (or, 6+ months is probably a bit long for an interval...)

-Cam

edit: never mind, just saw someone else asked about that and your response since I opened the page earlier today.

jspiros · on Sept 1, 2014

Yes, I should, and someday I will. I'm playing it fast and loose right now. :-p

pkaye · on Aug 31, 2014

How much extra power do you consume to run such a system 24/7?

jspiros · on Aug 31, 2014

~330 watts.

pedrocr · on Aug 31, 2014

If anyone else is wondering like I was that's 440$ a year in electricity at 15 cents per kWh, which should be a reasonable rate. Some places it will be double or more though.

jspiros · on Aug 31, 2014

I'm in Ohio, where it apparently averages around 0.11USD per kWh, so probably closer to ~320USD here.

jspiros · on April 6, 2014

Huh, here I am using iTerm2 every day and I didn't realize it had support for this. Fantastic.

jspiros · on April 6, 2014

The ReGIS support is nice, but with the custom functionality it would be nice if instead of echoing an ASCII string like "HTERMFILEXFER" it used actual escape sequences.

Terminology[1] does this correctly with its own escape sequences. But, unfortunately, it doesn't actually use the escape sequences to transfer file data, even though it could. Instead it just transfers URLs, which isn't as network-transparent as I would like.

(Background: I did some research a couple of months ago into what it would take to build a graphical terminal emulator, to support a backwards-compatible network-transparent TermKit[2]-like experience.)

(Edit: Apparently the latest nightlies of iTerm2 also supports this sort of functionality, and does so properly with escape sequences.[3] Thanks, gnachman.)

[1]: http://www.enlightenment.org/p.php?p=about/terminology

[2]: http://acko.net/blog/on-termkit/ and https://github.com/unconed/TermKit

[3]: http://www.iterm2.com/images.html

new299 · on April 6, 2014

Yes, it would be better to use escape sequences. If I remember correctly there were issues with terminal multiplexers stuffing data into the stream. So you'd need something smart serverside to escape each chunk.

In the current implementation, the server can be pretty dumb (i.e. all you need is a bash script). This was a requirement for me, because I spend a lot of time jumping between different Linux boxes which aren't configured as I might like them to be (in particular I wanted something that work when bouncing though odd ssh gateways etc).

So overall, the png rendering is intended as a quick hack, for when you want to render data quickly. As a proper inline rendering solution, it would be nice to add support for Sixels (http://en.wikipedia.org/wiki/Sixel) or something similar. If you google around I think there are a few terminals that support that, but none of them suited my requirements.

Overall, I was kind of happy with the code as a proof of concept, and still use the iOS version every day (it's a free iOS ssh client and does all the graphical support stuff). There's a lot that could be done though. Ideally I'd like to abstract out the ReGIS code, add Sixels support and then integrate that code into existing Linux and OSX terminal apps.

terhechte · on April 6, 2014

Thanks for the link to iterm2 + images! I've wanted this functionality for years and didn't know that the iTerm nighty supported it now. Fantastic!

jspiros · on Feb 2, 2014

I'd imagine this works similarly to how a slice of bread in a bag of cookies keeps the cookies chewy, as the sugar attracts and holds the water from the bread, and keeps the cookies from drying out. Something to do with sugar being hygroscopic.

vxNsr · on Feb 2, 2014

And now I know how to keep my cookies chewy, thank you jspiros.

jspiros · on Jan 24, 2014

What are you looking for in a city, besides friendly people?

jspiros · on Jan 15, 2014

I've used xpra for about a year now, and I do recommend it if it seems like it would fill a need in your workflow.

For me, I'm using it primarily as a way to run Pidgin (for Jabber and AIM, primarily) on my server that I can connect to from the various machines I use around the house; analogous to how I use irssi in screen for IRC.

I also configured my screen instance to have a DISPLAY environment variable that corresponds to my Xpra instance, so if I run anything within screen that connects to an X server, I can be sure it's connecting to the X server I'm currently connected to anyway, and I'm mostly safe from sudden disconnection problems.

dllthomas · on Jan 15, 2014

I presume you're familiar with Finch (the console-based Pidgin equivalent - still uses libpurple), but I figured I'd mention it as another good way to get jabber/aim remotely.

jspiros · on June 9, 2013

If I could vote this up a thousand times, I would. Very nice metaphor.

jspiros · on June 3, 2013

This story reminds me of my experience growing up and going to public school in Ohio (in the United States). When I did classwork, I did it very well. When I took tests, I was always in the top of my class (which led to me being in the highest-level classes offered at my schools). And when it came to class participation, the teachers always had praise for me. I spent most of my time outside of school learning new things from books, the internet, and everyone around me.

But I almost never did my homework.

This was almost the sole source of strife between my mother and I, and between both of us and school administrators. Going from 8th to 9th grade required a special exception, and recommendations from almost every one of my teachers, as my GPA was below the point normally required to progress to the next grade. I am lucky in that the teachers I had were sympathetic enough to stand up for me, and confident in my abilities using other methods of evaluation.

We (my mother and I) considered other schools, other formats of school, but in the end we settled on something modeled after Unschooling. When I was 13-going-on-14, halfway through my 9th grade year (freshman year of high school in the United States), my mother took me out of school using the legal provision for homeschooling. As she was a single working mother, I spent the rest of the years that my peers were in high school learning things on my own and seeking out my own learning opportunities. (A year or so later, I learned C, Objective-C, and the Cocoa frameworks and started a Mac applications business a year or so later with friends I met on the internet.)

Granted, I had the following things going for me: a supportive mother, generally being "well-behaved", living in a small town where it was safe to leave me to my own devices, living in a college town which gave me access to more learning resources, and an in-built self-motivated learning style. I also think it helped that I went through the public school system for many years before I left, as it meant I had friends and a social life that stuck with me even after leaving the school itself. So, I'm sure it's not for everyone, but for some, and maybe OP's son, it's worth considering.

https://en.wikipedia.org/wiki/Unschooling

jspiros · on May 22, 2013

I posted this as a comment on the article, but I figured I should mention it here too: I’ve switched to Twilio + OpenVBX (http://openvbx.org/) as a replacement for Google Voice. The per-month cost is low (something like $2/month per number), though you do have to pay a per-minute charge when you’re actually handling calls. I think transcription might be an additional fee, too.

jspiros · on May 11, 2013

Looks like it.