Rockstor, a Linux and BTRFS Based NAS Solution

ChuckMcM · on Sept 27, 2014

The only data protection options I could find were Raid 1, and 10. (raid 0 is a performance option) and as data loss on attempting to re-silver a 3TB mirror is 1 in 5, data protection here is not enterprise quality yet).

The UI stuff is great, but the tricky bit about building a storage system is not provisioning it, or getting the access protocols right, it is all about finding all the ways that data can be destroyed (both silently and noisily) and guarding against them. So if you want to stick with the Enterprise target, then you need something like the ZFS On Linux page which describes every way you can get data zapped and how you will prevent that from happening.

If you want to be just an off the shelf "hey here's something that will make your access point into something like a NAS device." then you get to lose data when a disk goes bad, or a memory chip goes bad, or a network cable is loose, or the powersupply cuts out, or the cat knocks it off the table etc.

schakrava · on Sept 27, 2014

Thank you, I couldn't agree more. It's a tall order and we have work cut out.

I've started to do some DR testing myself, but it will take a little while to publish our findings and recommendations.

feld · on Sept 27, 2014

Where did you get your hilarious "data loss on attempting to re-silver a 3TB mirror is 1 in 5" statistic from?

ChuckMcM · on Sept 27, 2014

The non-recoverable bit error rate spec.

NetApp tracks it with their Nearstore product line which used SATA drives in a NAS box (they have been for a while actually, when I left they had data on about 65 million drive hours) and while Seagate quotes it a 1x10^15 bits but its actually closer to 5 in 10^15 bits. A 3TB drive has 3x10^13 bits of data (closer to 3x10^14 when you account for track markers and error recovery bits).

If you're bored some time try reading every sector from one of these drives. To maximize your chance of success make sure you operate the drive at a slightly warm temperature (keeps the lubricant from sticking) and isolate it from vibration. Its worse if you read it randomly (you will get some arm servo movement just because the drive will have replaced some blocks from spares, but minimizing it also keeps vibrations down.)

Long before it became an issue on single drives, like it is today, it was an issue when trying to reconstruct a RAID4 (or 5) group that was 3.5TB (which at the time was a 7 disk raid group of .5T drives. 14 disk groups (a full shelf) were pretty much guaranteed to see a second error in the shelf during reconstruction. Which was also way RAID6 or dual-parity RAID became a must have enterprise feature back in 2005 or thereabouts.

On an interesting side note, because the chance of hitting an unrecoverable read error is evenly distributed through a drive, 3X replication is still recoverable even with intermittent read failures. There isn't really a RAID number for that but it does work reasonably well and avoids a pesky parity calculation if you embed check data in your blocks as they do in GFS.

[1] https://www.usenix.org/legacy/publications/library/proceedin... -- Peter Corbett's paper (he is the guy who invented NetApp's dual parity system, and from that paper the following --

"Disks protect against media errors by relocating bad blocks, and by undergoing elaborate retry sequences to try to extract data from a sector that is difficult to read [10]. Despite these precautions, the typical media error rate in disks is specified by the manufacturers as one bit error per 1014 to 1015 bits read, which corresponds approximately to one uncorrectable error per 10TBytes to 100TBytes transferred. The actual rate depends on the disk construction. There is both a static and a dynamic aspect to this rate. It represents the rate at which unreadable sectors might be encountered during normal read activity. Sectors degrade over time, from a writable and readable state to an unreadable state."

And in experience from the field put it at about 15TB transferred, so 3TB into 15TB, one in five.

ryao · on Sept 27, 2014

3TB is 310^12 bytes assuming the decimal bytes used in the storage industry. The uncorrectable bit error rate is for the raw block storage. It does not include the low level formatting, which is no more than 20% of the storage on 512-byte sector drives and less than 10% on advanced format drives. The probability of an uncorrectable bit error when copying 3TB using decimal bytes) is approximately 1.5% under the assumption of a 5 in 10^15 uncorrectable bit error rate:

[1 - (1 - 5 10^-15)^(3 * 10^12)] ~ 0.01488...

If your 20% figure is accurate, the actual uncorrectable bit error rate would need to be something like 7 in 10^14. I am not disputing your empirical information, but your numbers are do not agree with it. The difference in what your numbers say and what you say is only about 1 order of magnitude. Doing statistical calculations with better records could allow the cause of that to be identified.

ChuckMcM · on Sept 27, 2014

And to be clear, it is a bit error rate not a byte error rate. Nominal coding of data in magnetic media is 10 bits per 8 bit byte although a specific drive may use a different encoding on the platter. The Barracuda included 5120 NRZ encoded bits per sector and a 48 bit NRZ encoded checkword giving it a nominal 10.094 bits per byte. You're off by one decimal order of magnitude in the number of bits.

ryao · on Sept 27, 2014

Just to be clear, I meant 3 * 10^12, not 310^12. The arithmetic that I posted uses the correct number.

lutusp · on Sept 27, 2014

To avoid markdown, either use reverse-slashes to escape your asterisks in paragraphs, or surround them with spaces, or put four spaces to the left of short lines that have "special" characters.

feld · on Sept 30, 2014

So every time you do a zfs scrub on a large pool (many TB) you should see errors that are detected and corrected.

But you don't...

rbanffy · on Sept 27, 2014

> raid 0 is a performance option

Once I won a bet that RAID 1 was actually faster than RAID 0 on a given scenario.

RickHull · on Sept 27, 2014

My very first questions regarding a potential storage solution revolve around data loss:

    1. Can we enumerate the data loss scenarios?
    2. How is drive failure handled?
    3. How may data be corrupted and such corruption detected?
    4. For every data loss scenario, what is the recovery procedure?

Here is all I could find: http://rockstor.com/docs/faq.html#how-do-i-prevent-data-loss...

Of course, there is a wealth of information on such questions for standard RAID, but I would suggest for marketing purposes that rockstor synthesize available information (from the many relevant layers of data management) in a coherent fashion, specific to their product. It doesn't have to be deep, but it should be at least minimally comprehensive and broad, with pointers to more detailed, layer-specific information.

Also, it's fine if the recovery scenario is "restore from backup" for e.g. the scenario where data is deleted by an authorized user. If so, there should be at least a minimal "backup story".

schakrava · on Sept 27, 2014

That is great feedback for us. I've added a documentation issue with your feedback: https://github.com/rockstor/rockstor-doc/issues/37

We have added appliances <-> appliance replication recently, which can play an important role in recovering from bigger disasters.

We'll have all that documented. Please feel free to participate on our github.

yawniek · on Sept 26, 2014

the gui looks pretty cool. personally i would not trust btrfs for a nas. i have made not the best experience while running various production servers with btrfs. i switched (back) to zfs and never looked back, it its just better in every regard.

i also administer a freenas box for a small business and this stuff is rock solid, i would only wish a _easy_ solution to get the permission stuff right in a multi user setting.

none the less, thumbs up for creating this, cool stuff!

nisa · on Sept 26, 2014

I'm in the process to rolling out btrfs on a lot of production servers (no raid, just subvolumes and compression) using Ubuntu 14.04 - what problems did you encounter with btrfs?

cjbprime · on Sept 26, 2014

I've hit problems like reaching ENOSPC (even though the data extents were only 70% full) on a colocated server, and there isn't enough free space to run a balance operation to get more free space. (The docs literally suggest inserting a USB stick and adding it to your array to help make the balance work..)

Also, the fsck tool is still very immature. It takes many years to get good at detecting and recovering from corruption.

yawniek · on Sept 27, 2014

had that same problem and additionally performance problems on ssds with fast writes (postgresql) even when turning off the COW

richardkmichael · on Sept 27, 2014

If you don't already, I strongly recommend lurking on the btrfs mailing list. There are regular fixes to balancing, ENOSPC, send/receive and the btrfs-progs tools; occasional questions and fixes related to the compression code.

Be prepared to update your kernels and tools often and independent of your vendor. Btrfs-progs will likely need to come from the git repo, so building your own packages for distribution around your production nodes will probably be necessary too.

A word of caution: do not run btrfsck without consulting the wiki and mailing list first, and hopefully knowing exactly what you are doing. There are situations you'll encounter which do not require btrfsck to repair (but rather, other tools instead), and it will potentially make a recovery less likely.

FWIW, I have been watching the list for years, and reading regularly for about 6 months trying to get a sense of stability with respect to the features I want.

I would not put btrfs in production yet. Though, likely soon.. I'd guess another year or so.

nisa · on Sept 28, 2014

Oh my god. The debate was between ZFS and btrfs and although I favored ZFS, the extra kernel module and the upcoming support in distros led to the decision for btrfs. However we won't do anything fancy with it. Basically just using the whole disk for a distributed filesystem without snapshots and we use btrfs because of checksumming and scrubbing weekly/monthly to detect corrupt disks and data and maybe compression with lzo and subvolumes. As far as I understood this should be safe?

New kernels should be no problem as Ubuntu will likely provide an HWE stack in the future and btrfs-tools is inside a well maintained ppa...

Damn' I should have pushed ZoL through.

richardkmichael · on Sept 30, 2014

I wouldn't use ZoL either -- I read that mailing list for quite awhile too, and skimmed most of the issues on GitHub. As of about six months ago, lockups were too frequent for my taste. All the implementations are improving though and the OpenZFS movement is promising. A caution here too: if you use ZFS, all implementations are not equal, you'll need to research the specifics for each platform on which you intend to use it; and the compatibility [with other implementations] if you want to move the file system [to a different platform]. If I was rolling out ZFS, I'd only use it on Illumos/OpenIndiana (vs., say ZoL).

I have been waiting and watching for a long time for most of these "new" filesystem features (pools, fs-level RAID, checksums, send/receive), but I am a "filesystem conservative" (especially in production; less so on my own machines) -- I'll keep waiting awhile longer. On production Linux today, I stay with EXT4 or XFS.

schakrava · on Sept 26, 2014

Thanks. We believe it's a matter of short time before btrfs will be trusted enough.

Can you elaborate on the permission-stuff issue you have? We'd love to get it right in Rockstor.

yawniek · on Sept 27, 2014

basically you need to integrate an ldap server with a decent frontend.

polarix · on Sept 26, 2014

> better in every regard

Can't remove raidz's from zpools, but `btrfs device delete` exists.

Elhana · on Sept 27, 2014

But lacking raid5/6, even N-way mirrors. ZFS is not perfect, but btrfs is not even close in terms of features.

gh02t · on Sept 27, 2014

Btrfs does support raid5/6, I'm using it right now. It is still being refined and has a couple rough edges, but I haven't had any problems in the year or so I've been using it. It is not "production ready" yet for sure, but the support is there.

nisa · on Sept 27, 2014

Everything I've read (status link from the official wiki: http://marc.merlins.org/perso/btrfs/post_2014-03-23_Btrfs-Ra...) says don't touch raid5/6 yet. Maybe the pages are overly pessimistic but lack of recovery features sounds like a no-go for me?

yawniek · on Sept 27, 2014

in a enterprise setting you rarely want to remove devices/space.

regularfry · on Sept 27, 2014

Rarely is not never.

feld · on Sept 27, 2014

That's not a bug.

victorhooi · on Sept 26, 2014

Interesting =).

I'm currently running Freenas with ZFS.

Would be curious to see how this compares.

The one thing missing for me on FreeNAS is some kind of file search/indexing feature.

I wonder if the fact that this is Linux based will make adding something like that easier.

schakrava · on Sept 26, 2014

Perhaps. I have some ideas about search features. We can also get some cool stats efficiently from btrfs trees also. But I'd love to hear your thoughts. Is it possible for you to give more input on search/indexing that you wish to see? You can even write to us directly -- support@rockstor.com or file an issue on github: https://github.com/rockstor/rockstor-core

victorhooi · on Sept 26, 2014

I've just filed a support ticket here:

https://github.com/rockstor/rockstor-core/issues/484

Hmm, stats - don't know much about this topic, but I'd been keen to hear more about what's possible.

On the file indexing front, I think Recoll and Tracker/MetaTracker are the two most active projects - Recoll being the more active one. Strigi and Beagle are both discontinued.

nexus7556 · on Sept 27, 2014

All three of the server hardware suggestions are discontinued.

gourneau · on Sept 27, 2014

Anyone have suggestions for better servers? I wonder if Rockstore would work well with the backblaze case. Maybe some of the OCP cases would work. Anyone played with those?

schakrava · on Sept 27, 2014

I wish I knew first hand how Rockstor would work with backblaze. But 45drives can ship them with CentOS which is what Rockstor is based on.

I've had the opportunity to install Rockstor on various hp gen7 and gen8 servers and had no problems.

I witnessed Rockstor install just fine on an old Isilon node and was told that the performance was quite good -- sorry I have no specifics.

razster · on Sept 26, 2014

Demo page gives me an error message. Sends me to https://50.0.94.5/

schakrava · on Sept 26, 2014

Yes, it's a simple redirect. that's where the demo is hosted for now.

aliamir · on Sept 27, 2014

This looks pretty cool. Easier to use and nice gui.

akclr · on Sept 27, 2014

Good stuff guys!

jms703 · on Sept 26, 2014

no afp support?

sciurus · on Sept 26, 2014

Since Apple has supported SMB for a long time, and actually made it the default protocol in 10.9, is there much need for AFP?

stock_toaster · on Sept 26, 2014

  > Since Apple has support SMB for a long time, and actually made it the default
  > protocol in 10.9, is there much need for AFP?

Time machine backups still require afp I believe -- unless you use the "TMShowUnsupportedNetworkVolumes" option.

wazoox · on Sept 27, 2014

Performance of SMB on Mac is only about half of AFP/NFS, and NFS is more complex to manage from an authorization/user management point of view in a Mac environment.

victorhooi · on Sept 26, 2014

I'm running AFP on FreeNAS. I also have SMB setup.

I'm using OSX 10.9.4, and I've seen better performance over AFP than with SMB.

So yes, it'd be nice to have AFP support.

conception · on Sept 27, 2014

AFP is generally faster than SMB and SMB2, but SMB3 should be faster than AFP. YMMV of course.

schakrava · on Sept 26, 2014

No, but does is suggestion from our blog work for you?

http://rockstor.wordpress.com/2014/02/24/backup-mac-folders-...

madmaze · on Sept 27, 2014

Guys, you might want to remove that RSA Private Key. https://github.com/rockstor/rockstor-core/tree/master/certs

schakrava · on Sept 27, 2014

Thank you and appreciate your issue submission on github. We'll fix this right away.

spacefight · on Sept 27, 2014

It's still there.

schakrava · on Sept 28, 2014

Thanks for your concern, but we don't see a point in just removing it in git because it doesn't really help. the key is in several branches, in our iso file, every rockstor rpm in our yum repo and not to mention lot of users who have downloaded rockstor.

We changed the key in our live demo, but for our users we'll roll out the fix in the next update. As part of that fix, we'll also remove the key file from git.

I think that's a reasonable plan. Hope I am not missing something.