⁠Btrfs has been deprecated in RHEL

josefbacik · on Aug 2, 2017

People are making a bigger deal of this than it is. Since I left Red Hat in 2012 there hasn't been another engineer to pick up the work, and it is _a lot_ of work.

For RHEL you are stuck on one kernel for an entire release. Every fix has to be backported from upstream, and the further from upstream you get the harder it is to do that work.

Btrfs has to be rebased _every_ release. If moves too fast and there is so much work being done that you can't just cherry pick individual fixes. This makes it a huge pain in the ass.

Then you have RHEL's "if we ship it we support it" mantra. Every release you have something that is more Frankenstein-y than it was before, and you run more of a risk of shit going horribly wrong. That's a huge liability for an engineering team that has 0 upstream btrfs contributors.

The entire local file system group are xfs developers. Nobody has done serious btrfs work at Red Hat since I left (with a slight exception with Zach Brown for a little while.)

Suse uses it as their default and has a lot of inhouse expertise. We use it in a variety of ways inside Facebook. It's getting faster and more stable, admittedly slower than I'd like, but we are getting there. This announcement from Red Hat is purely a reflection of Red Hat's engineering expertise and the way they ship kernels, and not an indictment of Btrfs itself.

pgaddict · on Aug 2, 2017

I think a natural follow-up question is "Why Red Hat does not have engineers to support btrfs?" That is, if the lack of engineers is a symptom, what is the cause?

I'm pretty sure, had RH wanted they could either hire or assign engineers to maintain the btrfs code, take care of patches from upstream, etc. So why didn't that happen? I wonder what is your opinion on that.

I see a bunch of possibilities (not necessarily independent ones):

1) Politics. Perhaps RH wants to kill btrfs for some reason?

I see this as rather unlikely, as RH does not have a competing solution (unlike in the Jigsaw controversy, where they have incentives to kill it in favor of the JBoss module system).

2) Inability to hire enough engineers familiar with btrfs, or assign existing engineers.

Perhaps the number of engineers would be too high, increasing costs. Especially if not only to maintain the RHEL kernels, but to contribute to btrfs and move it forward.

Or maybe there's a pushback from the current filesystems team, where most people are xfs developers?

3) Incompatible development models.

If each release requires a rebase, perhaps supporting btrfs would require too much work / too many engineers, increasing costs? I wonder what Suse and others are doing differently, except for having in-house btrfs developers.

4) Lack of trust btrfs will get mature enough for RHEL soon.

It may work for certain deployments, but for RHEL customers that may not be sufficient. That probably requires a filesystem performing well for a wider range of workloads.

5) Lack of interest from paying RHEL customers.

Many of our customers have RHEL systems (or CentOS / Scientific Linux), and I don't remember a single one of them using btrfs or planning to do so. We only deal with database servers, which is a very narrow segment of the market, and fairly conservative one when it comes to filesystems.

But overall, if customers are not interested in a feature, it's merely a pragmatic business decision not to spend money on it.

6) Better alternatives available.

I'm not aware of one, although "ZFS on Linux" is getting much better.

So I tend to see this as a pragmatic business decision, based on customer interest in btrfs on RHEL vs. costs of supporting it.

josefbacik · on Aug 2, 2017

All this talk about Oracle is just plain stupid. Oracle doesn't control anything, the community does. One core developer still works on Btrfs from Oracle, the vast majority of the contributions come from outside Oracle.

Now as to

> "Why Red Hat does not have engineers to support btrfs?"

You have to understand how most kernel teams work across all companies. Kernel engineers work on what they want to work on, and companies hire the people working on the thing the company cares about to make sure they get their changes in.

This means that the engineers have 95% of the power. Sure you can tell your kernel developer to go work on something else, but if they don't want to do that they'll just go to a different company that will let them work on what they care about.

This gives Red Hat 2 options. One is they hire existing Btrfs developers to come help do the work. That's unlikely to happen unless they get one of the new contributors, as all of the seasoned developers are not likely to move. The second is to develop the talent in-house. But again we're back at that "it's hard to tell kernel engineers what to do" problem. If nobody wants to work on it then there's not going to be anybody that will do it.

And then there's the fact that Red Hat really does rely on the community to do the bulk of the heavy lifting for a lot of areas. BPF is a great example of this, cgroups is another good example.

Btrfs isn't ready for Red Hat's customer base, nobody who works on Btrfs will deny that fact. Does it make sense for Red Hat to pay a bunch of people to make things go faster when the community is doing the work at no cost to Red Hat?

chasil · on Aug 2, 2017

Oracle certainly controls the license for ZFS.

Release under a compatible license would likely see a ZFS kernel module appear in EPEL immediately; Red Hat would likely replace XFS with ZFS as the default in RHEL8 were this legally possible.

Oracle supports BtrFS in their Linux clone of RHEL. It certainly appears that Red Hat is swallowing a "poison pill" to increase Oracle's support costs (and I'm surprised that they have not swallowed more).

http://docs.oracle.com/cd/E52668_01/E54669/html/ol7-about-bt...

With these new added costs, Oracle might find it cheaper to simply support the code for the whole ecosystem (CentOS and Scientific Linux included). Given the adversarial relationship that has developed between the two protagonists, an enforceable legal agreement would likely be Red Hat's precondition.

Otherwise, BtrFS has been mortally wounded.

rbanffy · on Aug 4, 2017

> Oracle doesn't control anything

Oracle owns a lot of patents and I suspect both ZFS and BtrFS rely on some.

pgaddict · on Aug 2, 2017

> All this talk about Oracle is just plain stupid. Oracle doesn't control anything, the community does. One core developer still works on Btrfs from Oracle, the vast majority of the contributions come from outside Oracle.

FWIW I haven't said anything about Oracle & btrfs ...

>> "Why Red Hat does not have engineers to support btrfs?"

> You have to understand how most kernel teams work across all companies. Kernel engineers work on what they want to work on, and companies hire the people working on the thing the company cares about to make sure they get their changes in.

> This means that the engineers have 95% of the power. Sure you can tell your kernel developer to go work on something else, but if they don't want to do that they'll just go to a different > company that will let them work on what they care about.

> This gives Red Hat 2 options. One is they hire existing Btrfs developers to come help do the work. That's unlikely to happen unless they get one of the new contributors, as all of the seasoned developers are not likely to move. The second is to develop the talent in-house. But again we're back at that "it's hard to tell kernel engineers what to do" problem. If nobody wants to work on it then there's not going to be anybody that will do it.

Sure, I understand many developers have their favorite area of development, and move to companies that will allow them to work on it. But surely some developers are willing to switch fields and start working on new challenges, and then there are new developers, of course. So it's not like the number of btrfs developers can't grow. It may take time to build the team, but they had several years to do that. Yet it didn't happen.

> And then there's the fact that Red Hat really does rely on the community to do the bulk of the heavy lifting for a lot of areas. BPF is a great example of this, cgroups is another good example.

I tend to see deprecation as the last state before removal of a feature. If that's the case, I don't see how community doing the heavy lifting makes any difference for btrfs in RH.

Or are you suggesting they may add it back once it gets ready for them? That's possible, but the truth is if btrfs is missing in RHEL (and derived distributions), that's a lot of users.

I don't know what are the development statistics, but if majority of btrfs developers works for Facebook (for example), I suppose they are working on improving areas important for Facebook. Some of that will overlap with use cases of RHEL users, some of it will be specific. So it likely means a slower pace of improvements relevant to RHEL users.

> Btrfs isn't ready for Red Hat's customer base, nobody who works on Btrfs will deny that fact. Does it make sense for Red Hat to pay a bunch of people to make things go faster when the community is doing the work at no cost to Red Hat?

The question is, how could it get ready for Red Hat's customer base, when there are no RH engineers working on it? Also, I assume the in-house developers are not there only to work on btrfs improvements, but also to investigate issues reported by customers. That's something you can't offload to the community.

I still think RH simply made a business decision, along the lines:

1) The btrfs possibly matters to X% of our paying customers, and some of them might leave if we deprecate it, costing us $Y.

2) In-house team of btrfs developers who would work on it and provide support to customers would cost $Z.

If $Y < $Z, deprecate btrfs.

fundabulousrIII · on Aug 2, 2017

Wholeheartedly agree that btrfs isn't ready for a real customer base. I wish SuSE would have learned that lesson before they pushed it as default.

dagwieers · on Aug 4, 2017

It feels like reiserfs all over again.

fundabulousrIII · on Aug 4, 2017

Well, I am guilty of using reiserfs selectively (and with research + design) and having a really good experience with it. Maybe I should have done the same with btrfs but I took btrfs on faith and was burned.

chasil · on Aug 2, 2017

Oracle has essential control of both "nextgen" filesystems that should be used in Linux - as Sun, they developed and licensed ZFS, and they are the chief contributors of BtrFS. Their refusal to release ZFS under a license that is compatible with the GPL is keeping it out of Red Hat's distribution.

This move by Red Hat must be seen as a provocation of Oracle, to force either greater cooperation and compliance in producing a stable BtrFS for RHEL, or the release of ZFS under a compatible license. Red Hat has put an end to BtrFS for now, and Oracle will have to go to greater lengths to use it in their clone. Customers also will not want it if it does not run equally well between RHEL and Oracle Linux.

It is obvious that Oracle will have to assume higher costs and support if they want BtrFS in RHEL. Red Hat is certainly justified in bringing Oracle to heel.

Oracle recently committed preliminary dedup support for XFS, so they must be intimately aware of the technical and legal issues behind Red Hat's move.

https://blogs.oracle.com/linuxkernel/upcoming-xfs-work-in-li...

cjbprime · on Aug 2, 2017

Oracle is not the "chief contributors" of Btrfs. If anyone is, it's Facebook. Chris Mason (the btrfs creator) worked for Oracle. He left in 2012.

> This move by Red Hat must be seen as a provocation of Oracle

I doubt it.

chasil · on Aug 2, 2017

A thousand pardons - I am mistaking "initially designed" for current control.

https://en.wikipedia.org/wiki/Btrfs

"... initially designed at Oracle Corporation for use in Linux."

crudbug · on Aug 2, 2017

Oracle uses RHEL for their Unbreakable Linux [0] distribution. The least thing they can do is open up ZFS for the Linux community.

[0] - https://linux.oracle.com/

chasil · on Aug 2, 2017

Seconded. I am in absolute agreement.

rbanffy · on Aug 4, 2017

> open up ZFS for the Linux community.

More likely they'll support it only on their Linux.

chasil · on Aug 2, 2017

This will particularly impact the "Red Hat Compatible Kernel" (RHCK) that is shipped by Oracle Linux.

https://docs.oracle.com/cd/E37670_01/E57668/html/ol_kern_65r...

Assuming that RHEL v8 strips BtrFS, Oracle's RHCK will have to add support back in, and thus no longer be "compatible." Without that support, some filesystems will fail to mount at boot. In-place upgrades from v7 to v8 will be problematic.

Oracle has worked very hard to maintain "compatibility" with Red Hat, even going so far as to accept MariaDB over MySQL. Their reaction to the latest "poison pill" will be interesting.

lenzgr · on Aug 3, 2017

Why would Oracle have to add Btrfs support back into the RHCK? It's exactly the point of this kernel to be 100% identical to upstream RHEL. If an Oracle Linux user needs Btrfs support, it will still be included in the "Unbreakable Enterprise Kernel" (UEK), which Oracle provides as an alternative.

chasil · on Aug 3, 2017

Any BtrFS filesystems in /etc/fstab won't mount if/when an RHCK boots that lacks the filesystem driver.

An in-place upgrade from v7 to v8 could easily get hosed.

bonzini · on Aug 3, 2017

Does Oracle support Btrfs (as opposed to making it just a tech preview) with the compatible kernel? I don't think so, since it's the same code as RHEL. And if not, hosing in-place upgrades is acceptable. RHEL 7 is supported until 2024.

user5994461 · on Aug 2, 2017

>>> This announcement from Red Hat is purely a reflection of Red Hat's engineering expertise and the way they ship kernels, and not an indictment of Btrfs itself.

It's a clear indicator that RedHat doesn't want or can't support btrfs.

Which is a reflection of btrfs AND RedHat: the effort required to maintain it, the lack of usage in RHEL paying customers, the immaturity/fast development of the filesystem.

nailer · on Aug 2, 2017

Thanks. Any indication why RH didn't hire btrfs devs? It looks like a decision was made that it wasn't strategic (obviously xfs on Linux has a much longer history).

josefbacik · on Aug 2, 2017

They brought on Zach right before I left specifically to help with the effort, but he left as well. I can't really speak to Red Hat's overall strategic decisions, but really they have a large local file system team, and a lot of them are xfs developers. You aren't going to convince Dave Chinner he should go work on Btrfs instead of XFS. Unless there's somebody internally that actually wants to work on Btrfs the work simply isn't going to get done. All of the other Btrfs developers work for other companies, and none of them are interested in working for Red Hat.

jameskegel · on Aug 2, 2017

I'm feeling a subtext here that maybe RH isn't a desired place to work, when I've always imagined the opposite. Is this the case?

tytso · on Aug 2, 2017

One of Red Hat's superpowers is in hiring relatively unknown developers, and helping them become strong participants in the open source world. But their compensation isn't super high, and when you travel on Red Hat's nickle you have to share a room with someone else --- assuming you get travel approval to go at all. For people on help organize conferences, Red Hat is rather infamous about having their full-time employees ask for travel scholarships, which originally established to support hobbyist developers.

As a result, it is not at all surprising that Red Hat ends up functioning as somewhat like a baseball farm team for companies like Facebook, Google, etc. who are willing to pay more and have more liberal travel policies than Red Hat. If someone can become a strong open source contributor while working at Red Hat, they can probably get a pay raise going somewhere else.

There is a trade off --- companies that pay you much more also tend to expect that you will add a corresponding amount of value to the company's bottom line. So you might have slightly more control over what you choose to work on at Red Hat.

josefbacik · on Aug 2, 2017

Nope I love Red Hat and loved working for Red Hat and still interact with most of my colleagues there on a day to day basis. I shouldn't be speaking for everybody, but from what I can tell we're all pretty happy where we are, so no real reason to switch companies.

user5994461 · on Aug 2, 2017

I'd read the subtext as "there are only a handful of filesystem developers in the world and the 10 of them are already settled in a good big company".

cjbprime · on Aug 2, 2017

I think there are some ways that RH would be less desirable for many people than a BigCo. When I was interested in working for them they had offices in inconvenient locations and a requirement that you (or at least, I) work in one of them -- e.g. their "Boston" office is 30 miles away in Westford, and their headquarters are in North Carolina. That's disqualifying for many people.

I imagine they pay significantly less than the other companies (e.g. Facebook) who want to hire Btrfs devs can afford to, too.

nailer · on Aug 2, 2017

Isn't FBs internal distro Fedora based? I wonder if FB has a solid RH-based btrfs production ready kernel floating about.

bonzini · on Aug 3, 2017

The Fedora kernel is based on upstream. The RHEL kernel is a 3.10 fork with key subsystems currently having at least 4.5-ish features.

nailer · on Aug 2, 2017

Fair enough. I imagined if btrfs was a high enough priority they'd hire new staff specifically for it, but if they've tried, and money/good employment conditions don't work, that's all they can do.

chasil · on Aug 2, 2017

XFS does not support transparent compression, error detection and recovery, and (as yet) deduplication.

Fragmentation is also an issue, and xfs_fsr should be run at regular intervals to "defrag" an XFS file system. (I assume that) BtrFS handles this more intelligently.

I'd love to see XFS get some or all of these features.

_bz2r · on Aug 2, 2017

thanks for this context; i read this thread previously and had no idea of the "why" behind the news item. great comment to understand better.

feld · on Aug 2, 2017

The problem is that Redhat and others are refusing to challenge the norm and break away from the "freeze the release; backport fixes" mantra.

Stop backporting fixes. You're forking the codebase.

Ship exactly what upstream provides.

Teach upstream projects how to do better release engineering if they're abandoning major releases to early or breaking API/ABI in a minor release.

Stop backporting fixes. You're forking the codebase.

edit: also stop incorrectly backporting security fixes and creating new CVEs. Seriously. Stop it.

admax88q · on Aug 2, 2017

I think you're underestimating the stability that such practices provide for enterprise. This is what people pay Redhat for.

Not all upstreams are interested in doing release engineering. There are non zero costs to doing it. It can eat up time that can be spent on bug fixes and features, or even make it too costly to change direction if a certain approach to implementation is proving more difficult than it should be.

Look at the Linux kernel. The only reason there is a stable kernel series is because Greg K-H decided it was important enough. He was unable to convince any other developers to go along with it, and eventually the decision was "if you want to support it, then you can do it."

Do you consider the stable kernel series a fork of the codebase? Should everyone be running the newest kernel every release despite the plenty of regressions that appear?

Kernel developers are not interested in making every change in such a slow and controlled manner as to avoid any regressions. And it works for them. They get a lot of stuff done, and come back and fix the regressions later.

tytso · on Aug 2, 2017

There are real tradeoffs between development velocity, stability, scope (wide/narrow applicability), and headcount.

If you don't care that much about development velocity, it's really easy to make something that is super stable.

If you only care about making things work on a very narrow use cases (to support the back end of a particular company's web servers, or just to support a single embedded device), life also gets much easier.

If you want to "move fast and break things", that's also viable.

Finally, if you have unlimited amounts of head count, life also becomes simpler.

Different parts of the Linux ecosystem have different weights on all of these issues. Some environments care about stability, but they really don't care about advanced features, at least if stability/security might be threatened. Others are interested in adding new features into the kernel because that's how they add differentiators against their competitors. Still others care about making a kernel that only works on a particular ARM SOC, and to hell if the kernel even builds for any other architecture. And Red Hat does not have infinite amounts of cash, so they have to prioritize what they support.

So a statement such as "Teach upstream projects how to do better release engineering", is positively Trumpian in its naivete. Who do you think is going to staff all of this release engineering effort? Who is going to pay for it? Upstream projects consists of some number of hobbists, and some number of engineers from companies that have their own agendas. Some of those engineers might only care about making things better for Qualcomm SOC's, and to hell with everyone else. Others might primarily interested in how Linux works on IBM Mainframes. If there are no tradeoffs, then people might mind work that doesn't hurt their interests, but helps someone else. They might even contribute a bit to helping others, in the hopes that they will help their use case. That's the whole basis of the open source methology.

But at the same time you can't assume that someone will spend vast amounts of release engineering effort if it doesn't benefit them or their company. Things just don't work that way. And an API/ABI that must be stable might get in the way of adding some new feature which is critically important to some startup which is funding another kernel engineer.

There is a reason why the Linux ecosystem is the way it is. Saying "stop it" is about as intelligent as saying that someone who is working two 25 hour part-time jobs should be given a "choice" about her healthcare plan, when none of the "choices" are affordable.

cyphar · on Aug 2, 2017

I get the feeling you've never had to provide support for a distribution before. There are many guarantees that Red Hat or SUSE provide that are not provided by upstream projects. Freezing the release is the only sane way of doing it, and backporting fixes is necessary. There are exceptions to this, such as stable kernels (which was started by GregKH out of frustration of the backporting problem while at SUSE).

Upstreams don't have the resources to do proper release engineering, they're busy working on new features. The fact that SUSE and Red Hat spawned from a requirement for release engineering that upstreams were not able to provide should show that it takes a lot more work than you might think.

Also, can we please all agree as a community that writing patches and forking of codebases is literally the whole point of free software? If nobody should ever fork a codebase then why do we even have freedom #1 and #2? The trend of free software projects to have an anti-backport stance is getting ridiculous. If you don't want us to backport stuff, stop forcing us to do your release engineering for you.

digi_owl · on Aug 2, 2017

Sadly more and more upstream wants to have their cake and eat it to. Just look at Flatpak, that is all about moving the updating and distribution from distros to upstream.

cyphar · on Aug 4, 2017

I think Flatpak won't end up solving the problem though. Mainly because it still requires distributions to exist and provide system updates, but also because it just makes the static binary problem (that distributions were made to fix) even worse.

Honestly what I think we need is to have containers that actually overlay on the host system and only include whatever specialised stuff they need on top of the host. So updates to the host do propagate into containers -- and for bonus points the container metadata can still be understood by the host.

digi_owl · on Aug 4, 2017

In the end i don't see it as a technical problem, but a mentality problem.

Again and again we see that without any financial incentive, developers are loath to put any effort into backwards compatibility and interface stability.

At the same time they all want people to be running their latest and shiniest.

So in the end, what will happen is that each "app" will bundle the world, or at least as much as they feel they need to.

SwellJoe · on Aug 2, 2017

I don't get why this would be a positive thing for Red Hat's customers (or Red Hat, since stability/predictability is what Red Hat customers are paying for). There is a Red Hat-maintained Linux that is very close to upstream (Fedora). But the people who pay for RHEL don't want upstream and surprises, they want predictable for seven years and they're willing to pay a lot of money for that. Why would that be a negative for you or me? RHEL isn't breaking upstream with this practice, even if they are making mistakes in their own backports.

"edit: also stop incorrectly backporting security fixes and creating new CVEs. Seriously. Stop it."

Can you give some examples of cases where Red Hat introduced bugs in their backported patches? I follow RHEL CVEs relatively closely (because some of my packages are derived from their packages), and I can't think of an example of that happening. Debian has done so, but very rarely, that I can recall. (And, Ubuntu, too, since they just copy Debian for huge swaths of the OS.)

noselasd · on Aug 2, 2017

I for one am very grateful Red Hat does not do that. We have a kernel driver for custom hardware, there's around 80 of these devices in the world, split roughly 50/50 between Windows and Red Hat users. While these devices are not cheap, we could not recoup the cost of maintaining it if we had to track the upstream kernel all the time - we tried, and could not justify the cost.

The number of times the APIs changes from under your feet is astounding - even with just keeping up with Red Hat, we spend around 4-6 times the engineering time on the driver compared to what we do with the Windows version of the driver, tracking upstream gave us almost an order of magnitude more work (And keep in mind that /only/ supporting the most recent upstream kernel is rarely an option - several versions need to be supported concurrently)

bonzini · on Aug 3, 2017

Red Hat provides a stable ABI for a pretty large set of symbols. Unless you are doing strange things in the driver, a module built for RHEL 7.0 should be fine until 8 comes out.

noselasd · on Aug 11, 2017

That is exactly what I am saying - which is in stark contrast to what would happen if Red Hat did not provide that stable ABI, but instead "Ship exactly what upstream provides" as the original comment suggest they should do.

ajross · on Aug 2, 2017

If upstream releases were doing better release engineering in the way you mean then there would be no money to be made shipping RHEL as a product.

krylon · on Aug 2, 2017

> the "freeze the release; backport fixes" mantra.

For many customers of Red Hat, that mantra is the very reason they use RHEL in the first place.

digi_owl · on Aug 2, 2017

Indeed. Or they would have stuck to using Windows, or some commercial Unix.

Sadly i feel that more and more upstream wants it both ways, be able to push their latest and shiniest, and keep ignoring any need for interface stability etc.

Frankly i suspect the end result of the likes of Flatpak will be that upstream push whole distros worth of bundled libs, just so they don't have to consider interface stability as they pound out their shinies in their best "move fast and break things" manner.

thehardsphere · on Aug 2, 2017

Taking this to its most ludicrous extreme, everyone should use Arch, and anyone who can't should... what? Not use Linux?

lloeki · on Aug 2, 2017

The Fedora Project focuses, as much as possible, on not deviating from upstream in the software it includes in the repository.

[0]: https://fedoraproject.org/wiki/Staying_close_to_upstream_pro...

thehardsphere · on Aug 2, 2017

Right, "as much as possible" implying that there are cases where this is not possible. Which is more upstream-compliant than RHEL, but not 100% "stop doing this", which is the opinion of the comment I was replying to.

chasil · on Aug 2, 2017

We use both RHEL and Oracle Linux as a peer to VAX VMS and Unisys(Univac) OS2200 (a COBOL mainframe).

From the perspective of legacy systems, Red Hat's approach is more comfortable.

_pfxa · on Aug 2, 2017

You cant teach anything to anybody. An unrelated proof: I'm having to write a bot with phantomjs to scrape my uni's announcements page and turn it into an RSS feed so that I won't have to periodically control it; because they decided wordpress wouldnt cut it and they needed some blumming angular.js stuff, breaking all the urls and removing any sort of rss feeds on the way. And all it is is a blog, basically, nothing more. Mailed them telling that I used to use that stuff, no replies in weeks. At least I'm learning phantomjs which seems to be very useful of a tool.

justin66 · on Aug 2, 2017

I am as happy as anyone that XFS is finally getting the position of honor it deserves on enterprise Linux (something like 15 years later than it should have, grumble grumble) but it doesn't really take the place of what btrfs was trying to do. Only ZFS is in a position to do that. I wonder if there are any plans for supporting the native port on RHEL.

__david__ · on Aug 2, 2017

Anecdote time: Last month I had an XFS volume fail on me. It got some sort of internal inconsistency and refused to work (all fs calls returned errors until I unmounted). This is where I discovered that XFS still has extremely poor recovery tools.

xfs_repair will complain if there is a journal present and tell you to mount the fs to replay the journal. But mount would refuse, saying the fs was inconsistent. So the only option was to xfs_repair -L to just throw out the journal.

Then, xfs_repair sucked up something like 30GB or more of RAM, so I had to make a huge swapfile so that the kernel would OOM the repair.

Then, after roughly 20 to 30 hours of repair it would exit with an error. At that point it would actually mount, but hitting certain areas of the filesystem would trigger the inconsistency again and start the entire process over.

In the end I couldn't fix it and sadly had to reformat. I chose ext4 when I did—I've had lots of experience with ext3 and 4 and I've never had a filesystem that I couldn't at least make consistent again (even if it loses some data).

rob-olmos · on Aug 2, 2017

Yikes. That's unnerving to read. Were you able to create a bug report? Was there any other related issue like an underlying storage controller messing up?

mrweasel · on Aug 2, 2017

Redhat hasn't been on the best of terms with Oracle, so I suspect that they want to stay clear of ZFS. It does however leave Redhat without a more modern feature rich filesystem.

Perhaps Redhat could help to develop snapshots on XFS. It's not the only feature XFS is missing, but it's a start.

cyphar · on Aug 2, 2017

> Perhaps Redhat could help to develop snapshots on XFS. It's not the only feature XFS is missing, but it's a start.

Upstream has been working on adding more btrfs-like features to XFS, but I believe that RHEL encourages using devicemapper snapshots (which you then format with XFS).

2ion · on Aug 2, 2017

> Upstream has been working on adding more btrfs-like features to XFS, but I believe that RHEL encourages using devicemapper snapshots (which you then format with XFS).

Exactly, mountable and mergable snapshots have been supported by a LVM/devicemapper stack for a long time.

rbanffy · on Aug 2, 2017

It's still a bit more involved than the transparent snapshots ZFS and, to a lesser degree, BtrFS offer. I'm not happy with this and sincerely hope this position changes in the future.

boomboomsubban · on Aug 2, 2017

>Redhat hasn't been on the best of terms with Oracle, so I suspect that they want to stay clear of ZFS

The only connection Oracle has to ZFS on Linux is ownership of some patents that the license allows you to use, their reluctance is based on distribution issues between the GPL and CDDL.

jdub · on Aug 2, 2017

… for which Oracle owns the copyright.

bestham · on Aug 2, 2017

I can't see that the copyright matter here. What matters is the license. Oracle can't unlicence what Sun code is already part of OpenZFS.

lathiat · on Aug 2, 2017

Correct, Oracle would be in a position to do exactly that. Whether they do or not is another story but thats a pretty big liability.

Given they are discontinuing Solaris and all-in on Red Hat Enterprise Linux I can't help but wonder why they don't do more with ZFS on Linux and therefor wonder if the NetApp patent suits or some other patent suit is preventing them from doing anything in the background.

Many people don't realise that these crappy patent suits in the background prevent all sorts of really basic stuff, like the fact most things now bounce through a cloud server (like Facetime) because there's a patent troll for peer-to-peer communications. And it's causing total waste as a result :( It also seems likely that prevented facetime becoming an open standard as Apple original promised. This is only 1 example though.

4ad · on Aug 2, 2017

> Given they are discontinuing Solaris

Oracle is NOT discontinuing Solaris. This FUD must die.

https://news.ycombinator.com/item?id=14865237

cryptonector · on Aug 2, 2017

This may be true, but when Oracle killed OpenSolaris, my non-Sun/Oracle friends wrote off Solaris and moved off it.

Killing OpenSolaris, and talking up SPARC so much, made people think that a) Larry just wants to vendor lock them, b) doesn't care about x86 support because it makes vendor lock-in harder for Oracle, c) the OpenSolaris derivative community will not be able to compete with Linux. So everyone has grudgingly accepted that Linux is it for the enterprise Unix market.

I hate this as much as you do. I <3 Solaris/Illumos. Illumos derivatives have their niches, no doubt, and I want to be able to use them much more. But that's not how business people think.

I'm not sure that Oracle could turn this impression around at this point. To begin with it would have to restart OpenSolaris, and that might not be enough. OpenSolaris greatly helped Sun overcome resistance to Solaris, but it only went so far, so Oracle will have to do even more work to make Solaris' future bright.

This blog post is as relevant today as ever: https://blogs.oracle.com/bmc/the-economics-of-software

(And yes, it's STILL hosted at blogs.oracle.com. I'm almost afraid of mentioning it: who knows, it might get removed if Oracle execs notice it.)

lathiat · on Aug 2, 2017

Update: Apparently Oracle actually wouldn't be the liability on a CDDL basis because the violation is of the GPL not the CDL. Fair point.

lathiat · on Aug 2, 2017

Hrm thanks seems you're right. They cancelled Solaris "12" but 11 is still in development. Thanks for the correction.

https://theregister.co.uk/2017/01/18/solaris_12_disappears_f...

e12e · on Aug 2, 2017

But presumably, as copyright holders, Oracle is the entity that could try to enforce CDDL in court, in particular breaking the CDDL by mixing in GPL code in the same (ie: OS) distribution? Oracle goes: we bought Sun, and hold copyright to ZFS (also at the point of the OpenZFS fork) - RedHat could respond: we got a license - the CDDL - and Oracle could respond, sure - but CDDL isn't compatible with GPL - so you're in breach of the CDDL?

georgyo · on Aug 2, 2017

Oracle is not the one that could sue. There is nothing in the CDDL that prevents it being used else where.

The GPL on the other hand is a strong copy left. If you link against GPL code, your code must also be licensed as GPL.

This means the Linux copyright owners could sue the distributers of ZoL binaries, but Oracle could not.

Oracle has the power to allow their ZFS code to be relicensed as GPL, removing this road block, but they have no incentive to do so.

boomboomsubban · on Aug 2, 2017

>But presumably, as copyright holders, Oracle is the entity that could try to enforce CDDL in court, in particular breaking the CDDL by mixing in GPL code in the same (ie: OS) distribution?

Any Linux contributor could also try to enforce it, which is why the license incompatibility is the issue stopping them. Oracle holds no special power.

e12e · on Aug 2, 2017

True - but the incentives are a bit different. How many other Linux contributors[1] are selling a commercial operating system in direct competition with Linux as a general purpose Unix-like OS, with ZFS as one of the differentiating features?

Most Linux contributors want Linux to succeed. I don't think it's at all clear that corporate Oracle prefers Linux to succeed - at least not if higher adoption of Solaris is an alternative.

[1] (I guess IBM and Microsoft come to mind... but they don't have any special investment in ZFS)

boomboomsubban · on Aug 2, 2017

There are thousands of Linux contributors, I don't care enough to check but I have to imagine Oracle has employed at least one. Any one of them could sue, and several have mentioned they're considering the option.

The license issue is what's keeping Red Hat from using ZFS, not some rivalry with Oracle.

Nokinside · on Aug 2, 2017

How does mixing CDDL and GPL violate CDDL?

The only issue I'm aware is that mixing the two would violate GPL.

belorn · on Aug 2, 2017

CDDL says: Source code must be licensed under CDDL.

GPL says: Source code must be licensed under GPL.

If you follow the conditions of GPL, you are violating the condition of the CDDL. If you are following the conditions of CDDL, you are violating the GPL. Basic binary logic.

To add: "the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible". A license is really just an written intention of the author on what conditions copyright law restrictions may be legally ignored. In this case, those wishes had a very explicit intention. However those using the license today has had a general change of heart, and those with GPL interest has a general stance that no FOSS project will ever sue an other FOSS project over license incompatibility. As such, the risk of lawsuit is really just a company suing an other company under the technicality of incompatibility.

Naturally some organizations won't intentionally break copyright law just because no one will sue.

Nokinside · on Aug 2, 2017

Neither license forbids mixing with other licenses. As long as the demands of both are met, they can apply to the same source code.

>If you are following the conditions of CDDL, you are violating the GPL. Basic binary logic.

Relationship between licenses can be transitive but not commutative.

As far as I know CDDL allows using with code under GPL but GPL does not allow using code under CDDL. CDDL copyright owners have no case, GPL copyright owners have.

The question is: If I'm incorrect, what in CDDL prevents using with GPL?

belorn · on Aug 2, 2017

If CDDL has no issue with GPL conditions, then follow the GPL and everything is fine.

CDDL has this text: "Any Covered Software that You distribute or otherwise make available in Executable form must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License"

So you take some CDDL code, and some GPL code, and you put that whole new source code tree under GPL in order to fullfill the GPL license condition. Are you then in compliance with the CDDL code? My concussion is that you are not, as that would be in conflict with the above condition of the CDDL. The source code tree would not be "distributed only under the terms of this license".

cryptonector · on Aug 2, 2017

CDDL is per-file.

belorn · on Aug 2, 2017

How does this change this?

I take a CDDL licensed source code file. I take a GPL licensed source code file. I add inline the GPL licensed code to the CDDL licensed file, and release an executable form of the result. In order to comply with the GPL I then give out a single source code file under the GPL license terms with the code from the two files.

Is this in compliance with the CDDL terms and conditions?

e12e · on Aug 2, 2017

This stackoverflow question have some good points (notably, the accepted answer, and the bit about limitations - the CDDL section 6.2, for example revokes the CDDL in case of patent infringement, something that might be considered an "extra limitation" under the GPL (you're not allowed to add additional limitations to either the GPL or the CDDL). As such, the CDDL might be incompatible with the GPL up to and including v2 - while GPL3 might also be incompatible with the CDDL):

https://opensource.stackexchange.com/questions/2094/are-cddl...

https://github.com/zfsonlinux/zfs/blob/master/OPENSOLARIS.LI...

Also, the SO answer mentions consumer protection laws - but AFAIK they generally only apply to consumers - not businesses. So the GPL 0 clause might be void in many jurisdictions for individuals but still valid for businesses.

cryptonector · on Aug 2, 2017

Oracle can relicense their codebase anytime they want. They _cannot_ be constrained by OpenZFS/Illumos unless they accept patches from them without copyright assignment.

lunchables · on Aug 2, 2017

What about just using LVM for snapshots? Considering that there default partition schemes include LVM maybe that's what they bank on in most use cases?

jacquesm · on Aug 2, 2017

I have one machine running XFS but if that one is representative then I won't be installing XFS anywhere else and would happily discourage others from using it. It is terribly slow when doing some fairly common operations when you have a large number of small files.

dijit · on Aug 2, 2017

You're joking, surely.

XFS has outperformed EXT4 in almost all "high" use-cases in my experience and testing: Large files (500GB~) or many small files (128k files * 2,400,000 or so). EXT4 under those loads is comically bad.

BTRFS is also terrible at this, only XFS and ZFS are good at handling it.

jeltz · on Aug 2, 2017

On the other hand on database workloads, for example PostgreSQL, XFS and EXT4 are about equal these days. ZFS (at least on Linux) and Btrfs are both clearly slower on those workloads.

Here is one benchmark, but I have seen plenty of similar benchmark results for PostgreSQL showing the same thing: https://blog.pgaddict.com/posts/postgresql-performance-on-ex...

lathiat · on Aug 2, 2017

There's a fairly simple reason for that which is that ZFS (and btrfs to some extent) are almost literally "ACID" databases. They do alot of the same double-writes and other safe behaviour the database is also doing. Those have a penalty and you're doubled up.

There are various guides around for tuning ZFS and database servers to try reduce that duplication, for example you can disable the InnoDB double write buffer because ZFS guarantees you don't need it. You also need to tune recordsize to match the database page size so that you don't accidentally create large multi page blocks.

pgaddict · on Aug 2, 2017

I partially disagree with the claim that ZFS is slower than ext4/xfs. It is, but only as long as you don't use ext4/xfs on top of LVM to get similar snapshot capabilities etc. Then ZFS starts to win.

So if you only need a plain filesystem, ext4/xfs are great and you will get better performance.

If you need/want snapshots, e.g. to do backups that way, it makes sense to look at ZFS.

jacquesm · on Aug 2, 2017

I wish I was. It's pathetic. Creating a new directory entry on an idle machine with plenty of CPU and memory takes seconds, ditto deletions.

Also: I love how that comment sits at -4, as if downvoting it will somehow discredit the data point.

croon · on Aug 2, 2017

> I have one machine running XFS but if that one is representative

> Creating a new directory entry on an idle machine with plenty of CPU and memory takes seconds, ditto deletions.

I think your answer lies in your premise then. It's not representative.

jacquesm · on Aug 2, 2017

10's millions of files should have been the ideal use case for XFS, that's why I installed it in the first place. This was for the 'reocities.com' project and by the time I realized what the problem was most of the import had already been done so I let it run to completion but it makes updating the project a real PITA.

justin66 · on Aug 2, 2017

There's so much that can go wrong setting up a Linux server that it's impossible to give much advice with something like this.

I guess the general stuff is: the easy default partitioning setup you get from a Linux distro is total bs, you need more RAM than you think you do, the way you're serving files or accessing the system (NFS!) has plenty of ways to screw things up as well, and tens or hundreds of millions of files is not any filesystem's ideal use case. The classic IRIX workload would be guaranteed-rate streaming of large media files, and the Linux port of the filesystem obviously inherited a lot of that system's traits (without the GRIO).

XFS has received some very serious performance improvements in the past couple of years to address indexing, large volumes of metadata, and so on, so that'd be one very relevant thing. Dave Chinner's talks are worth the time to watch if you're interested. You would be giving bad advice if you steered people one way or the other with regard to filesystems based on a seven-year old project (unless you've refreshed that system much more recently, of course).

jacquesm · on Aug 3, 2017

> XFS has received some very serious performance improvements in the past couple of years to address indexing, large volumes of metadata, and so on, so that'd be one very relevant thing.

That's probably the difference right there. Thanks for pointing that out.

croon · on Aug 2, 2017

Sure, but the issue could be configuration, drive, interface, etc. It's impossible to speculate in, but what we know is you have trouble with one machine, and it's the only one that has used XFS. It's unfortunate, but likely a coincidence, or at least unrelated to XFS at its core.

I've been using XFS for 10 years without the issues you seem to be having.

fanf2 · on Aug 2, 2017

Your performance problem reminds me of this dentry cache performance failure https://sysdig.com/blog/container-isolation-gone-wrong/

feld · on Aug 2, 2017

You have to tune the parameters at filesystem creation time if you care about small file performance. It was designed for large files.

Dylan16807 · on Aug 2, 2017

What parameters? This guide[1] only mentions two things in relation to number of files. The first is inode count, for which performance is binary. The other is files in a single directory, and it says that the default setting is fine for a million. There's no explanation for the performance jacquesm describes.

[1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

baldfat · on Aug 2, 2017

Here is a link to benchmarks for the file systems. To bash XFS which is about as solid as one can get I would love to know why you would say that? http://www.phoronix.com/scan.php?page=article&item=linux-44-...

stargrazer · on Aug 2, 2017

Considering the size of disks now-a-days, the chance of bit rot is high. And (I don't have the original source) on SSD, bit rots probability is higher still. So... ZFS and BTRF have meta-data as well as data checksumming. From what I've read, XFS may have metadata checksumming, but not on the data side of things.

I consider checksumming important. Do others? What is the solution? What other file systems offer that sort of capability?

Snapshotting is a second go-to function. Particularly when it is integrated into the LXC container creation process. (There was a comment elsewhere here which said LXC is on it's way out.... huh? what?)

tytso · on Aug 2, 2017

There are many different ways that storage can be layered, and depending on your use case, you can put various advanced features (snapshots, checksum/data integrity, encryption, etc.) functionality in different places in the storage stack. You can put functionality the block device layer (e.g., lvm, dm-thin, dm-verity), you can put functionality into the file system, you can put functionality into the cluster filesystem (if you have such a thing), or you can put iti in the application level.

Depending on the requirements of your use case different choices will make more sense. It's important to remember that RHEL is used for enterprise customers, and what might be common in the enterprise world might not be common for yours, and vice versa. Certainly, if you are using a cluster file system, it makes no sense to do checksum protections at the disk file system level, because you will be using some kind of erasure coding (e.g., Reed Solomon error correcting codes) to protect against node failure. This will also take care of bit flips.

If you are using cloud VM's, or if you are using Docker / Kubernetes, then LXC won't make sense. It all depends on your technology choices, and so it's important to look at the big picture, not just at the individual file system's features.

rob-olmos · on Aug 2, 2017

Given a stock (or additional packages?) RHEL 7.4 install on non-clustered storage, what would be the best combination to detect & correct bitrot at the filesystem and lower level?

mrob · on Aug 2, 2017

Linux 4.12 introduced dm-integrity, which adds integrity checking at the block device level, so it will work with any file system:

https://gitlab.com/cryptsetup/cryptsetup/wikis/DMIntegrity

zerd · on Aug 2, 2017

One good thing about ZFS integrity checking is that when it finds an error it can repair the bit rot from another disk if you have parity or mirroring. Can dm-integrity do that?

_wxyv · on Aug 2, 2017

dm-integrity will only operate on a single disk so no, not on its own.

It does however return an error if the integrity check fails, so if you put mdadm on top, mdadm can repair the erroneous block. I've tested this and am currently running it on a 32TB array.

harshreality · on Aug 2, 2017

Not so far, it seems. https://www.spinics.net/lists/dm-devel/msg31482.html

> this target do not provide error correction, only detection of error (such a tool could be written on top of dm-integrity though)

rleigh · on Aug 3, 2017

Or multiple copies of the data (copies=n property).

ars · on Aug 2, 2017

Use mirror raid and have mdadm do a full disk compare/check every month (this is the default on Debian).

Additionally use smartmontools and configure it to do a short self test each night, and a long self test (i.e. full disk read) each week.

This will catch/flag errors early, which mdadm will then detect.

usefulcat · on Aug 2, 2017

Yes it can detect errors, but it can't continue to function correctly (read: return the correct data) because it doesn't know which copy of the differing data is damaged because it doesn't have checksums.

Moreover, if it doesn't always read both copies of the data (which it may well not, for performance reasons), then you have the possibility of silently propagating damaged data to all mirrors in the case that damaged data is returned to an application and the application then rewrites said data.

Compare that to a filesystem with checksums, which, in addition to being able to detect such a problem, could also continue to function completely correctly in the face of it.

willglynn · on Aug 2, 2017

Yep. "What happens if you read all the disks successfully but the redundancy doesn't agree?" is a great question.

Mirrors and RAID5: there's obviously no way that `md` software RAID can help, since it doesn't know which is correct. What about RAID6 though? Double parity means `md` would have enough information to determine which disk has provided incorrect data. Surely it does this, right?

Wrong. In the event of any parity mismatch, `md` assumes the data disks are correct and rewrites the parity to match. See "Scrubbing and Mismatches" section in `man 4 md`:

https://linux.die.net/man/4/md

If you scrub a RAID 6 array with a disk that returns bad data, `md` helpfully overwrites your two disks of redundancy in order to agree with the one disk that's wrong. Array consistent, job done, data... eaten.

rob-olmos · on Aug 2, 2017

That's incredible! Thanks for the insight.

Any recommendations for detecting/correcting bitrot with RHEL 7.4 at the filesystem or lower levels?

Wicher · on Aug 2, 2017

Disk A and disk B both contain file SomeFile.

On disk B this file has rotted.

When reading the file SomeFile into memory, the read will be distributed among the disks (for performance reasons) (and it will probby need to span a multiple of the stripe size).

Ok, file is read into memory, including the bitrotted part from disk B. Now we write the file blocks back - as one does.

Voila! Both disks now contain the bitrot. And mdadm will not complain - disk A and B are identical for the area of file SomeFile.

IgorPartola · on Aug 2, 2017

Moreover, even if you don't read the file, and the bit rot is discovered during the monthly compare, at least on Linux the disk that is considered correct will be chosen at random. So you need at least three disks to have some semblance of protection. Have you guys seen many laptops that come with three or more drives?

Just use ZFS. Even on a single disk setup you will at least not get silent bit rot.

binaryphile · on Aug 2, 2017

"Never go to sea with two chronometers; take one or three."

- adage cited in the Mythical Man Month

gigatexal · on Aug 2, 2017

Or just do raidz6 in ZFS and call it a day.

feld · on Aug 2, 2017

Actually it's better to just do mirrors. Avoid RAIDZ at all costs if you care about performance and the ability to resilver in a reasonable amount of time.

gigatexal · on Aug 2, 2017

Sure I agree. But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

aeorgnoieang · on Aug 2, 2017

> But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

Are you referring to mirroring a volume or dataset on a single disk? Why would you want to do that instead of mirroring among multiple drives?

gigatexal · on Aug 2, 2017

how would you set up a large pool?

two sets of say 5 disks in a mirror raidz1 would still fail if a disk in one set failed and a disk in the other set failed. I guess you could do a stripe setup of 5 sets of 2 disks in mirrors. Still it seems wicked risky to me. I do agree though mirroring has been the best for speed but a lot of that changes with nicer SSDs especially NVMe ones.

aeorgnoieang · on Aug 2, 2017

I was curious about what a "nested" mirror is really. What exactly is nested?

I'd setup a large pool with mirror vdevs, i.e. n sets of 2 disks per mirror.

My half-remembered reasoning was that backups manage the risk you'll lose data. But replacing a disk in a mirror vdev is much easier, and faster, than doing so with RAIDZ.

The risk of RAIDZ is that resilvering impacts multiple vdevs, is much more intensive than a simple mirror resilvering, and thus the probability that additional drives will fail is much higher.

Here's a blog post that I definitely read the last time I was reading up on this:

- [ZFS: You should use mirror vdevs, not RAIDZ. – JRS Systems: the blog](http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...)

aeorgnoieang · on Aug 2, 2017

A Reddit post about that blog post in my other reply:

- [You should use mirror vdevs, not RAIDZ. : DataHoarder](https://www.reddit.com/r/DataHoarder/comments/2v0quc/you_sho...)

gigatexal · on Aug 2, 2017

I wonder if resilvering is still an issue with SSDs. But I cede your point, nested vdevs of two disks making mirrors makes sense. It doesn't sit well still, but makes sense

gigatexal · on Aug 3, 2017

According to OpenZFS's changelog for v 0.7 resilvering is smarter now: https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.0

legulere · on Aug 2, 2017

> on SSD, bit rots probability is higher still

Do you have a source for this? So far I believed that bit-rot rates are pretty similar.

stargrazer · on Aug 2, 2017

A google search turns up a number of sources. Another concept which could use justification is whether SSDs bit rot more over long term than do spinning disks. I heard that somewhere as well.

legulere · on Aug 2, 2017

Using google was the first thing I did. I couldn't find something substantial.

buserror · on Aug 2, 2017

Well I used btrfs as a root filesystem for quite a while, until I realized it was pig slow for sync() -- I mean, it would take AGES to do and apt-get upgrade for example. I ended up having to do some tasks using 'eatmydata' [0] to make it all better, risking filesystem corruption in trade for speed. Also, at the time, there was no functioning fsck.

So I moved back safely to ext4 and never looked back!

[0]: https://www.flamingspork.com/projects/libeatmydata/

buster · on Aug 2, 2017

Over the recent years on every new laptop install i switched between filesystems, so i had ext3/4, btrfs and (currently) xfs on my system. I have to say, btrfs had the most glitches (a few years back), although it worked ok'ish (no data loss).

Nowadays, i must say that i very much prefer a stable filesystem with as little complicated logic as possible. I actually never use snapshots or subtrees! I never put another disk in my laptop (where would that go?!) so i don't need to do dynamic resizing (while online of course!). All this makes the filesystems a lot more complex then it has to be. I've also run into problems using ZFS on Solaris some years back which took ~2 weeks in dtracing what the hell is going on. Of course it was related to CoW.

My lessons learned: Check your requirements. Will you really need and use subtrees/snapshots/XYZ on your system? Will you really need to do online-resizing? If not, just use a stable, simple filesystem. There are perfect usecases for ZFS or btrfs. But not everyone needs the advanced features.

lorenzfx · on Aug 2, 2017

I'm using ZFS on my FreeBSD laptop. Snapshots not only make backups safer (by making sure the complete backup is taken at the same time, and by zfs sending and receiving them), the boot environments feature also make upgrading safer.

I also really like that I don't need to partition my disk, if it turns out that /tmp needs > 10% of the disk for whatever reason: no problem!

And as I like my data, I appreciate checksumming and copy-on-write.

I haven't noticed any bad slowdowns compared to ext4 on my Debian laptop I used before.

rsync · on Aug 2, 2017

"I'm using ZFS on my FreeBSD laptop. Snapshots not only make backups safer..."

Did you know you can 'zfs send' snapshots to rsync.net ?[1][2]

[1] https://arstechnica.com/information-technology/2015/12/rsync...

[2] http://www.rsync.net/products/zfsintro.html

viraptor · on Aug 2, 2017

> Will you really need and use subtrees/snapshots/XYZ on your system?

It's a valid question, but not the best one. Almost nobody needs snapshots. But they make things easier. You most likely don't need a journaled fs in your laptop either (battery level notification should take care of the issues). But it does make life better.

"Need" is not the threshold I'm interested in. Most features, I'd like. One feature I think I do need most is scrubbing, which is still absent from most filesystems :(

buster · on Aug 2, 2017

"Need" as in "will you use it?". I played around with snapshots once and never really used them. So i clearly don't have a need for them on my laptop. Journaling on the other hand helps data safety a lot and i think it's not overly complex. I've had data loss happing in the past before journaling, but never again since then. So, wouldn't i need CoW for even better "data safety"? Maybe, but since i've never experienced data loss for so many years, i don't feel like the added complexity is worth it. On my laptop, for my usecase.

But that's only me. Your experience may differ very much :)

lomnakkus · on Aug 2, 2017

> "Need" as in "will you use it?".

Well, many of us have experienced a botched system package upgrade or two. If the file system supports snapshots, then the package manager could automatically ensure fully atomic package upgrades.

That should be reason enough, I should think.

Re: The data loss issue: Yes, I've actually have XFS completely throw away a file system upon a hard power-off + boot-up cycle. (This was ages ago, I'm sure it's improved heaps since then.)

buster · on Aug 2, 2017

Good point. I'm using Debian as my Desktop for many years. I don't remember a "botched" system package upgrade in the last 5 years, but i've probably learned over years how to handle dpkg/apt.

The atomic updating is a very interesting topic and the reason why i find ostree/guix/nixos very appealing. Note that neither ostree nor guix or nixos make use of filesystem snapshots, afaik. OSTree even documents why it won't use filesystem snapshots: https://ostree.readthedocs.io/en/latest/manual/related-proje... Debians dpkg does not use snapshots as well.

So, it's a definitely a nice-to-have, but not something i need, because i can handle dpkg/apt much better then i could handle filesystem internals.

That's sort of the point: I don't want the complexity in the filesystem, but i am fine with it in userspace. I can use snapshots on filesystem level. Or i can use other backup tools in userspace. While it's certainly neat that the filesystem can do that, i'm perfectly fine with handling backups on another level.

Another example: It's certainly neat that there are a bunch of distributed filesystems (which by the way have A LOT of complexity and often can't handle all workloads you would expect from a filesystem). But i'd rather use either an S3-like network storage or build a system that scales well without relying on Ceph/Gluster/Quobyte/etc.

For example, in a hypothetical distributed system i'd rather use Cassandra and distribute data over commodity hardware then use Ceph. I'd rather handle problems with data persistence/replication on the cassandra level then debugging on file system level. Especially, when Cassandra has a problem i'll most likely be able to access all data atleast on the filesystem level. When my filesystem is borked, i'm in a much worse situation.

lomnakkus · on Aug 2, 2017

> So, it's a definitely a nice-to-have, but not something i need, because i can handle dpkg/apt much better then i could handle filesystem internals.

I don't think you're seeing my point. You wouldn't have do anything -- it would all be done automatically as long as your file system supports snapshots.

BTW, to your "I know how to use dpkg/apt": It's not about knowledge. I could well be said to be at an "advanced" level of expertise in system maintenance, but "system upgrade" fuckups had nothing to do with me, but everything to with bad packaging and/or weird circumstances such as a dist-upgrade failing midway through because some idiot cut a cable somewhere in my neighborhood.

While Nix and the like are nice and all, they're currently suffering from a distinct lack of manpower relative to the major distributions. They also don't quite fully solve the "atomic update" problem, but that's a tangent. Then, OTOH, some of them have other advantages such as the easy of maintaining your full system config in e.g. Git. Swings and roundabouts on that front. FS support for snapshots would help everybody.

buster · on Aug 3, 2017

You're absolutely right. Still, dpkg doesn't support snapshots out of the box. I could fiddle around with it and i suppose i could make a snapshot before running "apt upgrade", but since that never failed for me, i would touch something very stable for little apparent benefit. Let's say Debian 10 will support btrfs snapshots on updates, i'll consider using btrfs for the next installation, but not before.

Did you read the link from the ostree people? Let's pretend Debian 10 offers to choose between OStree-like updates and btrfs snapshots: I'd probably choose OStree and stick to ext4/xfs.

lomnakkus · on Aug 5, 2017

> Still, dpkg doesn't support snapshots out of the box.

Yes, but it SHOULD, just because ALL REASONABLE FILE SYSTEMS SHOULD SUPPORT SNAPSHOTS. Therefore dpkg should assume that such support is avaiable, or at the very least take advantage of it, when available.

Just to reiterate: You (impersonal!), the "ignorant user", shouldn't have to even have to think about it.

Does this make my point clear?

(I'm only being this obtuse because you're saying "you're absolutely right", but apparently not seeing my point. I'm assuming it's some form of miscommunication, but it's difficult to tell.)

EDIT: Hehe, I'm sorry, that sounded much more aggressive than I intended. I just think that us software developers could and should(!) do much better by our users than we(!) currently do. My excuse is that most of my stuff is web-only, so at least I can't do the accidental equivalent of "rm -rf /", but...

buster · on Aug 6, 2017

Still, i think that my filesystem shouldn't accumulate that much complexity. Maybe a layered approach similar to Redhats Stratis is a better way.

cyphar · on Aug 2, 2017

> If the file system supports snapshots, then the package manager could automatically ensure fully atomic package upgrades.

That's exactly what openSUSE / SLE do with snapper. Every upgrade or package install with YaST/zypper creates two snapshots (before/after) and you can easily rollback to an older snapshot (even doing so from GRUB). This has been enabled by default for years.

_lqaf · on Aug 2, 2017

I have a wrapper script for apt-get to do that on several machines that run Debian unstable.

Another use for snapshots is backups. I love `zfs send` - it makes backup braindead-simple.

exikyut · on Aug 2, 2017

If you're on an SSD/MTD/NVMe, you have TRIM, and scrubbing is a no-op no matter what approach you try. You need a spinning HDD for scrubbing to be useful.

Here is one way to do simple, secure scrubbing on Linux without any intrusive system changes. It is mildly restrictive, but works.

First, you need a small, dedicated partition, but it only needs to be around 16MB or so. Resizing an existing partition down (tune2fs will happily resize a mounted ext4 filesystem, but you'll probably still need to reboot to reload the partition table once you've resized that too) will give you a bit of space.

Now you have a small area of the disk that occupies a known range of sectors, and because you have no TRIM, writes to this area will be properly deterministic. Good.

Create and mount a new filesystem without a journal on the new partition. ext2 could work here (:D), you could `mkfs.ext{3,4} -O ^has_journal`, or you could use filesystem defaults and simply overwrite the entire partition with /dev/urandom later.

Make a sparse file with fallocate (make sure the file system you create the file on can handle sparse files) that is big enough to handle the biggest file.

Create a LUKS volume with a detached header inside the new sparse file, and store the detached header metadata into a file in the new journal-less partition.

Create an ordinary filesystem inside the LUKS volume.

Now you have a Rube Goldberg sparse file. You've moved the deterministic-writing/journal-less stage into a tiny key, which is a lot easier to manage than a whole gigabytes+-large partition.

As an alternative you could drop the key onto a flash drive, and nuke the flash drive when you wanted to kill the data. That's kind of wasteful though (and it carries the same flash-drive-quality risks as copying the only copy of the data itself onto the flash drive).

LUKS was designed such that if you lose the key(s) or the detached header, all that's left is statistically random garbage.

dsr_ · on Aug 2, 2017

You seem to be using a different definition of scrubbing than people talking about filesystems usually use.

Scrubbing means to read all the data off a filesystem and compare it against its checksums, so that you are confident nothing has happened to the data (hardware failures, cosmic rays, whatever).

ZFS and btrfs have specific scrub commands that do that.

There's no scrubbing available for a system which does not keep some form of checksum/crc/hash of the data.

I think that you are talking about secure delete procedures.

exikyut · on Aug 3, 2017

OH. You're right. I got the terminology confused with, uh... shredding. Heh.

I actually tried to delete this comment for unrelated reasons shortly after posting it, but was unable to. Now I feel doubly stupid.

Maakuth · on Aug 2, 2017

It's a good practice to use LVM there between disk partitions and volumes. It has negligible performance implications but makes things very flexible when you need to resize volumes or add space. You also gain reliable snapshotting from device mapper, although that does have some performance effects.

pmlnr · on Aug 2, 2017

My requirements: transparent compression. I'm left with: btrfs or, if possible, ZFS.

eternauta3k · on Aug 2, 2017

It's ok to play with the OS in order to learn.

StreamBright · on Aug 2, 2017

We are using XFS for most of our production workloads, it turned out to be an excellent choice for most data heavy use cases. Brtfs was never an option, it is a bad idea to gamble with beta technology for data storage that a production system relies on. ext4 vs xfs is a much interesting argument, I haven't had time to follow up on this.

jerven · on Aug 2, 2017

We use XFS for sparql.uniprot.org (basically columnar database with semantic graph) there we recently retested it by accident. We use 2*4 TB consumer SSD ,ok rich consumer ;) With XFS we have 10-13% faster linear write of one big file (1.3TB) and about 20% more reads serviced per minute than EXT4. EXT4 was selected by accident instead of XFS for 1 of the 2 otherwise equal machines when upgrading these with the new 4TB SSDs instead of 1TB ones which they had before.

In general I feel that the more you go into "enterprise" storage levels, the more XFS pulls ahead from the EXT family. i.e. laptops and small servers are not where the difference lies.

joosters · on Aug 2, 2017

apt-get upgrade syncs so often that 'eatmydata' gives a noticeable speedup pretty much everywhere (I got into the habit of using it for ext3/4)

I don't get why apt syncs so often - isn't the main point of log-structured file systems their ability to recover after a crash or powerloss? If so, why should you need to sync more than one every ten seconds or so?

dsr_ · on Aug 2, 2017

apt doesn't assume that you have a reliable filesystem. It assumes that you might crash at any moment, and it would be really important for you to have a consistent view of what packages are installed when you reboot.

joosters · on Aug 2, 2017

But ext3 and more advanced filesystems have been around for almost twenty years now... it seems an odd assumption that your filesystem is unreliable on any machine that isn't completely ancient (is anyone still using ext2, for instance?)

rleigh · on Aug 2, 2017

It's not about the filesystem being "unreliable". It's about having the package manager's state checkpointed so that it can recover and resume if there is e.g. power loss or any other form of interruption at any point during package installation, upgrade, removal etc. This means having all of the updated files synched on disc plus the database state which describes it.

When you move to a more advanced setup such as ZFS clones, you could do the full upgrade with a cloned snapshot, and swap it with the original once the changes were complete. This would avoid the need for all intermediate syncs--if there's a problem, you can simply restart from the starting point and throw all the intermediate state away.

yjftsjthsd-h · on Aug 2, 2017

Debian calls itself "the universal operating system" and officially supports not only multiple init systems, but multiple kernels (kfreebsd); somehow I don't see it relying on specific filesystems.

lathiat · on Aug 2, 2017

Though true I have literally never had my system get corrupt or inconsistent during failed dpkg/apt from power loss, hang, filesystem going ro, etc. It's very reliable.

I've had older rpm & yum/dnf failures multiple times leaving me in weird inconsistent states from crashes or power losses etc. not conclusive but anecdotal experience - It's also possible it's been improved.

Meanwhile you can disable the file syncing with the apt preference dpkg::unsafe-io (google will be required for the exact syntax and file in /etc/apt - fairly sure you can cmdline it also)

bedros · on Aug 2, 2017

I think this is a political move disguised as technical move

oracle pays the developers of btrfs [0]

redhat hates the guts of oracle, since oracle released oracle linux, which is a clone of redhat enterprise (based on centos)

so, redhat wants to cripple btrfs and hurt oracle.

However, btrfs is my favorite FS, been using it on my home computer and backup drives for at least 6 years, before it was included in the kernel, love the subvolumes, snapshots, and compression; never had issues with it .

[0] https://oss.oracle.com/~mason/

[Update] Chris mason no longer at Oracle since 2012

ge0rg · on Aug 2, 2017

I think this is a political move disguised as technical move

I think there are solid technical reasons to discourage Btrfs use, just to quote from the official wiki [0]:

> The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.

Now I don't know if this issue has been addressed already, or which kernels are affected, but the fact that there is a prominent warning on the wiki speaks for itself.

Personally, I'm a happy btrfs user deploying a mixed-disk-size array without parity, with the hope to add redundancy some time in the future. Currently, btrfs is the only FS allowing to mix disks of any size and to run an optimal configuration on top of them [1].

[0] https://btrfs.wiki.kernel.org/index.php/RAID56

[1] http://carfax.org.uk/btrfs-usage/

cyphar · on Aug 2, 2017

The particular bug that sparked that warning was fixed a while ago, but as a precaution against "btrfs ate my data" stories they've removed the ability to create btrfs-raid from the CLI tools (you can still use md RAID with btrfs but you lose most of the benefits of btrfs that way).

cmurf · on Aug 2, 2017

Who's they? Upstream have not removed Btrfs raid creation capability in btrfs-progs, I'm not aware of any distro that has patched it this way.

cyphar · on Aug 2, 2017

Oh, I must've misunderstood this mail[1] and thought they had actually gone through with #ifdef-ing out the raid56 creation code in btrfs-progs.

My bad.

[1]: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

tw04 · on Aug 2, 2017

Which benefits are those? Both synology and qnap have the ability to detect and correct bitrot doing btrfs on top of mdraid.

rodgerd · on Aug 3, 2017

The design of btrfs allows for mismatched disks to be used in an array, and the btrfs RAID will keep the right level of redundancy while using the maximal amount of space, e.g. with a 4 TB, 2 TB, and 1 TB drive:

mdadm will give you 1 TB in RAID 1, or 1.5 TB in RAID 10 (constrained by the smallest drive).

btrfs will give you 3 TB in RAID 1 (constrained by the sum of the smallest drives).

btrfs also allows per-subvolume raid policies. So you could, for example, give users an "archive" subvolume in their home directory. You could then mark this as RAID 1 or RAID 5 (because you don't care so much about performance) while the main /home filesystem is RAID 10.

Unfortunately the RAID code is all horribly broken.

cyphar · on Aug 2, 2017

Being able to have non-symmetric disk topologies with redundancy. I believe that md raid does not support that, while btrfs multi-device does (which is what I think of as one of the really unique features of btrfs -- not even ZFS can handle the sort of disk topologies that btrfs can).

tw04 · on Aug 3, 2017

md-raid absolutely supports that and synology has for a long time. They call it "SHR". You simply do raid over disk partitions to enable disks of disparate sizes.

The reason ZFS doesn't support it, and absolutely 0 enterprise storage devices support this is because as the disks fill up, you sacrifice both performance and redundancy. Synology won't even support it on their high-end devices for this very reason. They'll only do it on their devices targeted at home use.

rleigh · on Aug 2, 2017

Even when it gets resolved, it's far from being the only major problem with Btrfs. It's merely the current high profile one.

cmurf · on Aug 2, 2017

Fixed in 4.12.

skrause · on Aug 2, 2017

Chris Mason, the principal Btrfs author, left Oracle over 5 years ago:

https://en.wikipedia.org/wiki/Btrfs#History

"In June 2012, Chris Mason left Oracle for Fusion-io, which he left a year later with Josef Bacik to join Facebook; while at both companies, Mason continued his work on Btrfs."

newscracker · on Aug 2, 2017

> However, btrfs is my favorite FS, been using it on my home computer and backup drives for at least 6 years, before it was included in the kernel, love the subvolumes, snapshots, and compression; never had issues with it .

Slightly off topic. I chose btrfs as my main filesystem recently on a system running Ubuntu/Xubuntu. I have done some research on backing up (with the advantage of snapshots) but it looks like there aren't (m)any graphical tools (this gets a little more confusing with /@ and /@home subvolumes on the same partition being treated separately for snapshots, AFAIK).

Do you manage it all from the command line and/or do you have any suggestions for graphical tools to do "as-is clones of entire partitions" (and also incremental backups) to local external drives (not over the network)? Or if you could point to any great documentation or blog posts on this topic, that'd be helpful too (I have read some bits of the btrfs wiki and the btrfs parts in the Arch wiki).

Currently I'm doing a plain rsync using Grsync, and not really taking advantage of btrfs features like snapshots.

The main reason I'm looking at avoiding the command line is to make it easier for others around me to use it.

bedros · on Aug 2, 2017

my line for snapshots

btrfs subvolume snapshot /source/drive/folder/ /source/drive/folder/.snapshots/snapshot-`date +"%Y-%m-%d-at-%I-%M%P"`

this will create a snapshot with date and time attached to the snapshot name

you can more info here

https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Managi...

make sure to remember to delete old snapshots, otherwise you'll run out of disk space and not know where it went

lathiat · on Aug 2, 2017

performance during snapshot creation/deletes (presumably mostly the deletes) is one of the reasons I personally stopped using btrfs on my desktop. Now using ZFS root (with Ubuntu devel).

I had auto hourly snapshots and sometimes when it deleted one my entire system would hang for a few seconds and occasionally 10s of seconds.

Having said that I do suspect that might be partially related to also using ecryptfs on top, but still.

cookiecaper · on Aug 2, 2017

If anything, the political move would've been Oracle's sponsorship of btrfs in the first place. They want to push people to enterprise OS/storage systems, so they've told everyone "Oh yeah, btrfs is coming soon, it'll be great" ... and it isn't great. It sucks.

I've finally broken and installed ZOL (ZFS on Linux) after trying btrfs repeatedly over the last three years. ZFS is already a breath of fresh air and I've only been using it a couple of months. For whatever reason, btrfs came together as a messy hodge-podge, and it shows in bad performance for many use cases (e.g. "omg I forgot nodatacow"), buggy implementations, difficult user interfaces, kernel bugs, etc.

btrfs needs a reboot (I hear bcache? is trying). Meanwhile, everyone should stop getting hung up on the arcane licensing details and just use ZFS directly. It can't be distributed as part of the kernel, but that's why we have distributions, isn't it? They bundle all that crap together for us. There shouldn't even be the normal OSS infighting because this isn't a proprietary blob or something, it's just using a license that's GPL-incompatible.

The best thing Linus could do for the community at large would be to fork and start committing to ZOL, giving it a tacit endorsement.

cleech · on Aug 2, 2017

Chris Mason left Oracle for Fusion-io back in 2012 and then from there to Facebook in 2013

sargun · on Aug 2, 2017

Oracle still has a couple developers on their payroll that do BtrFS development:

* Liu Bo * Anand Jain

rodgerd · on Aug 2, 2017

> I think this is a political move disguised as technical move

You literally have no idea what you're talking about, and I doubt you've used btrfs seriously, or you wouldn't talk this shit. The fact it's been upvoted so heavily just shows what absolute technically-false nonsense will draw support at HN.

madez · on Aug 2, 2017

There was no reason given for the deprecation of BTRFS.

JoshTriplett · on Aug 2, 2017

Has anyone here tried bcachefs (http://bcachefs.org/) for some of the same use cases as btrfs? What do people think of its current state?

lathiat · on Aug 2, 2017

You'd need to read up on it but the TL;DR i got from previous comments was that previously it was being funded by a company that wasn't 100% aware of the entire situation around it (they used it in their commercial product, management didn't fully understand it was being released as open source, though I also understand they didn't own the rights to the entire code base so basically that was necessary effectively). The person developing it left that company and is seeking his own funding for it but that generally the development was significantly stunted as a result.

It seems a bit here-say-ish though, so please don't assume I'm entirely correct on all fronts there and I'd encourage you to research it further!

joecool1029 · on Aug 2, 2017

>I'd encourage you to research it further!

On that note, their Patreon is a better primer than the website is: https://www.patreon.com/bcachefs

lathiat · on Aug 2, 2017

Happy to see his Patreon is looking "healthier". I mean $1500/m isn't really that much but I'm sure last I looked it was much lower.

In other news.. consider supporting the people that support you! Personally, I spent over $100/month on Patreon. Most of those are creators rather than open source people but there are a couple of open source ones such as Ondřej Surý who works on PHP packaging in Debian/Ubuntu

icebraining · on Aug 2, 2017

Joey Hess, a formerly prominent Debian developer and author of git-annex, etckeeper and a bunch of other open source projects is also on Patreon: https://www.patreon.com/joeyh

Unfortunately, he's only getting $500/m, which isn't much even for someone with a very "off-grid" life: https://joeyh.name/blog/entry/notes_for_a_caretaker/

lathiat · on Aug 2, 2017

Thanks for the link! I've added him to my patreon-roll (what do you call that?)

Discoverability is a real problem for Patreon, outside of the "most successful"

vbezhenar · on Aug 2, 2017

It's $1050, not $1500. I'm interested in two projects on Patreon (bcachefs and Matrix) and both projects are not getting nearly enough to be self-funded, so this raises the question if this model even works for Open Source. So far any advanced technology seems to be funded by some big corporation and it's not very good. But, I guess, users just don't care about good inner workings, they care about things they personally enjoy (cartoons, etc), so funding those inner workings is still an open question.

tmikaeld · on Aug 2, 2017

I think it's an issue with the marketing of it as well, I recently got notice that some of the most important open source projects I use have a patron page they depend on, yet didn't show it on github, only on their home page under donate. Same as with Kickstarter, if you want funds, you have to properly and visibly ask for it!

I donated a lot to git-annex because he did it right, showing everywhere that you could sponsor the development.

koverstreet · on Aug 3, 2017

Oh, they knew it was being released as open source. There was a clear boundary between the open source and the proprietary code - I worked on both. The open source thing was just a convenient excuse for some political bullshit - either that or they thought they could buy out a GPL project and take it proprietary, which is just insane. I mean, half the copyright was Google.

riffraff · on Aug 2, 2017

OT, but I think you meant "hearsay-ish"

sargun · on Aug 2, 2017

I've looked into BCacheFS extensively. Two problems I have -- First, it's not there in features yet. It doesn't have quotas, nor snapshots.

Second, it lacks a formally published design for these features. Because the author is the only person with knowledge of how these things might work, it makes it really difficult to mitigate the bus factor while the project is still in heavy development.

joecool1029 · on Aug 2, 2017

Looks neat. Nobody uses it though and LKML has zero discussion on the announcements.

However, HN has a previous thread on it here: https://news.ycombinator.com/item?id=12410798

geota · on Aug 2, 2017

I actually use bcachefs on my main data drive and so far it has been running solid for ~3months. That said, looks like the author has gone camping/moving/re-evaluating life. Hoping the development picks back up or that it gets rebased off of 4.12 (or maybe made into a set of kernel patches?). Wanted a COW fs - tried btrfs and kept running into issues where the drive would fall into ro mode so far not regretting bcachefs.

https://aur.archlinux.org/packages/linux-ck-bcachefs/

cmurf · on Aug 2, 2017

Unsurprising. Red Hat has not hired upstream Btrfs developers for years, where SUSE has hired bunches. Meanwhile Red Hat has upstream ext4, XFS and LVM developers.

If you're going to support a code base for ~10 years, you're going to need upstream people to support it. And realistically Red Hat's comfortable putting their eggs all in the device-mapper, LVM, and XFS basket.

But, there's more: https://github.com/stratis-storage/stratisd

Btrfs has no licensing issues, but after many years of work it still has significant technical issues that may never be resolved. page 4

Stratis version 3.0 Rough ZFS feature parity. New DM features needed. Page 22 https://stratis-storage.github.io/StratisSoftwareDesign.pdf

Both of those are unqualified statements, so fair or unfair my inclination is to take the project with a grain of salt.

gcp · on Aug 2, 2017

As for the significant technical issues, one thing is the core decision to make it a CoW system, which has fundamental performance issues with many workloads that are exactly those used in the server space. You can disable CoW, but you lose many reasons to use btrfs in the first place if you do.

When I gave up on it there were also fundamental issues with metadata vs data balancing, not-really-working RAID support, and so on...

pgaddict · on Aug 2, 2017

I find the suggestion that the technical issues are caused by the CoW design a bit strange.

Sure, making the filesystem CoW-based means there are some inherent costs, but it allows the filesystem to implement some interesting features (e.g. snapshots) in a more efficient way. For example if you want to do snapshots with ext4/xfs, you'll probably do that using LVM (which you can see as turning the stack into a CoW). In my experience the performance impact of creating a snapshot on ext4/LVM is about 50%, so you cut the performance in half. While on ZFS the impact is mostly negligible, due to the filesystem is designed as CoW in the first place.

And thanks to ZFS we know that it's possible to implement a CoW filesystem that provides extremely stable and balanced performance. I've done a number of database-related tests (which is the workload that I do care about) and it did ~70-80% TPS compared to ext4/xfs (without snapshots). And once you create a snapshot on ext4/xfs, the performance tanks, while ZFS works just like before, thanks to the CoW design.

Unfortunately, BTRFS so far hasn't reached this level of maturity and stable performance (at least not in the workloads that I personally care about). But that has nothing to do with the filesystem being CoW, except perhaps that CoW maybe makes the design more complicated.

jeltz · on Aug 2, 2017

Didn't one of your benchmarks show that nodatacow on Btrfs resulted in a major performance improvement? But that might just show an issue with Btrfs's CoW implementation rather than CoW in general.

pgaddict · on Aug 2, 2017

Yes, I've done some tests on BTRFS with nodatacow, and it improved the performance and behavior in general. Still slower tha XFS/EXT4, but better than ZFS (with "full" CoW).

But as you mention, that does not say anything about CoW filesystems in general. It merely hints the BTRFS implementation in not really optimized.

FWIW while I do a lot of benchmarks (both out of curiosity and as part of my job, when evaluating customer systems), I've learned to value stability and predictability over performance. That is, if the system is 20% slower, but provides stable and predictable behavior, it's probably OK. If you really need the extra 20% you can probably get that by adding a bit more hardware, and it's cheaper than switching filesystems etc. (Sure, if you have more such systems, that changes the formula.)

With EXT4/XFS/ZFS you can get that - predictable, stable performance. With BTRFS not so much, unfortunately.

gcp · on Aug 2, 2017

>it allows the filesystem to implement some interesting features (e.g. snapshots) in a more efficient way.

Interesting features are worthless when reading and writing data is prohibitively slow. Or when there are documented cases where updating a file in random-access manner can cause its storage requirement to balloon to blocks^2.

cryptonector · on Aug 2, 2017

There's a write magnification effect when using CoW. The ZIL helps with this because the ZIL itself is not CoW'ed, and it allows deferring writes, which allows more transactions to share interior metadata blocks, thus reducing the write magnification multiplier. I don't get where you get O(N^2) from.

As to snapshots, who cares, they cost nothing to create and they do not slow down writes -- they only slow down things like zfs send (linearly) and they cost storage over time, but not much more.

gcp · on Aug 3, 2017

You are confusing storage requirements with write amplification (which is another downside). They're totally different.

pgaddict · on Aug 2, 2017

Are you suggesting that's a problem with CoW in general, or with BTRFS implementation specifically?

I would say ZFS works extremely well (at least for the workloads I care about, i.e. PostgreSQL databases, both OLTP and OLAP). I know about companies that actually migrated to FreeBSD to benefit from this, back when "ZFS on Linux" was not as good as it's today.

cmurf · on Aug 2, 2017

Unconvincing. LVM's snapshots are CoW whether thick or thinly provisioned. And while not yet merged in mainline, XFS devs are working on CoW as well which is used when modifying reflinked files (shared extents).

Btrfs behaves basically like that with 'nodatacow' today. It will overwrite extents if there's no reflink/snapshot. If there is, CoW happens for new writes and any subsequent modifications are overwrites until there's a reflink/snapshot in which case CoW happens.

The 'nodatacow' flag can be used as either a mount option, or selectively with an xattr per subvolume, directory, or file. And in all cases, metadata writes (the file system itself) are still CoW.

denji · on Aug 5, 2017

https://github.com/redox-os/tfs

lathiat · on Aug 2, 2017

This is specific to RHEL7, notably that they won't backport any further kernel updates and won't move it from Technology Preview to release. Red Hat wasn't really driving btrfs development at all from what I am aware of.

⁠Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.

The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.

Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology.

X-Istence · on Aug 2, 2017

More importantly Red Hat has deprecated FCoE in RHEL, which is big news, because at a previous $JOB they went all in on FCoE because it was supposed to be the future.

KaiserPro · on Aug 2, 2017

FCoE died a death <2013.

Expensive hardware, with little gain sadly. It was a nice idea, however at its very core is a fairly large problem: converged network adaptors are problematic.

Unless you have lots of bandwidth in said adaptor (ie 56gig inifiband) you are going to get contention between network and disk IO.

X-Istence · on Aug 2, 2017

I agree, not only that, but at least with Cisco gear FCoE was a real pain in the behind to manage and configure. So much duplication in configuration and settings across a lot of gear, and it never was as smooth as they made it out to be.

B1FF_PSUVM · on Aug 2, 2017

https://en.wikipedia.org/wiki/Fibre_Channel_over_Ethernet

lunchables · on Aug 2, 2017

We've been big Cisco customers for years and ever since I saw FCoE I thought it was a disaster. All of the DCB extensions that had to go into ethernet to get it to work was such an ugly mess. It was just too complicated compared to alternatives, and iSCSI got a free ride on all that work (ethernet pause, flow control, etc) and was far simpler to implement. And of course with lots of 10Gb options with iSCSI offload, it was getting harder and harder to find any advantage LARGE ENOUGH in FCoE to justify it's cost and complexity.