Hacker News new | past | comments | ask | show | jobs | submit login
DTrace for Linux 2016 (brendangregg.com)
482 points by okket on Oct 27, 2016 | hide | past | favorite | 81 comments



It would be worthwhile to clarify the term "tracing" to distinguish between live aggregation and post-processing approaches. The general confusion around the "tracing" terminology seems to imply a competition between these two, while they should rather be seen as complementary.

DTrace, SystemTap and eBPF/BCC are designed to aggregate data in the critical path and compute a summary of the activity. Ftrace and LTTng are designed to extract traces of execution for high resolution post-processing with as small overhead as possible.

Aggregation is very powerful and gives a quick overview of the current activity of the system. Tracing extracts the detailed activity at various levels and allows in-depth understanding of a particular behaviour after the fact by allowing to run as many analyses as necessary on the captured trace.

In terms of impact on the traced system, trace buffering scales better with the number of cores than aggregation approaches due its ability to partition the trace data into per-core buffers.

Both approaches have upsides and downsides and should not be seen as being in competition, they address different use-cases and can even complement each other.


You're right that a key feature and differentiator of DTrace/stap/BPF is kernel aggregations, but they can do per-event output as well. But I think I know what you mean, especially as I was at the sysdig summit yesterday and could see a major difference.

I think the two models for tracers, playing on their strengths, are: 1. real-time analysis tracers (DTrace/stap/BPF), and 2. offline analysis tracers (LTTng, sysdig). Both can do the other as well, but I'm just pointing out strengths.

sysdig (and I believe LTTng) has done great work at creating capture files that can then be analyzed offline in many many different ways, and they've optimized the way full-event dumps can be captured and saved (which I know LTTng has done as well). DTrace/stap/BPF don't have any offline capture file capabilities -- they could do it, but it's not been their focus.


I've only recently tried out DTrace on OS X, and I'll admit to being kinda floored at what it can do. To think I used to be satisfied with strace on Linux!

Seeing the tracing capabilites of Linux expand is exciting indeed.

Edit: the couple of tutorials that finally unlocked DTrace (on OS X) for me are:

https://www.objc.io/issues/19-debugging/dtrace/

https://www.bignerdranch.com/blog/hooked-on-dtrace-part-1/


Agree, DTrace on OS X is supper powerful.

I once try to debug the open source libusb app in Mac OS, with DTrace I can trace the App, Kernel USB API call, libusb internal thread in user space, etc.

Much better visibility to system activities compare to simple strace.

Absolutely love the power of what it can do.

BTW, Can a DTrace script to use to monitor a system with potential "Dirty COW" type privilege escalation issue?


The most challenging thing for us is running a new enough kernel to get these features. While upgrading to a newer kernel isn't particularly hard, small companies don't have a lot of engineering resources to run kernels that aren't maintained by their distro of choice (usually on the LTS release).

The good thing is this is solved simply by waiting long enough. The bad thing is most developers can't just pick this up today without a bunch of extra effort.

If you are looking for something you can use with old kernels you should definitely checkout Brendan's perf-tools repo[1]. It takes advantage of older kernel features and works with things as old as ubuntu 12.04.

*Edit: Fixed Brendan's name

[1]: https://github.com/brendangregg/perf-tools


On the other side of the spectrum, companies highly averse to technical changes culturally (typical case in the F500) will avoid ever upgrading kernels, libraries, and tooling. It's how I've wound up spending days trying to compile C++11 and C++14 code with toolchains that would run on CentOS / RHEL 5 and 6. Using the JVM lets you side-step the shared library linkage compatibility issues at least, but when you need a new kernel for instrumentation it's an even harder sell to an antagonistic IT department that only wants 2 OSes and corresponding versions to exist in the world ideally - "Linux" and Windows.


Right, thanks, my perf-tools are on the Netflix BaseAMI, and are my go-to tracing tools for 3.x and earlier 4.x kernels.


Amazon enabled BPF flags in the Amazon Linux AMI with 2016.03, and generally seems to move to whatever the latest LTS kernel is when they release a new version.

Since 4.9 is supposed to be the next LTS, if it gets out of RC fast enough, we could see 4.9 in the 2017.03 Amazon Linux AMI, which would be a pretty big win for those of us running workloads in the AWS cloud.


This is the same problem shared by RedHat customers, although RH is great about backporting features to older kernels, I'm not sure they'll be able to move this to 3.x from 4.9. The price we pay for stability.


You mean the price paid for stable kernel ABIs, so proprietary drivers can use them?

The Linux kernel is the most safe operating system component to upgrade, mostly because the Linux kernel people care deeply about compatibility and not breaking user space. While it is still not free to upgrade, the cost is minimal compared to the cost (in real money, risk, security) of backporting major components to older kernels, like Red Hat is doing.

Red Hat is maintaining old kernels for having a stable kernel ABI (among some other reasons), not for general "stability".

Personally I run only CentOS, but always with a recent kernel, albeit usually a LTS one. Mostly to get exactly these type of features described in this article.


Well they show the same caution and care around curating their repositories and releasing security fixes, so in my opinion (you may disagree) they try to ensure stability beyond the kernel's ABI.


Indeed. The kernel is not the only component (of the complete system) that gets features backported.

I, personally, am quite happy that I don't really have to worry that a routine "yum update" is going to break any of my installed applications.


Congrats, this is good news.

> On Linux, some out-of-tree tracers like SystemTap could serve these needs, but brought their own challenges.

I was pretty happy with stap, it had a really rich feature set.

> DTrace has its own concise language, D, similar to awk, whereas bcc uses existing languages (C and Python or lua) with libraries.

I think we need more creative names for languages. The short and simple ones like "go" and "D" keep on having collisions. :)

>BPF adds ... uprobes

uprobes + all the other stuff is really killer, I like the idea of watching for stuff like "my app has crossed this threshold and then this system condition occurs". At least when I tried it a couple years ago with stap my kernel wasn't built with uprobes support and I wasn't inclined to rebuild it. Hopefully it becomes (or has become) more mainstream.


> I was pretty happy with stap, it had a really rich feature set.

So are other companies. I mentioned it in the post, as in a way this hurt BPF development, as companies that normally would have contributed resources said they were satisfied with stap. Exciting times might be ahead for stap, if it continues its BPF backend.

As for naming, yes, we need better names. Maybe the bcc/Python/BPF combination can be named something?


Will there every be way to write probes/tracing scripts without dropping into C? I don't mind C in general, but I don't want to have to dig out the documentation for the eBPF C library and start writing hundreds of lines of C every time I want to run a trace.

DTrace made this really nice, because you would write your tracing scripts in a high-level, awk-like language, which is the sort of thing well-suited to the purpose.


Yes, see the section "A higher-level language", which mentions at least two projects: SystemTap+BPF and ply.

Think of the current bcc/Python/C interface as a lightweight skin that was necessary during BPF development to kick the tires on various features, prototype tools, see what else needed to be done, etc. It may be good enough to stay around, as lots of tools have been written for it that will get used and be valuable. But there's room for higher-level languages too.

If Sasha keeps developing his "trace" tool (and its summary counterpart, argdist), that may serve many such custom needs (as another option). See the various examples: https://github.com/iovisor/bcc/blob/master/tools/trace_examp... , like:

    # trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
    TIME     PID    COMM         FUNC             -
    05:18:23 4490   dd           sys_read         read 1048576 bytes
    05:18:23 4490   dd           sys_read         read 1048576 bytes
    05:18:23 4490   dd           sys_read         read 1048576 bytes


How feasible would it be to compile the dtrace language itself to an eBPF backend?


Huh. Is there a formal definition anywhere?

If we have some kind of spec (even if it's not formal), it might be possible, since they are roughly equivalent, AFAIK. However, since I haven't worked in-depth in either, I'm unsure what work would be involved: it's possible that only a subset would compile.

Anyways, this is probably a good goal to shoot for. Dtrace is the system tracer on pretty much all of the other big unixes, so it's a good idea to support as much of its language as possible. Plus, there are a lot of scripts already written in DTrace's language: having access to them would be invaluable.


Nobody would love a DTrace/BPF front-end more than I. And not just because I'd sell more copies of my DTrace book (I joke :). It is a really nice language, although missing a few things that BPF can do that DTrace can't (like saving and retrieving stacks), so it'd need to be enhanced.

But with the warning that I'm not a lawyer: before beginning work on a DTrace/BPF front-end, I'd start by talking to a copyright lawyer to see if permission or a license is needed from Oracle. DTrace is Oracle copyrighted, and re-implementing a DTrace front-end on Linux sounds a lot like re-implementing an Oracle-copyrighted API.


What would stop one from enhancing the DTrace in illumos further? That is licensed under the CDDL, and Adam has made several enhancements to DTrace lately, and if my memory serves me correctly, Bryan fixed a couple of bugs in it recently as well.


Wouldn't really solve our problem, though: We'd need a new compiler for the language. As I understand it, that would require a almost entirely new codebase. But I don't work on DTrace internals, so I may be wrong.


What? That doesn't make any sense. How is dtrace(1) built on illumos, then?


I'm uncertain, but I thought dtrace on illumos was interpreted. Am I wrong?

If I am, than all we'd have to rewrite was the compiler backend, which is much easier, so that would be nice.


dtrace(1) itself is compiled from C into an ELF binary executable.

  % file `which dtrace`
  /usr/sbin/dtrace: ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped
The DTrace language, D, is interpreted. By DTrace.

The problem lies in the fact that neither the GNU/Linux kernel, nor the GNU applications provide DTrace probe points. On Solaris and thus on illumos, and thus on SmartOS, there are tens of thousands of probe points and numerous probe providers. Some external applications, like PostgreSQL or PHP, added DTrace probes, and all is well on Solaris / illumos / SmartOS. Some, like node.js had providers and probes added by engineers at Joyent.

http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/

http://dtrace.org/blogs/dap/2013/11/20/understanding-dtrace-...

http://dtrace.org/blogs/blog/category/node-js/

GNU/Linux would have to do the same thing. It currently only has but a handful of DTrace probes and providers, which is understandably not very useful.

http://dtrace.org/blogs/ahl/2011/10/05/dtrace-for-linux-2/


None of that is at all what I meant: allow me to clarify.

I was talking about the DTrace language, which I have been avoiding calling D up until this point, so as to avoid confusion with the other D. In this post, when I talk about D, I will be referring to the DTrace language.

Linux now has a tracing system called eBPF, which provides many of the same advantages of DTrace. This is what Gregg's blog post was about.

However, eBPF requires compiling tracer scripts, or at least parts of the tracer scripts, into bytecode (IIRC). Currently, the bytecode is usually compiled from C code. However, it is ungainly and impractical to write a bunch of C every time you want to run a trace. So I asked if they had plans to support compiling a higher level language. At this point, somebody suggested that somebody should work on compiling the DTrace language, D, to eBPF bytecode. I thought that this would be a good idea, and we were discussing how viable it would be.

I thought that you had suggested using Illumos DTrace as a base for this compiler. Since by my impression, Illumos DTrace interprets D, I thought that this would require almost a complete rewrite, and thus it wouldn't very helpful.

It seems you meant something else. So what did you mean?


I think we both got lost, didn't we? (:-)) So let's rewind:

It is a really nice language, although missing a few things that BPF can do that DTrace can't (like saving and retrieving stacks), so it'd need to be enhanced. But with the warning that I'm not a lawyer: before beginning work on a DTrace/BPF front-end, I'd start by talking to a copyright lawyer to see if permission or a license is needed from Oracle.

I fail to see how Oracle would be relevant, given that the entire DTrace is released under the CDDL. My question to Brendan was why he couldn't just take or enhance the existing DTrace codebase to do the front-end for BPF?

(Personally, I think the other guy working on integrating BPF deeper into the GNU/Linux kernel, rather than biting the bullet, taking DTrace and doing the same thing Solaris engineers did to the SunOS kernel, is terribly mis-guided, especially since some probes and providers already exist in Linux. In the end, it will be a "Linux zoo", like everything else in Linux: "56" competing solutions for doing one thing, none of them comprehensive, and no consistency. Linux history is being repeated again. You have one comprehensive tool which works across several operating systems, DTrace, and Linux is yet again different from everyone else. Reminds me a lot of Microsoft Windows.)


It's not ideal, but eBPF works now, and it works on all the new kernels, which is more than can be said for any of the DTrace on Linux projects. It looks like eBPF is becoming that unified solution (mostly: ftrace and a few others still exist, but eBPF is the most capable, and is picking up steam). As for being different from what the other Unixes did, that's why I was in favor of developing a frontend that supported D, so that we could at least have a shared language with the rest of unixland.

>My question to Brendan was why he couldn't just take or enhance the existing DTrace codebase to do the front-end for BPF?

Brendan didn't answer, so I don't know, but my guess was that turning the interpreter into a compiler would require a near-complete rewrite.


> that's why I was in favor of developing a frontend that supported D, so that we could at least have a shared language with the rest of unixland.

Yes, D everyone would be nice, but what exactly does it mean? We can share DTrace scripts? Docs? Blog posts? Books? I've been porting them all over to bcc/BPF. Am I missing something?

People have already began work developing new languages, eg, ply https://wkz.github.io/ply/. What if we develop a language that's much better than D? We need to make enhancements, anyway.

I should reiterate something I covered in the post: most people won't care about this. Most people didn't write DTrace scripts when they had the option to (did either of you write DTrace scripts? have some on github?). Most people used the tools. And today, people can "apt install bcc-tools" and use DTraceToolkit like tools.

If someone wants to engage lawyers & Oracle and see if or what needs to be done to use DTrace, then great, it'd make my job easier when developing these tools (and I'd sell more DTrace books :). But I'd also like to see someone take a swing at developing a better language as another possibility.


did either of you write DTrace scripts?

I did, some, in the beginning (circa 2006): I was stymied mostly by the realization that deep, deep knowledge of the kernel structures was required to make use of DTrace (I wasn't working as a kernel engineer per se at the time.)

I forgot a lot of it in the meanwhile: just yesterday I was trying to get a simple ustack() using DTrace on a running process, and I even pulled open "Solaris Performance and Tools", and eventually when the process finished, I threw up my hands in frustration. All I wanted to do was see why the running process (oggenc) was taking so long. (But this BPF thing looks far, far more complicated and convoluted than D.)

Nevertheless, I think D is ideal, because, in my case, it plays on my experience in programming AWK: apart from needing to know what in the kernel I wanted to probe, I could immediately start writing DTrace programs without having to learn the language. And that is amazing.


See my other comments -- Annatar, what is your real name?

I'll add: as someone who has written and published countless DTrace and BPF scripts, I don't know that pursing a DTrace front end is wise right now, for reasons I've already covered.

I'm sorry you weren't able to solve that issue. I'd suggest starting with a profiler (timed sampling) if it was running on-CPU, to see where CPU time is spent.


Yes, D everyone would be nice, but what exactly does it mean? We can share DTrace scripts? Docs? Blog posts? Books? I've been porting them all over to bcc/BPF. Am I missing something?

I think the most important thing to support would actually be DTrace's probe definition files. It would be great, for example, if the story for PostgreSQL's dtrace probe points was "your dtrace probe points are now accessible from eBPF programs too!".


That already works! :) I wrote some MySQL examples in bcc that use the MySQL USDT probe points. Eg:

https://github.com/iovisor/bcc/blob/master/examples/tracing/... https://github.com/iovisor/bcc/blob/master/examples/tracing/...

There's also a tool in /tools.

I'll whip up some PostgreSQL examples too (unless someone else beats me to it).


That looks great! (Why is char query[128] apparently unused?)

Is USDT expected to become the de facto standard for userspace probe points?


Oops, that looks like a left-over from development, that's no longer needed. Thanks! One of us should fix the example. :)


Well, something good about the eBPF design is that we can go in both directions. It's possible to write both new language projects, and to port old ones. I'd work on it myself, if I had the requisite knowledge.

Heck, maybe I'll learn some requisite knowledge and get on it.


Out of curiosity, did you ever write DTrace scripts? SystemTap? BPF? Do you have some on github I can see?


No. See what I mean about requisite knowledge?

I've been interested in learning DTrace for some time now, as it seemed like a tremdously helpful tool, but my systems are all linux, I don't run Fedora or RHEL, so systemtap is out. In recent months, eBPF has become a viable option, so I've been looking into that, but haven't had time to dive in deep yet. This makes me a poor candidate for doing this sort of work, which is why I'm not saying for sure that I'd do it. But I might be willing to try.

Don't get me wrong: I have no particular attachment to D as the tracing language. However, I do think it's worth supporting the tracing language that pretty much all of the other big unixes support if we can.

Am I less than qualified to have that opinion? Yes. Think what you will.


Thanks, was wondering if this came from a specific language pain or not.

Since all of this requires being on a newer kernel anyway, I'd do that first, then try the existing bcc tools and see where you are at. You might discover that you can do everything you want to with the existing tools, which include some powerful ad-hoc analysis tools (trace and argdist). Or, if you're missing one or two tools and let us know, and we may develop them for you and add them to bcc.

Many work on the current Ubuntu Xenial 4.4. A few more on 4.6 (stack tracing), more on 4.7 (tracepoints), and all work on 4.9.


Thanks, was wondering if this came from a specific language pain or not.

No language pain with DTrace, but having to know kernel structs put a damper on things like you wouldn't believe.

The problem is this: suppose you've just built mikmod on Solaris (or some variant thereof), and you're not the developer of the application. You want to see why the mikmod application's output is pausing every few seconds, causing a lag; that's all you care about - answers. DTrace to the rescue! Not so fast, you don't:

with DTrace, you know what the answer to what you want to get is - why is the output freezing every few seconds? What you don't know is what to ask to get it.

And I bet based on experience, that every tracing framework currently in existence has the same issue.

Even in scenarios where I knew exactly what to ask, correctly interpreting the results is a laborious and involved arcane art exercise in itself - you yourself wrote several books on the subject!


> I fail to see how Oracle would be relevant

This sounds like a legal conclusion. Care to back that statement with your real name?


I have never been affiliated with either Sun Microsystems or the Oracle corporation, if that is what you're getting at.

And I am not yet prepared to disclose the name that the system gave me on the Internet - because Internet never, ever forgets. Roaches check in - but they don't check out.

I still don't see how Oracle is relevant in any of that, since DTrace source code is under CDDL. As far as I'm concerned, that's where the story begins and that's where it ends.


As far as an anonymous person is concerned? You keep stating that your opinion matters, but won't back it up with your actual name. Why is that?


Because a) I value my privacy and b) name shouldn't matter - only whether the argument, assertion, or the idea is correct or not.

Sadly, you and I have never met. I've got questions for you, and then some.


"and then some"?


Yes, lots of questions, irrelevant to this topic but as a huge fan of illumos and SmartOS very much relevant to me, personally.


He's not anonymous, he's pseudonymous, which is a distinction with considerable difference.


Huh. Well than how does Illumos and FreeBSD DTrace exist? They're actually based on the code, yes, but it's the same legally AFAIK (but I'm not a lawyer).

Besides, as I understand it, it's the implementation that's copyrighted, not the interface or the idea, which is why Oracle went after Google the way they did, and why things like Freeablo and OpenMW can exist. But again, I'm not a lawyer, so this may be incorrect.


CDDL is a file based license, so they use the files and follow the license on those files, which grants them rights to use it.

The GPL/CDDL license wars over DTrace went on for years. It was a real shame -- we should have been discussing or using a great technology instead.

There must be a hundred different ways and languages we can try, some of them may be better than DTrace's D. One of them is Oracle copyrighted, and lacks some features we need (stack saving). I'd either A) talk to a lawyer, or B) pick one of the other 99 options.


That's entirely reasonable. I had hoped the we'd be able to use scripts from the other unixes, and have at least a mostly common language. But it's not a deal breaker or anything.


I think it would be an absurd result if, for example, I could: buy your book (which presumably does not infringe Oracle's copyrights); write a new implementation of the probe language described therein; and thereby end up with something that infringes Oracle's copyrights.


Thanks. That looks very exciting, and it's a relief to know that the eBPF team has thought about these things, and are actually doing work on them.


It wasn't mentioned in the article, but I've recently merged LuaJIT to BPF compiler. So you just write Lua and the kernel bits get compiled into BPF bytecode and loaded. No C.

See https://github.com/iovisor/bcc/blob/master/src/lua/README.md... or examples https://github.com/iovisor/bcc/blob/master/examples/lua/trac... (this one is for tracing).


Okay, that's really cool. Thanks. It does need more documentation, but this is an excellent step.

It also allows for building higher level APIs atop BCC's low-level lua integration, as opposed to writing yet another DSL.


It shouldn't be that difficult to write some simple compiler to translate a scripty language that looks like awk to BPF-Assembly (or C and stuff that into LLVM). I might look into this stuff for my Bachelor Thesis ;)


Yes, people are working on an eBPF backend for systemtap.


Well, that will at least stop systemtap bringing your system to a crashing halt (literally). Although I believe they may have fixed that already.

But what would you expect for software coming out of Red Hat? Their projects are usually a mixed bag at best.


So we're not getting DTrace proper, it seems. Instead something else will stem up from the various linux tracing systems. Maybe this BPF-based one.

It's a shame. One of the nice things about dtrace was that there was a book on it. Good, in-depth documentation on performance tools is hard to find.


Thanks, I wrote the DTrace book with Jim Mauro, and there will be a BPF tracing book as well.

BTW, I wouldn't say "maybe" regarding BPF, as it's integrated in the Linux kernel (unlike most of the other tracers, which are add-ons). Sooner or later everyone who runs Linux is getting it.


I think I bother you about a new tracing book for Linux every time one of your articles is posted, so I'll give my obligatory: We want a new Brendan Gregg tracing book! ;)

Things have been moving so fast it's probably a good thing you didn't. It sounds like 4.9 will slow a lot of that down to a more manageable pace for writing a book, though.


I've been waiting for this for since Solaris. Thanks!


> In 2014 I joined the Netflix cloud performance team. Having spent years as a DTrace expert, it might have seemed crazy for me to move to Linux

I thought Netflix was mostly running FreeBSD [1]. Is it only the Open Connect Appliance?

[1]: https://www.freebsdfoundation.org/testimonial/netflix/


When you login to Netflix and browse videos, you're running on the Netflix cloud, which is massive, AWS/EC2, and mostly Ubuntu Linux. When you hit play, you're running on the OCA FreeBSD CDN, which is also a large deployment.


So why didn't they just deploy FreeBSD across the entire server park? That would also give you DTrace again...


I'll put $1 on politics. I mean look, you have an OS that bgregg has had to pour how much effort into to get the observability that FreeBSD already had? And that's just the observability part. Then you have the FreeBSD network stack. To me it's clear based on the work done on Linux it was a political choice.


Yeah but if that's the case, it's really bad. There is no place for politics in computer science or information technology.


I think it's more likely because Netflix uses the JVM -- probably the Oracle JDK -- which is supported on Linux but not on FreeBSD.


That makes even less sense, and reeks even more of irrationality: if they're using JVM, a Solaris based system like SmartOS would be the best choice - Solaris is where Java is developed, after all.

It's like buying a NetApp appliance to run NFS servers, when Solaris is the reference NFS server implementation. Humans do not make any sense with their decisions governed by feelings instead of logic.


Yes, we're using the JVM. No, Solaris or SmartOS would not be the best choice. Would it help if I went into detail as to why?


Yes it would.


I worked on Solaris for over a decade, and for a while it was usually a better choice than Linux, especially due to price/performance (which includes how many instances it takes to run a given workload). It was worth fighting for, and I fought hard. But Linux has now become technically better in just about every way. Out-of-box performance, tuned performance, observability tools, reliability (on patched LTS), scheduling, networking (including TCP feature support), driver support, application support, processor support, debuggers, syscall features, etc. Last I checked, ZFS worked better on Solaris than Linux, but it's an area where Linux has been catching up. I have little hope that Solaris will ever catch up to Linux, and I have even less hope for illumos: Linux now has around 1,000 monthly contributors, whereas illumos has about 15.

In addition to technology advantages, Linux has a community and workforce that's orders of magnitude larger, staff with invested skills (re-education is part of a TCO calculation), companies with invested infrastructure (rewriting automation scripts is also part of TCO), and also much better future employment prospects (a factor than can influence people wanting to work at your company on that OS). Even with my considerable and well-known Solaris expertise, the employment prospects with Solaris are bleak and getting worse every year. With my Linux skills, I can work at awesome companies like Netflix (which I highly recommend), Facebook, Google, SpaceX, etc.

Large technology-focused companies, like Netflix, Facebook, and Google, have the expertise and appetite to make a technology-based OS decision. We have dedicated teams for the OS and kernel with deep expertise. On Netflix's OS team, there are three staff who previously worked at Sun Microsystems and have more Solaris expertise than they do Linux expertise, and I believe you'll find similar people at Facebook and Google as well. And we are choosing Linux.

The choice of an OS includes many factors. If an OS came along that was better, we'd start with a thorough internal investigation, involving microbenchmarks (including an automated suite I wrote), macrobenchmarks (depending on the expected gains), and production testing using canaries. We'd be able to come up with a rough estimate of the cost savings based on price/performance. Most microservices we have run hot in user-level applications (think 99% user time), not the kernel, so it's difficult to find large gains from the OS or kernel. Gains are more likely to come from off-CPU activities, like task scheduling and TCP congestion, and indirect, like NUMA memory placement: all areas where Linux is leading. It would be very difficult to find a large gain by changing the kernel from Linux to something else. Just based on CPU cycles, the target that should have the most attention is Java, not the OS. But let's say that somehow we did find an OS with a significant enough gain: we'd then look at the cost to switch, including retraining staff, rewriting automation software, and how quickly we could find help to resolve issues as they came up. Linux is so widely used that there's a good chance someone else has found an issue, had it fixed in a certain version or documented a workaround.

What's left where Solaris/SmartOS/illumos is better? 1. There's more marketing of the features and people. Linux develops great technologies and has some highly skilled kernel engineers, but I haven't seen any serious effort to market these. Why does Linux need to? And 2. Enterprise support. Large enterprise companies where technology is not their focus (eg, a breakfast cereal company) and who want to outsource these decisions to companies like Oracle and IBM. Oracle still has Solaris enterprise support that I believe is very competitive compared to Linux offerings.

So you've chosen to deploy on Solaris or SmartOS? I don't know why you would, but this is also why I also wouldn't rush to criticize your choice: I don't know the process whereby you arrived at that decision, and for all I know it may be the best business decision for your set of requirements.

I'd suggest you give other tech companies the benefit of the doubt for times when you don't actually know why they have decided something. You never know, one day you might want to work at one.


It was Jeff Bonwick's team which proved that the number of engineers or even developers working on a given problem is completely irrelevant: ZFS was developed by a team of, what, five people? Meanwhile, how many people are working on BTRFS? It's nowhere near ZFS.

But, let's chalk that up to an isolated, one off statistical aberration. From what I understand Adam and Bryan wrote DTrace almost single handedly, with some help from Mike, and even with all the contributions, you can still count the people who made DTrace a working production tool on the fingers of your one hand.

However, let's chalk that up to a one-off, statistical aberration as well. Meanwhile, how many people are working on how many tracing frameworks for Linux?

Next, we have zones, a complete, working, production proven virtualization solution, augmented by KVM, lx, TRITON, Consul, et cetera. One coherent solution. Built upon technology on which I ran production Oracle databases on, way back in 2006, powering a very large institution which was making very large amounts of money. By the second. How many engineers did it take to design, architect, and code all that up?

Meanwhile, there are how many competing cloud virtualization solutions based on Linux? And remarkably, except for SmartOS, none are a complete, comprehensive solution: they all lack one thing or another. Not one of those Linux based solutions is paranoid about data integrity or correctness of operation. Those things are not even an afterthought of Linux.

Should I chalk that up to a one-off, statistical aberration, or would you say that there is a pattern here?

Amiga Intuition library, the foundation on which the GUI is built into the system, was written single-handedly by one just one person: RJ Mical. In a couple of days! For almost two decades, it was the reference on how to build a library of GUI primitives with almost unlimited flexibility.

Star Control 2, one of the greatest games in history, was developed by just two guys in the span of three years.

Dave Haynie almost single handedly developed not one, but entire series of Commodore computers, the C16, C116, C Plus/4 (Commodore 264). Those are the lessons not only of history, but of our contemporaries, people you used to work with: KVM was ported from the Linux kernel by what, three engineers, and form what I can tell, it runs faster on illumos than it does on Linux where it's developed! Why is that?

You and I apparently drew a completely different set of conclusions: when you wrote Linux now has around 1,000 monthly contributors, whereas illumos has about 15 you seem to equate the number of people working on a product with that product's capability and quality, whereas I drew the conclusion that the number of people is irrelevant, but what the individuals or individual can do makes all the difference in the world.

Where you are absolutely correct is that the job market for illumos based operating systems is non-existent, at least in the country where I live, and slim elsewhere (I used to work in Silicon Valley and in other parts of the States). That's a fact. But I wouldn't rush to the conclusion that it's because illumos or SmartOS are worse products, because I see no evidence of that. Furthermore, at the end of the day, people still need to run a cloud on something which actually works, and Linux is not it. It doesn't work correctly, when it works at all. Not even after 20 years, billions of dollars and a world wide army of people working on it. What is the alternative? SmartOS.

I read the Netflix tech blog from time to time. And over time, one thing became clear to me: Netflix can do the things it does because they have one single application to scale, but most of the world out there, in the trenches, has more than one application. You write of people with deep knowledge of the kernel and performance: I've been working in this industry for decades, and I've yet to meet anyone like that (they must all either be a secret society, or I'm just way too paranoid, but I do know a lot of IT professionals). So perhaps it's a living in an enclave problem, or perhaps both you and I work in enclaves, only different ones? I'm the only person I know in IT that has done or has any interest in kernel, system engineering or performance; I must either be incredibly bad at picking companies to work for, or people you mentioned are really few and far between, or a third possibility is that it's a fluke coincidence?

Let me tell you about my world: I work on and with Linux professionally. Where Netflix has only one major application (according to their tech blog) to worry about, I work at a place where we literally have several thousands of applications, some bought, some developed in-house; for just about every problem, we have an average of five applications, all different, but basically doing the same thing; and some of our applications are so exotic, so complex, and so custom, that it is impossible to find anyone on the market with any experience in them. Thousands.

So while you might be picturing this in your head, imagine running Linux, and suddenly your database keels over: Linux didn't fail over to the other path, so multipathing doesn't work right. Then imagine having systems with data corruption, but Linux can't fix it, because ZFS isn't supported by redhat which we run, so there goes that - another outage (we have regulators and governments to worry about, so the company is reluctant to start hacking their own custom kernel and a ZFS-based Linux). Next, Linux suddenly has an outage because the NFS mount is flapping. Why is it flapping? Because Linux's NFS implementation doesn't play well with NetApp. Now imagine stuff like this happening on a scale of 72,000 systems, spread across the planet. I never had such problems with Solaris. Not once.

But, since that's anecdotal evidence and experience, we have to discount that as well.

Then, I have hardware (from one of Oracle's competitors), very, very expensive, intel-based 80-CPU Xeon monsters, with .5 TB of memory per system, where the serial console hangs at random: redhat points the finger at the hardware manufacturer, hardware manufacturer points the finger at redhat. Result: console is still hanging at random, with both companies telling us they have no clue what the problem is. That's Linux for you.

Serial console always worked just fine on illumos. After all, it's basic functionality.

Then there's the issue of Linux not getting shutdown properly: you'd think that after 20 years of development and as you correctly noted, a world wide army of developers and billions of dollars in investments, the shutdown procedure wouldn't try to write to an already unmounted filesystem; it's basic functionality, after all; but even that is too much to expect, apparently (I can dig out the redhat bug if you're interested).

That last one, we cannot chalk up to a fluke, and even worse, sgi's XFS was the only one which actually detected that write and panicked the kernel - ext3 was oblivious to this data corruption. It's mighty difficult for me to engineer highly reliable services on such a substrate... but let's not dwell on that too much right now. It's too depressing.

Then there is tracing: you know there are several frameworks at play. Then there is also lack of proper DWARF2 support (I researched the subject, and found out that the "solution" was to replace my run time linker!) Can you imagine something like that being a solution on an illumos based system? I think everybody would commit collective suicide or quit altogether like Keith Wesolowski did before casually suggesting such a thing, but let's not dwell on that either. (At this point, I think it fair to sue for pardon if I don't want my operating system made by people who think nothing of casually replacing the run time linker only to get DWARF 2 debugging support. Do you agree?)

Then there's this issue of startup: while SMF has been humming along for more than a decade, Linux is still trying to figure out some sort of a complete working solution: currently that's systemd, and based on how it's architected, it looks like Windows and Linux are finally converging. Meanwhile, to make a startup which sort of reminds of the working SMF, systemd has several different configuration states for its services... and no fault management architecture to speak or write of.

One thing's for sure: your and my expirences are radically different. You shocked me to the core, but I also understand your thinking and motives for leaving illumos behind better, and it's the kind of appreciation I'm unable to put in words. You are also a much more flexible: after having seen just how convoluted, complex, slow, and resource wasting Java is, I would never go work at another company which used it (the place where I work now, Java is the language and the platform). I'd just quit the industry like Keith did.

In spite of all of this, if you let me know how to reach you, I'll provide you with enough information on how to get in touch: I'd still love to have you over if you're in the country, and cook you dinner.


You've just discounted quite a lot of what I said as "no evidence", and have made some incorrect assumptions about both development at Sun and Netflix. Along with your other comments, at this point it's clear you are bashing on Linux, Netflix, and me personally, and you still haven't revealed your real name.

I'd like to know what your real name is. If you really cannot post it here, then feel free to contact me at bgregg@netflix.com.


I am bashing on Linux, absolutely; that massive bleeding wound is very raw and painful. I have no reason to bash on Netflix; I merely pointed out that, in my view, Netflix's problem domain is very narrow, and a luxury: most IT departments don't have only one (however massive) application to worry about.

As for you personally, I have nothing but highest respect for you. You are one of the reasons why I still haven't quit this industry. In fact, I still cannot believe I've actually communicated with Brendan Gregg. To me personally, you're a living legend. If I believed in personal heroes, you'd be one of them.


If so, I would be very interested to see what, if anything at all, moving from JVM to BEAM in Erlang/Elixir/Phoenix would do for them. In a never going to happen but interesting to see on this scale kind of way.


Really rather unfortunate that big enterprise platforms such as banks and so forth are so far behind on their kernel version that it will be approximately 7-8 years before they will have this capability, unless RH backport of course.


On the other hand, I'm glad the banks who handle my money don't upgrade to the latest and greatest software without taking very, very stringent precautions to make sure everything will work.


In my experience thats not the case - its more like 'It works, no-one touch it! We're spending our money on more visible things' (several years later):- "Whats that, its no longer going to be supported? Damn, now we have to upgrade"


Linux is not my favorite operating system, but it seems like we're stuck with it. I'm very happy for all these improvements. Once you got used to a system with a quality and functional tracer, Linux was hard to get back to. But Linux tracing is getting better and better now. I am very satisfied.


Linux is not my favorite operating system, but it seems like we're stuck with it.

It only seems that way. We're never stuck with something as long as we don't accept it. One other factor is at play which works against Linux, and that is that people in IT like shiny new things, and therefore something else always comes along. Hopefully this time around, that something else will be the old new thing (learning from the past, and re-discovery). One way or the other, the clock is ticking on Linux, and one of these days, it won't be as popular any more, because something else will be the new-new thing. It's the nature of this industry:

change is the only constant.

You don't have to accept anything. Don't bow to peer pressure.


So how does this relate to uprobes? I've been looking into that lately because I want frequency counts (or coverage analysis) of user space programs but without the nop-sled overhead of xray. Does dtrace supplement or replace uprobes? Or am I really just confused?


DTrace is a Solaris (and BSD/OSX) tracing tool that never quite made it to Linux (There are some attempted ports, but none of them really caught on). BPF (and adding in frontends like BCC) give you the same sort of functionality in Linux.

BPF can take advantage of uprobes and instrument around them, but it interacts with them, and does not replace them




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: