BSD kqueue is a mountain of technical debt (2021)

klabb3 · 2024-12-29T16:00:52 1735488052

While I don’t doubt that the author has good reason for their opinion, the reason is handwaved here, no? Where is the tech debt? In the kernel only? Does it affect user space? For who is it more or less composable?

I personally think ”async IO” (a term I don’t love, but anyway), should be agreed upon and settled across OSs. Not necessarily exact syscall equivalence, but the major design choices needs to be the same. For instance, who owns the buffers (kernel or user), and should it be completion or ready-based event systems?

The reason I think this is so important to agree upon, is that language runtimes (and other custom runtimes like libuv) need to be simplified in order to allow for healthy growth. Any language that doesn’t have cross-platform support for ”async IO” is completely nerfed, increasing the barrier of entry for things that otherwise would be ”graduate level” projects, into deep OS/kernel knowledge that is hard to find outside FAANG and a few other spaces.

bluetomcat · 2024-12-29T17:44:52 1735494292

The claim is silly and unfounded. If anything, kqueue is the interface that pollutes userspace less. You have a single kevent call that is used both for waiting and registering events represented by (filter, ident) tuples. All of the data related to the event is contained in a struct that’s passed between the kernel and the user. For non-fd events (EVFILT_PROC, EVFILT_TIMER, EVFILT_SIGNAL), it is much more straightforward to use compared to the Linux way where you need to keep track of one more resource, read specific binary data from it, have specific flags, etc.

somat · 2024-12-29T22:52:50 1735512770

The article reads more like pr damage control than an actual complaint. epoll has a reputation as a fragile, hard to correctly use interface. and kqueue as relatively simple and sane. So the question always is. why didn't linux just adopt the already existing kqueue instead?

Sometimes it seems that linux does end up with more than it's fair share of technical excellence coupled to a bad interface. iptables, epoll and git come to mind.

adrian_b · 2024-12-29T23:29:17 1735514957

This other article has been discussed in the past on HN:

https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...

https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-...

Because it discusses epoll in much more detail, it is far more convincing than the parent article of this thread.

The conclusion of that article is that how to use correctly epoll is not at all obvious and it has some pitfalls that are not easy to avoid. Therefore epoll seems to be more affected by a serious technical debt, i.e. by a less than good design of the original API.

badgersnake · 2024-12-29T17:59:45 1735495185

If you come from the starting point that Linux is correct and everything different is wrong then you get blog posts like this.

throw16180339 · 2024-12-29T19:11:02 1735499462

The author thinks that musl, Alpine, and Linux are right and everything else is wrong.

* https://ariadne.space/2022/03/27/the-tragedy-of-gethostbynam... - musl's DNS resolver doesn't support TCP lookups, but it's your fault for expecting the standard DNS lookup functions to work. You should change your program to use a third-party library for DNS lookups. This post is a couple years old; musl now supports TCP lookups.

* https://ariadne.space/2021/06/25/understanding-thread-stack-... - Alpine has a stack size that's substantially smaller (128k) than other widely used OSes (512k+). Your program may work everywhere else, but you're wrong to assume a reasonable stack size. Here's how to jump through a hoop and make it work on Alpine.

blueflow · 2024-12-29T19:21:12 1735500072

Same person that both contributed the code of conduct to Alpine and later got caught bragging on twitter about having bullied people out of the project.

The work of someone who did not ask themselves the necessary questions, but now its some years later and things have changed.

a_t48 · 2024-12-29T22:18:24 1735510704

The second article is not saying that Alpine is right, only that it behaves differently. "Reasonable" stack size is pretty subjective - many workflows won't care about stack allocating megabytes at a time and could save RAM from having a smaller stack size. The article is pretty informative with workarounds for this situation. There's no need to attack the author about this, especially from a throwaway.

johnbellone · 2024-12-29T20:07:15 1735502835

Sounds like an asshole.

Aurornis · 2024-12-29T17:47:56 1735494476

> While I don’t doubt that the author has good reason for their opinion, the reason is handwaved here, no?

I came to the comments hoping to get more explanation. I was waiting for the full explanation of the technical debt but then the article just came to an abrupt end.

I wonder if the article started as a short history and explanation of differences, but then the dramatic headline was chosen to drive more clicks? The article could have been interesting by itself without setting up the “mountain of technical debt” claim in the headline.

mst · 2024-12-29T19:19:20 1735499960

I know ariadne well enough that I'm pretty confident that the headline will have been chosen out of annoyance rather than click maximisation.

That doesn't mean I agree with the conclusion (I am ambivalent and would have to think rather more before having an opinion myself) but I'm reasonably sure of the motivation nonetheless.

BobbyTables2 · 2024-12-29T17:07:24 1735492044

I fully agree.

Synchronous I/O looks appealing for beginners and simple projects. It seems like the simplest path for things where I/O isn’t the primary focus.

Then one suddenly finds themselves juggling ephemeral threads to “background” tasks or such. The application quickly becomes a multi-threaded monster and still suffers responsiveness issues from synchronous use.

The Tower of Babel created from third party runtimes further adds to the mess even for those who try to do better.

worik · 2024-12-29T18:37:53 1735497473

> Synchronous I/O looks appealing for beginners and simple projects.

Yes. And.

Synchronous I/O is suitable for almost all projects

When it is unsuitable it is catastrophicly unsuitable,

astrange · 2024-12-29T19:21:26 1735500086

Synchronous I/O is actually better than async for many programs, partly because almost all async systems were designed without thinking about priority inversions and can't resolve them properly. If you have futures, you have this problem.

p_ing · 2024-12-30T13:25:48 1735565148

Since the NT kernel is async throughout, it solves this with periodic priority boosts for non-running but runnable threads.

I swear, the kernel designers just need to look at NT :-P But we're well beyond a complete rewrite of old kernels, sadly.

astrange · 2024-12-31T14:19:04 1735654744

That can work but I think it's cheating. Especially if your scheduling is tied into power management or you have high/low power cores, then boosting the low priority threads will waste power.

astrange · 2024-12-30T09:06:40 1735549600

Forgot to say, the correctly designed ones are called "structured concurrency".

liontwist · 2024-12-29T19:47:15 1735501635

The entire unix operating system is designed to abstract away async problems and turn them sync so you can write sane code (if, else, for).

In my experience simple blocking code is incredibly misunderstood. Most programmers stop their thinking at “blocking is slow and non blocking is fast!”

mschuster91 · 2024-12-29T19:56:05 1735502165

> The entire unix operating system is designed to abstract away async problems and turn them sync so you can write sane code (if, else, for).

That only works for workloads that can be independently parallelized. For stuff that cannot be - which is the majority of consumer workload - you're in for a world of pain with locking and whatnot other issues.

liontwist · 2024-12-29T21:04:16 1735506256

No. It’s almost all the time that your computer has more than one thing to do;

- your browser uses a process per tab - your web server responds to multiple requests in parallel - your web app is going to launch other programs - your desktop ui is going to have a compositing server separate from each application - your compiler is going to work on multiple source files

mzimbres · 2024-12-30T08:27:19 1735547239

> In my experience simple blocking code is incredibly misunderstood.

I don't think so. You can't even implement a full-duplex protocol properly with sync IO since reading and writing can't happen at the same time on the same thread. Splitting them to different threads also won't help since accessing shared state requires synchronization with e.g. mutex that kills again simultaneous reading and writing.

liontwist · 2024-12-30T17:00:02 1735578002

How do you propose to implement duplex IO? And what’s the example case?

With threads the reading and writing can happen simultaneously. The synchronization is only on whatever data you need to share.

Note that if reading and writing are dependent, then single thread actually makes sense because you can’t do one without the other.

Otherwise the pattern of logic you will execute for reading or writing is completely different, which is why threading is the perfect abstraction.

samsquire · 2024-12-30T22:14:16 1735596856

I implemented split IO with io_uring, epoll and eventfd and my own thread safe comm pattern

You can write to an eventfd to wake up io_uring listening on an epoll instance.

jandrewrogers · 2024-12-29T17:18:32 1735492712

The differences in design reflect differences in use case assumptions, which in turn reflects that different OS tend to be used for different workloads.

Broadly speaking it is difficult to use async I/O well without understanding the implementation details. It is a sharp tool for experts, not magic “go faster” juice.

vacuity · 2024-12-30T02:59:45 1735527585

Which is why parent is calling for standardization, so that people don't usually need to be experts. I reject that there are a diverse range of async abstractions that are all justified such that a given OS only supports some well. More likely, that means the designs are poorly done. Although I can't say anything if it's for backwards compatibility, how I loathe and crave it.

jandrewrogers · 2024-12-30T05:31:00 1735536660

The bigger issue is that no OS was designed for expansive and flexible async functionality. The implementations reflect practical limitations of remaining compatible with the rest of the guarantees the kernel makes. I would love a portable standard, I’ve been doing async for a long time and I would benefit greatly, but I also understand why it hasn’t happened.

A good standard would likely require every OS to materially redesign its internals in a way that would have backward compatibility issues for the other users of the system. I am not against the idea since I would clearly benefit personally, but I understand why the tradeoffs make it unlikely to happen.

Even going from io_submit to io_uring just on Linux changes the way you need to architect your software, aside from API changes, and that is the same kernel. It deeply touches the behavior of the system.

p_ing · 2024-12-30T13:19:57 1735564797

> The bigger issue is that no OS was designed for expansive and flexible async functionality.

The NT kernel was designed in this very way. All I/O in the NT kernel is async. It was one of the design principles.

a-dub · 2024-12-29T17:48:03 1735494483

a good computer system would allow the application programmer to write their code in a synchronous and single threaded manner and all performance optimizations with respect to multiprocessing (and associated resource sharing) would be handled automatically.

cempaka · 2024-12-29T18:43:17 1735497797

I'm not sure providing that level of complex logic and scheduling is really a proper task for OS kernels. It's essentially what you get from green threads / coroutines which are provided by numerous runtimes and frameworks, but those have to be translated to the various different async IO facilities in order to be portable between OSes.

klabb3 · 2024-12-30T03:34:48 1735529688

Disagree. Scheduling is exactly the type of thing the OS should be doing. It’s (a) extremely easy to get wrong and cause starvation and other problems (b) very complex and large in general and most importantly (c) the OS is the only component with birds-eye view across all processes and hardware (like Numa cores). Every major OS is already doing it, but not for async IO. So it’s absolutely reasonable to maintain this responsibility during the transition to async IO.

This goes back to my initial point. We should not need a team of PhDs and kernel hackers to do hello world. The old sync syscalls have largely uniformity across platforms, but the async ones don’t. This means, in practice, that most applications, runtimes and thus users, don’t get to enjoy the benefits.

vacuity · 2024-12-30T04:11:04 1735531864

Counterpoint: the OS should indeed do its job of scheduling, but applications sometimes know better (the ones specifically implemented by programmers who know better). The OS and application must cooperate to do big jobs like scheduling. See Linux's clunky NUMA lack-of-awareness and transparent huge pages blunder for examples of how sole OS control is lacking. This is not fundamentally a Linux implementation issue: the OS does not always know best. The other extreme, giving applications control, is obviously wrong too, like cooperative scheduling being infeasible for general purpose systems. The OS should have strong defaults but also empower applications (or rather, their programmers) to try their utmost if they want. The OS is not the end all, be all. It is the platform on top of which applications gain valuable power, and constraints. That is basically by definition.

klabb3 · 2024-12-30T22:07:07 1735596427

Ack.

> but applications sometimes know better (the ones specifically implemented by programmers who know better)

Sure, and I don’t mind it. But right now to get basic modern performance you need that knowledge. That’s my objection. I just want more projects to be able to benefit.

> See Linux's clunky NUMA lack-of-awareness and transparent huge pages blunder for examples of how sole OS control is lacking. This is not fundamentally a Linux implementation issue: the OS does not always know best.

I’m not familiar. But, I have a question. Do you think it’s possible to create a middle ground, where applications can specify a small (ideally simple) params in order to allow the OS to make meaningfully better scheduling decisions (as opposed to relinquishing responsibility entirely)?

vacuity · 2024-12-30T22:51:41 1735599101

The biggest roadblock is, as always, backwards compatibility. The second biggest is getting people to agree to standardization and documentation, solidifying the expected base of knowledge. All in all, I don't have much hope for mainstream OSes to get these improvements. But in theory it could be done. For certain workloads, applications could even just make minor adjustments like your suggestion of scheduling policy hints for great gains. This is especially about being more sympathethic about context switches, which can embody better core/NUMA socket locality, better awareness for interactivity or batch processing, better performance/efficiency core allocation, etc.. So in general I would say we could get quite far without much application programmer or user intervention.

vacuity · 2024-12-30T02:52:32 1735527152

This is wishful thinking. It would be nice, I agree, but. Even "guaranteed lower bound of optimization" for general purpose programs is off the table. Our tools for programming should be such that easy problems are easy to solve, and hard problems are possible to solve. There will always be hard problems. Single-threaded, synchronous code is not congruent with actual hardware constraints, and this has been the case for decades. And to the people advocating for async/await or green threads or similar, they are necessary, but even green threads don't compare to the ergonomics of single-threaded and synchronous code.

liontwist · 2024-12-29T19:49:55 1735501795

This is almost what we have in UNIX. The resource sharing and multithreading happens automatically if you start shelling out to programs to do work, or have a simple multi process architecture like NGINX, Postgres, and web app

jiggawatts · 2024-12-30T00:19:58 1735517998

The new Java runtime does this using lightweight fibres.

lukeh · 2024-12-29T23:23:51 1735514631

I mean, Swift or Rust async/await with poll or io_uring gets you there. I use Swift structured concurrency with io_uring and love it.

WhyNotHugo · 2024-12-30T03:54:23 1735530863

The only thing that's agreed upon and settled across OSs, is that reading from a regular file is always blocking, even with O_NONBLOCK.

This means that you neither epoll nor kqueue nor poll work if you ever need to read from a file.

I wish the BSDs and Linux and other platforms would align on non-blocking IO semantics. I just hope it's something new which can handle non-blocking IO with files too.

p_ing · 2024-12-31T12:29:40 1735648180

This is not the case on NT. The OVERLAPPED data structure allows for async file operations.

https://learn.microsoft.com/en-us/windows/win32/fileio/synch...

Joker_vD · 2024-12-29T15:55:19 1735487719

So the argument is that extending kqueue's interface to handle more and more event is worse than turning more and more events into xxxfd subsystems. Why is that worse, again? Like, not in "it's not a composable design" abstract sense but in concrete "you can't do this and that, which are useful things" sense?

Someone · 2024-12-29T16:58:07 1735491487

I would think it’s the other way around. If you have separate subsystems for events of types A, B, C, etc. there’s no easy way you can have a process wait for “an event of type A or C, whichever comes first”.

There also is the thing that, sometimes, different types of event are identical types, lower down in the stack. A simple example are keyboard and mouse events generated from USB devices.

Why would the low-level USB stack have to do a switch on device type to figure out what subsystem to post an event to?

IshKebab · 2024-12-29T18:45:17 1735497917

Because "wait for an event" is a good abstraction for many event types. The epoll interface is a generic wait interface whereas the kqueue one is a hardcoded list of things you can wait for.

Think about async programming you have a single `await` keyword that can await lots of different things.

toast0 · 2024-12-29T19:17:10 1735499830

How many event types do you need though? And would you actually need to do less work to add support for a new one to a user program? It would be different work, for sure, but I don't think it would be meaningfully less work.

jsnell · 2024-12-29T16:24:01 1735489441

Yeah, the article really doesn't make a case as much as assert it.

One way to think about this is whether all these non-file fds are useful for a variety of generic operations, or whether they're only used for epoll and one-off system calls or ioctls specific to that type of fd. If it's the latter, it seems hard to believe that there actually is some kind of composability advantage.

So, what can you do with them?

1. You can use these fds with poll()/select(), not just epoll. That's not a big deal, since nobody really should be using either of them. But it does also mean that you should ideally be able to use them with any future event notification syscalls as well. And the invention of new mechanisms is not hypothetical, since Linux added io_uring. I'd be curious to know how well io_uring supports arbitrary fd types, and whether they got that support "for free" or if io_uring needs to be extended for every new fd type.

2. You can typically read() structured data from those fds describing the specific event. With kqueue all that data would need to be passed through struct kevent, which needs to be fixed size and generic enough to be shared between all the different event types.

3. You can pass individual fds between processes, either via UDS or fork(). I expect you would not be able to do that for individual kqueue filters.

4. You can close() the fd to release the underlying kernel resource. This doesn't seem interesting, just listing it for completeness.

So there's probably enough smoke there that one could at least argue the case. It's too bad the author didn't.

stevefolta · 2024-12-29T18:47:10 1735498030

> ...nobody really should be using either of them.

What's wrong with poll(), at least for smaller values of nfds? And what should one use instead when writing POSIX-compatible portable code, where epoll() and kqueue() don't exist?

somat · 2024-12-29T20:02:19 1735502539

Fun fact, on openbsd(at least, I don't know about freebsd) select and poll were reworked to use kqueue internally.

The thread is an interesting read as it sounds like the naive approach has negative performance implications, knowing openbsd I suspect the main motivation was to reduce maintenance burden, that is, make it easier to reason about the internals of event driven mechanisms by only having one such system to worry about.

https://marc.info/?l=openbsd-tech&m=162444132017561&w=2

ajross · 2024-12-29T16:53:59 1735491239

Do you have a third party driver that exposes a file descriptor for some oddball purpose? epoll can wait on that.

It's absolutely true that within the bounds of what the OS supports, both interfaces are complete. But that's sort of a specious point. One requires a giant tree of extra types be amended every time something changes, while the other exploits a nearly-five-decade old abstraction that already encapsulates the idea of "something on which you might want to wait".

markjdb · 2024-12-29T18:46:53 1735498013

The same is true of kqueue/kevent though... the driver just needs to decide which filters it wants to implement. There's no need to extend kqueue when adding some custom driver or fd type. One just needs to define some semantics for the existing filters.

ajross · 2024-12-29T20:17:41 1735503461

> the driver just needs to decide

That's pretty much the definition of technical debt though. "This interface works fine for you if you take steps to handle it specially according to its requirements". It makes kqueue into a point of friction for everything in the system wanting to provide a file descriptor-like API.

markjdb · 2024-12-30T01:05:15 1735520715

Well, no, it's "this interface works fine for you if you implement it."

The kernel doesn't magically know whether your device file has data available to read, your device file has to define what that means. That's all I'm referring to. Hooking that up to kqueue usually involves writing 5-10 lines of code.

netbsdusers · 2024-12-29T22:40:51 1735512051

No it isn't. Letting files be poll/select/epoll'd isn't free either. They don't get support for that by magic. A poll operation has to be coded, and this is just as a much a "point of friction" then as supporting kqueue. (It bears mentioning as well that on at least DragonFly BSD and OpenBSD, they have reimplemented poll()/select() to use kqueue's knotes mechanism, and so now you only have to write a VOP_KQFILTER implementation and not a VOP_POLL too.)

ajross · 2024-12-30T13:31:58 1735565518

> Letting files be poll/select/epoll'd isn't free either.

Yes, but those slashes are showing the lie in the statement. Letting files be polled/selected isn't "free", but it's standard. The poll() method has been in struct file_operations for literally decades[1]. Adding "epoll support" requires no meaningful changes to the API, for any device that ever supported select().

That kind of evolutionary flexibility (the opposite of "technical debt") is generally regarded as good design. And it's something that epoll had designed in and something that queue lacks, having decided to go its own way. And it's not unreasonable to call that out, IMHO.

[1] It's present in commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), which is the very first git commit. I know people maintain archives of older trees, but I'm too lazy to dig. Suffice it to say that epoll relies on an interface that is likely older than many of the driver developers using it.

markjdb · 2024-12-30T14:35:56 1735569356

The article clearly isn't talking about technical debt within the kernel implementations of epoll and kqueue, and if one wanted, it'd be easy to define fallback EVFILT_READ/WRITE filters using a device's poll implementation.

I don't really understand what argument you're making. Is io_uring also a bad design because it requires new file_operations?

ajross · 2024-12-30T15:00:13 1735570813

> if one wanted, it'd be easy to

Which, again, is a statement that gets to the root of the idea of "technical debt". You can excuse almost anything like that. It still doesn't make it better than a design that works by default. I remain shocked that this seems to be controversial.

FWIW: io_uring has been very loudly criticized for being hard to implement, maintain and use, via some of this same logic, yes. This isn't a senseless platform flame. Linux does bad stuff too. There are good designs and bad designs everywhere, and io_uring is probably not one (though to be fair it does have some extremely attractive performance characteristics, so I guess one might be tempted to forgive a few warts in the interface layers).

markjdb · 2024-12-30T16:12:17 1735575137

A design that works by default isn't automatically better either though. You have to look at the details.

> I guess one might be tempted to forgive a few warts in the interface layers

... well, yeah, that's exactly my sentiment about kqueue here. What you're talking about is basically a small wart that no one's bothered to address because it's inconsequential.

ori_b · 2024-12-29T19:45:10 1735501510

Counterpoint -- epoll is fundamentally broken: https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...

You can end up with events on file descriptors that you don't even have open -- they leak across process boundaries. And that means that if the file descriptor gets reused, you can end up with events on the wrong file descriptor.

zzo38computer · 2024-12-29T20:10:10 1735503010

From looking at the man page, it look like epoll does not return the file descriptor of the event; it returns a union containing user-defined data, although one of the union's fields is called "fd" because it is presumably intended to be used as the file descriptor.

However, this is still subject to the problems you mention, as well as that you presumably can no longer control events for a file descriptor if you do longer have it, so it is still seems a problem.

Putting the file descriptor in the "struct epoll_event" instead of "epoll_data_t" would have avoided the problem of events on the wrong file descriptor, but that still might not be good enough. (It could specify -1 as the file descriptor for events of file descriptors that you do not have access to.)

Some of this is just the problem with POSIX in general. I did consider such problems (of file descriptors and of event handling) in my own ideas of operating system design, which uses capabilities. I would want to avoid the mess that other systems is doing, too. A capability in this case is fundamentally a different data type than numbers (and IPC treats them differently), although I am not sure yet how to handle this in a suitable way within a process (tagged memory might do, although not all computers use tagged memory). (Note that most I/O is done using IPC and not by system calls; the number of system calls is limited and is usually only used for managing the capabilities and event handling. This would also improve security as well as other things.)

ori_b · 2024-12-30T00:52:53 1735519973

> From looking at the man page, it look like epoll does not return the file descriptor of the event; it returns a union containing user-defined data, although one of the union's fields is called "fd" because it is presumably intended to be used as the file descriptor.

Yes -- so how do you unregister an event after you close the file, or dup the file descriptor? You have no way to do that, you just keep getting events for a resource you no longer have open, so long as the parent or child still has it open!

It's incredible that you can set things up in a way that you get events on a resource you can no longer refer to in any way.

zzo38computer · 2024-12-30T03:33:34 1735529614

> so how do you unregister an event after you close the file ... You have no way to do that, you just keep getting events for a resource you no longer have open ... It's incredible that you can set things up in a way that you get events on a resource you can no longer refer to in any way.

Yes, I thought that too, it does not make much sense to me either. (I would suppose that you could close the epoll file descriptor, but then that would cancel all events, and not only that one.)

vacuity · 2024-12-30T17:48:31 1735580911

It's you again! Hi. We talked about capabilities a while back.

> I am not sure yet how to handle this in a suitable way within a process

If this is about implementing capabilities, I think partitioned capabilities should be the default.

> Some of this is just the problem with POSIX in general. I did consider such problems (of file descriptors and of event handling)

Yes, I think the kernel is trying to do too much. The more micro/exokernel it is, the better, IMO. Doesn't reduce (essential) complexity, but gives programmers the flexibility to tackle it how they want.

I'm also curious how you're thinking of doing event handling in general, like D-Bus or something. I think IPC is best left as a point-to-point bare bones communication channel, but even then it's pretty complex as the central load-bearing construct. For events, I expect there would be a lot of shared memory usage. It would use centralized services and/or userspace-defined capabilities to restrict who can receive certain events. I'm not too concerned since it's more of a userspace concern, unlike IPC.

zzo38computer · 2024-12-30T21:54:53 1735595693

> If this is about implementing capabilities, I think partitioned capabilities should be the default.

I am not entirely sure, but probably.

> I'm also curious how you're thinking of doing event handling in general, like D-Bus or something. I think IPC is best left as a point-to-point bare bones communication channel, but even then it's pretty complex as the central load-bearing construct.

I dislike D-Bus. My idea does not use any kind of shared message bus.

IPC would be done as messages; any process that has a reference to a capability can send messages to that capability and can request to receive messages from that capability; so these received messages can be events. The message can contain any bytes and also capabilities. The system calls would be used to request and send such events, with parameters for blocking/non-blocking, for multiple objects at once, and for atomic wait-and-send or wait-and-receive (in order to avoid some types of race conditions).

> For events, I expect there would be a lot of shared memory usage.

I had also thought of shared memory, although my intention is to allow network transparency and proxy capabilities (although network transparency would be implemented by using proxy capabilities), so I had thought to not use shared memory.

However, shared memory may be useful, but there may be ways to allow it to work transparently without otherwise affecting the protocol, e.g. with read-only mapping and copy-on-write mapping, or for a mapping to only be accessible by receiving events. A pass-through function would also be possible, to make some proxies more efficient. These features are essentially optimizations which can be optional to implement, so if you implement a proxy that does not use them, the programs will still work even if they are unaware that it uses a proxy that is unable to share memory.

There is then also other considerations such as audio/video/input synchronization; if you display a movie (or a game, which will involve input as well) then the audio/video would be synchronized, even if one or both are being redirected (e.g. you might redirect the audio output to a EQ filter or to a remote computer, or you might redirect both together to a remote computer or recorder, or to a program that expects input from a camera).

> It would use centralized services and/or userspace-defined capabilities to restrict who can receive certain events. I'm not too concerned since it's more of a userspace concern, unlike IPC.

Who can receive certain events would be a feature of the userspace-defined capabilities. Some services can be centralized, that many programs will be capable of using, either directly or through proxies; these proxies would be used for handling security and many other features.

Some of my ideas are similar (but different in many ways) like some other designs, including a few things listed in http://www.divergent-desktop.org/blog/2020/08/10/principles-... Proxy capabilities, and the Command, Automation, and Query Language, and the common data format used for most data, and other features of the system that I intended to have, will be able to help with some of the things listed there, as well as other benefits. (My ideas are actually mostly independent of that and other documents, but some of them end up being similar to those, and sometimes my ideas can then be refined when I learn more from such documents, too.)

ryao · 2024-12-29T19:35:48 1735500948

I don’t see how adding a new syscall to monitor something new via epoll is more desirable than adding a new BSD event filter. Either way, you are modifying the syscall interface. The initial epoll_create() had to be revised with epoll_create1(). It is even cited on slide 53 as being an example of a past design failure:

https://www.man7.org/conf/lpc2014/designing_linux_kernel_API...

Interestingly, the BSD kqueue made the same mistake and had to be revised for the same reason, which is why kqueue1() was made. However, by the points on the technical checklist there for syscall design, kqueue is an excellent design, minus the one mistake of not supporting O_CLOEXEC at the beginning.

marcodiego · 2024-12-29T16:01:59 1735488119

I remember that when io_uring was in it's early stages several pointed to kqueue (also nt'wait_for_multiple_something or iocp and solaris event ports). By coming later, I think io_uring was able to better fit modern reality and also avoid problems from previous implementations.

Hope security issues with it are solved and usage becomes mostly transparent through userspace libs, it looks like a high performance strategy for our current computing hardware.

petesergeant · 2024-12-29T16:07:38 1735488458

> I think io_uring was able to better fit modern reality

Genuinely would be fascinated to understand what that means

muststopmyths · 2024-12-29T17:00:52 1735491652

I haven't looked into io_uring except superficially, but Windows implemented Registered I/O in Windows 8 circa 2011. this is the basically the same programming paradigm used in io_uring, except it is sockets-only.

Talk here [1] speaks to the modern reality of 14 years ago :-).

Since kqueue seems very similar to IOCP in paradigm, I guess some of the overheads are similar and hence a ring-buffer-based I/O system would be more performant.

It's worth noting that NVME storage also seems to use a similar I/O pattern as RIO, so I assume we're "closer to the hardware" in this way.

1. https://learn.microsoft.com/en-us/shows/build-build2011/sac-...

p_ing · 2024-12-29T21:01:29 1735506089

Microsoft implemented I/O rings in an update to Windows 10 with some differences and it is largely a copy-and-paste of io_uring.

It's important to note that the NT kernel was built to leverage async I/O throughout. It was part of the original design documents and not an after-thought.

https://learn.microsoft.com/en-us/windows/win32/api/ioringap...

https://windows-internals.com/i-o-rings-when-one-i-o-operati...

https://windows-internals.com/ioring-vs-io_uring-a-compariso...

https://www.cs.fsu.edu/~zwang/files/cop4610/Fall2016/windows...

muststopmyths · 2024-12-29T21:27:43 1735507663

Winsock Registered I/O or RIO predates io_uring and you could make the argument that io_uring "is largely a copy-paste of RIO" if you wanted to be childish.

The truth is that they both use a obvious pattern of high-performance data transfer that's been around a long time. As I said, NVME devices have been doing that for a while and it's a common paradigm in any DMA-based transfer.

io_uring seems more expansive and hence useful compared to the limited scope of RIO or even the NtIoRing stuff.

p_ing · 2024-12-29T22:42:54 1735512174

> io_uring "is largely a copy-paste of RIO" if you wanted to be childish.

This was unnecessary. I'm not denigrating I/O Rings or io_uring. The NT kernel is more advanced than the Linux/BSD/macOS kernels in certain ways. There should be a back-and-forth copying of the good ideas/implementations.

muststopmyths · 2024-12-30T01:00:52 1735520452

I misunderstood. apologies.

JackSlateur · 2024-12-29T18:44:20 1735497860

io_uring is not limited to networking/sockets

io_uring is not even limited to IO

muststopmyths · 2024-12-29T19:50:40 1735501840

I was talking about RIO

JackSlateur · 2024-12-29T22:21:23 1735510883

And I was talking about the modern reality enabled by iouring. Not really 14yo stuff, to be honest.

nesarkvechnep · 2024-12-29T16:14:16 1735488856

I'm wondering what has fundamentally changed in computing between the creation of kqueue and io_uring so the latter is able to better fit modern reality.

adrian_b · 2024-12-29T19:13:14 1735499594

I have not looked at the io_uring implementation to see if it really has improvements from this point of view, but something that has changed during the quarter of century from the design of kqueue until now is that currently it has become much more important than before to minimize the number of context switches caused by system calls in the programs that desire to reach a high performance for input/output.

The reason is that with faster CPU cores, more cores sharing the memory, bigger CPU core states and relatively slower memory in comparison with the CPU cores, the time wasted by a context switch has become relatively greater in comparison with the time used by a CPU core to do useful work.

So hopefully, implementing some I/O task using io_uring should require less system calls than when using kqueue or epoll. According to the io_uring man page, this should be true.

oasisaimlessly · 2024-12-29T19:52:28 1735501948

In addition to the reasons you listed, context switches have also been significantly slowed down by Meltdown/Spectre speculative execution vulnerability mitigations.

mananaysiempre · 2024-12-29T17:22:25 1735492945

I don’t know which things here are actually relevant to the differences between the two, but of course there have been changes. Core counts are much higher. You can stripe some enterprise SSDs and get bandwidth within an order of magnitude or so of your RAM. Yet clocks aren’t that much higher, and user-supervisor transitions are comparatively much more expensive. There’s a reason Lemire’s talk on simdjson is called “Data engineering at the speed of your disk”.

vacuity · 2024-12-29T16:47:12 1735490832

I don't know that there were relevant changes between the advent of each, so much as kqueue just didn't have existing features in mind. I assume GP is referring to the ring buffer design and/or completion-based processing, as part of the ability to batch syscall processing. This is reminiscent of external work like FlexSC and can be viewed as mechanical sympathy.

Asmod4n · 2024-12-29T20:15:38 1735503338

syscalls which go from user to kernelspace got more expensive after the mitigations for vulnerabilities in intel and amd CPUs. io_uring solved that.

Someone · 2024-12-29T20:34:56 1735504496

> io_uring was able to better fit modern reality and also avoid problems from previous implementations.

Learning from the past can indeed lead to better designs.

> Hope security issues with it are solved

So, if I understand that correctly, it also introduced new problems? If so, do we know those new issues are solvable without inventing yet another API?

If not, is it really better or just having a different set of issues?

zamalek · 2024-12-29T22:53:52 1735512832

Here's a recent vulnerability: https://blog.exodusintel.com/2024/03/27/mind-the-patch-gap-e...

It's just your typical multithreading woes (in an unsafe language). One presumes that these problems will be ironed out eventually. Unfortunately it seems as though various hardened distros turn it off for this reason (source was a HN comment I read a while back).

baggy_trough · 2024-12-29T15:55:27 1735487727

I must have missed where the mountain of technical debt was. The article just says epoll has a more composable design.

vacuity · 2024-12-29T16:51:23 1735491083

To be fair (?), the title seems like classic clickbait. The thesis seems to just be

> While I agree that epoll doesn’t come close, I think that’s actually a feature that has lead to a much more flexible and composable design.

You could read it as "kqueue is more prone to tech debt than epoll if events keep getting added".

nobodyandproud · 2024-12-29T18:31:12 1735497072

Is this correct? How I interpreted the author:

- kqueue : Designed to work off of concrete event filters. New event types mean kqueue must be rengineered support it (at the very least, a new event filter type).

Versus

- epoll : Designed to work off of kernel handles. The actual event type isn’t the concern of epoll and therefore epoll itself remains unaffected if a new event type were introduced.

I’m guessing composibility is mentioned because of this decoupling? I’d think a better explanation would be single-responsibility, but I’m likely not understanding the author correctly here.

markjdb · 2024-12-29T19:17:39 1735499859

I don't think the article does a good job of arguing its premise, which I think is that kqueue is a less general interface than epoll.

When adding a new descriptor type, one can define semantics for existing filters (e.g., EVFILT_READ) as one pleases.

To give an example, FreeBSD has EVFILT_PROCDESC to watch for events on process descriptors, which are basically analogous to pidfds. Right now, using that filter kevent() can tell you that the process referenced by a procdesc has exited. That could have been defined using the EVFILT_READ filter instead of or in addition to adding EVFILT_PROCDESC. There was no specific need to introduce EVFILT_PROCDESC, except that the implementor presumably wanted to leave space to add additional event types, and it seemed cleaner to introduce a new EVFILT_PROCDESC filter. Process descriptors don't implement EVFILT_READ today, but there's no reason they can't.

So if one wants to define a new event type using kevent(), one has the option of adding a new definition (new filter, new note type for an existing filter, etc.), or adding a new type of file descriptor which implements EVFILT_READ and other "standard" filters. kqueue doesn't really constrain you either way.

In FreeBSD, most of the filter types correspond to non-fd-based events. But nothing stops one from adding new fd types for similar purposes. For instance, we have both EVFILT_TIMER (a non-fd event filter) and timerfd (which implements EVFILT_READ and in particular didn't need a new filter). Both are roughly equivalent; the latter behaves more like a regular file descriptor from kqueue's perspective, which might be better, but it'd be nice to see an example illustrating how.

One could argue that the simultaneous existence of timerfds and EVFILT_TIMER is technical debt, but that's not really kqueue's fault. EVFILT_TIMER has been around forever, and timerfd was added to improve Linux compatibility.

So, I think the article is misguided. In particular, the claim that "any time you want kqueue to do something new, you have to add a new type of event filter" is just wrong. I'm not arguing that there isn't technical debt here, but it's not really because of kqueue's design.

nobodyandproud · 2024-12-29T19:37:44 1735501064

Thanks.

Then it seems like there are more similarities than differences here: Both solve the same problem of “select” by being the central (kernel-level) events queue; though with different APIs.

The other bit that caught my eye was the author saying epoll can do nearly everything kqueue can do.

What is that slight bit that epoll can’t do?

markjdb · 2024-12-29T20:08:59 1735502939

I'm not sure. Maybe it's "wait for events that aren't tied to an fd."

For instance, FreeBSD (and I think other BSDs) also have EVFILT_PROC, which lets you monitor a PID (not an fd) for events. One such event is NOTE_FORK, i.e., the monitored process just forked. Can you wait for such events with epoll? I'm not sure.

More generally, suppose you wanted to automatically start watching all descendants of the process for events as well. If I was required to use a separate fd to monitor that child process, then upon getting the fork event I'd have to somehow obtain an fd for that child process and then tell epoll about it, and in that window I may have missed the child process forking off a grandchild.

I'm not sure how to solve this kind of problem in the epoll world. I guess you could introduce a new fd type which represents a process and all of its descendants, and define some semantics for how epoll reports events on that fd type. In FreeBSD we can just have a dedicated EVFILT_PROC filter, no need for a new fd type. I'm not really sure whether that's better or worse.

oasisaimlessly · 2024-12-29T19:58:38 1735502318

AFAIK, just aio[1] (async file IO).

[1]: https://man.freebsd.org/cgi/man.cgi?query=aio&sektion=4&apro...

adrian_b · 2024-12-29T18:59:24 1735498764

I have not looked at kqueue, but there is no reason to believe that it would be affected by the introduction of a new event filter more than epoll is affected by the introduction of a new system call that creates a new kind of pseudo-file handles that can be passed to epoll.

The event filters must have a standard interface, so adding one more kind should not matter for kqueue more than the existence of a new kind of "file" handle matters for epoll.

The epoll API makes opaque for the user the differences between the kinds of kernel handles, but at some point inside the kernel the handles must be disambiguated and appropriate actions for each kind of handle must be performed, exactly like in kqueue various kinds of event filters must be invoked.

nobodyandproud · 2024-12-29T19:21:22 1735500082

Right. I’ve never worked in the kernel space, but I wanted to understand the author.

As you mentioned, event handlers still only care for certain messages and ignore others, so the disambiguation must happen.

Rereading it; what’s also not clear to me is whether epoll is a single queue, or one queue for each subsystem and handle type.

From a handler’s perspective, it seems like an implementation detail?

Again, I don’t know about this stuff; just hoping someone knowledgeable can help clear it up.

caboteria · 2024-12-29T15:36:46 1735486606

(2021)

sc68cal · 2024-12-29T19:23:59 1735500239

kqueue was if not the, then one of the first attempts to solve the issue of being more performant than select(). So this whole post boils down to "they didn't get it right the first time" to which I say, okay sure, but we only know that now after kqueue proved that event based queues were better than select.

This post is ahistorical and anti-social.

jcalvinowens · 2024-12-29T16:58:07 1735491487

I'll save you some time if you haven't read this yet:

> Hopefully, as you can see, epoll can automatically monitor any kind of kernel resource without having to be modified, due to its composable design, which makes it superior to kqueue from the perspective of having less technical debt.

That's literally the entire argument.

That's not what technical debt means. The author doesn't reference a single line of source code: this is vapid clickbait.

steeleduncan · 2024-12-29T17:40:35 1735494035

> That's not what technical debt means

Undoubtedly, I'm not sure why technical debt is in the title

> this is vapid clickbait

This seems unfair. It is a well written discussion of the difference between kqueue on BSDs and epoll on linux, and a historical overview of predecessor APIs. It just has nothing to do with technical debt, which, given the title, is admittedly odd

codr7 · 2024-12-29T20:17:40 1735503460

I'm curious: What would be the difference between suboptimal decisions made earlier in the history of a code base and technical debt?

SMP-UX · 2024-12-29T23:27:43 1735514863

You don't have to necessarily use kqueue on a BSD. In areas where it might make more sense performance wise you can obviously use poll. I'm not opposed to introducing an alternative interface like epoll, but it's only recently that Linux has managed to bridge the network performance gap. I attribute that to a lot of money being poured into it with pure and raw optimization from all of the Linux sponsors that they get which is really cool and great for them but it does not necessarily make gnu/Linux a purer system. Both approaches have their merits. There's no reason to sit here and claim that one is better than the other

StayTrue · 2024-12-29T17:54:10 1735494850

Technical debt has become a meaningless term.

mwkaufma · 2024-12-29T18:29:55 1735496995

Making a mountain out of a molehill.

delta64 · 2024-12-30T02:43:56 1735526636

It's slightly odd to say 'wow, epoll can wait on any type of kernel handle, like timerfd and signalfd!' given those *fd interfaces were added just for select/poll/epoll. Strikes me as remarkably ugly design...

kev009 · 2024-12-29T19:54:57 1735502097

This is pretty facepalm, has the author actually used any of these extended epoll APIs? inotify - deprecated. aio - complete footgun in linux. Whatever point is trying to be made here is lost in reality, I fail to see how having a bunch of additional function calls is superior to event filters and there is no coherent argument laid out.. how is kqueue not composable if it too has added event types (i.e. Mach ports they mention)? You're going to have to modify the kernel in either case.

Validark · 2024-12-29T16:37:34 1735490254

A good start to an article, but unfortunately it feels like it was cut off before the asserted argument was demonstrated to be true.

netbsdusers · 2024-12-29T22:34:11 1735511651

Five points in reply:

1. Signalfd is a mountain of technical debt. It's not like a file at all. It reads entirely different things depending on which process is reading from a single common open file description, and it interacts with epoll in a most bizarre way, in that the process that added the signalfd to the poll set is the one whose signals will notify readiness - even if that process then closed the epoll and signalfd.

2. Signalfd is a mountain of technical debt because they built it over signals. Signals are of two forms: the one is asynchronous, the other synchronous. The one should be replaced with a generic message queue in a technical debt free design. The other is debatable. (Mach and also Fuchsia have an "Exception Port" instead, a message port someone can read from to deal with exceptions incurred by some thread.)

3. Regarding: > epoll can automatically monitor any kind of kernel resource without having to be modified, due to its composable design

Well, so can kqueue - EVFILT_READ/EVFILT_WRITE will work just as well on any file, be it a socket, an eventfd, or even another kqueue (though I don't think anyone who isn't dealing with technical debt should be adding kqueues to kqueues, nor epolls to epolls.) No need to modify kqueue! But in either case, though, you've got to modify something. Whether that's trying to fit signals into a file and getting signalfd, or whether it's adding a signal event filter and getting EVFILT_SIGNAL, something has to be done.

4. FreeBSD's process descriptors predate Linux's pidfds. They are also better and less technically indebted because they are pure capabilities (they were invented as part of the Capsicum object-capability model based security architecture for FreeBSD) while pidfds are not: what you can do with a pidfd is decided on based on the ambient authority of the process trying to, say, signal a pidfd, and not on rights intrinsic to the fact of having a pidfd. In fact, these pidfds are not even like traditional Unix open file descriptions, whose rights are based on the credential of who opened them. This makes privilege-separation architectures impossible with pidfds, but I digress.

5. The author ignored the actual points people argued against epoll with, viz. that 1) epoll's edge triggering behaviour is useless, and 2) that epoll's conflation of file descriptor with open file description is a terribly leaky abstraction:

https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-... https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-...

vacuity · 2024-12-30T03:40:36 1735530036

> Signals are of two forms: the one is asynchronous, the other synchronous. The one should be replaced with a generic message queue in a technical debt free design. The other is debatable.

Unfortunately, sychronous signals of some form are here to stay. That is, having that interrupt-like context switch blocking the same thread. But the ergonomics could definitely be improved. Seems like exception ports are quite useful, but to my understanding they lack some flexibility and performance, so in the end I'd like to emphasize the interrupt style as the bare minimum. Self-paging in Nemesis and Barrelfish reflects this "delegate an individual's responsibility to the individual" principle. Definitely still use a message queue abstraction if applicable, I love that stuff. And since I want the whole state-machine-async-event-loop hog anyways, a top-level event loop fits in nicely as the handler for synchronous signals.

> In fact, these pidfds are not even like traditional Unix open file descriptions, whose rights are based on the credential of who opened them. This makes privilege-separation architectures impossible with pidfds, but I digress.

I knew about capabilities and file descriptors but not this about pidfd...that is an ouch indeed.

WesolyKubeczek · 2024-12-29T18:52:07 1735498327

I went to read the article hoping that it would shed light on some nitty-gritty internals of the Free(or any other)BSD kernel and tell a story.

The story would be, of course, about how the brave and bold developers undertook a journey to add something to the kqueue mechanism, or fix a bug in it, or increase its performance, or improve its scalability, and how on their way there they hit the legendary, impenetrable Mountain of Technical Debt. How they tried to force their way over the mountain first, but how the dangers and the cold drove them to try to go in depth, via disused tunnels made by some hard-working fearless people many moons ago. In there, our brave team discovers long-decayed badly burned bodies, rusted tools, and fragmentary diaries made by some of those people.

They learn that the hard-working tunnelling people undertook a quest not unlike their own; they also felt that the Mountain of Technical Debt could actually present an opportunity to enrich themselves, and while mining it for Bugfixes and Improvements here and there, they accidentally awoke an Ancient Evil of Regression-Rich Behavior, the Mother of All Heisenbugs.

The hard-working tunnelling people were no strangers to the battle, and they faced the enemy, but it overpowered them with Crippling Regressions and Massive Burnout. Only a few survivors lived to tell the tale, and while what they told was confusing, the clear message was to avoid the Mountain and never approach the tunnels again. At least this was what everyone remembered, but almost everyone forgot what evil fought them and drove them out.

Our heroes realize that their quest has suddenly become much more dangerous than they thought, but they don't want to stop halfway, and they can be hard-working and hard-fighting too, and their tools are much more superior than those of their predecessors. They also think that they are much more resistant to burnout. Of course they go on, and eventually they meet the creature. Each attempt at improving kqueue yields such obscure regressions and bugs elsewhere in the system, that half of the team succumbs to Burnout unleashed on them despite youth, vigor, tools, and strength. Two brave souls, armed with the best debuggers, fuzzers, and in-depth code analyzers, manage to escape the Mother of Heisenbugs, wounding it, and damaging some of the Mountain itself. They describe their findings in a much more coherent way than their surviving predecessors have managed to, and conclude that the Mountain and the creature cannot be defeated on their home ground, not until a viable replacement arises which will be compatible with and and all the quirks the software grew to rely on. This replacement would starve the creature and the Mountain would erode.

The story of the kqueue Replacement Building would be a separate one, in a different genre, with our brave heroes acting as wise advisors to the new cast of protagonists, detailing challenges they all run into, and how their combined wisdom, intelligence, and superb teamwork and camaraderie help them overcome hardships — sometimes barely, but still.

Needless to say, the article failed me greatly. And it didn't even make an attempt to describe where the debt was!

ndesaulniers · 2024-12-29T19:04:08 1735499048

What was the AI prompt that generated that? LotR but programming?

WesolyKubeczek · 2024-12-29T20:23:04 1735503784

Purely organic, no AI used whatsoever.

(Lord of io_urings, oh my)

cardiffspaceman · 2024-12-29T20:18:13 1735503493

On the other hand, I enjoyed reading it.

ndesaulniers · 2024-12-30T01:09:00 1735520940

I don't even know what's real anymore...