Would memory safe languages avoid these kinds of problems? It seems like a good ...

layer8 · 2024-12-31T15:44:48 1735659888

Depends. The underlying issue for this bug is that the code involved crosses language boundaries (the Windows kernel and win32 libraries written in C and the application in C++). The code where the lifetime failure occurs is Windows code, not application code. However, the Windows code is correct in the context of the C language. The error is caused by an APC that calls exception-throwing C++ code, being pushed onto the waiting-in-C thread. This is a case of language-agnostic OS mechanisms conflicting with language-specific stack unwinding mechanisms.

This could only be made safe by the OS somehow imposing safety mechanisms on the binary level, or by wrapping all OS APIs into APIs of the safe language, where the wrappers have to take care to ensure both the guarantees implied by the language and the assumptions made by the OS APIs. (Writing the OS itself in a memory-safe language isn’t sufficient, for one because it very likely will still require some amount of “unsafe” code, and furthermore because one would still want to allow applications written in a different language, which even if it also is memory-safe, would need memory-correct wrappers/adapters.)

This is similar to the distinction between memory-safe languages like Rust where the safety is established primarily on the source level, not on the binary level, and memory-safe runtimes like the CLR (.NET) and the JVM.

jpc0 · 2024-12-31T17:17:57 1735665477

> the Windows kernel and win32 libraries written in C and the application in C++

To my knowledge the kernel and win32 is in fact written in C++ and only the interface has C linkage and follows C norms.

So this error occurred going C++ > C > C++ never mind languages with different memory protection mechanisms like Rust > C > C++.

layer8 · 2024-12-31T21:54:39 1735682079

It’s an unholy combination of C, C++, and Microsoft extensions at worst. But apart possibly from some COM-related DLLs, the spirit is clearly C, and C++ exceptions are generally not expected. (There may be use of SEH in some parts.)

Of course, you can write C++ without exception safety too, but “C++ as a better C” and exception-safe C++ are effectively like two different languages.

ryao · 2025-01-01T08:33:42 1735720422

I filed bugs against both GCC and LLVM asking for compiler warnings that would inform developers of the risk:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

rurban · 2024-12-31T20:28:08 1735676888

No, the windows kernel is written in pure C.

cryptonector · 2024-12-31T23:08:42 1735686522

I believe it's C++, but not allowed to use exceptions.

rurban · 2024-12-31T23:23:44 1735687424

We know that it's pure C, because it leaked.

cryptonector · 2025-01-01T00:34:37 1735691677

All of it is C?

protomolecule · 2025-01-01T12:08:39 1735733319

>The error is caused by an APC that calls exception-throwing C++ code

The article doesn't say it was a C++ exception. Could've been a SEH exception.

saagarjha · 2024-12-31T13:01:51 1735650111

No*. This is one of the bugs that traditional memory safety would not fix, because the issue crosses privilege boundaries in a way that the language can't protect against.

*This could, in theory, be caught by fancy hardware strategies like capabilities. But those are somewhat more esoteric.

kibwen · 2024-12-31T14:24:44 1735655084

To elaborate, the problem here is that it looks like the OS API itself is fundamentally unsafe: it's taking a pointer to a memory location and then blindly writing into it, expecting that it's still valid without actually doing any sort of verification. You could imagine an OS providing a safe API instead (with possible performance implications depending on the exact approach used), and if your OS API was written in e.g. Rust then this unsafe version of the API would be marked as `unsafe` with the documented invariant "the caller must ensure that the pointer remains valid".

ryao · 2025-01-01T08:59:29 1735721969

They passed a function that throws an exception to a C ABI function. C ABI functions cannot tolerate exceptions because C does not support stack unwinding. It might work anyway, but it is technically undefined behavior and it will only ever work when simply deallocating what is on the stack does not require any cleanup elsewhere.

The exception caused the stack frame to disappear before the OS kernel was done with it. Presumably, the timeout would have been properly handled had the stack not been unwound by the exception. If it had not, that would be a bug in Windows.

There is a conceptually simple solution to this issue, which is to have the C++ compiler issue a warning when a programmer does this. I filed bug reports against both GCC and LLVM asking for one:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

jpc0 · 2024-12-31T17:15:24 1735665324

Seeing as rust has no stable ABI and likely never will. How would you provide the API in rust, also in golang, also in .NET, and swift, and Java, and whatever other language you add without doing exactly what Win32 does and go to C which has a stable ABI to tie into all those other languages?

pornel · 2024-12-31T20:16:01 1735676161

Rust ecosystem solves that by providing packages that are thin wrappers around underlying APIs. It's very similar to providing an .h file with extra type information, except it's an .rs file.

Correctness of the Rust wrapper can't be checked by the compiler, just like correctness of C headers is unchecked, and it just has to match the actual underlying ABI.

The task of making a safe API wrapper can be relatively simple, because you don't have to take into consideration safety an application as a whole, you only need to translate requirements of individual APIs to Rust's safety requirements, function by function. In this case you would need to be aware that the function call may unwind, so whether someone making a dedicated safe API for it would think of it or not, is only a speculation.

jpc0 · 2024-12-31T21:11:28 1735679488

I seem to remember a linux kernel dev quiting and not being able to specify exactly what you say this wrapper should abide by as being a contributing factor.

If those specifications were written down clearly enough then this dev wouldn't have needed to spend 5 days debugging this since he spent a significant amount of time reading the documentation to find any errors they are making that is mentioned in the documentation.

And don't say that they can actually just read the rust code and check that since well, I can't read low level rust code and how any of the annotations ca interact with each other.

A single line of rust code could easily need several paragraphs of written documentation so that someone not familier with what rust is specifying will actually understand what that entails.

This is part of why Rust is difficult, you have to nail down the specification and a small change to the specification causes broad changes to the codebase. The same might need to happen in C, but many times it doesn't.

pornel · 2025-01-02T11:35:10 1735817710

That Linux drama was due to "nontechnical nonsense" of maintainers refusing to document their APIs requirements.

In C you can have a function that returns a pointer, and get no information how long that pointer is valid for, what is responsible for freeing it, whether it's safe to use from another thread. That's not only an obstacle for making a safe Rust API for it, that's also a problem for C programmers who don't want to just wing it and hope it won't crash.

The benefit of safe wrappers is that as a downstream user you don't need to manually check their requirements. They're encoded in the type system that the compiler checks for you. If it compiles, it's safe by Rust's definition. The safety rules are universal for all of Rust, which also makes it easier to understand the requirements, because they're not custom for each library or syscall. The wrappers boil it down to Rust's references, lifetimes, guards, markers, etc. that work the same everywhere.

jpc0 · 2025-01-02T14:58:18 1735829898

> ... you only need to translate requirements of individual APIs to Rust's safety requirements...

> That Linux drama was due to "nontechnical nonsense" of maintainers refusing to document their APIs requirements.

> If those specifications were written down clearly enough then this dev wouldn't have needed to spend 5 days debugging...

wat10000 · 2024-12-31T19:26:03 1735673163

What would this safe API look like? The only thing I can think of would be to have the kernel allocate memory in the process and return that pointer, rather than having the caller provide a buffer. Performance would be painful. Is there a faster way that preserves safety?

LorenPechtel · 2024-12-31T21:29:17 1735680557

No allocation--it returns the address of a buffer in a pool. Of course this permits a resource leak. It's a problem with no real solution.

quotemstr · 2024-12-31T14:24:10 1735655050

Safe code definitely won't have this sort of problem. Any code that could invoke a system call to scribble on arbitrary memory is by definition unsafe.

saagarjha · 2024-12-31T14:31:34 1735655494

That's basically all code

quotemstr · 2024-12-31T14:35:43 1735655743

No it isn't. You can write safe file IO in Rust despite the read and write system calls being unsafe.

saagarjha · 2024-12-31T14:46:42 1735656402

I take it you are not familiar with the classic Rust meme of opening /proc/self/mem and using it to completely wreck your program?

IshKebab · 2024-12-31T15:09:53 1735657793

That's obviously outside the scope of the language's safety model, and it would be quite hard to do that accidentally.

saagarjha · 2024-12-31T16:35:50 1735662950

That is exactly my point, though: system calls are completely outside the scope of a language's safety model. You can say, well /proc/self/mem is stupid (it is) and our file wrappers for read and write are safe (…most languages have at least one), but the fundamental problem remains that you can't just expect to make system calls without that being implicitly unsafe. In the extreme the syscall itself cannot be done safely, with no possible safe wrapper around it. My point is that if you are calling these Windows APIs you can't do it safely from any language; Rust won't magically start yelling at you that the kernel still expects you to keep the buffer alive. You can design your own wrapper around it and try to match the kernel's requirements but you can do that in a lot of languages, and that's kind of missing the point.

loeg · 2024-12-31T16:59:44 1735664384

Right. And of course, it's not just Windows. For example the Linux syscall aio_read() similarly registers a user address with the kernel for later, asynchronous writing (by the kernel). (And I'm sure you get similar lifetime issues with io_uring operations.)

ryao · 2025-01-01T09:12:44 1735722764

While I am not aware of a Linux syscall that would be equivalent to QueueUserAPC() to allow this to happen, the kernel writing to stack memory is not the problem here. The problem is that a C++ exception was invoked and it unwound a C stack frame. C++ exceptions that unwind C stack frames invoke undefined behavior, so the real solution is to avoid passing function pointers to C++ functions not marked noexcept to C functions as callbacks. It is rather unusual that Windows permits execution on the thread while the kernel is supposed to give it a return value. Writing to the stack is not how I would expect a return value to be passed. Presumably, had the stack frame not been unwound, things would have been fine, unless there is a horrific bug in Windows that should have been obvious when QueueUserAPC() was first implemented.

Anyway, it is a shame that the compiler does not issue a warning when you do this. I filed bug reports with both GCC and LLVM requesting that they issue warnings, which should be able to avoid this mess if the compilers issue them and developers heed them:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

IshKebab · 2024-12-31T17:28:30 1735666110

The bug was not because a system call was involved. It was a multi threaded lifetime issue which is completely withing Rust's safety model.

To put it another way, you can design a safe wrapper around this in Rust, but you can't in C++.

saagarjha · 2024-12-31T17:38:54 1735666734

No. The kernel has no idea what your lifetimes are. There’s nothing stopping a buggy Rust implementation from handing out a pointer for the syscall (…an unsafe operation!) and then accidentally dropping the owner. To userspace there are no more references and this code is fine. The problem is the kernel doesn’t care what you think, and it has a blank check to write where it wants.

IshKebab · 2024-12-31T18:19:24 1735669164

That's no different to FFI with any C code. There's nothing unique to this being a kernel or a syscall. There are plenty of C libraries that behave in a similar way and can be safely wrapped with Rust by adding the lifetime requirements.

fc417fc802 · 2024-12-31T22:42:13 1735684933

> can be safely wrapped with Rust

They can't. Rust can't verify the safety of the called code once you cross the language boundary. Handing out the pointer is inherently unsafe.

In the user space FFI case at least you might be able to switch to an implementation written in the same (memory safe) language that you are already using. Not so for a syscall.

IshKebab · 2025-01-01T09:25:02 1735723502

Rust can't verify the correctness of the kernel code, but the problem here wasn't incorrect kernel code!

The problem was that the C API exposed by the kernel did not encode lifetime requirements, so they were accidentally violated. Rust APIs (including ones that wrap C interfaces) can encode lifetime requirements, so you get compile time errors if you screw it up.

I don't think you can win this argument by saying "but you have to use `unsafe` to write the Rust wrapper". That's obviously unavoidable.

ryao · 2025-01-01T11:28:07 1735730887

There was no problem with lifetime requirements. The problem was that a pointer to a C++ function that could throw exceptions was passed to a C function. This is undefined behavior because C does not support stack unwinding. If the C function's stack frame has no special for how it is deallocated, then simply deallocating the stack frame will work fine, despite this being undefined behavior. In this case, the C function had very specail requirements for being deallocated, so the undefined behavior became stack corruption.

As others have mentioned, this same issue could happen in Rust until very recently. As of Rust 1.81.0, Rust will abort instead of unwinding C stack frames:

https://blog.rust-lang.org/2024/09/05/Rust-1.81.0.html#abort...

That avoids this issue in Rust. As for avoiding it in C++ code, I have filed bugs against both GCC and LLVM requesting warnings:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

Once the compilers begin emitting warnings, this should not be an issue anymore as long as developers heed the warnings.

muststopmyths · 2024-12-31T16:22:11 1735662131

In this specific type of Win32 API case, I can think of a way to make this safe.

It would involve looking at the function pointer in QueueUserAPC and making sure the function being called doesn't mess with the stack frame being executed on.

This function will run in the context of the called thread, in that thread's stack. NOT in the calling thread.

It's a weird execution mode where you're allowed to hijack a blocked thread and run some code in its context.

Don't know enough about Rust or the like to say if that's something that could be done in the language with attributes/annotations for a function, but it seems plausible.

loeg · 2024-12-31T17:07:46 1735664866

Perhaps simpler would be to just not unwind C++ exceptions through non-C++ stack frames and abort instead. (You'd run into these crashes at development time, debugging them would be pretty obvious, and it'd never release like this.) This might not be viable on Windows, though, where there is a lot of both C++ and legacy C code.

charrondev · 2024-12-31T19:18:46 1735672726

As I understand this was recent stabilized in rust and is now the default behaviour.

https://blog.rust-lang.org/2024/09/05/Rust-1.81.0.html#abort...

You have to explicitly opt into unwinding like this now otherwise the program will abort.

ryao · 2025-01-01T11:35:59 1735731359

Another possibility is to avoid it in the first place by not allowing C++ function pointers that are not marked noexcept to be passed to C functions. I filed bugs against both GCC and LLVM requesting warnings:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

If/when they are implemented, they will become errors with -Werror.

loeg · 2025-01-01T18:30:39 1735756239

Doesn't seem all that useful unless C++ compilers will start warning about noexcept functions calling exception-throwing functions -- they don't today: https://godbolt.org/z/4qbcbxaET .

ryao · 2025-01-01T19:48:46 1735760926

That is supposed to be handled at runtime:

  > Whenever an exception is thrown and the search for a handler ([except.handle]) encounters the outermost block of a function with a non-throwing exception specification, the function std :: terminate is invoked ([except.terminate])

https://timsong-cpp.github.io/cppwp/n4950/except.spec#5

If it is not, then there is a bug in the C++ implementation.

loeg · 2025-01-01T21:23:46 1735766626

Catching it at runtime somewhat defeats the benefit of your approach upthread:

> Another possibility is to avoid it in the first place by not allowing C++ function pointers that are not marked noexcept to be passed to C functions.

ryao · 2025-01-01T22:13:32 1735769612

The two would combine to avoid situations where people spend 5 days debugging like unity did.

That said, my personal preference is to use C instead of C++, which avoids the issues of exceptions breaking kernel expectations entirely.

LegionMammal978 · 2024-12-31T17:06:03 1735664763

Nothing in C can prevent your function from being abnormally unwound through (whether it's via C++ exceptions or via C longjmp()). The only real fix is "don't use C++ exceptions unless you're 100% sure that the code in between is exception-safe (and don't use C longjmp() at all outside of controlled scenarios)".

ryao · 2025-01-01T11:34:39 1735731279

A better fix is to avoid passing pointers to C++ functions that can throw exceptions to C functions. This theoretically can be enforced by the compiler by requiring the C++ function pointers be marked noexcept.

I filed bugs against both GCC and LLVM requesting warnings:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

rramadass · 2024-12-31T21:15:10 1735679710

No. The problem was in the architecture of the asynchronous api w.r.t. the kernel. The last line of the article states; Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!

LorenPechtel · 2024-12-31T21:32:59 1735680779

More generally:

1) The top level of an async routine should have a handler that catches all exceptions and dies if it catches one.

2) If you have a resource you have a cleanup routine for it.

rramadass · 2025-01-01T01:59:58 1735696798

It is even more fundamental. People are focusing wrongly on the mention of exceptions here (most obvious) but what is crucial is to understand how Async callbacks registered with a Kernel work on all OSes. The limitations/caveats imposed on these routines (they are akin to interrupts) are given in their respective documentations and one has to be careful to understand and use them appropriately; eg. what is the stack used by these handlers? The article though detailed in the beginning sort of glosses over all this in the final paragraphs and hence we have to link the dots ourselves.

LegionMammal978 · 2025-01-01T03:56:13 1735703773

It's not really about asynchronous callbacks or their equivalents. (In this case, the thread running it is otherwise meant to be blocked in a safe state, so that there's none of the usual dangers of interrupting arbitrary code.) Instead, it's about any callbacks coming out of C code, even something as trivial as qsort(). If you pass a C library your C++ callback, and your callback runs back through it with an exception, then 9 times out of 10, the C library will leak some resources at best, or reach an unstable state at worst. C just doesn't have any portable 'try/finally' construct that can help deal with it.

So I'd say it's more about the basic expectations of a function called from C, which includes a million other trivial things like "don't write beyond the bounds of buffers you're given" and "don't clobber your caller's stack frame" and "don't spawn another thread just to write to output pointers after your function returns" (not that any of these is the issue here).

rramadass · 2025-01-01T06:28:21 1735712901

No, you (and most folks here) have not understood the full picture. Only the C ABI is relevant here and not the language (C/C++/whatever) itself.

You have to know how exactly asynchronous callbacks registered with the kernel get called, how their stack frames get setup, how kernel writes to local variables within a stack frame of a user thread, how stack frames are adjusted when a blocking system call returns to user space and finally, how and when exceptions (in any language) mess up the above when they implement a different flow of control than that expected by the above "async callback kernel api architecture". All of these are at play here and once you put them together you understand the scenario.

IshKebab · 2024-12-31T15:07:27 1735657647

Yes memory safe languages would absolutely help here. In Rust you would get a compile time error about the destination variable not living long enough.

This sort of stuff is why any productivity arguments for C++ over Rust are bullshit. Sure you spend a little more time writing lifetime annotations, but in return you avoid spending 5 days debugging one memory corruption bug.

ryao · 2025-01-01T11:36:59 1735731419

This is not a memory corruption bug. It is an undefined behavior bug and it also affected Rust until 1.81.0 as per comments from others:

https://blog.rust-lang.org/2024/09/05/Rust-1.81.0.html#abort...