Do C++ coroutines give you a sane call stack when chained? I’ve only worked with proprietary implementations that do not, and it is an utter catastrophe for maintenance once they have infested a large code base.
There's something truly perverse about the way languages are recapitulating the evolution of call stacks in the form of coroutines[1] and promises.
Erlang, Go, Lua, and Scheme get this right. Stackless Python never caught on :( But Java may get proper stackful coroutines in the near future.
[1] Stackless coroutines, which means you can't yield across nested function invocations. The unfortunately named Stackless Python actually implements stackful coroutines. "Stackless" in Stackless Python refers to not implementing the Python call stack on the C ABI stack. Stackful coroutines are basically threads (sometimes called microthreads or fibers to distinguish from C ABI/kernel thread construct) with a language-level construct for directly passing control to another thread.
There's nothing about stackless coroutines that means you can't have good stack traces. For example, C# already does it, at least to some degree.
Stackful coroutines are clearly a viable tool, but they don't work in all use cases. They require either segmented stacks, a precise GC, or memory usage comparable to kernel threads. They are tricky to implement correctly on Windows. Etc.
In my ideal world, we'd have stackless coroutines with great debugger support everywhere, with languages free to experiment with the syntax- explicit suspension points, implicit suspension points, effect polymorphism to make it look like you're yielding across nested function calls, etc...
> They require either segmented stacks, a precise GC
Which is _exactly_ what stackless coroutines and promises do in a very roundabout manner. Some languages are moving toward annotations to automate chaining, but the problem with having to explicitly annotate coroutines is that you no longer have (or can have) first-class functions; at best you now have multiple classes of functions that only interoperate seamlessly with their own kind, which is the opposite of first-class functions. Plus it's much slower than just using a traditional call stack.
Implementing transparently growable or moveable stacks can be difficult, yes. Solving C ABI FFI issues is a headache. And languages that stack-allocate variables but cannot easily move them are in a real pickle. Though, they can do as Go and only stack-allocate variables that don't have their address taken, and in any event that only applies to C++ and Rust. There's no excuse for all the other modern languages. Languages like Python and JavaScript don't have stackful coroutines because of short-sighted implementation decisions that are now too costly to revisit. Similarly, Perl 6 doesn't officially have them because they prematurely optimized their semantics for targets like the JVM where efficient implementations were thought to be difficult. (Moar VM implements stackful coroutines to support gather/take, which shows that it was simply easier to implement the more powerful construct in order to support the less powerful gather/take construct.)
If it were easy we wouldn't have stackless coroutines at all because they're objectively inferior in every respect, and absent external constraints (beholden to the C stack) can result in less memory usage and fewer wasted CPU cycles in both the common and edge cases. But both PUC Lua and LuaJIT do it properly and are among the fastest interpreted and JIT'd implementations, respectively, so I think the difficulty is exaggerated.
I understand why these other constructs exist, but I still think it's perverse. At some point we should just revisit and revise the underlying platform ABI so we can get to a place where implementing stackful coroutines is easier for all languages.
For example, the very same debugging information you might add to improve stack traces can be used by implementations to help, say, move objects. Make that mandatory as part of the ABI and a lot of cool things become possible, including easy reflection in compiled languages. DWARF is heavyweight and complex, but Solaris (and now FreeBSD and OpenBSD) support something called Compact C Type Format (CTF) for light-weight type descriptions, which shows how system ABIs could usefully evolve.
Newer languages shouldn't be tying themselves to incidental semantics from 40 years ago. Rather, they should be doing what hardware and software engineers were doing 40 years ago when they defined the primary semantics--properly abstract call stacks/call state into a first-class construct (i.e. thread of control), while simultaneously pushing the implementation details down so they can be performant.
I think you're misunderstanding the benefit of stackless coroutines. Here's the core of the issue: Stacks, by definition, are dynamically growable objects. But sometimes (in most cases?) the state that needs to be stored across yield points can be statically bounded up front. In that case, stackful coroutines (really, threads) add unnecessary overhead, because you pay for that ability to grow. Golang, for instance, has to start stacks small and grow them by copying, because it can't reliably statically analyze the goroutine to determine the needed stack space. A properly-implemented stackless coroutine system (or even the classic C approach of writing state machines by hand), by contrast, can determine the amount of space necessary for each coroutine and allocate it precisely. There are even slab allocation tricks for many coroutines of the same size. Fundamentally, this is why manually-written C code using epoll can outperform any stackful coroutine system: optimal allocation of stackful coroutines in a language with recursion and first-class functions requires higher-order control flow analysis, which is generally considered infeasible.
Of course, I've only focused on the performance issues here, and obviously there are also important issues of ergonomics at play. But my point is that there are benefits to stackless coroutines that can't be simply dismissed.
I'm not at all convinced. Stackless coroutines have a statically-known fixed stack size, which can be treated as an anonymous type a la C++ or Rust closures. This is far cheaper and easier to optimize than any possible implementation of stackful coroutines, regardless of what your platform ABI looks like.
Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.
This seems like a super-interesting discussion but I can't quite follow due to the mixed-up naming. Is there a good paper or website that clarifies what stackful/less means in the various contexts?
The distinction are difficult to grasp, and it doesn't help that meanings evolve. For example, many people today understand "thread" to mean a preemptively scheduled kernel resource with a contiguous region of memory used for storing objects and return addresses, which shares an address space with other such threads. That sort of "thread" is what 10 years ago many people called a LWP (lightweight process), when thread didn't imply preemptive scheduling. And what was a "thread" 10 years ago is often called a "fiber" or "microthread" today. Though I'm coming from the world of Unix. I think connotations had a slightly different evolution in the world of Windows, largely because Unix emphasized the "process" (concurrent execution, non-shared address space) and modern concurrent execution with shared address space were only formalized in the late 1990s, whereas Windows didn't get isolated virtual address spaces until Windows 95, so the primary construct was always a "thread" (with or without preemptive execution, depending on era).
When I say thread in my previous post I'm referring to the abstract construct that represents the flow of control of sequentially executed instructions, regardless of the underlying implementation, and regardless of how you initiate, yield, or resume that control. For languages that support function recursion that necessarily implies some sort of stack (if not multiple stacks) for storing the state of intermediate callers that have invoked another function. Often such a stack is also used to store variables, both as a performance optimization and because many (perhaps most objects) in a program have a lifetime that naturally coincides with the execution of the enclosing function.
Such a "call stack" has historically been implemented as a contiguous region of memory, but some hardware (i.e. IBM mainframes) implement call stacks as linked-lists of activation frames, which effectively is the data structure you're creating when you chain stackless coroutines.
The two best sources I've found that help untangle the mess in both theoretical and practical terms are
Excellent, thanks. I was first introduced to threads in OS/2 and then Windows so for me Thread still means preemptive + stack + shared address space (vs. process which has it's own address space.) I've run into single-stack multitasking in the context of RTOSes (like this: https://github.com/djboni/librertos) but then it becomes cooperative so really a different beast.
The explanation of using a separate data structure to track activation frames is really the key thing for me here, otherwise I can't see how coroutines would work well. I guess there's cases, as mentioned in the original article, where the C++ compiler can determine that a coroutine could be hosted inside the current stack frame but that's really a special case.
How do the terms in that paper map to the stackful/stackless distinction being discussed above? For example neither term appears in the paper at all, except for one mention of "Stackless Python" as a name.
I should have linked the other paper by Ierusalimschy, "Revisiting Coroutines". wahern linked it in his reply. They are both good reads to better understand coroutines.
I appreciate all your points. However, in my experience implementing coroutines and fibers, where I allocated stacks using `mmap`, with guard pages, I'm not completely convinced that fixed-size stacks are an issue:
1/ The typical C stack is fixed size and code is written accordingly.
2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.
How about fixed size stacks- with a canary of a expansionfunction at the top?
That way you can have it both- the speed and fixed size- and expandability on demand- by simply checking for the canary in a pattern and executing it ..
References about Java? I was looking for coroutines / CPS for Java recently and found no JSRs and no plans to include it into the language anytime soon.
In my experience it depends on the tooling - both the compiler needs to allow for insertion of the necessary debugging symbols, and the debugger needs to interpret them correctly.
I'd like to improve this situation, right now the following markers are commented out because I've had issues with them.
>I’ve only worked with proprietary implementations that do not
Hmm, what does this mean? I've used coroutines in Unity and to that respect, yielding functions in C# as well as generators in Python. The stacks are sane in the sense that they include the current running iteration.
Is that tricky to implement? Are you expecting something different? Seems pretty straight forward to me.
It depends what you mean by "heavy", but the common example is something like thread-per-request or implementing async IO using threads that make blocking calls.
In those context threads are primary "heavy" because each thread needs a dedicated stack, so there is a pretty low limit on the total number of threads you can have based on your available memory, so you run into scaling limitations whenever you'd naturally want to have 100 of thousands or millions of such threads.
The CPU use of context switches is a secondary factor there and probably pales compared to IO costs - and some competing designs also have context switch costs.
Threads were also heavy in the sense that if they are OS scheduled, the behavior often degrades with large numbers of threads, depending on the implementation. Recent scheduler designs are mostly able to mitigate this, however.