i know, but why doesn't the OS do this too? Most programmers don't care if their "thread" is an actual thread on the machine or just a "fiber". If the benefits are this large, it should be provided by the OS in my opinion.
OS scheduling requires you to switch contexts between different modes in the CPU. Essentially, the OS has the power to perform operations that regular code does not. Threads and processes are usually built on top of these additional permissions.
Context switching is relatively expensive. The CPU needs to push a lot of application state out of the way to clear the path for the OS code to run, then after the OS is done, push the OS out of the way and retrieve the application's state and code again.
Whereas a fiber remains entirely inside the application code. It never requires a context switch. But a fiber loses some of the powers of the OS: for example, it can't draw hard memory boundaries between fibers that will be enforced by the CPU.
For some things you want processes, for some threads, for some fibers.
Context switching doesn't need to be as expensive as it is in these cases, it's just that Linux doesn't provide the mechanisms needed to make it more efficient. See, e.g., https://blog.linuxplumbersconf.org/2013/ocw/system/presentat... which implemented a syscall for a process-directed context switch directly to another thread.
It’s the OS context switch that makes process threads so much more costly than user-scheduled fibers. You cannot involve the OS if you want efficiency.
The OS doesn't know the details of what things are specific to a given "thread". So it has to take a "big dumb sledgehammer" approach to switching tasks: it has to switch out the whole stack (4k or 8k) even when there are only a few bytes that belong to that particular task (because it has no way of knowing which bytes those are), and it has to interrupt the tasks at essentially arbitrary times rather than waiting for them to yield (because, again, it has no way of knowing what the yield points are) which then means the tasks have to have more synchronization overhead etc. to work around the fact that they might be preempted at any point.
In this age of VMs/containers for everything I'm not convinced a conventional OS offers a lot of value - OSes made sense when programs needed to access different kinds of hardware and we liked to have multiple processes/users sharing a single machine while broadly trusting each other, but neither of those things is really true any more. Look at unikernels for where I think the future is going - bootable VMs that act as a language runtime that controls things like threading directly, no need for an OS intermediary.
> it has to switch out the whole stack (4k or 8k) even when there are only a few bytes that belong to that particular task (because it has no way of knowing which bytes those are)
This reads to me like you believe the OS must memcpy/move the whole stack out of the way on a context switch. It doesn't; the other thread has other, dedicated memory its stack, and on a context switch, the stack pointer is simply adjusted to point at the other stack.
Assuming Java's fibers work similar to other green thread implementations, green threads/fibers work similarly — it's just a pointer change, except the adjustment is done in userspace.
(And, to some degree, the OS does know what bytes are stack for any given task. It's whole pages, yes, but that allows the program to manage it otherwise. But I think most green-thread/userspace threads are similar: the stack is preallocated ahead of time and left to the thread to manage. Sure, you might not know the exact range, but you don't really need to? Go, I think, is an interesting outlier here; IIRC, it dynamically expands and contracts the allocated space on the stack in response to the application's demands, though I do think they had some interesting issues w/ loops thrashing allocations if they fell along an allocation boundary. I think they've also long since fixed that issue.)
Java's fibers actually work like the previous poster described, which may explain the confusion:
> The current prototype implements the mount/dismount operations by copying stack frames from the continuation stack – stored on the Java heap as two Java arrays, an Object array for the references on the stack and a primitive array for primitive values and metadata. Copying a frame from the thread stack (which we also call the vertical stack, or the v-stack) to the continuation stack (also, the horizontal stack, or the h-stack) is called freezing it, while copying a frame from the h-stack to the v-stack is called thawing. The prototype also optionally thaws just a small portion of the h-stack when mounting using an approach called lazy copy; see the JVMLS 2018 talk as well as the section on performance for more detail.
The video presentation describes it better. I think in their expected usage scenarios the stacks of fibers aren't particularly deep so the copying isn't that expensive. Also, IIRC, doing it this way was less intrusive to the existing JVM architecture; it's possible in time they'll rearchitect things to use a more traditional technique.