Unfortunately crossing the user space barrier would cost something regardless of how lightweight "threads" would be. Also, making the OS the scheduler for all async tasks would preclude different scheduler designs since every runtime would have to use the OS's scheduler.