This argument is basically hand waving, and it's factually wrong.
Yes, switching from user to kernel mode and back is expensive. But let's count the syscalls.
Model 1: Threads. 1 blocking read(), 1 blocking write(). Plus cost for the OS scheduler.
Model 2: 1 "I want to read", 1 "can I read now?", 1 actual read, 1 "I want to write now", 1 "can I write now?", 1 actual write. Multiply as necessary if the read or write are only partial. Add IPC cost as necessary if you have more than one thread handling async events.
Measuring by the syscall overhead, blocking I/O is much more efficient.
There are other costs to it, though, so it usually consumes more resources per handled socket and turns out to be more expensive than async I/O. But it's not the syscall cost causing it.
Depends, given that operating systems like Solaris, macOS, Windows, also have userspace thread like APIs (tasks, dispach queues, concurrent runtime), that do the kernel/userspace M:N mapping in a language agnostic way.
I don’t think we should parrot the syscalls-are-expensive too much. The overhead of thread syscalls is smaller than you’d think, and for say TCP and file IO our buffers are so big that we can mitigate the amounts in practice to saturate hardware (like NICs and SSDs) limitations, no problem.
The issue has more to do with thread lifecycle management, ie creating, maintaining a stack and destroying the thread. When you have very lightweight tasks, those add up and have no direct mitigations. That’s a big reason we moved runtimes into userspace.
I wish I could see the perf chart of these. If anyone finds them please link. It’s important to have objective facts to base these discussions in.
Related on-going thread:
Asynchronous IO: the next billion-dollar mistake?
The problem with this analysis is that single system calls aren't the only way to invoke the operating system. You can also use an io_uring-style mechanism to batch your system calls. Or BPF to run mini-programs in the context of the OS, which is also a sort of batching. These don't have the same problematic overhead as system calls.
This is really just archaic *nix world problems; the NT kernel has had actual asynchronous I/O since inception -- even 30 years later, io_uring is mere batching to the kernel upper-half (and is a carbon-copy, down to the particular terminology, of RIO)
We took a shot at doing ultra-fast kernel threads on FreeBSD a few decades ago. For various reasons, it was reverted and removed a few major versions later.
If you look a the old KSE work, the general gist was that if you were about to block in a syscall then you'd effectively get a signal-style longjmp back to your userland thread scheduler. You'd pick another thread and continue running all in the same process/task context.
There were many problems with what we did and how we did it, but the unavoidable fundamental problem at the time was that it inverted assumptions about costs of low level primitives. Important(TM) software was optimized for the world where threads and blocking were expensive and things like pthread mutex operations were cheap. Our changes made threads and blocking trivially cheap but added non-trivial overhead to pthread mutex etc operations. Applications that made extensive use of pthread mutexes to coordinate work dispatching on a precious small pool of expensive threads were hit with devastating performance losses. Most critically, MySQL. We'd optimized for hundreds of thousands of threads rather than the case of multiplexing work over a few threads.
It became apparent that this was going to be an eternal uphill battle and we eventually pulled the plug to do it the same way as Linux. We made a lot of mistakes with all of this.