On Concurrency
As a backend engineer working primarily on different distributed systems, ranging from ML infrastructure, data ingestion and streaming pipelines, to (these days) RAG systems, I have found that a thorough understanding of concurrency is probably the most important engineering knowledge. It’s up there with other concepts fundamental to the profession, like idempotency, ACID, or BASE. However, the latter are more often encountered as features of systems that an engineer uses (like a database’s ACID properties, or event-driven architecture’s guarantee of eventual consistency). This doesn’t mean a conceptual understanding of them isn’t helpful—it’s more that it doesn’t necessarily apply to the day-to-day coding work of engineers. Concurrency, however, is different. A solid grasp of concurrency allows one to understand and design for scale, avoid potential performance bottlenecks when writing services, and debug some of the trickiest bugs that can befall production systems. It’s knowledge that clearly differentiates a solid senior from a junior or mid-level engineer.
Shapes of Concurrency
Most engineers know about operating system processes and threads. Those are probably the OG concurrency primitives handled entirely by the operating system. Users only call into the operating system API through a system call (or some wrapper around it) and get a handle to that object (process or thread ID—from the perspective of Linux, they are very similar objects). The management of these objects is fully within the power of the operating system, including scheduling when they are able to execute. Whenever there is blocking IO initiated by the thread (or under different other internal OS conditions), the OS can swap out the process or thread and schedule another one instead. As a user, you have little say in when or for how long you can execute from your thread or process.
A lot of early (but also recent) computer systems have been entirely built on the basis of processes and threads. For example, PostgreSQL is fully based on a process system, where each new client connection gets a dedicated process. Consequently, you more often than not need connection pooling in front of a PostgreSQL instance, e.g., PgBouncer, to manage any production-level traffic. Any systems that use a lot of processes (or threads) for these types of traffic will hit performance issues related to memory pressure and context switching sooner or later. Operating system designers have therefore spent decades building primitives that could make developing client-server applications easier to scale.
In the case of Linux, these efforts bore fruit in the form of select, epoll, or modern io_uring, and in the case of Windows, it’s IO completion ports. At a high level, these abstractions provide a way for sending commands to the OS to do something (e.g., send something over a network, wait until a response has been received from the network) and wait for any notifications related to these commands. These primitives don’t force developers into a one-thread-per-connection model to achieve concurrency, but allow all IO operations to be handled within a single thread.
Programming languages can then add direct support for concurrent programming (e.g., through async/await or similar language constructs) or one can use libraries that make using the OS-level primitives more convenient (e.g., Python’s gevent or C++ Boost ASIO). The underlying logic of handling concurrency is very similar regardless of whether the language supports concurrency as a first-class language construct. A general architecture of a library providing concurrency to its users is a) some kind of runtime (aka event loop), and b) some kind of data structure holding the state of individual schedulable objects (coroutines, goroutines, etc.).
Cooperative vs Pre-Emptive Scheduling
Let’s start with the latter. Generally, when we are talking about concurrency, we want to ensure we can run multiple IO-bound tasks (like making a network request) that look as if we are running them at the same time. So what we would ideally want to do is swap these tasks once they hit an IO block, and return to them once any of the IO conditions they are waiting on (like receiving a network response) have been completed. We generally denote these special functions as coroutines. Now whenever we are going to execute this IO-bound operation, the runtime will be called, which will first save the current state of the coroutine and see whether there is any other coroutine out there that we can run instead (e.g., one whose IO operations have been completed since it was last swapped). If there is, it will resume that other coroutine. If not, it just waits for the earliest IO operation of any of its coroutines to finish.
This model of work swapping is called cooperative because each coroutine yields to the event loop once it hits an IO block. However, that implies that if a coroutine is not IO-bound and possibly requires a lot of CPU time, it might continue running for a long time without being swapped out. This is probably one of the most common asynchronous programming problems people encounter in certain languages (e.g., Python). With cooperative task scheduling, we rely on the code (i.e., us programmers) to properly separate any IO and CPU tasks (into different threads or processes). If we do not, we run the risk of blocking some of the operations for too long and causing downstream timeouts (server request timeouts, Kafka poll timeouts, etc.). A different approach is pursued by some programming languages with language support for the asynchronous model. For example, Go uses a so-called preemptive model of scheduling. The Go runtime registers an OS-level interrupt handler that gets triggered after a certain time slice. The runtime registers a specific handler in the OS on startup, and the OS then calls it whenever the time period has passed. This is happening in a different thread than the ones used by the runtime, so that’s how it is able to basically change its internal state, and once the OS context switches back to any thread belonging to the running Go program, the yielding of the current goroutine happens. The last part is actually thanks to a slight hack that the Go compiler uses. The Go compiler actually adds a check for a preemptive flag at certain code places, like for loops and function calls. What actually yields is still the goroutine itself, though by checking that preemptive flag set after the OS interrupt by the runtime (source).
Stackless vs Stackful Coroutines
Having a better understanding of how coroutines and event loops interact, we can examine how these coroutines are represented under the hood. There is again a fundamental division between models where a programming language supports coroutines natively through some async/await construct and where it doesn’t. In the former case, what is most often used are stackless coroutines, and in the latter, stackful ones. Stackful coroutines represent a more native (though still very complex) approach to storing the state of the coroutine. They store the state, as the name suggests, by storing the stack of functions. That way, they basically make a shadow copy of the function parameters and any local variables. In addition, they also store the CPU register pointers (like the current instruction pointer and others) that are then used whenever the coroutine needs to be resumed. Stackful coroutines operate at the CPU level and basically need to replicate any architecture-specific conventions (so you will need a different implementation for different processor architectures). Go and Boost ASIO both use stackful coroutines.
Stackless coroutines, on the other hand, don’t fully replicate the values of the stack. However, they require help from either a compiler or a programming language to keep the necessary coroutine state between subsequent resumptions. The compiler or programming language creates a heap-allocated class structure that contains all the function parameters and local variables. This structure additionally defines a resuming method that basically encapsulates the original function. This method doesn’t just call the original coroutine though. The compiler splits the coroutine into branches separated by yielding statements (await in Python or co_await in C++) and tags each part with an ID. Then, as part of every yield, the class structure also saves a compiler-generated unique ID. Whenever the coroutine is later resumed, it uses this ID to immediately jump to the next branch. Python and C++ 20 async/await language additions both use this model of coroutine implementation (for a great insight into the compiler-generated code, take a look at this article).
End
The performance of our programs depends heavily on the architecture of the underlying concurrency framework, and an in-depth understanding of these models will never be wasted knowledge. With this insight, we can reason much better about how to properly separate IO and CPU-bound tasks under cooperative scheduling, avoiding the common pitfall of blocking the event loop with compute-heavy operations. This understanding also proves invaluable when making decisions about integrating non-async libraries within asynchronous frameworks—knowing whether to offload blocking calls to thread pools, when to choose different concurrency models entirely, or how to structure our code to maintain the benefits of async execution. As distributed systems continue to grow in complexity and scale requirements, this foundational knowledge of concurrency remains one of the most practical and enduring skills in a backend engineer’s toolkit.