Elixir's Concurrency Model Is the One You Actually Want

May 16, 2026 14 min read elixir otp engineering

async/await and goroutines solve scheduling. The BEAM solves failure. Why most concurrency pain is actually failure-isolation pain — and only the actor model plus supervision trees fix it.

Dark editorial card titled 'The Concurrency Model You Actually Want' with a red supervision tree root and four child nodes

TL;DR: async/await and goroutines solve scheduling — how to interleave a lot of work on a few OS threads without blocking. The BEAM solves failure — what happens when one of those units of work blows up at 3am. Most of the "concurrency" pain backend devs feel is actually failure-isolation pain wearing a concurrency costume, and only the actor model plus supervision trees address it head-on. Elixir's model is the one you actually want; it's also the one with the smaller hiring pool, the Erlang-isms, and no business doing your matrix multiplication. Here's the honest version, with a working GenServer and Supervisor you can paste into a fresh mix project.

The 3am page async/await can't prevent

Here's a bug I've shipped, in some form, in three different languages.

A request handler does five things. It validates input, hits the database, calls a third-party API, transforms the result, and writes to a cache. The third-party API starts returning a malformed payload — not an HTTP error, just JSON with a field that's now null where it used to be a string. Your transform step does payload.token.toUpperCase(). It throws. The throw is unhandled in that code path because you wrote the happy path first and the deadline was Friday.

In Node, depending on where that ran, you either crash the process or — worse — you reject a promise nobody's awaiting and the runtime prints UnhandledPromiseRejection and, in modern Node, exits anyway. In Python with asyncio, an exception in a task that nobody awaits gets logged when the task is garbage-collected and silently swallowed until then. In Go, if that transform ran in a bare go func() with no recover(), the panic walks up that goroutine's stack and takes the entire process with it. One bad upstream payload, one unguarded line, whole service down. The pager goes off at 3am.

Notice what the bug isn't. It isn't a scheduling problem. async/await scheduled that work perfectly. Goroutines would have scheduled it perfectly. The event loop did its job. The bug is a failure-isolation problem: there was no boundary between "this one request's transform step exploded" and "the process serving every other request is now dead."

This is the thing I want to convince you of: the concurrency model you reach for should be judged less on how elegantly it schedules work and more on what it does when one unit of that work fails. By that measure, the models most backend devs use every day are weak, and the one they keep getting told to try — "just use Elixir" — is strong for reasons nobody bothers to explain past the slogan.

Let me explain past the slogan.

The three models, honestly

Strip the marketing off and there are three concurrency models a backend dev is likely to touch. They are not competing implementations of the same idea. They guarantee different things.

Threads and async (Node, Python, the JVM's default style). You have one or a few OS threads. You multiplex many logical tasks onto them using an event loop (libuv, asyncio) or a thread pool. Tasks share the same heap. The model's core guarantee is throughput: you can have ten thousand in-flight requests without ten thousand OS threads. What it does not give you is isolation. Every task lives in the same memory space and, in single-threaded runtimes, the same failure domain. An unhandled exception's blast radius is "whatever shares this thread/process," which in practice is everything. You bolt safety on afterward with try/catch discipline, framework-level error middleware, and a process supervisor like pm2 or systemd restarting the whole thing.

Goroutines and channels (Go). Go gives you cheap green-threaded units (goroutines) multiplexed by the runtime onto OS threads, plus channels for communicating between them. This is genuinely better ergonomics than callback or async-coloring soup — goroutines don't have a "color," and go foo() is about as low-friction as concurrency syntax gets. The guarantee is cheap concurrency with first-class communication. What Go does not give you is memory isolation or failure isolation. Goroutines share the same address space by design; the language even documents the data-race rules you must follow because shared memory is the default substrate. And an unrecovered panic is not goroutine-local — it terminates the program.

Processes and supervision (the BEAM: Erlang, Elixir). The BEAM gives you processes — not OS processes, not threads, but VM-level units that are extraordinarily cheap and, critically, share no memory. The Erlang docs are blunt about it: "Threads of execution in Erlang share no data, that is why they are called processes." (erlang.org, Concurrent Programming) They communicate only by copying messages between isolated mailboxes. The guarantee here is different in kind: isolation plus a structured story for failure. A process can crash without touching any other process's memory, because there is no shared memory to touch. And the platform ships a first-class abstraction — supervision trees — whose entire job is deciding what to do when one does crash.

The first two models optimize "run lots of things without blocking." The third optimizes "contain the damage when one of those things dies." Those are different problems, and most production incidents are the second one.

Where Go's model leaks

I'm picking on Go specifically because Go is the language people most often reach for when they've outgrown Node/Python concurrency and think they've solved the problem. Go's model is good. It is not the same kind of good.

Shared memory is the default, not the exception. The Go proverb is "Do not communicate by sharing memory; instead, share memory by communicating" (go.dev, Share Memory By Communicating). It's good advice precisely because the default is the opposite. Channels are opt-in; the shared heap is opt-out. The Go memory model exists to tell you the rules for the unsafe thing you can do by accident: it defines a data race as "a write to a memory location happening concurrently with another read or write to that same location" and is explicit that races are errors — its own summary of the philosophy is "Don't be clever." (go.dev, The Go Memory Model) A model whose spec needs a "don't be clever" section is a model where the foot-gun is loaded by default.

A panic in one goroutine takes down all of them. This is the big one and it surprises people coming from "concurrency means isolation." It does not, in Go. From the official Go blog: when a panic isn't recovered, "the process continues up the stack until all functions in the current goroutine have returned, at which point the program crashes." (go.dev, Defer, Panic, and Recover) And recover only works from a deferred function on the same goroutine that panicked — a sibling goroutine cannot catch it for you. So the moment you write go handleRequest(conn) and handleRequest panics on a nil dereference from that malformed upstream payload, every other in-flight request in that process dies with it. The mitigation is real but it's manual: you wrap every goroutine entry point in defer func(){ recover() }(). Forget one — in a library, in a callback, in code a teammate wrote on a Friday — and you're back to whole-process death. Isolation that depends on every author remembering a boilerplate incantation is not isolation; it's a convention.

Channel deadlocks are a class of bug, not an edge case. Unbuffered channel sends block until there's a receiver. A goroutine waiting to send on a channel nobody will ever receive from is stuck forever, holding whatever it holds. If every goroutine ends up blocked this way the Go runtime can detect total deadlock and crash with fatal error: all goroutines are asleep - deadlock! — but the far more common production case is a partial deadlock: a few goroutines wedged on channel operations while the rest of the program runs fine, leaking a little memory and one request's worth of progress every time it happens, invisible until you're staring at a slowly climbing goroutine count in production.

None of this makes Go bad. Go's model is a massive upgrade over callback-era Node for the scheduling problem. It just doesn't solve the failure-isolation problem, and it's frequently sold as if it does.

The BEAM bet

The BEAM makes a specific, opinionated bet: optimize the runtime for isolated failure and structured recovery, and accept the costs that come with it.

Processes are cheap and isolated. A freshly spawned BEAM process is small. The Erlang efficiency guide states the default initial heap is 233 words and notes this is "quite conservative to support Erlang systems with hundreds of thousands or even millions of processes"; the same guide's worked example shows a newly spawned process at 327 words total (erlang.org, Processes). A word is 8 bytes on a 64-bit VM, so we're talking single-digit kilobytes per process, growing on demand. The point isn't a brag number — it's that "spawn a dedicated process per request, per connection, per job" is a normal, expected thing to do, not a resource gamble. Each one has its own heap. A crash in one cannot corrupt another's state because there is no shared state to corrupt.

Preemptive scheduling, so one process can't starve the rest. Go's scheduler is good but its preemption story has historically had rough edges. The BEAM is preemptive by reduction counting: a process is given a fixed budget of "reductions" (roughly, function calls) and yielded when it's spent them. The budget is the CONTEXT_REDS constant — 4000 reductions — defined in the VM's erl_vm.h and documented in The BEAM Book (happi/theBeamBook, scheduling). Practically: one process running a tight loop cannot freeze the others. The scheduler will pull the rug at 4000 reductions whether the code cooperates or not. That property is why one slow request doesn't degrade the latency of the other ten thousand.

"Let it crash" and supervision trees. This is the philosophical core, and it's the opposite of defensive programming. Instead of wrapping every operation in error handling to keep a process limping along in a corrupted state, you let the process die cleanly at the first sign that its assumptions are violated — and you put a supervisor above it whose job is to restart it from a known-good initial state. The OTP design principles describe this directly: "The supervision tree is a hierarchical arrangement of code into supervisors and workers, which makes it possible to design and program fault-tolerant software." (erlang.org, OTP Design Principles) A supervisor "is responsible for starting, stopping, and monitoring its child processes," with restart strategies (:one_for_one, :one_for_all, :rest_for_one) that declare exactly how a sibling's death affects the others (erlang.org, Supervisor Principles).

Here's the thing that's hard to convey until you've run it: in this model, the 3am bug from the opening is not a page. The request process handling the malformed payload crashes. Its supervisor restarts a fresh worker. Every other request is untouched because it was a different process with a different heap. You get a log entry and a metric, not an outage. The failure didn't have to be anticipated, caught, and handled at the call site. It had to be contained, and containment is the runtime's job, not yours.

Here is a real, working example — a rate-limiter GenServer supervised by a Supervisor. Paste it into lib/ of a fresh mix new demo project and it runs.

defmodule Demo.RateLimiter do
  @moduledoc """
  A token-bucket rate limiter as an isolated process.
  If its state ever becomes inconsistent it is allowed to crash;
  the supervisor restarts it from a clean bucket.
  """
  use GenServer

  # --- Client API ---

  def start_link(opts) do
    name = Keyword.get(opts, :name, __MODULE__)
    GenServer.start_link(__MODULE__, opts, name: name)
  end

  @doc "Returns :ok if a token was available, :rate_limited otherwise."
  def request(server \\ __MODULE__) do
    GenServer.call(server, :request)
  end

  # --- Server callbacks ---

  @impl true
  def init(opts) do
    max = Keyword.get(opts, :max_tokens, 5)
    refill_ms = Keyword.get(opts, :refill_ms, 1_000)
    :timer.send_interval(refill_ms, :refill)
    {:ok, %{tokens: max, max: max}}
  end

  @impl true
  def handle_call(:request, _from, %{tokens: tokens} = state) when tokens > 0 do
    {:reply, :ok, %{state | tokens: tokens - 1}}
  end

  @impl true
  def handle_call(:request, _from, state) do
    {:reply, :rate_limited, state}
  end

  @impl true
  def handle_info(:refill, %{max: max} = state) do
    {:noreply, %{state | tokens: max}}
  end
end

defmodule Demo.Application do
  @moduledoc false
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      # If RateLimiter crashes, ONLY RateLimiter is restarted.
      {Demo.RateLimiter, name: Demo.RateLimiter, max_tokens: 5, refill_ms: 1_000}
    ]

    opts = [strategy: :one_for_one, name: Demo.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Wire Demo.Application into mix.exs with mod: {Demo.Application, []}, run iex -S mix, and try it:

iex> Demo.RateLimiter.request()
:ok
# ...call it past the bucket size...
iex> Demo.RateLimiter.request()
:rate_limited

# Now kill it on purpose and watch the supervisor heal it:
iex> Process.exit(Process.whereis(Demo.RateLimiter), :kill)
true
iex> Demo.RateLimiter.request()
:ok   # a brand-new process, fresh full bucket, no manual restart

That Process.exit(..., :kill) is the whole argument in three lines. You deliberately destroyed the process. You did not write a single line of recovery code. The :one_for_one supervisor noticed the child died and started a clean replacement, and the very next call succeeds against fresh state. GenServer is, in the official Elixir docs' words, "a behaviour module for implementing the server of a client-server relation" that plugs directly into supervision and standard error reporting (hexdocs.pm, GenServer). You write the state transitions; OTP writes the resilience.

That is the bet: a little ceremony (GenServer callbacks, child specs, supervision strategy) bought up front, in exchange for failure isolation being a structural property of the system rather than a discipline every author must remember.

The honest costs

If the BEAM model were free, everyone would use it. It isn't. A flagship post that doesn't say this is a brochure.

The hiring pool is genuinely smaller. This is the real one, and no amount of "but it's easy to learn" hand-waving makes it go away. You will interview fewer Elixir engineers than Go or Node engineers, full stop. You can mitigate it — the language is approachable and strong devs pick it up fast — but if your hiring strategy depends on a deep local market of people who already know the stack, that's a strike against, and pretending otherwise is dishonest.

Ecosystem gaps in specific corners. The web story (Phoenix, LiveView, Ecto) is excellent and competitive with anything. Outside that, you will hit libraries that are thinner than the Go or Python equivalent: certain cloud-vendor SDKs, some ML/data tooling, niche protocol clients. The usual escape hatch is a port/NIF or shelling out to another runtime, which is fine but it's work you wouldn't have on a more mainstream stack.

Erlang-isms leak through. Elixir is a lovely language, but it sits on a 1980s telecom VM, and the substrate shows. Stack traces drop into Erlang term syntax. Tooling and observability docs are split across Elixir and Erlang/OTP. You will, eventually, read Erlang source to understand a library. That's a tax on every engineer, paid forever, not just at onboarding.

It is not a number-crunching runtime. The BEAM is optimized for massive concurrency and message passing, not raw CPU throughput on tight numeric loops. Heavy computation — image processing, large-matrix math, cryptographic grinding — is not what it's for, and naively doing it in pure Elixir is slow. The community answer is real (NIFs, Nx/EXLA for numerical work, offloading to native code), but the honest framing is: the runtime's strength is concurrency-and-failure, and you pay for that focus in compute-bound work.

When NOT to reach for Elixir

Decisions are made by knowing when not to use the thing.

CPU-bound batch work with little concurrency. A nightly job that does heavy math on one big dataset has no failure-isolation problem worth solving and will run faster in a runtime built for throughput. Wrong tool.
Small teams on a tight deadline who already know Go/Node/Python well. The right concurrency model in your hands beats the better one you're learning under deadline pressure. Familiarity is a real engineering input. Ship the thing.
You genuinely don't have a failure-isolation problem. A mostly-stateless service that fans out a few HTTP calls behind a load balancer that already restarts unhealthy instances has externalized the supervision problem to your orchestrator. The BEAM's biggest advantage is partly redundant there. It's still nice; it's not decisive.
Hard real-time or microsecond-latency systems. The BEAM's preemptive, garbage-collected scheduling is built for soft real-time (consistent low-ish latency under massive concurrency), not hard guarantees. If you need bounded microsecond worst-case, this is not your platform.

The pattern: reach for the BEAM when your dominant pain is many independent things that can each fail independently and must not take each other down. That's chat backends, telephony, IoT fleets, payment orchestration, real-time multiplayer, anything stateful-per-connection. When that's not your shape, the model's signature advantage is muted, and its costs are still fully priced in.

Verdict

The slogan "just use Elixir" is right for the wrong stated reason. People say it like it's about concurrency throughput. It isn't — Go and well-written async Node handle enormous concurrency. It's about what the runtime does at 3am when one unit of that concurrency hits a bug nobody anticipated. async/await and goroutines schedule beautifully and then make that failure your problem, manually, forever. The BEAM makes it the runtime's problem, structurally, by default.

Here's the decision table I actually use:

Your dominant pain	Best-fit model	Why
"I have lots of I/O-bound requests and don't want a thread per request"	Threads/async (Node, Python, JVM)	Scheduling is the whole problem; isolation isn't your bottleneck
"I want cheap concurrency with clean communication and a big hiring pool"	Goroutines + channels (Go)	Excellent scheduling + ergonomics; accept manual failure isolation
"Independent stateful things that must fail without taking each other down"	Processes + supervision (Elixir/BEAM)	Failure isolation is structural, not a convention you must remember
"Heavy CPU-bound numeric work, low concurrency"	A throughput runtime (Go, Rust, native)	None of the above models' isolation buys you anything here
"Hard real-time, bounded worst-case latency"	A real-time runtime, not the BEAM	GC + preemptive scheduling is soft real-time only

If your honest answer to "what's my dominant pain" is the third row — and for a surprising number of stateful backends it is, people just file it under "concurrency" because that's the word they have — then yes. It's the model you actually want. Pay the hiring and Erlang-ism costs with your eyes open, keep it away from your matrix math, and let it crash.

All writing