Continuation of GenServer the “rabbit hole”. If you haven’t read it yet, I recommend going there first — we’ll start from the same genserver_study project and the same concepts of mailbox, supervisor and max_restarts.

In the previous article we explored how a “misbehaving” GenServer can shut down the whole supervision tree when it bursts the supervisor’s max_restarts. We showed how messages are lost in the mailbox and how “Let it crash” isn’t exactly a blank check to let processes break randomly.

In that example, however, the error was born inside the GenServer itself — an explicit exception in handle_info/2. Real life is rarely that direct. Many of the Elixir pod restarts I’ve investigated in production share the same signature: a singleton worker that had no exception in its own code simply disappears, taking its supervisor with it, and the BEAM exits with exit-code != 0. You look at pod memory and it’s at ~5GB with a 24Gi limit. It’s not OOM. It’s supervisor cascade.

The question that remains: how does a worker that never crashes die?

The answer involves three concepts that tend to be treated as a detail in the official docs and that, together, form the biggest source of “ghost restart” in Elixir apps:

  1. Process links propagate exits silently — and almost everything you use creates links under the hood.
  2. trap_exit turns exits into messages, but if you don’t have clauses for every variant, :function_clause will take you down.
  3. Validations left only at “the layer above” — known errors become runtime exceptions on the most critical path.

Let’s go in parts 🐇.


When an Elixir process starts running, it can be connected to other processes by two different mechanisms: links and monitors. Both serve to “know that someone died”, but the difference is fundamental:

  • Link (Process.link/1, spawn_link/1, Task.async/1, Task.async_stream/3): bidirectional connection. If A is linked to B and B dies with reason != :normal, then A also receives the exit signal and dies along (unless A has trap_exit enabled).
  • Monitor (Process.monitor/1, Task.async_nolink/1): unidirectional connection. A receives a {:DOWN, ref, :process, pid, reason} message when B dies, but A is not affected.

Why does this matter? Because a link gives you a very strong guarantee: if one side dies, the other dies too. In a supervision tree this is exactly what you want — the supervisor is linked to its children, so it knows when one fell and can decide (restart, escalate, terminate). But in a regular worker that should not die alongside its helpers, the silent link is a trap.

And worse: most of the “ergonomic” spawn functions in Elixir create links by default. Let’s look at one of the most common: Task.async_stream/3.


From the official doc:

Each element of enumerable will be prepended to the given args and processed by its own task. Those tasks will be linked to an intermediate process that is then linked to the caller process.

If you find yourself trapping exits to ensure errors in the tasks do not terminate the caller process, consider using Task.Supervisor.async_stream_nolink/6 to start tasks that are not linked to the caller process.

In other words: if any of the Tasks crashes with an unhandled exception, the process that called Task.async_stream/3 dies with it. The doc itself warns about this, but it’s easy to miss when you’re just doing a simple parallel IO fan-out.

Let’s reproduce this in our genserver_study project. Create a new BatchServer module:

defmodule BatchServer do
  use GenServer

  @prefix "[BatchServer]"
  @interval 1_500

  def start_link(_), do: GenServer.start_link(__MODULE__, %{}, name: __MODULE__)

  @impl true
  def init(_) do
    IO.puts("#{@prefix} Initializing GenServer")
    schedule_next_run()
    {:ok, %{batches: 0}}
  end

  @impl true
  def handle_info(:run_batch, %{batches: n} = state) do
    IO.puts("#{@prefix} Starting batch #{n + 1} (PID #{inspect(self())})")

    1..5
    |> Task.async_stream(&process_job/1,
      ordered: false,
      max_concurrency: 3,
      timeout: 1_000
    )
    |> Stream.run()

    IO.puts("#{@prefix} Batch #{n + 1} finished")
    schedule_next_run()
    {:noreply, %{state | batches: n + 1}}
  end

  defp process_job(id) do
    IO.puts("#{@prefix} Processing job #{id}")
    :timer.sleep(100)

    # Job 3 simulates an unexpected external failure
    # (imagine: external API returned invalid body, payload > 255 chars, etc.)
    if id == 3 do
      raise "external API returned unexpected error"
    end

    IO.puts("#{@prefix} Job #{id} done")
  end

  defp schedule_next_run, do: Process.send_after(self(), :run_batch, @interval)
end

Add to the supervision tree replacing the previous modules:

def start(_type, _args) do
  children = [
    BatchServer
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

Start the app:

iex -S mix

[BatchServer] Initializing GenServer
[BatchServer] Starting batch 1 (PID #PID<0.156.0>)
[BatchServer] Processing job 1
[BatchServer] Processing job 2
[BatchServer] Processing job 3

18:42:11.301 [error] Task #PID<0.160.0> started from BatchServer terminating
** (RuntimeError) external API returned unexpected error
    (genserver_study 0.1.0) lib/genserver_study/batch_server.ex:34: BatchServer.process_job/1
    ...
Function: &BatchServer.process_job/1
    Args: [3]

18:42:11.302 [error] GenServer BatchServer terminating
** (EXIT from #PID<0.156.0>) shell process exited with reason:
   an exception was raised:
    ** (RuntimeError) external API returned unexpected error
        (genserver_study 0.1.0) lib/genserver_study/batch_server.ex:34: BatchServer.process_job/1

[BatchServer] Initializing GenServer
[BatchServer] Starting batch 1 (PID #PID<0.166.0>)
...

Notice the two lines in sequence:

  1. Task #PID<0.160.0> … terminating — one of the Tasks raises RuntimeError.
  2. GenServer BatchServer terminating — and right after the GenServer dies with it, without having executed a single line of its own code that could have failed.

BatchServer never crashed on its own. The Task was linked to an intermediate process that was linked to BatchServer. When the Task raised, the exit signal went up the links and killed the GenServer. If this loop happens fast enough, the supervisor blows max_restarts and the whole application exits — exactly the scenario of the real production incident that motivated this article.

💡 Subtle detail: the error was in a Task. The log shows the Task’s stack trace. When you’re investigating a ghost restart and search for the “first error of GenServer X” in the logs, you won’t find anything related to the GenServer’s code. The root cause is in another process that was linked to it. Search for the GenServer name as “started from”, not as “GenServer terminating”.


The most direct fix is to decouple the tasks from the worker that dispatches them. The Task.Supervisor.async_stream_nolink/6 function does exactly that: Tasks are linked to a Task.Supervisor (not to your GenServer), and the stream emits {:ok, value} or {:exit, reason} tuples for each element — you decide what to do with each.

Let’s add a Task.Supervisor to the tree and refactor BatchServer:

def start(_type, _args) do
  children = [
    {Task.Supervisor, name: BatchServer.TaskSupervisor},
    BatchServer
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

And in BatchServer, replace Task.async_stream/3 with:

@impl true
def handle_info(:run_batch, %{batches: n} = state) do
  IO.puts("#{@prefix} Starting batch #{n + 1} (PID #{inspect(self())})")

  BatchServer.TaskSupervisor
  |> Task.Supervisor.async_stream_nolink(1..5, &process_job/1,
    ordered: false,
    max_concurrency: 3,
    timeout: 1_000,
    on_timeout: :kill_task
  )
  |> Enum.each(fn
    {:ok, _value} -> :ok
    {:exit, reason} -> IO.puts("#{@prefix} ⚠️  job failed: #{inspect(reason)}")
  end)

  IO.puts("#{@prefix} Batch #{n + 1} finished")
  schedule_next_run()
  {:noreply, %{state | batches: n + 1}}
end

Notice two important changes besides swapping the function:

  1. on_timeout: :kill_task — without this option, a timeout in async_stream_nolink still takes down the caller (on_timeout: :exit is the default!). With :kill_task, the timeout becomes another {:exit, :timeout} in the stream.
  2. Enum.each with pattern matching for :ok and :exit — now every failure is a tuple you inspect, log, send metrics for, etc. A Task failure no longer takes down the GenServer.

Starting the app again:

iex -S mix

[BatchServer] Initializing GenServer
[BatchServer] Starting batch 1 (PID #PID<0.158.0>)
[BatchServer] Processing job 1
[BatchServer] Processing job 2
[BatchServer] Processing job 3

19:01:44.512 [error] Task #PID<0.162.0> started from #PID<0.157.0> terminating
** (RuntimeError) external API returned unexpected error
...

[BatchServer] Processing job 4
[BatchServer] Processing job 5
[BatchServer] Job 4 done
[BatchServer] Job 5 done
[BatchServer] ⚠️  job failed: {%RuntimeError{message: "external API returned unexpected error"}, [...]}
[BatchServer] Batch 1 finished
[BatchServer] Starting batch 2 (PID #PID<0.158.0>)
...

The BatchServer PID is now stable between batches (#PID<0.158.0>). The failed Task is logged as “started from #PID<0.157.0>” — that’s the Task.Supervisor, not our GenServer.

🎯 Practical rule: if a GenServer does parallel fan-out of work whose scope is “this iteration” (not the lifetime of the server), use Task.Supervisor.async_stream_nolink/6. Use Task.async_stream/3 only when you want the error to take down the caller (rare scenarios, usually in migration scripts or batch jobs running isolated on a dedicated BEAM).


Trap exit: the temptation of “just catch it all”

The other option to contain exits is the Process.flag(:trap_exit, true) we mentioned above. From the official doc:

If pid is trapping exits, the exit signal is transformed into a message {:EXIT, from, reason} and delivered to the message queue of pid.

Sounds like the silver bullet: enable trap_exit in your GenServer and exits from linked processes turn into normal messages you handle in handle_info/2. It doesn’t die anymore.

It’s not that simple. The Highlander library (a singleton manager for Elixir clusters) has a historical case that illustrates the problem well.

Highlander enables trap_exit to detect name conflicts ({:EXIT, _, :name_conflict} in netsplit cases). The relevant code in version 0.2.1:

@impl true
def handle_info({:DOWN, ref, :process, _, _}, %{ref: ref} = state) do
  {:noreply, register(state)}
end

def handle_info({:EXIT, _pid, :name_conflict}, %{pid: pid} = state) do
  :ok = Supervisor.stop(pid, :shutdown)
  {:stop, {:shutdown, :name_conflict}, Map.delete(state, :pid)}
end

# ⚠️ And that's it. No clause for {:EXIT, _, :shutdown}, {:EXIT, _, _}, or _msg.

Result: when Highlander’s internal supervisor itself blew max_restarts (cascading from an upstream bug), it terminated with reason :shutdown. That :shutdown arrived at Highlander’s GenServer as {:EXIT, pid, :shutdown}. There was no clause for this message:function_clause → the whole singleton fell.

If you enable trap_exit, you are responsible for handling all possible variants of {:EXIT, _, _}. And further: ideally, any critical GenServer should have a handle_info catch-all as defense-in-depth against unknown messages (from libraries, orphan timers, old monitors):

defmodule SingletonServer do
  use GenServer

  @prefix "[SingletonServer]"

  def start_link(_), do: GenServer.start_link(__MODULE__, %{}, name: __MODULE__)

  @impl true
  def init(_) do
    Process.flag(:trap_exit, true)
    IO.puts("#{@prefix} Initializing (PID #{inspect(self())})")
    {:ok, %{children: []}}
  end

  # Specific handling that matters for the logic
  @impl true
  def handle_info({:EXIT, _pid, :normal}, state), do: {:noreply, state}

  def handle_info({:EXIT, pid, :shutdown}, state) do
    IO.puts("#{@prefix} child #{inspect(pid)} shut down cleanly")
    {:noreply, state}
  end

  def handle_info({:EXIT, pid, reason}, state) do
    IO.puts("#{@prefix} ⚠️  child #{inspect(pid)} died with #{inspect(reason)}")
    # Decide here: re-spawn? metric? alert? but DON'T crash.
    {:noreply, state}
  end

  # Defense-in-depth — any other unknown message becomes log + continue
  def handle_info(msg, state) do
    IO.puts("#{@prefix} ⚠️  unknown message: #{inspect(msg)}")
    {:noreply, state}
  end
end

⚠️ Note the order: more specific clauses first, catch-all (handle_info(msg, state)) last. Erlang/Elixir resolves by textual order in the module.

When NOT to enable trap_exit

There’s another trap here. trap_exit changes your process lifecycle in an important way: it also intercepts the shutdown signal coming from the parent supervisor. That means if you enable trap_exit, you must implement terminate/2 correctly, and the parent supervisor will wait :shutdown_timeout (default 5s) for you before killing with :kill. This can delay rolling updates and graceful shutdowns if you’re not careful.

Practical summary:

ScenarioUse
Regular worker that dispatches parallel tasksDon’t enable trap_exit. Use Task.Supervisor.async_stream_nolink.
GenServer that needs to do orderly cleanup of resources on shutdownEnable trap_exit and implement terminate/2.
Process manager (singleton, registry, pool) that needs to react to child deathsEnable trap_exit and handle all variants of {:EXIT, _, _}.

Let’s see the catch-all in action

Go back to the project and create the SingletonServer above. Add it to the supervision tree along with a “sibling” GenServer that will die on purpose:

defmodule NoisyChild do
  use GenServer

  def start_link(_), do: GenServer.start_link(__MODULE__, %{})

  @impl true
  def init(_) do
    parent = Process.whereis(SingletonServer)
    Process.link(parent)   # manually creates a link with SingletonServer
    Process.send_after(self(), :die, 200)
    {:ok, %{}}
  end

  @impl true
  def handle_info(:die, state) do
    {:stop, :weird_reason, state}
  end
end
def start(_type, _args) do
  children = [
    SingletonServer,
    NoisyChild
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

Start it and watch what happens:

iex -S mix

[SingletonServer] Initializing (PID #PID<0.156.0>)
[SingletonServer] ⚠️  child #PID<0.157.0> died with :weird_reason
[SingletonServer] ⚠️  child #PID<0.159.0> died with :weird_reason
[SingletonServer] ⚠️  child #PID<0.161.0> died with :weird_reason
[SingletonServer] ⚠️  child #PID<0.163.0> died with :weird_reason
[error] GenServer #PID<0.157.0> terminating
** (stop) :weird_reason
...

# And the supervisor shuts down the app due to NoisyChild's max_restarts
# BUT SingletonServer never crashed

SingletonServer saw every death of NoisyChild, logged it, and kept running. Without the catch-all and without the {:EXIT, _, _} clause, SingletonServer would have fallen with the first death of its sibling, with a :function_clause. That’s exactly the pattern Highlander 0.2.1 failed to protect against.


The root cause no one wants to admit: validation at the source

Everything we’ve discussed so far was about containing the blast radius when an error happens. But there’s a layer that comes before: very often, the error should never have become an exception in the first place.

In a real production incident that motivated this article, the sequence was:

  1. External API returned a large error payload.
  2. The code did "#{inspect(reason)}" to save it into an error VARCHAR(255) column.
  3. The string went past 255 chars → Postgrex.Error 22001 at Repo.update/1.
  4. Because the Task was linked (see above), the exit cascaded.

Notice that steps 2 and 3 are predictable, known errors. All that needed to exist was:

# In the changeset
field
|> cast(attrs, [:error, :failed_at])
|> validate_length(:error, max: 255)

# OR truncate before
error_str = reason |> inspect() |> String.slice(0, 240)

And the whole problem would have turned into an {:error, changeset} that the caller handles normally — no exit, no cascade, no restart.

The golden rule of supervised processes in Elixir is one:

💎 Exceptions are for the unexpected. Known errors should be values.

Anything you can foresee — field size limits, invalid formats, external API unavailability, timeouts — has to turn into {:error, reason} and be handled in the normal flow. Exceptions and Process.exit/2 are reserved for what you genuinely didn’t expect.

This connects with “Let it crash” from the previous article: the philosophy works because you’re handling the known errors. When an exception escapes, it represents a truly unexpected state, and then it makes sense to let the supervisor restart and try again. If you let known errors become exceptions, the restart becomes a retry with no gain — the next request hits the same bug, generates the same exception, and the only thing that changes is the max_restarts counter going up.


Defense in layers — the checklist

Every time you design (or review) a GenServer that does external work, run through this checklist:

Layer 1 — The operation itself

  • Predictable errors (schema validations, API responses, expected timeouts) become {:error, reason}, not exceptions?
  • External inputs going into database columns are truncated or validated before Repo.update/insert?
  • Calls to external APIs have an explicit timeout and handling for returned {:error, ...}?

Layer 2 — Tasks and spawns

  • Every Task.async_stream/3 call has been evaluated: should it really take down the caller on failure?
  • Long-running workers use Task.Supervisor.async_stream_nolink/6 with on_timeout: :kill_task?
  • The stream result is consumed with pattern matching for {:ok, _} and {:exit, _}?

Layer 3 — The GenServer itself

  • Is there a handle_info(msg, state) catch-all as defense against unknown messages?
  • If you enable trap_exit, do you have clauses for {:EXIT, _, :normal}, {:EXIT, _, :shutdown} AND {:EXIT, _, reason}?
  • Is terminate/2 implemented and not raising exceptions (it runs on the death path and any error there becomes an ugly log)?

Layer 4 — The supervisor

  • Is the child’s restart policy correct (:permanent, :transient, :temporary)?
  • Is max_restarts at a value that makes sense for the expected failure rate (and is not masking a bug)?
  • Are children grouped in supervision subtrees thinking about blast radius — a noisy worker can’t take down its cache, its connection pool and its Phoenix together?

Layer 5 — Observability

  • Is there an alert for “exception rate per minute”?
  • Is there an alert for “restart count per pod”?
  • Does the metrics panel distinguish in-place restart (BEAM exited) from pod replacement (rolling update)?

Summary: the mature “Let it crash”

“Let it crash” isn’t “let it break and the supervisor fixes it”. It’s a contract with several parts:

  1. You handled known errors as values in the normal flow.
  2. You isolated unpredictable failure sources (Tasks, external APIs, helper processes) with nolink or supervised subtrees.
  3. You defended your critical GenServers with a handle_info catch-all and complete clauses for {:EXIT, _, _} when relevant.
  4. You configured the supervisor with a restart policy that matches the nature of the work.
  5. You have observability to notice when one of these assumptions broke.

When one layer fails, it’s just one layer. The blast radius stays contained. When all of them fail — classic production incident — you see the dramatic version: a string truncation taking down the entire BEAM across five exit propagation hops.

The good news is that each layer costs little when you know it exists. The bad news is that finding out one was missing usually costs a post-mortem.


References