GenServer — the "rabbit hole"

When we explore Elixir’s strengths as a programming language, two stand out compared to other languages on the market:

🚀 3. Concurrency and Scalability
Uses actors (processes) for concurrency (each with its own memory and message queue).
Millions of processes can run concurrently with low overhead.
Built-in tools for distribution across multiple nodes.
🛠 4. Fault-Tolerance and Supervision Trees
“Let it crash” philosophy: Failures are expected and isolated.
Supervision trees: Automatically restart failing processes, making systems self-healing and resilient.

GenServer is an abstraction over the Erlang VM’s process model, shipped by the language core, that connects directly to Elixir’s concurrency and supervision tree features. Because it’s a common pillar in the architecture of Elixir systems, and because it’s covered “prematurely” in the official docs (Client-server communication, OTP Concurrency), newcomers tend to model their first Elixir solutions using the abstraction provided by the GenServer module. It takes a while before the official doc gets to its first disclaimer for the unaware: When (not) to use a GenServer.

Beyond the warning about not using GenServers as a code organization tool, there’s another warning I consider essential and don’t find in the official doc (or at least I haven’t reached that part yet 😅): avoid using GenServer when it isn’t strictly necessary (and/or until the developer is familiar with all the specifics of the Erlang process model).

Although at first glance the GenServer API looks pretty simple (at least for someone already familiar with the language syntax), the full lifecycle of a process in Elixir has characteristics that aren’t intuitive and are often unknown. That’s why code that goes beyond the simple counters in the doc starts generating errors and unpredictable behavior in production. Problems that are hard to debug and understand, especially for those still new to Elixir.

I wanted to share some important points to keep in mind when modeling your solution with this abstraction.

Let it crash (but not really 🫠)

This mantra gets repeated everywhere when talking about the language’s capabilities. It assumes that all processes in Elixir are isolated, errors that happen in one process don’t interfere with the execution of the others, and if the failing process is important, the Supervisor will take care of restarting it and everything will be fine. Will it really?

Processes are indeed isolated, and an error in one process won’t affect the others (unless the developer explicitly makes it so). Another fact is that yes, the Supervisor will restart processes in its supervision tree if they crash. What not everyone knows, however, is that there’s a threshold for maximum restart attempts in a time window (Supervisor strategies and options). In practice, the default supervisor configuration says that if a child process exits more than 3 times within 5 seconds, the entire supervisor shuts down, and with it the entire supervision tree. In practice that means a “misbehaving” GenServer under the same supervisor as your Phoenix and Ecto can shut down your whole application. Well, looks like a problematic process can affect other processes after all 🙃.

Let’s see how this works in practice (I recommend a first read of the GenServer doc since we won’t cover API details). Create a new project on your machine, using the --sup flag to generate an Elixir app with a supervision tree already configured:

mix new genserver_study --sup

Create a new module to represent some GenServer in your app:

defmodule SafeServer do
  use GenServer

  @prefix "[SafeServer]"

  def start_link(_), do: GenServer.start_link(__MODULE__, %{}, name: __MODULE__)

  @impl true
  def init(_) do
    IO.puts("#{@prefix} Initializing GenServer")
    {:ok, %{}, {:continue, :init}}
  end

  @impl true
  def handle_continue(:init, state) do
    IO.puts("#{@prefix} Initialized with #{inspect(self())}")
    {:noreply, state}
  end
end

This GenServer doesn’t process anything, it just logs its initialization with its PID.

Now let’s add it to our app’s supervision tree. Open lib/genserver_study/application.ex and include SafeServer in the list of children:

def start(_type, _args) do
  children = [
    SafeServer # Add the SafeServer module to the list of child processes
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

Now let’s start iex loading our app with iex -S mix:

iex -S mix
Erlang/OTP 28 [erts-16.0] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit:ns]

[SafeServer] Initializing GenServer
Interactive Elixir (1.18.4) - press Ctrl+C to exit (type h() ENTER for help)
[SafeServer] Initialized with #PID<0.142.0>

We can see from the log that our SafeServer started correctly.

Now let’s create a problematic GenServer and see how it affects our supervisor and SafeServer.

defmodule BrokenServer do
  use GenServer

  @prefix "[BrokenServer]"

  def start_link(behaviour), do: GenServer.start_link(__MODULE__, behaviour, name: __MODULE__)

  @impl true
  def init(behaviour) do
    IO.puts("#{@prefix} Initializing GenServer for #{behaviour}")
    {:ok, %{}, {:continue, {:init, behaviour}}}
  end

  @impl true
  def handle_continue({:init, behaviour}, state) do
    IO.puts("#{@prefix} Initialized with #{inspect(self())}")
    Process.send_after(self(), behaviour, 100)
    {:noreply, state}
  end

  @impl true
  def handle_info(:break, _state), do: raise("Unhandled error")

  def handle_info(:stop, state) do
    {:stop, :finish, state}
  end
end

A brief explanation of how BrokenServer works: the init/1 function returns a tuple with the option {:continue, {:init, behaviour}}. The handle_continue/2 callback schedules sending a message to itself in 100ms. We have two handle_info/2: one for the stop behavior and one to raise an exception. We’ll cover the difference between them later, but the side effect will be an :exit of the GenServer process in both cases.

Now let’s add BrokenServer to the list of supervised children, with an option for the server’s initialization:

def start(_type, _args) do
  children = [
    SafeServer,
    {BrokenServer, :break} # Starting the GenServer with the :break option
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

Now let’s start our app with iex (I cleaned up some logs for readability):

iex -S mix

[SafeServer] Initializing GenServer
[BrokenServer] Initializing GenServer for break
[SafeServer] Initialized with #PID<0.156.0>
[BrokenServer] Initialized with #PID<0.157.0>

15:40:22.511 [error] GenServer BrokenServer terminating
** (RuntimeError) Unhandled error
    (genserver_study 0.1.0) lib/genserver_study/broken_server.ex:22: BrokenServer.handle_info/2
    ...
Last message: :break
State: %{}

[BrokenServer] Initializing GenServer for break
[BrokenServer] Initialized with #PID<0.161.0>

15:40:22.614 [error] GenServer BrokenServer terminating
** (RuntimeError) Unhandled error
    ...

15:40:22.617 [notice] Application genserver_study exited: shutdown

Running on your machine you’ll see the exception from our BrokenServer four times and at the end the message Application genserver_study exited: shutdown.

If you read the Supervisor strategies and options section, you’re probably thinking: “Well, the solution then is to crank the Supervisor threshold up to ‘infinity’ and prevent it from killing the entire application”. My answer to you is: it depends. This supervisor behavior exists to prevent your application from entering an infinite loop of crash-and-restart when it reaches a “no-return” state. The rationale is that by shutting down the whole app, an external process (the k8s lifecycle, for example) will rebuild the entire environment and eventually restore the application’s health. You can obviously choose more permissive values than the defaults, but you need to be very careful not to “mask” a real problem. After all, if the process isn’t staying alive, in practice your application isn’t working, so what would be the benefit of keeping a zombie app?

So what about “Let it crash”? I personally think the slogan choice is a mistake. It suggests that letting code break by design is the right way to do things in Elixir, and in practice it isn’t quite like that. The correct idea of fault tolerance in Elixir is to understand that unexpected errors won’t initially bring down your entire application. There’s a safety layer in the language that isolates processes and tries to recover from occasional or temporary anomalies. With that in mind, you should always handle all known error cases inside a GenServer’s execution (and any supervised process), and let exceptions be raised only for unforeseen and unexpected cases.

Exception vs Gracefully stop

You may have noticed that in the BrokenServer code example we have a handle_info flow that returns a stop tuple instead of raising an exception:

def handle_info(:stop, state) do
  {:stop, :finish, state} # Gracefully stop
end

This option signals that the GenServer process should be terminated. It’s typically used when we have a flow where it’s no longer necessary to keep the GenServer process running, often because it won’t be used anymore. But how does the Supervisor behave when a GenServer sends a “controlled” exit signal? Let’s change the configuration in our supervision tree to enable the stop flow:

def start(_type, _args) do
  children = [
    SafeServer,
    {BrokenServer, :stop} # Changed the option from :break to :stop
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

And then we get the logs (some lines omitted for readability):

iex -S mix

[SafeServer] Initializing GenServer
[BrokenServer] Initializing GenServer for stop

[BrokenServer] Initialized with #PID<0.157.0>
[SafeServer] Initialized with #PID<0.156.0>

10:12:32.292 [error] GenServer BrokenServer terminating
** (stop) :finish
Last message: :stop
State: %{}

[BrokenServer] Initializing GenServer for stop
[BrokenServer] Initialized with #PID<0.159.0>

...

10:12:32.604 [notice] Application genserver_study exited: shutdown

You’ll notice that, again after restarting the process 4 times, the supervisor shuts down with Application genserver_study exited: shutdown.

That happens because the default Supervisor configuration is to permanently try to restart a process when it terminates, regardless of the reason. This default behavior can be changed for a specific GenServer in the options of use GenServer, restart: :transient (How to supervise), or in the “global” supervisor settings: Restart values (:restart).

It’s important, however, to keep in mind that when the supervisor is configured to restart processes as transient, if a GenServer ends execution with :stop and needs to be restarted, this has to be done manually via start_child/2.

Message queue — mailbox

Another very important aspect of processes in Elixir, and therefore also a GenServer characteristic, is that all messages (send/2, GenServer.call/2, GenServer.cast/2) are always enqueued in each process’s message queue (mailbox) and can be read one at a time. The existence of the mailbox is briefly mentioned in Elixir’s message-sending docs: Sending and receiving messages, but it’s explained in more detail in the official Erlang docs: Signals — Adding Messages to the Message Queue.

Understanding this aspect of how processes work is important because it usually becomes the cause of two very common failures in Elixir architectures: bottlenecks and message loss. Imagine you created a GenServer responsible for controlling API rate limits: for each incoming request the controller sends a message to that GenServer with the source IP; the GenServer needs to save the date and time that IP tried to hit the endpoint, and fetch all the previous attempts to compute whether the limit has been exceeded. Now imagine your API receives millions of requests per second, all sending messages to that GenServer which has to save and compute limits before replying to the controller whether to accept the request. Notice that this GenServer process became a bottleneck for the entire API — every request waits for the response of this centralized process which can only handle one message at a time. And the most concerning part: what happens to the mailbox if that GenServer exits because of an exception? Let’s test.

Create a new GenServer module with the following code:

defmodule MsgServer do
  use GenServer

  @prefix "[MsgServer]"

  def start_link(_), do: GenServer.start_link(__MODULE__, %{}, name: __MODULE__)

  @impl true
  def init(_) do
    IO.puts("#{@prefix} Initializing GenServer")
    {:ok, %{count: 0}}
  end

  def message(id), do: GenServer.call(__MODULE__, {:msg, id})
  def stop, do: GenServer.call(__MODULE__, :stop)
  def break, do: GenServer.call(__MODULE__, :break)

  @impl true
  def handle_call({:msg, id}, _from, %{count: count}) do
    # Fake heavy computation
    :timer.sleep(200)
    IO.puts("#{@prefix} Finish process msg #{id}")

    {:reply, {:ok, id}, %{count: count + 1}}
  end

  @impl true
  def handle_call(:stop, _from, state) do
    IO.puts("#{@prefix} Sending stop message")

    send(self(), :restart_connection)
    {:reply, {:ok, :stoping}, state}
  end

  @impl true
  def handle_call(:break, _from, state) do
    raise("Unhandled Error")

    {:noreply, state}
  end

  @impl true
  def handle_info(:restart_connection, state) do
    IO.puts("#{@prefix} Handling stop message")
    {:stop, :publish_error, state}
  end

  @impl true
  def terminate(reason, state) do
    IO.puts("#{@prefix} Terminating reason: #{inspect(reason)}, state: #{inspect(state)}")

    :normal
  end
end

This GenServer has an API with just 3 functions: one to simulate “heavy” message processing, one to simulate an error via exception, and another to simulate an exit with stop.

Let’s change our supervision tree to start this new GenServer:

def start(_type, _args) do
  children = [
    MsgServer # Removed the other GenServers and added MsgServer
  ]

  opts = [strategy: :one_for_one, name: GenserverStudy.Supervisor]
  Supervisor.start_link(children, opts)
end

Let’s create a module to simulate sending messages to our GenServer:

defmodule MsgTest do
  def stop do
    Enum.each(0..15, fn index ->
      if index == 3 do
        spawn(fn -> MsgServer.stop() |> IO.inspect(label: "stop") end)
      else
        spawn(fn -> message(index) end)
      end

      :timer.sleep(50)
    end)
  end

  def break do
    Enum.each(0..15, fn index ->
      if index == 3 do
        spawn(fn -> MsgServer.break() |> IO.inspect(label: "break") end)
      else
        spawn(fn -> message(index) end)
      end

      :timer.sleep(50)
    end)
  end

  defp message(id) do
    IO.puts("[Test] Sending message #{id}")

    case MsgServer.message(id) do
      {:ok, _count} ->
        IO.puts("[Test] Result for message #{id} is ok")

      error ->
        IO.puts("[Test] Result for message #{id} is #{inspect(error)}")
    end

    IO.puts("[Test] Gracefully ending message #{id}")
  catch
    :exit, error ->
      IO.puts("[Test] Catch error #{inspect(error)}")
  end
end

This test module exposes two public functions: one to simulate a scenario where the GenServer terminates with an exception (break) and another for the GenServer to terminate with a stop/exit. The mechanism is to spawn 16 processes; the fourth one will send a break or stop message, simulating a break/termination of the GenServer. The goal is to watch the logs and see what happens to the messages still in the GenServer’s mailbox when it terminates.

Let’s start the app and run MsgTest.break to simulate the interruption via exception:

iex(1)> MsgTest.break()
[Test] Sending message 0
[Test] Sending message 1
[Test] Sending message 2
[MsgServer] Finish process msg 0
[Test] Result for message 0 is ok
[Test] Gracefully ending message 0
[Test] Sending message 4
[Test] Sending message 5
[Test] Sending message 6
...
[MsgServer] Finish process msg 2
[Test] Result for message 2 is ok
[Test] Gracefully ending message 2
[Test] Sending message 12

12:18:43.864 [error] GenServer MsgServer terminating
** (RuntimeError) Unhandled Error
    ...

[MsgServer] Initializing GenServer
[Test] Catch error { {%RuntimeError{message: "Unhandled Error"}, ...}, {GenServer, :call, [MsgServer, {:msg, 4}, 5000]}}
[Test] Catch error { {%RuntimeError{message: "Unhandled Error"}, ...}, {GenServer, :call, [MsgServer, {:msg, 6}, 5000]}}
... (7 more similar messages)
[Test] Sending message 13
[Test] Sending message 14
[Test] Sending message 15
[MsgServer] Finish process msg 13
[Test] Result for message 13 is ok
[Test] Gracefully ending message 13
[MsgServer] Finish process msg 14
[Test] Result for message 14 is ok
[Test] Gracefully ending message 14
[MsgServer] Finish process msg 15
[Test] Result for message 15 is ok
[Test] Gracefully ending message 15

We can see from the logs that, before the exception happened, messages were sent up to index 12. Of those, messages 0, 1 and 2 were processed: [Test] Gracefully ending message 2. All the other messages that were in the mailbox were caught by the catch (we’ll come back to this). The processes for messages 13, 14, 15 that weren’t sent initially waited for the GenServer to restart (default behavior when using GenServer.call/2), and as soon as it restarted they were sent and processed correctly. In short, because of the exception during the GenServer execution, 9 messages waiting in the mailbox were lost, even though the Supervisor recovered the GenServer afterwards.

What happens to the messages “dropped” when the GenServer terminates? Imagine the initial example, a GenServer that controls API rate limit. Each request to that API is handled by an isolated process, and each one calls GenServer.call/2. Imagine our API controller has the following code:

def show(conn, params) do
  with :ok <- GenServer.call(RateLimit, {:get, conn}) do
    process_show(params)
  end
end

In this code, when the process running show/2 reaches the GenServer.call/2 line, it sends the message to the RateLimit GenServer and waits for a reply. While waiting for the reply, it also watches for exit signals from the GenServer process. When the exception happens, the exit signal is propagated to all processes waiting for a message. Since the code has no catch to handle the exit signal, the controller process will also exit without continuing the next line, which would be process_show(params). That exit is captured by Phoenix, and we’d see a 500 error from the API.

The catch used in the MsgTest test module is a way to capture that exit signal and control the flow. In our case, we don’t terminate the process that called GenServer.call/2, we just log the information and continue normal execution.

You can now test what happens calling the stop flow: MsgTest.stop(). You’ll notice the behavior is the same. Even with a controlled termination, all messages in the mailbox are discarded and the processes that called the GenServer receive the exit signal.

Mitigating bottlenecks and message loss with Poolboy

There are some strategies to mitigate the problems we showed. One is to isolate the GenServer execution from the queue management and parallelize the processing of messages across more than one process. This isn’t always simple — in the rate-limit example it would be a bit more complex because you can have several requests in parallel from the same IP, requiring an additional “routing” layer to ensure data consistency. But for most simpler problems, a great ally to adjust the architecture is the Poolboy library, which lets you control that queue and parallelize execution using a worker-pool strategy (very similar to what Ecto does with database connections). With Poolboy, it manages the queue and sends just one message to each worker; that way, when a worker dies, the others keep handling demand and no “extra” message is lost.

GenServer — the "rabbit hole"

🚀 3. Concurrency and Scalability

🛠 4. Fault-Tolerance and Supervision Trees

Let it crash (but not really 🫠)

Exception vs Gracefully stop

Message queue — mailbox

Mitigating bottlenecks and message loss with Poolboy

🚀 3. Concurrency and Scalability#

🛠 4. Fault-Tolerance and Supervision Trees#

Let it crash (but not really 🫠)#

Exception vs Gracefully stop#

Message queue — mailbox#

Mitigating bottlenecks and message loss with Poolboy#

🚀 3. Concurrency and Scalability

🛠 4. Fault-Tolerance and Supervision Trees

Let it crash (but not really 🫠)

Exception vs Gracefully stop

Message queue — mailbox

Mitigating bottlenecks and message loss with Poolboy