Another chapter in the GenServer the “rabbit hole” series. No GenServer this time — just logs. But the same principle: the abstraction looks simple, and almost no one realizes they’re using it wrong until the bill arrives.
Picture the scene: 2 a.m., a pod restarted and the alert woke someone up. They open Loki, filter by the app label, and the first browser screen looks like this:
12:45:33Z info [InboundWebhook] - Income message: %{"path" => "proxy/977671907166781440/example.cdn.com/v3/openlive/CA2585112_5_2", ...}
12:45:33Z info Sent 200 in 44ms
12:45:33Z warn [ExternalProvider EventWorker] - Event already processed, skipping: "869247060431936_2026-05-21..."
12:45:33Z warn [SetAuthToken] Resource ba8bf8f3-5ceb-4d7d-a318-f0a88e228f41 send token 010000000000003D and AuthToken is not found
12:45:33Z info Sent 200 in 12ms
12:45:33Z warn [ExternalProvider EventWorker] - Event already processed, skipping: "865478070223930_..."
12:45:33Z info Sent 200 in 39ms
12:45:33Z error Error creating resource: "network exception"
12:45:33Z info Sent 200 in 22ms
12:45:33Z warn [SetAuthToken] Resource 38d916e1-1ee7-4f91-9f96-db6ff14c0bd7 send token 0100F68F9DD034FB and AuthToken is not found
... (70+ more lines per second)
The window covers 1 second. To understand the restart, you need what happened in the last 5 minutes. Good luck.
This is the state of logs in a hypothetical Elixir application in production (any resemblance to a codebase you’ve worked on… well, it’s intentional 😉). The fictional case is going to act as the thread of the article — the story is just there to give shape to the practices that come next.
16 lines per second come from the team’s code
Suppose we run a simple Loki query covering the last hour of production for our example app, aggregating by erl_level:
sum by (erl_level) (count_over_time({app="my-app"}[1h]))
The numbers below are illustrative — they didn’t come from a specific cluster, but they reflect the order of magnitude that shows up in mature Elixir apps running Phoenix + Tesla + a handful of workers. The point isn’t the absolute values: it’s the shape of the problem.
| Level | Lines/hour | % of total | Lines/second |
|---|---|---|---|
| info | 236,667 | 89.4% | 65.7 |
| warning | 23,557 | 8.9% | 6.5 |
| error | 5,406 | 2.0% | 1.5 |
| notice | 22 | <0.01% | ~0 |
| Total | 265,652 | 100% | 73.5 |
A good chunk of that volume is expected framework baseline — stuff the team didn’t write. lib/phoenix/logger.ex (request summary "Sent 200 in 44ms") alone generates 200,015 lines/h (84.5% of the total). Add lib/tesla/middleware/logger.ex (HTTP client, 5,244/h) and runtime crash reports (gen_server.erl, 1,508/h), and you have ~207k lines you can’t reduce without touching library config.
What’s left — ~58,900 lines/h, ~16.4 lines/second — is what the team wrote. And that’s where the story lives.
Before moving on, a number worth highlighting: 5,406 errors per hour at steady state. No incident, no rollout, just the app living. For comparison: a real post-mortem recorded 1,931 Postgrex 22001 errors in 3 hours during an incident that took down a pod. The “healthy” error rate is 8 times higher than the error rate during the incident that motivated a post-mortem.
And when you filter framework noise and look at where those 16 lines/s that the team generates come from, the picture gets uncomfortable:
topk(10, sum by (file) (count_over_time({app="my-app"} | json [1h])))
(filtering files under lib/myapp/)
| File | Lines/h | % of what the team wrote |
|---|---|---|
lib/myapp/gateways/external_provider/event_worker.ex | 16,218 | 27.5% |
lib/myapp/resources/set_auth_token.ex | 15,374 | 26.1% |
lib/myapp/jobs/consumers/job_finished_consumer.ex | 12,848 | 21.8% |
lib/myapp/resources/workers/update_resource_cost_center_worker.ex | 2,832 | 4.8% |
lib/myapp/gateways/device_cloud/positions/enrich_position.ex | 2,676 | 4.5% |
3 files account for 75% of the log the team produces. Five files account for ~85%. The rest of the codebase — thousands of modules — generates, combined, less log than the ExternalProvider’s event_worker.ex alone.
That’s the effect of pollution. Each of those lines, individually, seemed useful when it was written. Combined, they drown out what matters when someone needs to investigate — the exact opposite of what logs are for. Polluted logs don’t help debugging; they get in the way. Each “just in case” line pushes the line that actually tells the story further off the screen.
And the concentration is also a point worth keeping for the rest of the article: each of these top files has, at its core, a single line of Logger.x firing in a loop. All the noise in Loki isn’t the entire application screaming — it’s a handful of calls that, because they’re on a hot path, dominate the volume. Let’s look at the leader to understand the pattern:
Logger.warning(
"[ExternalProvider EventWorker] - Event already processed, skipping: #{inspect(cache_key)} - #{alert_type}"
)
Translation: “this event was already processed, so I won’t process it again”. That’s dedup working. It’s the code’s happy path. It’s exactly what the function was written to do.
And it’s classified as warning. In production. 4.5 times per second. 16,000 times per hour.
That’s the theme of the article.
Part 1 — The cost of noise
💎 Logging isn’t free. Each line has three different costs.
Every time a Logger.info("just gonna log a little something here") appears in the code, three invisible budgets are being spent:
1. Direct infra cost. On a self-hosted observability stack you don’t pay an ingestion fee to anyone. But you pay AWS for the EC2 instances running Loki, the EBS/S3 storing indexed chunks, the network moving log data between pod, ingester and storage. That cost grows linearly with volume and retention. Every "#{inspect(big_map)}" on a hot path inflates the line from 80 bytes to 4 kB, and that turns into extra EBS next month — not a scary SaaS bill, but a silent infra curve. (Loki — best practices for labels has the summary of why the index suffers as much as the storage.)
2. Cognitive cost on on-call. The human reading the screen at 2 a.m. has finite bandwidth. If they have to scroll through 30 seconds of “skipping” before finding the stack trace, incident resolution speed drops. This cost doesn’t show up on any bill, but it shows up in MTTR (and in DORA Metrics — we talked about that).
3. Cardinality cost. This is the most subtle one and the one that hurts Loki the most over time. A message that changes on every call blows up the index. Look at the worst offender in the example:
Logger.warning(
"[#{@log_namespace}] Resource #{resource.id} send token #{token} and AuthToken is not found"
)
15,374 lines/hour of these, each with a unique message (different resource_id and token). To find every occurrence of this case in Loki, you can’t filter by exact message — you need regex (|~), which is more expensive. And you’ll never be able to group by this “event category” because from the index’s perspective there are 15,000 different categories per hour.
The correct pattern would be:
# Fixed message + metadata
Logger.warning("AuthToken not found for incoming token",
resource_id: resource.id,
token: token
)
The message became one category. resource_id became a structured, indexable, groupable label. The cost of querying Loki drops. And the search becomes:
{app="my-app"} | json | msg="AuthToken not found for incoming token"
instead of:
{app="my-app"} |~ "AuthToken is not found"
And the best part: resource_id became a label you can pivot on (sum by (resource_id)) to find out if the problem is in one specific resource or all of them. Before, that was impossible.
🎯 Practical rule: message is constant. The variable parts are metadata. If your
Logger.X("...")has#{interpolation}, stop and ask: why isn’t this metadata?
Configuring the formatter and the handler
Good news is that the infra for this is usually already available — it just needs to be wired up. In a typical Elixir app, the minimal setup has two layers:
a) Text formatter for dev/test — config :logger, :default_formatter controls how each line appears in the local console. Here you list the metadata fields you want to see in the terminal:
# config/config.exs
config :logger, :default_formatter,
format: "$time $metadata[$level] $message\n",
metadata: [
:request_id,
:graphql_operation_name,
:graphql_operation_type,
:organization_id,
:resource_id,
:log_name,
:span_id,
:trace_id,
:module
]
This is just for human readability in the console. In production the output needs to be structured — plain text isn’t indexable.
b) JSON handler for production — by plugging in a lib like LoggerJSON, each Logger.x call becomes a JSON line with msg, level, time, and all metadata as a first-class field:
# config/runtime.exs
if config_env() == :prod do
config :logger, :default_handler,
formatter: {LoggerJSON.Formatters.Basic, metadata: :all}
end
The lib ships with formatters for some common destinations (Basic, GoogleCloud, Datadog) — pick the one that matches your log backend. The important bit is that, with the JSON handler on, resource_id, request_id, trace_id stop being loose text in the message and become columns Loki (or equivalent) indexes.
The set_auth_token line from the example could already have resource_id indexable and doesn’t. It’s not lack of infra, it’s lack of habit.
Logger.metadata in layers: set it once, inherit always
You don’t need to repeat resource_id: in every log call inside a resource’s processing tree. Logger.metadata/1 is per-process and propagates to every subsequent log in that same Elixir process. The good case lives in entry-point middlewares like this:
# lib/my_app_web/middlewares/operation_name_logger.ex
defmodule MyAppWeb.Middleware.OperationNameLogger do
@behaviour Absinthe.Middleware
alias Absinthe.Blueprint.Document.Operation
@impl true
def call(resolution, _opts) do
case Enum.find(resolution.path, &match?(%Operation{}, &1)) do
%Operation{name: name, type: type} when not is_nil(name) ->
Logger.metadata(graphql_operation_name: name, graphql_operation_type: type)
_ ->
Logger.metadata(graphql_operation_name: "#NULL", graphql_operation_type: "#UNKNOWN")
end
resolution
end
end
And the trick is to plug this middleware into the schema’s middleware/3, so it runs before any resolver:
# lib/my_app_web/schema.ex
defmodule MyAppWeb.Schema do
use Absinthe.Schema
# ...
def middleware(middleware, _field, _object) do
[MyAppWeb.Middleware.OperationNameLogger | middleware]
end
end
From there on, every Logger.x call inside that request — in the resolver, in the context, in Ecto, in any library — inherits these fields. Zero cost per line.
The same technique pairs well with other entry-points — resource processing, organization processing, Oban worker:
# In an Oban worker, before doing anything else:
def perform(%Job{args: %{"resource_id" => resource_id, "organization_id" => org_id}}) do
Logger.metadata(resource_id: resource_id, organization_id: org_id)
do_the_work()
end
From there on, no one needs to interpolate IDs in messages. do_the_work() calls 5 more functions, each does 3 logs — all of them come out with resource_id and organization_id in the JSON. No extra work.
If you don’t want to repeat that setup in every worker, you can centralize it via an Oban telemetry handler. The [:oban, :job, :start] event fires right before perform/1 runs, and you can inject metadata from a single place:
# lib/my_app/application.ex
def start(_type, _args) do
:telemetry.attach(
"oban-logger-metadata",
[:oban, :job, :start],
&MyApp.ObanLogger.handle_event/4,
nil
)
# ...
end
# lib/my_app/oban_logger.ex
defmodule MyApp.ObanLogger do
def handle_event([:oban, :job, :start], _measure, %{job: job}, _config) do
Logger.metadata(
oban_worker: job.worker,
oban_job_id: job.id,
resource_id: job.args["resource_id"],
organization_id: job.args["organization_id"]
)
end
end
Upside: workers stay clean. Downside: the metadata depends on the convention of keys in args — if a worker uses "device_id" instead of "resource_id", it falls out. Pick what matches the team.
The inspect that never sleeps
Worse than interpolating an ID is interpolating an entire payload. Typical case in our example:
Logger.info(
"[InboundWebhook] - Income message: #{inspect(params, printable_limit: :infinity, limit: :infinity)}"
)
Notice the inspect parameters: printable_limit: :infinity, limit: :infinity. The call is explicitly opting in to not truncate. In practice, this Logger.info became a contractual promise to print the entire payload, no matter how large, in production, on every request — probably a decision made at a time when that made sense, but that aged together with the endpoint’s volume.
Take a look at a neighbor too:
Logger.debug("Complete resource data: #{inspect(resource_data)}")
Logger.debug is silenced in production by config :logger, level: :info, so this line doesn’t reach Loki — but it reaches the local dev/staging log, and the first time someone enables debug on a pod to investigate something, it’s going to print the entire struct of each resource on every user interaction in the map.
⚠️ Practical rule:
inspect/1in a log is a strong smell. If you need to see the payload to debug, save the ID and look up the payload later (in the DB, in S3, in a replay). If the payload is small and stable, it becomes structured metadata. If it’s large and dynamic, it’s not a log, it’s a dump — and dumps belong elsewhere.
Part 2 — Severity isn’t decoration
💎
erroris the level for “someone needs to look at this”. If you use it for normal flow, no one looks anymore.
There’s a version of the tooth-fairy saying: if everything is important, nothing is important. The log level is exactly that — a promise of relevance. When you use error for a scenario that isn’t an error, you break the contract with whoever’s reading.
The semantics I defend (and which are roughly consensus in Elixir and beyond):
| Level | Scenario | Who’s the audience |
|---|---|---|
debug | Technical detail only useful during active investigation | You, on the day you need it. Silenced in prod. |
info | Important business event; relevant state transition | Audit, replay, “what did this pod do in the last hour” |
warning | Something degraded but recovered; unexpected condition that didn’t stop the flow | The team, at the next retrospective |
error | Something broke and someone needs to look at it | On-call, now |
With that criterion in hand, let’s look at two typical cases.
Case 1: auth-deny isn’t an error
if authorized?(token) do
conn
else
Logger.error("Unauthorized request with token: #{inspect(token)}")
abort_request(conn)
end
In production this shows up as erl_level=error:
{
"msg": "Unauthorized request with token: nil",
"function": "call/2",
"module": "Elixir.MyAppWeb.Plugs.RequireApiToken",
"request_id": "GLGTwuoyCvEsYq0AtM1E"
}
Two lessons in this line:
- A client without a token (or with a wrong token) is normal flow for a public API. It’s not a system failure. No one’s going to be woken up at 2 a.m. to investigate this. It would fit better as
infoor — if it’s worth being observable — as a metric (:telemetry.execute([:myapp, :auth, :denied], …)) instead of a log. #{inspect(token)}ends up putting the token value in the log line. In the example, the token wasnil(missing header in the request). But the day a real invalid token arrives, it goes into self-hosted Loki, indexed, retained for whatever the cluster holds. That’s a credential exposure risk worth avoiding.
And the telemetry alternative doesn’t have to stay vague — the concrete path is to declare the metric and plug it into a Reporter:
# lib/my_app/telemetry.ex
defmodule MyApp.Telemetry do
import Telemetry.Metrics
def metrics do
[
counter("myapp.auth.denied.count",
event_name: [:myapp, :auth, :denied],
tags: [:reason]
)
]
end
end
Then in application.ex:
# lib/my_app/application.ex
children = [
{TelemetryMetricsPrometheus, [metrics: MyApp.Telemetry.metrics()]},
# ...
]
And in the plug, in place of Logger.error:
:telemetry.execute([:myapp, :auth, :denied], %{count: 1}, %{reason: :missing_token})
Now “how many auth-denies did we have today” is a chart, with a tag per reason. Nothing goes into Loki — and no one gets woken up.
Case 2: the expected retry became an alert
Logger.error("Erro ao criar recurso: #{inspect(response)}")
This shows up in production as:
{
"msg": "Erro ao criar recurso: \"network exception\"",
"module": "Elixir.MyApp.Gateways.DeviceCloud.Device.ConfigResource",
"function": "configure_resource/3"
}
Here’s what happens: HTTP call to an external gateway returned network exception. The code handles this in the next line: get_resource(device_channel, area_id, external_id) — it queries the API to see if the resource already exists, retries, moves on. The operation recovers on its own.
That’s not error. That’s warning (maybe), or not even that. And the #1 error volume source in the same service is exactly this module: 1,628 “errors” per hour coming from process_event.ex of the same gateway, in the same pattern of “external API slow/unstable, retrying”.
Whoever looks at the error dashboard sees 5,406 errors/h and hits the brakes. Whoever investigates finds out ~30-50% of that is expected gateway retry. The metric lies — and when it lies, nobody looks anymore.
The good case (it exists too)
To close, a correct example:
Logger.warning(
"[CrmWebhookWorker] webhook_id=#{webhook.id} rate-limited — snoozing #{seconds}s"
)
Destination rate limit: something degraded (we won’t deliver right now), but it recovers (we’ll try again in seconds). warning level, action-oriented message, structured ID (even though inside the string — the webhook_id= helps the searcher, but the ideal form would be Logger.warning("rate-limited; snoozing", webhook_id: webhook.id, snooze_seconds: seconds)).
This log says something useful to a human: I’m snoozing, keep an eye on it if it turns into an epidemic. That’s exactly what warning should say.
About log language
You’ll notice that part of the examples above are in Portuguese (Erro ao criar recurso, Recurso não encontrado) and part in English. It’s not uncommon to find bilingual codebases — something that happens naturally in a bilingual team over time, and individually each choice made sense.
Nothing wrong with logging in Portuguese. The cost is specifically about mixing in the same codebase: substring alerts have to cover both variants ({app="my-app"} |~ "not found|não encontrado"), Loki searches force you to remember two versions of the word, and someone still getting acquainted with the domain spends extra attention on mental translation.
The reflection is simple: picking one language only — English or Portuguese, either one — gives more consistency than the mixed state. It’s a conversation worth having calmly, picking one, and following from there.
Part 3 — Telemetry and Logger aren’t the same thing
💎 If you can count and group, it’s a metric. If you need to tell a story, it’s a log.
A modern Elixir stack usually has :telemetry configured, and OpenTelemetry plugged into the log formatter via trace_id and span_id. The two channels are complementary, not substitutes. The convention I use:
- Telemetry answers “how many / how frequently / how fast”. Metric, aggregation, dashboard. Ex.:
[:myapp, :crm, :webhook, :received], with:duration,:batch_size,provider. - Logger answers “what happened in this specific case that I need to reconstruct”. Singular event, with IDs for cross-reference. Ex.:
Logger.warning("webhook rate-limited", webhook_id: id, retry_in_seconds: s).
The good case is this:
:telemetry.execute(
[:myapp, :crm, :webhook, :received],
Map.put(summary, :batch_size, length(events)),
%{provider: :hubspot}
)
Pure metric. No redundant log. Whoever wants to know “how many webhooks did HubSpot send today” won’t read logs — they’ll open the dashboard.
Plugging this into your app
For :telemetry.execute/3 to actually become a chart, you need the other side of the equation — something listening to the event and exporting it as a metric. The most direct path in a modern Elixir app is the :telemetry_metrics + a Reporter combo:
# lib/my_app/telemetry.ex
defmodule MyApp.Telemetry do
import Telemetry.Metrics
def metrics do
[
counter("myapp.crm.webhook.received.count",
event_name: [:myapp, :crm, :webhook, :received],
tags: [:provider]
),
distribution("myapp.crm.webhook.received.duration",
event_name: [:myapp, :crm, :webhook, :received],
measurement: :duration,
tags: [:provider],
unit: {:native, :millisecond}
)
]
end
end
# lib/my_app/application.ex
children = [
{TelemetryMetricsPrometheus, [metrics: MyApp.Telemetry.metrics()]},
# ...
]
Done: the Reporter exposes a /metrics endpoint in Prometheus format, and the [:myapp, :crm, :webhook, :received] event becomes myapp_crm_webhook_received_count_total{provider="hubspot"} on the dashboard.
For specific cases where an off-the-shelf Reporter isn’t enough, you can attach a custom handler:
:telemetry.attach(
"myapp-crm-webhook-logger",
[:myapp, :crm, :webhook, :received],
fn _event, measure, meta, _config ->
# custom logic here
MyApp.Metrics.record(measure, meta)
end,
nil
)
Careful with this on hot paths — attach/4 runs synchronously in the process that emitted the event. For high-frequency events, prefer an off-the-shelf Reporter.
The bad case
Almost every codebase has it: the same event produces both a log and a metric. The log goes out of sync with the metric (old text, old format, missing the new field added to the metric) and on-call reads the log because it’s what shows up in Loki. Worst of both worlds.
🎯 Practical rule: before adding a
Logger.infoon a high-frequency path, ask: is this going to become a chart someday? If it is, it’s telemetry. Not a log.
Part 4 — Debuggability: when the log saves you
Everything we discussed so far was about trimming. But logs do a job no other tool does as well: tell the story of a specific case, to a human, after the fact.
Think about a cascade incident via process links (the kind of thing Part 2 of the series describes). What allows reconstructing this kind of incident in a few hours is exactly what’s usually well done in the logs:
- Stable
trace_idcrossing processes. The Task that crashed had the sametrace_idas the original request; the GenServer that received the{:EXIT, _, :shutdown}had thetrace_idinherited viaLogger.metadata. - Distinguishable messages for exceptions:
Postgrex.Error 22001 string_data_right_truncationis a unique, googleable string.function_clause in Highlander.handle_info/2points straight to the code. - Aggregate counts work because the message is stable:
count_over_time({...} |= "22001" [3h])gives 1,931 without needing expensive regex.
It’s this trio that lets you move from the wrong hypothesis (“it’s OOM”) to the right one (“it’s link cascade via Task”) in just a few iterations.
Now look at the exact anti-pattern:
exception ->
Logger.error("Unexpected error: #{inspect(exception)}",
error: exception,
stacktrace: __STACKTRACE__
)
Resolution.put_result(resolution, @default_error)
The intent is exactly right: an Absinthe middleware that captures unexpected exceptions so they don’t take down the GraphQL response. But the form has three points worth rethinking from a debuggability point of view:
- The message is constant (
"Unexpected error: ...") but the content ofinspect(exception)changes per case. You fall into the same cardinality trap from Part 1 — you can’t group by “exception type”. - The resolver that blew up isn’t in the log (no
graphql_operation_namein the message — it picks it up from metadata ifOperationNameLoggerran first; good in theory, fragile in practice). - The caller receives
{:error, :unknown}(the@default_error) — the frontend can’t distinguish “validation failed” from “BEAM caught fire”.
The idiomatic fix would be something like:
exception ->
Logger.error("absinthe resolver raised",
exception_module: exception.__struct__,
exception_message: Exception.message(exception),
stacktrace: Exception.format_stacktrace(__STACKTRACE__)
)
Resolution.put_result(resolution, @default_error)
Stable message ("absinthe resolver raised"), exception module as indexable metadata (exception_module=Postgrex.Error, exception_module=ArgumentError), human message separate. Now you can do:
sum by (exception_module) (count_over_time({app="my-app"} | json | msg="absinthe resolver raised" [1h]))
and see, in a chart, which exceptions are most frequent. Same intent as the original code, with debuggability turned on.
Bonus: PII in logs
The same paths that eat Loki’s lungs are also where PII tends to leak. Typical case:
Logger.debug("[Forms] form: create with params #{inspect(params)}")
params here is the entire request body — in checklists it can include a national ID, phone, photo. Even though it’s debug (silenced in prod), all it takes is someone enabling debug on a pod for 5 minutes for this line to spill personal data into Loki. And once it’s in Loki, it stays there for the retention period. This is a case worth rethinking strategy: log only the ID of the freshly created form, and keep the full payload out.
For request payloads, the correct pattern starts at the entry — Absinthe, for instance, has explicit variable filtering:
config :absinthe, Absinthe.Logger,
pipeline: true,
filter_variables: ["access_token", "password", "secret"]
Explicit list of keys to filter. Not bulletproof (you have to remember to include each new field), but better than raw inspect/1.
For an extra safety net — catching what slips past the entry filter — LoggerJSON accepts redactors applied to every log output:
# config/runtime.exs (prod)
config :logger, :default_handler,
formatter:
{LoggerJSON.Formatters.Basic,
metadata: :all,
redactors: [
{LoggerJSON.Redactors.RedactKeys, ["password", "token", "cpf", "access_token"]}
]}
Every key on that list, found at any depth in the serialized metadata, comes out as "[REDACTED]". It doesn’t replace entry-side filtering (which prevents the object from being constructed in the first place), but it works as a last line of defense for the times someone forgets.
Checklist: how to evaluate a log line before merging
Every time you add (or touch) a Logger.x in a PR, run through this checklist:
Layer 1 — The line
- Right level?
erroris “wake someone up”,warningis “degraded but recovered”,infois “business event”,debugis “active investigation”. When in doubt, drop a level. - Fixed message? If the string changes per call (
#{...}), you’re manufacturing cardinality. Move what changes to metadata. - No
inspect/1of a big struct? Especially withoutprintable_limit: :infinity. If you need to see the entire object, it’s not a log, it’s a dump.
Layer 2 — Context
- Is there an ID linking this line to the rest of the execution?
request_id,resource_id,organization_id,trace_id. The app formatter already declares these fields — just use them. - Do you need to repeat the ID in every call? Almost never. Set it once at the entry-point with
Logger.metadata/1(hexdocs) and the whole tree inherits.
Layer 3 — Content
- PII out? Token, national ID, email, phone, photo, entire request payload don’t go in logs. Explicit list of fields to filter, Absinthe
filter_variablesstyle, plus a last-line-of-defense redactor inLoggerJSON. - Doesn’t duplicate what telemetry already measures? If the question this line answers is “how many / in how much time”, the answer is metric, not log.
Layer 4 — Volume
- Is it on a hot path? LiveView event handler, Broadway processor, Oban worker in a loop, GenServer with a full mailbox. If yes: the natural level is
debug. If your intuition is that it needs to beinfo, articulate why first — usually one of those shows up. - Is the line repeated many times with the same message? Consider whether it’s “individual event” (log) or “frequency metric” (telemetry).
Layer 5 — Investigability
- If a human has to find this line in Loki at 3 a.m., can they? Is the message distinguishable? Are there labels that allow narration (
| json | msg="...")? Is there atrace_idto cross with OTel? - If this line shows up 10,000 times/h, will anyone notice? If they won’t, the level is too high. If they will, the level is right — but it’s better to become telemetry.
In one sentence
A log is the cheapest channel to communicate with your three-months-from-now self. Don’t waste it.
The numbers that opened the article — 16 lines/second written by the example team, 5,406 errors/hour, 75% of it from 3 files — aren’t “too much log”. They’re poorly calibrated log. The difference between the current state and a healthy state isn’t cutting it in half: it’s each line having a predicted reader and a specific job.
This article isn’t asking anyone to open a PR tomorrow. The goal is just to put each of the points above in the team’s head. Next time you open a file and have to add (or touch) a Logger.x, may the question “does this log answer any of the 5 checklist layers?” come along, and may the answer be conscious.
And next time you go to write Logger.error("Error doing …"), remember: 1,628 other identical lines are already in the queue, and none of them is an error.
References
- Logger (hexdocs) — the whole API, with the metadata part
Logger.metadata/1— per-process propagation- logger_json (hexdocs) — JSON formatter and redactors
- :telemetry (hexdocs) — metric vs log
- Telemetry.Metrics (hexdocs) — declarative metric definitions
- TelemetryMetricsPrometheus (hexdocs) — Reporter that exposes
/metrics - Loki — best practices for labels — why cardinality in messages hurts
- 12-Factor App — XI. Logs — the classic opinion on log as event streams
- GenServer the “rabbit hole” (part 1) — part 1 of the series
- GenServer the “rabbit hole” — Part 2: The Silent Cascade — part 2
- DORA Metrics — where MTTR becomes a number, and where quality logs become a measurable difference
