Infrastructure2026-03-30 9 min

Scaling Stateful WebSocket Infrastructure

Study notes on what makes WebSockets hard at scale — sticky routing, presence, cross-region fan-out — and the patterns that keep showing up in real systems.

// study notes

These are learning notes, not war stories — patterns I'm internalising from reading and exploration as I grow further into backend engineering.

WebSockets are deceptive. The protocol itself is small enough that a working server can be implemented in an afternoon. The hard part is everything that surrounds the connection once there is more than one node, one region, or one customer paying attention.

The pattern I keep seeing in post-mortems and engineering blogs is the same: teams hit a wall where the limiting factor is no longer the WebSocket itself, but the surrounding infrastructure pretending the WebSocket is stateless. These are the notes I have collected while reading about that wall.

The sticky-routing problem

The first painful moment is when you go from one node to many. Each connection is a long-lived TCP session, and the load balancer cannot reshuffle it the way it would reshuffle HTTP requests. If your routing layer is dumb, every reconnect lands on a different node, and you spend the rest of your day reconciling state.

The fix is not technically complicated — consistent-hash on a session ID, or sticky sessions at the LB — but it changes how you think about deploys. A rolling restart now drops connections. Clients must reconnect with backoff. The reconnect storm has to be tested before it happens in production. The pattern shows up repeatedly in incident reports: the reconnect storm itself becomes the outage, with every client coming back at once and exhausting the new pool of nodes before they finished warming up.

Presence is harder than messaging

Most teams underestimate presence. "Is this user online" sounds like a boolean. In a distributed system, it is more like an eventually-consistent claim with a TTL.

The pattern that keeps appearing in real architectures: each node owns a local view of who is connected to it, written to Redis with a short TTL that the node renews periodically. A separate aggregator reads across keys to build a system-wide view. Disconnects are not instant — the TTL expires within seconds — but the model is honest about that and the UI can hide it.

The naive alternative is for every node to broadcast every connect/disconnect to every other node. This works for small clusters and collapses in interesting ways for large ones.

ts// snippet

// per-node presence with TTL; aggregator reads across keys
const PRESENCE_TTL = 25;        // seconds
const HEARTBEAT_MS = 10_000;

socket.on("connect", async (s) => {
  await redis.set(`presence:${s.userId}:${nodeId}`, "1", "EX", PRESENCE_TTL);
});

setInterval(async () => {
  for (const s of openSockets()) {
    await redis.expire(`presence:${s.userId}:${nodeId}`, PRESENCE_TTL);
  }
}, HEARTBEAT_MS);

Cross-region fan-out

Once you have users in more than one region, the question is no longer "where is this user" but "how does a message addressed to a room reach every node holding a member of that room."

Two approaches show up repeatedly:

Redis pub/sub at the edge. Cheap, fast, and gives up on durability. Fine for typing indicators and ephemeral signals; insufficient for messages anyone cares about.
Redis streams (or Kafka) for durable rooms. Slightly higher latency, but consumers can resume on reconnect, and the topic becomes the source of truth for "what was said in this room." Most chat-style architectures eventually converge on something like this.

The hybrid — durable for messages, pub/sub for ephemera — is the architecture I would reach for if I were designing one today.

The boring parts that seem to matter most

Three things keep coming up in incident write-ups, more often than any architectural decision:

Per-connection rate limits at the edge. Without them, one misbehaving client can drag a node down.
Heartbeats and idle timeouts. Half-open connections accumulate silently and exhaust file descriptors at the worst time.
Backpressure in the write path. Slow consumers must not stall the server. Either drop, queue with a bounded buffer, or disconnect — but make the choice explicitly.

Each of these sounds like a checklist item. Each of them is the post-mortem of at least one outage I have read carefully.

The takeaway I keep returning to

WebSockets are a distributed-systems problem, not a protocol problem. The protocol is the easy part. The interesting work is in routing, presence, fan-out, and the operational discipline around long-lived connections that survive deploys, network blips, and badly behaved clients.

When that work is done well, the system feels invisible — which seems to be, in this kind of infrastructure, the highest compliment.

// written by Fikrat · feedback welcome at fikretallahquluzade@gmail.com