Engineering2026-04-22 7 min

Why Event-Driven Beats Request/Response for Realtime Workloads

Study notes on event-driven architecture — what it actually buys you, and the failure modes I want to recognise before I hit them.

// study notes

These are learning notes, not war stories — patterns I'm internalising from reading and exploration as I grow further into backend engineering.

A pattern I keep encountering as I read about realtime systems is the same painful arc: a team starts with a synchronous, REST-shaped pipeline because it is the most natural thing to build, and the architecture quietly degrades under its own success. Every spike in user activity translates, almost linearly, into spikes on the upstream providers. When one provider gets slow, the API gets slow. When the API gets slow, the dashboard gets slow. When the dashboard gets slow, the people who pay for the product get annoyed.

The escape hatch — described in book after book, post-mortem after post-mortem — is to convert the pipeline into something event-driven: producers emit onto a broker, a thin set of stateless consumers do the actual work, and the sinks are idempotent. The architecture stops being "send and wait" and becomes "publish and forget, then observe."

These are the notes I have collected while trying to internalise why this pattern keeps showing up, and where it stops working.

What events actually buy you

Events buy you three things that REST does not. First, decoupling in time: the producer does not need the consumer to be alive, fast, or even present. Second, replay: the broker becomes a short-term source of truth for "what happened" rather than just "what is true now," which is invaluable for both debugging and re-deriving downstream state. Third, fan-out for free: adding a new consumer is, in most cases, a configuration change rather than an architectural one.

For realtime workloads — anything where the relevant unit of work is a thing that occurred rather than a question being asked — these properties compound. A trade execution, a sensor reading, a user message, a payment authorisation: all of these are events first, and the business logic that responds to them is almost always better modelled as a graph of consumers than as a chain of synchronous calls.

Where the pattern breaks down

It would be dishonest to write only the optimistic half of this. Event-driven systems pay for their flexibility with a different operational tax — and most of what I have read about distributed-systems failures, at scale, comes back to this part.

Ordering is one. Kafka gives you per-partition ordering, which is sufficient for most workloads but lies to you if you pretend it is global ordering. The number of teams that have spent a quarter trying to reconstruct a global event order they never actually needed is, by reading volume, very high. The fix in every account is the same: rethink the consumer to be order-independent, or partition on the natural ordering key.

java// snippet

// idempotent consumer: dedupe on a stable event key
@KafkaListener(topics = "payments.authorised")
public void on(PaymentEvent e) {
    var key = e.idempotencyKey();          // producer-assigned, stable
    if (!seen.setIfAbsent(key, "1", Duration.ofHours(24))) {
        log.debug("replay ignored: {}", key);
        return;
    }
    ledger.apply(e);
}

Exactly-once is the other lie. In practice you get at least once delivery plus idempotent processing, and you build the second one yourself. Every event needs a stable, deterministic key, and every sink needs to know how to recognise a replay. The teams that skip this re-discover it through duplicate emails to customers, which seems to be an expensive way to learn it.

The other lesson I have absorbed from reading is that an event bus is wrong for genuinely request/response workloads — a UI fetching its own user profile, a CRUD admin tool, an internal search. Adding Kafka to those flows inflates the operational surface without buying anything in return.

The mental model I keep coming back to

When thinking about a new service, the question I try to ask is: is the unit of work here a question, or a fact? Questions want synchronous APIs and short-lived state. Facts want logs, brokers, and consumers. Most real systems are a mix of both, and the architectural skill — the part that takes years to develop — is being honest about which is which.

Realtime workloads, from everything I have read, are almost always made of facts. Treating them as anything else seems to be how teams end up with a polished frontend, slow APIs, and a quietly burning ops channel at 3am.

// written by Fikrat · feedback welcome at fikretallahquluzade@gmail.com