Back to insights
Infrastructure2026-03-04 8 min

Notes on Running Things in Linux Production

Study notes on the operational discipline behind production Linux — the kernel-facing tools, the limits, the boring parts that repeatedly turn out to matter most.

// study notes

These are learning notes, not war stories — patterns I'm internalising from reading and exploration as I grow further into backend engineering.

Most of the systems I read about and work with run on Linux. Not in the abstract "containerised on a managed control plane" sense, but in the concrete sense of "if this host misbehaves, someone has to SSH in at an unsociable hour and reason about it." The following is a short list of the lessons that recur in operational writing — the ones I am internalising as I move further into backend work.

1. Learn the kernel-facing observability tools before you need them

The lesson that appears in almost every incident retrospective is the same: when a production node behaves badly, application metrics tell you nothing useful, because the problem is below the application. The tools that actually matter are vmstat, iostat, ss, pidstat, and — for the brave — perf. None of these are exotic. All of them are present on every reasonable distribution. Knowing what "good" looks like for each of them on a healthy host is more valuable than any dashboard.

The same logic applies to dmesg. Half the worst-night incidents in the write-ups I have read leave a fingerprint there hours before anyone notices.

bash// snippet
# what I actually run when a box "feels slow"
vmstat 1 5                 # cpu / memory / io snapshot
iostat -xz 1 5             # per-device service time, %util
ss -s                      # socket summary, half-open count
pidstat -t -p $PID 1 3     # per-thread cpu for the suspect process
dmesg -T | tail -n 40      # kernel events with human timestamps

2. cgroups and limits are not optional

JVM workloads in particular have an old reputation for misbehaving on shared hosts, and the reputation is partly earned. A JVM that does not know its memory limit will happily claim more than it should, and Linux will eventually intervene with the OOM killer. The result is a process death with no application-level warning.

The remedy is unglamorous: container memory limits, JVM flags that respect them (MaxRAMPercentage and friends on modern JDKs), and explicit CPU shares. When done right, the OOM killer stops being an actor in your post-mortems.

3. Logs are a contract, not a stream of strings

The single highest-leverage thing I have done across multiple teams is to enforce structured logging. JSON lines, a small fixed schema, a request or trace ID on every line. The moment logs become parseable, every downstream tool — jq, Loki, an aggregator, a junior engineer with twenty minutes — becomes more useful.

The corollary is that human-readable log lines should be treated as a developer convenience, not a production artefact. A production log line is an event with a schema. A "let me just print this out" log line is technical debt with a timestamp.

4. Backups you have not restored are wishes

The lesson, repeated by everyone who has ever written about ops and ignored by everyone who has not yet been bitten: a backup is the restore, not the copy. If the restore has not been exercised in the last quarter, the system does not have a backup; it has a hopeful filesystem.

5. Time, NTP, and the assumption you should never make

Do not assume the clock is monotonic across hosts. Do not assume it is accurate, either. NTP failures are subtle and survive long enough to produce confusing data. Most distributed-systems bugs that look like ordering problems are really clock problems in disguise. Where ordering matters, lean on logical clocks (sequence numbers, Lamport timestamps, broker offsets) rather than wall-clock time.

6. Shells are infrastructure

Bash is not glamorous, and writing more than fifty lines of it is usually a sign that you should be writing something else. But the small operational tools — health checks, rotation scripts, smoke tests, deploy helpers — are easier to maintain when they are short, idempotent, and explicit. Use set -euo pipefail. Quote your variables. Treat shell scripts as production code, not as throwaway notes.

7. The discipline beneath all of the above

What ties these together is not a particular tool or technology. It is the posture: Linux production is best treated as a long conversation with a complicated, mostly cooperative system. The engineers who do this well are not the ones with the most exotic toolchains. They are the ones who have built a quiet feel for what their boxes do when healthy — and noticed early when the behaviour drifts.

That posture is what I am trying to build, one production system at a time.

// written by Fikrat · feedback welcome at fikretallahquluzade@gmail.com