Beyond Job Titles: SRE, DevOps, and the Sysadmin Renaissance
SRE isn't dashboards—it's reliability governance around SLOs, error budgets, and change policy. A seasoned sysadmin already covers the skills often relabeled as DevOps or SRE.
TL;DR
“SRE” ≠ dashboards. It’s reliability governance around SLI/SLOs, error budgets, and change policy.
A seasoned sysadmin already covers the skills often relabeled “DevOps” or “SRE.”
Hire for capability lanes, not titles. You’ll improve reliability and delivery faster.
The title trap
We’ve renamed operations work three times in a decade. The craft didn’t change: design for reliability, automate the boring parts, know your systems, and turn incidents into safer releases. Senior sysadmins have done this for years—long before we called it DevOps or SRE.
Confusion peaks when companies say “we need SRE” but really mean “we need observability.” Observability is essential—but without SLOs, error budgets, and release gates, you built a speedometer without brakes.
SRE is not your observability team
Observability platform = telemetry plumbing (metrics/logs/traces), golden dashboards, alert templates, paved-road instrumentation.
SRE practice = reliability governance: SLI/SLOs, error budgets that constrain change, incident command & postmortems, production-readiness checks, and toil reduction as policy.
Simple RACI to reset expectations
| Area | Observability Platform | Product Teams | SRE Practice |
|---|---|---|---|
| Telemetry ingestion, dashboards, alert templates | R/A | C | C |
| Define SLIs/SLOs per user journey | C | A/R | A/R |
| Error-budget policy → gates in CI/CD | I | A/R | A/R |
| Incident command & postmortems | I | R | A/R |
| Production-readiness & change risk | C | R | A/R |
The 3-in-1 skills map
A USENIX-style senior/lead sysadmin already covers the capability lanes modern postings split across. Map your hiring to these lanes and stop arguing over titles.
| Capability lane | What a seasoned sysadmin already does | "DevOps" label (today) | "SRE" label (today) | Fast assessment question |
|---|---|---|---|---|
| Declarative config & modeling | Treats config as desired state; reduces complexity | IaC + GitOps, policy in CI | Desired-state guardrails to cut toil | "Show me a plan/apply & policy flow you shipped." |
| Automation & scripting | Eliminates repeatables; builds tools for scale | CI/CD, pipelines, self-service | Toil kept ≤50%; ops as code | "Which 8h monthly task did you delete permanently?" |
| OS internals & performance | Tunes kernel/JVM/filesystems; reads flamegraphs | Perf gates in pipelines | SLIs for latency/availability; capacity planning | "Pick a p95 outage you fixed—what changed?" |
| Networking & distributed systems | DNS/routing, client/server, failure domains | Service networking, TLS/mTLS, peering | Resilient topologies tied to SLOs | "Draw blast radius and failover for X." |
| Identity, backups & day-2 ops | Users/secrets/backups/restore drills | Immutable images, patch orchestration | RPO/RTO as SLOs; restore proofs | "When did you last prove a restore?" |
| Observability | Builds/uses dashboards for hypotheses | Telemetry paved roads | SLO burn-rate alerts, noise hygiene | "What alerts page humans vs open tickets?" |
| CI/CD & release | Scripts installs → releases | Four Keys focus (freq, lead time, fail %, MTTR) | Error-budget-gated rollout (canary/soak/rollback) | "When do you freeze changes, exactly why?" |
| Incident lifecycle | Leads peers, liaises vendors | Runbooks, on-call ergonomics | IC role, blameless PMs, action closure | "Show a PM that changed policy, not just code." |
| Reliability governance | Scales 24x365 ops sensibly | Platform guardrails & enablement | Readiness reviews, SLO coverage, cost/reliability tradeoffs | "What's 'reliable enough' for this tier?" |
Quick diagnostic (5 questions)
Answer “yes” or “no” for your top services:
- Do you have SLIs/SLOs tied to user journeys?
- Does an error budget actually gate releases (canary/soak/freeze)?
- Are pages triggered by SLO burn, not raw CPU?
- Do postmortems generate tracked, closed actions?
- Can you prove restore within RPO/RTO SLOs?
≤2 yes — you have observability. ≥3 yes — you have SRE practice.
30/60/90: institute capability-first reliability (no reorg required)
Day 0: Rename the work, not the people. Charter two streams:
- Observability Platform (plumbing + paved roads)
- Reliability Practice (SLOs, budgets, incident, readiness)
30 days — clarity
- Pick Top-5 services. Draft SLIs/SLOs and wire burn-rate alerts.
- Kill noisy “CPU 95%” pages; keep them as tickets, not pages.
- Add rollback to every deployment path.
60 days — control
- Enforce error-budget gates (auto-freeze on fast burn).
- Create a one-page Production Readiness checklist.
- Start blameless postmortems with action SLAs.
90 days — scale
- Expand SLO/error budgets to Top-20 services.
- Run monthly game-days.
- Track SLO coverage %, MTTR, toil %, and rollback rate.
How to interview for 3-in-1 (signals that cross titles)
- Show me a Terraform/Helm change that prevented an outage rather than fixed one.
- Live task: take a 10-step runbook; ship a one-click job with retries and idempotency.
- Whiteboard: draw traffic, identity, and failure domains for a multi-AZ service; mark where you’d place SLOs.
- Evidence: a postmortem whose actions changed policy (alerts, rollout gates), not only code.
- Restore proof: produce logs/screenshots/CLI showing a timed restore within RPO/RTO.
The hiring message for your JD or deck
We hire capabilities, not titles. The work historically done by a senior systems administrator already spans today’s “Sysadmin,” “DevOps,” and “SRE.” The only material addition in SRE is the explicit contract—SLIs/SLOs and error budgets—that governs change. We assess candidates across capability lanes (declarative config, automation/toil reduction, OS+perf, networking, day-2 ops, observability, CI/CD, incident, SLOs/error budgets, and 24x365 leadership) and place them where they deliver the most leverage.