jaugusto.dev

Software Engineer

Postmortems at Scale: Why Large Engineering Orgs Need AI-Assisted Incident Analysis

A postmortem is, at its core, a structured analysis of a failure: timeline, contributing factors, detection, mitigation, and follow-up actions. In a small team running a handful of services, this analysis is tractable. In an organization running hundreds of services across dozens of teams, the same process starts to break down in measurable ways.

The mechanics of the failure are still there. The signals are still there. What's missing is the bandwidth to correlate them.

The Engineering Value of a Good Postmortem

In mature engineering organizations, postmortems serve three distinct functions. First, they feed the operational feedback loop — adjusting SLOs, alert thresholds, runbooks, and on-call procedures based on observed system behavior. Second, they produce architectural signals: recurring failure modes in a service or boundary often indicate a design issue that won't be caught by code review or unit tests. Third, they generate a corpus of organizational knowledge that compounds over time, assuming someone is actually reading it.

When postmortems work, they're one of the highest-leverage artifacts in engineering. A single well-analyzed incident can shift roadmap priorities, justify reliability investment, or surface a latent risk across multiple services.

The question worth asking: across your last quarter of incidents, how many of the resulting action items shipped, and how many regressed back into the backlog?

Why Postmortems Degrade at Scale

The constraints are practical, not cultural. Consider what producing an accurate postmortem actually requires:

  • A precise timeline reconstructed from Grafana metrics, application logs, and human communication.
  • Correlation between deployment events in GitHub and the failure signature.
  • Identification of contributing changes — code merges, config rollouts, feature flags.
  • Cross-referencing with related Jira tickets, prior incidents, and existing technical debt.
  • Extraction of decisions made during mitigation, typically captured only in Slack threads and Zoom war room recordings.

In a large organization, these signals are distributed across multiple systems with different access models, query languages, and data retention policies. The engineer writing the postmortem the next morning is doing manual data integration work that consumes hours and still produces an incomplete result.

How accurate is the timeline section of your last postmortem, measured against what your observability stack actually recorded?

What Changes When AI Has the Right MCP Context

Model Context Protocol (MCP) servers expose tools and data sources to an AI assistant through a standardized interface. Connect an LLM-based assistant to MCP servers for Slack, Zoom, Jira, Grafana, and GitHub, and the data integration problem becomes solvable in a way it wasn't before.

The mechanics look like this. Given an incident ID and a time window, the assistant queries Grafana for relevant metric series, GitHub for deploys and merges within the window, Slack for messages in the incident channel, Zoom for the war room transcript, and Jira for related tickets. It correlates these signals temporally, produces a structured timeline with source citations, identifies the deploy or config change most likely associated with the regression, and drafts the contributing factors section based on the evidence.

The engineer remains the author of judgment — root cause analysis, action items, organizational follow-ups. The AI handles correlation and first-pass synthesis.

The assistant performs correlation across data sources; the engineer retains ownership of analysis and follow-up.

The accuracy gain matters, but the second-order benefit is more significant: when postmortems are produced with a consistent structure and grounded in queryable sources, the postmortem corpus itself becomes analyzable.

From Individual Postmortems to Continuous Improvement

A single postmortem is a lagging indicator. A corpus of postmortems with structured fields — affected services, contributing factors, mitigation type, time-to-detect, time-to-mitigate — is a dataset that supports systematic analysis.

With the same AI assistant operating across the corpus, an engineering organization can answer questions that were previously prohibitively expensive to investigate: which services have degraded MTTR over the last two quarters; which contributing factors appear most frequently across postmortems; which categories of action items are consistently completed versus dropped; which deploy patterns precede incidents at a statistically meaningful rate.

This is where postmortems shift from a per-incident artifact to a continuous improvement signal. The same instrumentation that produces the individual postmortem also produces the longitudinal view.

If you ran a query across your postmortems from the last 12 months, would you be able to identify your top three recurring contributing factors? In most organizations today, that question is technically answerable but practically out of reach.

Where to Start

Implementing this doesn't require rebuilding your incident process. The practical starting point is narrow: identify two or three of your highest-signal data sources — typically Grafana, GitHub, and your incident communication channel in Slack — and connect them through MCP to an AI assistant your team already uses. Define a postmortem template with structured fields. Generate a draft for the next non-critical incident and compare it to what an engineer would have produced manually.

The objective isn't to remove engineers from the postmortem process. It's to remove the manual correlation work that doesn't require engineering judgment, so the judgment work can happen with better inputs and more time.

The signals are already in your systems — Slack, Zoom, Jira, Grafana, and GitHub already witnessed everything. The remaining question is whether your postmortem process is built to use them.