Back to Blog
sreobservabilitydistributed-systemsincident-responseopentelemetrysystem-architectureslackdatadogpython

Building the Proactive Nerve System: Causal RCA in Action (Part 3)

Suvro Banerjee
March 9, 2026
16 min read

The Shift: From Symptom Alerts to Causal RCA

In Part 1, we built the secure ingestion perimeter. In Part 2, we built the reasoning engine. But there is a harder systems problem underneath both:

Alerts usually fire where pain is visible, not where failure started.

That distinction matters more in distributed systems than most incident tooling admits. A service can still return 200 OK, a monitor can still fire correctly, and the operator can still be pointed at the wrong place to act first.

This post is about how Flipturn handles that class of incident.

The demo scenario looks simple on the surface:

  • search latency breaches p99
  • Datadog fires a monitor on search
  • the slowest span in the winning trace is the database

But none of those facts, by themselves, tell you the root cause.

In our demo run, the first real causal break is a cache serialization failure on a versioned key. That forces catalog into database fallback. The database becomes the latency bottleneck. That latency finally surfaces as a slow search request.

So the incident has four distinct roles:

  • root cause in catalog-cache
  • propagation through catalog
  • bottleneck in database
  • symptom in search

That is the core idea behind Flipturn: the monitor tells us where pain surfaced. Flipturn tells us where it started.


Watch The Demo

Before going deeper into the architecture, here is the end-to-end demo of this flow in action: a symptom alert on search, a trace-first investigation across services, deterministic evidence correlation, and a follow-up answer from the same incident ledger.


1. The Incident Pattern Most Tools Struggle With

Traditional incident workflows are still biased toward binary failures:

  • a service is down
  • an error rate spikes
  • a host is unhealthy
  • a dependency is unreachable

Those are important incidents, but they are not the hardest ones.

The more interesting class of failure is when:

  • every service is technically "up"
  • HTTP status codes are still mostly healthy
  • the customer experiences degradation rather than outage
  • the alert fires on the symptom surface
  • and the root cause lives upstream in a different role than the bottleneck

That is the exact shape of the cache_poison_pill scenario in Flipturn's simulation lab.

This is why the demo matters architecturally. It is not just showing that we can parse a Datadog alert and post to Slack. It is showing that Flipturn can:

  1. start from a customer-facing symptom,
  2. collect deterministic evidence across traces, logs, metrics, and tagged exceptions,
  3. distinguish causal origin from the point of highest latency,
  4. and answer follow-up operator questions from persisted evidence instead of recomputing everything from scratch.

If a system can do that reliably, it is doing incident investigation. If it cannot, it is just summarizing observability artifacts.


2. The End-to-End Architecture

At a high level, the architecture for this flow looks like this:

graph TD DD[Datadog Monitor on search p99] --> SlackAlert[Slack Alert Message] SlackAlert --> TrustGate[Trust Gate / Webhook Verification] TrustGate --> Normalize[UniversalIncidentSignal] Normalize --> Streams[Redis Streams] Streams --> Worker[Stream Worker] Worker --> Envelope[CorrelationEnvelope] Envelope --> Prefetch[Trace-First Evidence Prefetch] Prefetch --> Planner[Evidence Query Planner] Planner --> DDQueries[Datadog APM / Logs / Metrics] Planner --> Sentry[Sentry Issues] DDQueries --> Timeline[Deterministic Evidence Timeline] Sentry --> Timeline Timeline --> RCA[Confidence + Representative Trace + RCA Header] RCA --> Agent[Agent Service / LangGraph] Agent --> SlackRCA[Slack RCA Reply] SlackRCA --> Followup[Follow-Up Question in Thread] Followup --> Memory[Thread Context + Incident Memory] Memory --> SlackAnswer[Focused Follow-Up Answer] style Worker fill:#f6d365,stroke:#333,stroke-width:2px style Timeline fill:#bfdbfe,stroke:#333,stroke-width:2px style RCA fill:#a7f3d0,stroke:#333,stroke-width:2px style SlackAnswer fill:#ff8a80,stroke:#b71c1c,stroke-width:3px,color:#111

There are five architectural modules doing the real work here:

  1. Trust Gate receives the alert safely and turns it into a normalized incident signal.
  2. Stream Worker is the orchestration boundary for evidence collection and RCA assembly.
  3. Correlation + Evidence Layer converts the alert into deterministic pivots and ranked evidence.
  4. Agent Layer turns a constrained evidence set into a readable RCA.
  5. Slack Thread Memory lets follow-up questions reuse the same incident context.

This is important: Flipturn is not one big model call. It is a pipeline where deterministic computation narrows the problem before the model is asked to explain it.


3. Ingestion: The Alert Is Just the Starting Gun

The incident starts when Datadog posts an alert into Slack. That alert is not treated as a final diagnosis. It is treated as a trigger to start investigation.

The webhook path matters because it defines what the rest of the system can trust.

From app/api/v1/endpoints/webhooks.py, Slack events enter through the verified webhook endpoint, get parsed, normalized, and then queued to Redis Streams for processing. That means the alert-handling system has three important properties before any reasoning begins:

  • the request is authenticated,
  • the message is normalized into UniversalIncidentSignal,
  • and processing is decoupled from the webhook response path.

Architecturally, that means Slack is only the ingress surface. The actual investigation happens in the worker.

The normalized signal is important because it lets the rest of the system reason in source-independent terms:

  • title
  • description
  • severity
  • tags
  • source metadata
  • correlation hints

That keeps the evidence pipeline from becoming a pile of source-specific special cases.


4. Correlation First: Why Flipturn Can Pivot Deterministically

The first important step inside the worker is not “ask the model what this alert means.”

It is: extract the strongest possible correlation keys.

This is handled through the generalized CorrelationEnvelope model in app/models/correlation.py. The envelope ranks correlation by strength:

  1. Exact IDs: trace_id, request_id, correlation_id, run_id, span_id
  2. Strong dimensions: service, environment, region, scenario_id
  3. Weak fallbacks: alert text and contextual matching

That ordering is not cosmetic. It shapes how the entire investigation proceeds.

Why This Matters

If an alert already contains a usable trace_id, starting with generic keyword search is wasteful and noisy. If a run_id and scenario_id exist, we should preserve them. If only service and env exist, those become the safest first filters.

This is what deterministic incident investigation looks like in practice:

  • use exact IDs before fuzzy search
  • prefer topology-aware pivots to free-text heuristics
  • shrink the search space before the model reasons

That is how Flipturn avoids behaving like a “smart grep bot.”

The Practical Result

By the time the worker logs:

  • exact IDs available
  • dimensions available
  • best correlation key chosen

the system already knows whether this is a trace-first, run-first, or dimension-first investigation.

That decision cascades through the rest of the evidence pipeline.


5. Trace-First Evidence Collection

Once the worker has the correlation envelope, Flipturn starts deterministic evidence collection.

This stage has two layers:

  1. a trace-first prefetcher for fast, bounded evidence acquisition
  2. an evidence planner for structured query expansion

Layer 1: Trace-First Prefetch

The DatadogEvidencePrefetcher is designed with strict API budgets. It does not query everything. It starts from the symptom service and the alert metric context, searches APM for slow spans, then tries to select a representative trace that best explains the incident.

This is the point where the demo’s main line becomes true:

the slowest span is not always the root cause.

Why? Because APM answers “where time was spent,” not necessarily “where the first causal break occurred.”

The prefetcher gives us:

  • a candidate trace_id
  • top spans from that trace
  • trace-attached log rows
  • diagnostics about how the trace was selected
  • the alert metric fact used to anchor the search

That produces a compact, incident-shaped evidence bundle instead of a broad dump of logs and spans.

Layer 2: Evidence Planning

Then the EvidenceQueryPlanner expands the query plan based on the strongest available keys:

  • @trace_id:<value> for APM
  • @otel.trace_id:<value> for logs
  • run_id pivots when available
  • service and environment fallbacks
  • Sentry issue lookups
  • symptom metric confirmation

The planning strategy matters because observability systems are not equally reliable across every field. Standard dimensions such as service and env are usually immediately searchable. Custom attributes like run_id may lag indexing. Trace IDs are usually the best causal handle when present.

That is why Flipturn does not rely on one magical query. It executes an ordered correlation strategy based on evidence reliability.

flowchart LR Alert[Slack Alert] --> Corr[CorrelationEnvelope] Corr --> Exact{Exact ID available?} Exact -- trace_id --> APM[Query Datadog APM by trace_id] Exact -- run_id only --> LogsRun[Query logs by run_id] Exact -- none --> Dim[Query by service + env] APM --> Rep[Representative Trace] Rep --> LogsTrace[Query logs on @otel.trace_id] LogsRun --> Merge[Merge Evidence] Dim --> Merge LogsTrace --> Merge Merge --> Metrics[Confirm symptom metrics] Merge --> Issues[Query Sentry tags] Metrics --> Timeline[Deterministic Evidence Timeline] Issues --> Timeline style Rep fill:#f6d365,stroke:#333,stroke-width:2px style Timeline fill:#bfdbfe,stroke:#333,stroke-width:2px

The architecture point is simple: the model is not “exploring Datadog.” The system is curating a deterministic evidence set before the narrative layer begins.


6. Root Cause, Propagation, Bottleneck, Symptom

This is the conceptual center of the demo and the product.

Flipturn does not flatten the incident into one overloaded label like “the bad service.” It separates different causal roles in the chain.

For the demo scenario:

  • Root cause: catalog-cache throws a serialization failure on a versioned cache key.
  • Propagation: catalog misses cache and falls back to direct database reads.
  • Bottleneck: database becomes the slowest span because it absorbs fallback load.
  • Symptom: search latency breaches p99 and triggers the monitor.

This distinction is operationally important.

If an operator only sees the Datadog alert, they are likely to think:

  • “Search is slow.”

If they open APM and look only at the waterfall, they may conclude:

  • “Database is the root cause because it is the slowest span.”

But the useful RCA is:

  • “The first causal break is in cache compatibility, which changes execution mode upstream and only then turns the database into a bottleneck.”

That tells the operator where to act first.

Why This Is Hard

Distributed incidents often blur these roles because:

  • the monitor is attached to the symptom service
  • the slowest span belongs to the bottleneck service
  • the first failing log belongs to a faster, upstream service
  • the user-facing impact appears at the edge, not at the origin

This is exactly why “just show me the trace” is not enough.

A trace is necessary, but it still has to be interpreted in a causal model.

Flipturn's Role Taxonomy

The product logic increasingly depends on this separation:

  • root cause is where the first causal defect emerges
  • propagation is how that defect changes behavior across services
  • bottleneck is where latency or saturation accumulates most visibly
  • symptom is what the monitor or operator sees first

That taxonomy is the bridge between raw telemetry and useful action.

graph LR Cache["catalog-cache\nRoot Cause\nSerialization failure"] --> Catalog["catalog\nPropagation\nDB fallback"] Catalog --> DB["database\nBottleneck\nSlowest span"] DB --> Search["search\nSymptom\np99 alert"] style Cache fill:#ffccbc,stroke:#bf360c,stroke-width:3px style Catalog fill:#fff3cd,stroke:#8d6e63,stroke-width:2px style DB fill:#ffe082,stroke:#f57f17,stroke-width:3px style Search fill:#c8e6c9,stroke:#1b5e20,stroke-width:3px

7. The Deterministic Evidence Layer

Flipturn’s RCA is useful because it is not generated from an unstructured bag of telemetry.

Before the final response is assembled, the worker computes a deterministic evidence layer:

  • scored evidence
  • an evidence timeline
  • confidence signals
  • an RCA header
  • an action scaffold
  • a representative trace
  • OTel pivots for human verification

This is one of the most important architectural choices in the whole repo.

Evidence Timeline

app/services/evidence_timeline.py normalizes logs, APM spans, and Sentry issues into a common event model. Those events get stable evidence IDs like E1, E2, E3, making them referencable in the final RCA and follow-up answers.

That gives us a source-of-truth ledger:

  • not a prose summary,
  • not raw vendor JSON,
  • but a compact, ordered set of incident events.

Confidence Before Narrative

The system computes confidence and supporting signals before the final narrative is generated. That means the model is reasoning over:

  • a confidence score
  • explicit causal hints
  • a selected representative trace
  • a filtered event timeline

rather than inventing its own evidentiary structure from scratch.

Representative Trace

The representative trace is not simply “the longest trace.”

The selection logic in app/services/representative_trace.py ranks candidate traces by:

  • threshold satisfaction
  • symptom-service duration
  • max duration
  • multi-service coverage
  • causal usefulness

That is much closer to what an experienced incident responder would want. The best trace is the one that explains the incident, not the one that happens to be globally longest.

RCA Header

The deterministic RCA header then compresses the incident into three high-signal lines:

  • confidence
  • root cause statement
  • proof trace ID

That design is deliberate. In the demo, the first three lines of the Slack reply already tell the operator:

  • how certain Flipturn is
  • where the causal break started
  • and which trace ties the cross-service evidence together

That gives operators a fast read before they ever inspect the full ledger.


8. The Role of the Agent: Explainer, Not Evidence Generator

At this point in the flow, the model still matters. But its job is narrower and more disciplined than in most AI incident products.

The AgentService does not start from raw alert text alone. It receives:

  • scrubbed incident text
  • a deterministic evidence timeline
  • scored evidence context
  • prefetched evidence
  • optional thread history

That is a major architectural decision.

The model is not responsible for discovering the investigation structure. The system has already done a large part of that work. The model’s job is to:

  • synthesize the evidence,
  • explain the causal chain,
  • format findings,
  • and answer in operator-friendly language.

That is much safer than asking a model to explore observability tools blindly and decide what counts as proof with no deterministic scaffolding underneath.

This is also why the demo works well as a product story: Flipturn is ranking evidence, not just summarizing telemetry.


9. Follow-Up Q&A: From One-Shot Summary to Incident Memory

A one-shot summary bot would stop after posting the first RCA. Flipturn does not.

In the demo, the operator asks:

“Show me the single most causal evidence line and why it beats the alternatives.”

That is not a generic re-ask of the original incident. It is a challenge question. It requires the system to remember what it already concluded, retain the evidence set, and answer narrowly.

Why This Is Architecturally Hard

Slack follow-ups are new webhook events. They do not arrive with in-memory session state. So if you want thread continuity, you need to rebuild context explicitly.

Flipturn does that by persisting two things:

  1. thread context
  2. thread incident memory

The worker stores:

  • run_id
  • scenario_id
  • representative trace_id
  • deterministic header
  • confidence blob
  • action scaffold
  • key evidence IDs
  • full evidence timeline payload

That lets follow-up requests restore incident context from Redis rather than rerunning the investigation blindly.

The Follow-Up Flow

sequenceDiagram participant Slack as Slack Thread participant Worker as Stream Worker participant Redis as Thread Memory participant Timeline as Evidence Timeline participant Reply as Slack Reply Builder Slack->>Worker: Follow-up question Worker->>Redis: Load thread context Worker->>Redis: Load thread incident memory Redis-->>Worker: run_id, trace_id, E1..En, timeline Worker->>Timeline: Rehydrate deterministic evidence Worker->>Reply: Build focused follow-up answer Reply-->>Slack: Cite single most causal evidence line

Why This Matters

This is the difference between:

  • “Here is another summary”

and:

  • “Here is the most causal evidence line, here is why it outranks the others, and here is the exact evidence ID.”

That is a very different product experience. It moves Flipturn closer to an incident copilot than a post-hoc summarizer.


10. Why the Simulation Lab Matters to This Story

This entire flow is powered by Flipturn’s simulation lab, and that matters enough to say directly.

We did not build the lab because we want to be in the business of demo microservices. We built it because autonomous causal RCA is impossible to validate seriously without a controlled distributed environment.

The lab gives us:

  • real service hops
  • real OpenTelemetry traces, metrics, and logs
  • realistic causal ambiguity
  • reproducible failure scenarios
  • and a place to test whether the RCA engine is actually finding the right first action

That is crucial for this blog because the demo’s main claim is subtle:

  • the monitor is right
  • the trace is right
  • the database is slow
  • and the root cause is still somewhere else

You cannot prove that kind of reasoning with static screenshots or hand-authored evidence bundles alone. You need a live system where propagation can actually happen.

The simulation lab is the proving ground. The end goal is production-grade incident investigation across real systems, but the lab is where the causal model gets hardened.


11. What This Means for SRE Teams

The main lesson here is not “use Slack” or “wire Datadog to an LLM.”

It is that modern incident tooling needs to separate at least four concepts that are often collapsed into one:

  • where the alert fired
  • where the request was slowest
  • where the failure started
  • and what evidence proves the chain

If your system cannot distinguish those roles, operators will keep getting pointed at the wrong place to act first.

The second lesson is that useful AI in incident response is mostly a systems architecture problem:

  • secure ingestion
  • strong correlation
  • deterministic evidence shaping
  • memory across follow-ups
  • constrained model responsibility

The model matters, but only after the telemetry and evidence pipeline are doing the right kind of narrowing.


Key Takeaways

  1. The alert surface is not the root cause. In distributed incidents, the service that triggers the monitor is often just where pain became visible.
  2. The slowest span is not necessarily the first causal break. Bottleneck and root cause can live in different services.
  3. Trace-first investigation reduces ambiguity. Exact pivots like trace_id are far more useful than broad keyword search.
  4. Deterministic evidence should come before narrative generation. Timelines, representative traces, and confidence signals make the final RCA more trustworthy.
  5. Follow-up memory is a core architectural capability. Without persisted incident context, every Slack reply is just another one-shot summary.
  6. Simulation labs are not demos for their own sake. They are the environment where causal RCA architectures can be stress-tested before broader real-world rollout.

Closing

This demo is really demonstrating one architectural claim:

Flipturn is designed to investigate incidents where the obvious answer is wrong.

It starts from the symptom alert, pivots into the distributed trace, correlates logs and spans on the same causal path, separates root cause from bottleneck, and then keeps that incident alive as thread memory for follow-up questions.

That is what makes the RCA useful. It does not just say what was slow. It tells the operator where to act first and why.

Want to eliminate incident firefighting?

Join teams using Flipturn for autonomous root cause analysis.

Request Access
← Return to Flipturn homepage