Building the Proactive Nerve System: Causal RCA in Action (Part 3)
The Shift: From Symptom Alerts to Causal RCA
In Part 1, we built the secure ingestion perimeter. In Part 2, we built the reasoning engine. But there is a harder systems problem underneath both:
Alerts usually fire where pain is visible, not where failure started.
That distinction matters more in distributed systems than most incident tooling admits. A service can still return 200 OK, a monitor can still fire correctly, and the operator can still be pointed at the wrong place to act first.
This post is about how Flipturn handles that class of incident.
The demo scenario looks simple on the surface:
searchlatency breaches p99- Datadog fires a monitor on
search - the slowest span in the winning trace is the database
But none of those facts, by themselves, tell you the root cause.
In our demo run, the first real causal break is a cache serialization failure on a versioned key. That forces catalog into database fallback. The database becomes the latency bottleneck. That latency finally surfaces as a slow search request.
So the incident has four distinct roles:
- root cause in
catalog-cache - propagation through
catalog - bottleneck in
database - symptom in
search
That is the core idea behind Flipturn: the monitor tells us where pain surfaced. Flipturn tells us where it started.
Watch The Demo
Before going deeper into the architecture, here is the end-to-end demo of this flow in action: a symptom alert on search, a trace-first investigation across services, deterministic evidence correlation, and a follow-up answer from the same incident ledger.
1. The Incident Pattern Most Tools Struggle With
Traditional incident workflows are still biased toward binary failures:
- a service is down
- an error rate spikes
- a host is unhealthy
- a dependency is unreachable
Those are important incidents, but they are not the hardest ones.
The more interesting class of failure is when:
- every service is technically "up"
- HTTP status codes are still mostly healthy
- the customer experiences degradation rather than outage
- the alert fires on the symptom surface
- and the root cause lives upstream in a different role than the bottleneck
That is the exact shape of the cache_poison_pill scenario in Flipturn's simulation lab.
This is why the demo matters architecturally. It is not just showing that we can parse a Datadog alert and post to Slack. It is showing that Flipturn can:
- start from a customer-facing symptom,
- collect deterministic evidence across traces, logs, metrics, and tagged exceptions,
- distinguish causal origin from the point of highest latency,
- and answer follow-up operator questions from persisted evidence instead of recomputing everything from scratch.
If a system can do that reliably, it is doing incident investigation. If it cannot, it is just summarizing observability artifacts.
2. The End-to-End Architecture
At a high level, the architecture for this flow looks like this:
There are five architectural modules doing the real work here:
- Trust Gate receives the alert safely and turns it into a normalized incident signal.
- Stream Worker is the orchestration boundary for evidence collection and RCA assembly.
- Correlation + Evidence Layer converts the alert into deterministic pivots and ranked evidence.
- Agent Layer turns a constrained evidence set into a readable RCA.
- Slack Thread Memory lets follow-up questions reuse the same incident context.
This is important: Flipturn is not one big model call. It is a pipeline where deterministic computation narrows the problem before the model is asked to explain it.
3. Ingestion: The Alert Is Just the Starting Gun
The incident starts when Datadog posts an alert into Slack. That alert is not treated as a final diagnosis. It is treated as a trigger to start investigation.
The webhook path matters because it defines what the rest of the system can trust.
From app/api/v1/endpoints/webhooks.py, Slack events enter through the verified webhook endpoint, get parsed, normalized, and then queued to Redis Streams for processing. That means the alert-handling system has three important properties before any reasoning begins:
- the request is authenticated,
- the message is normalized into
UniversalIncidentSignal, - and processing is decoupled from the webhook response path.
Architecturally, that means Slack is only the ingress surface. The actual investigation happens in the worker.
The normalized signal is important because it lets the rest of the system reason in source-independent terms:
- title
- description
- severity
- tags
- source metadata
- correlation hints
That keeps the evidence pipeline from becoming a pile of source-specific special cases.
4. Correlation First: Why Flipturn Can Pivot Deterministically
The first important step inside the worker is not “ask the model what this alert means.”
It is: extract the strongest possible correlation keys.
This is handled through the generalized CorrelationEnvelope model in app/models/correlation.py. The envelope ranks correlation by strength:
- Exact IDs:
trace_id,request_id,correlation_id,run_id,span_id - Strong dimensions:
service,environment,region,scenario_id - Weak fallbacks: alert text and contextual matching
That ordering is not cosmetic. It shapes how the entire investigation proceeds.
Why This Matters
If an alert already contains a usable trace_id, starting with generic keyword search is wasteful and noisy. If a run_id and scenario_id exist, we should preserve them. If only service and env exist, those become the safest first filters.
This is what deterministic incident investigation looks like in practice:
- use exact IDs before fuzzy search
- prefer topology-aware pivots to free-text heuristics
- shrink the search space before the model reasons
That is how Flipturn avoids behaving like a “smart grep bot.”
The Practical Result
By the time the worker logs:
- exact IDs available
- dimensions available
- best correlation key chosen
the system already knows whether this is a trace-first, run-first, or dimension-first investigation.
That decision cascades through the rest of the evidence pipeline.
5. Trace-First Evidence Collection
Once the worker has the correlation envelope, Flipturn starts deterministic evidence collection.
This stage has two layers:
- a trace-first prefetcher for fast, bounded evidence acquisition
- an evidence planner for structured query expansion
Layer 1: Trace-First Prefetch
The DatadogEvidencePrefetcher is designed with strict API budgets. It does not query everything. It starts from the symptom service and the alert metric context, searches APM for slow spans, then tries to select a representative trace that best explains the incident.
This is the point where the demo’s main line becomes true:
the slowest span is not always the root cause.
Why? Because APM answers “where time was spent,” not necessarily “where the first causal break occurred.”
The prefetcher gives us:
- a candidate
trace_id - top spans from that trace
- trace-attached log rows
- diagnostics about how the trace was selected
- the alert metric fact used to anchor the search
That produces a compact, incident-shaped evidence bundle instead of a broad dump of logs and spans.
Layer 2: Evidence Planning
Then the EvidenceQueryPlanner expands the query plan based on the strongest available keys:
@trace_id:<value>for APM@otel.trace_id:<value>for logsrun_idpivots when available- service and environment fallbacks
- Sentry issue lookups
- symptom metric confirmation
The planning strategy matters because observability systems are not equally reliable across every field. Standard dimensions such as service and env are usually immediately searchable. Custom attributes like run_id may lag indexing. Trace IDs are usually the best causal handle when present.
That is why Flipturn does not rely on one magical query. It executes an ordered correlation strategy based on evidence reliability.
The architecture point is simple: the model is not “exploring Datadog.” The system is curating a deterministic evidence set before the narrative layer begins.
6. Root Cause, Propagation, Bottleneck, Symptom
This is the conceptual center of the demo and the product.
Flipturn does not flatten the incident into one overloaded label like “the bad service.” It separates different causal roles in the chain.
For the demo scenario:
- Root cause:
catalog-cachethrows a serialization failure on a versioned cache key. - Propagation:
catalogmisses cache and falls back to direct database reads. - Bottleneck:
databasebecomes the slowest span because it absorbs fallback load. - Symptom:
searchlatency breaches p99 and triggers the monitor.
This distinction is operationally important.
If an operator only sees the Datadog alert, they are likely to think:
- “Search is slow.”
If they open APM and look only at the waterfall, they may conclude:
- “Database is the root cause because it is the slowest span.”
But the useful RCA is:
- “The first causal break is in cache compatibility, which changes execution mode upstream and only then turns the database into a bottleneck.”
That tells the operator where to act first.
Why This Is Hard
Distributed incidents often blur these roles because:
- the monitor is attached to the symptom service
- the slowest span belongs to the bottleneck service
- the first failing log belongs to a faster, upstream service
- the user-facing impact appears at the edge, not at the origin
This is exactly why “just show me the trace” is not enough.
A trace is necessary, but it still has to be interpreted in a causal model.
Flipturn's Role Taxonomy
The product logic increasingly depends on this separation:
- root cause is where the first causal defect emerges
- propagation is how that defect changes behavior across services
- bottleneck is where latency or saturation accumulates most visibly
- symptom is what the monitor or operator sees first
That taxonomy is the bridge between raw telemetry and useful action.
7. The Deterministic Evidence Layer
Flipturn’s RCA is useful because it is not generated from an unstructured bag of telemetry.
Before the final response is assembled, the worker computes a deterministic evidence layer:
- scored evidence
- an evidence timeline
- confidence signals
- an RCA header
- an action scaffold
- a representative trace
- OTel pivots for human verification
This is one of the most important architectural choices in the whole repo.
Evidence Timeline
app/services/evidence_timeline.py normalizes logs, APM spans, and Sentry issues into a common event model. Those events get stable evidence IDs like E1, E2, E3, making them referencable in the final RCA and follow-up answers.
That gives us a source-of-truth ledger:
- not a prose summary,
- not raw vendor JSON,
- but a compact, ordered set of incident events.
Confidence Before Narrative
The system computes confidence and supporting signals before the final narrative is generated. That means the model is reasoning over:
- a confidence score
- explicit causal hints
- a selected representative trace
- a filtered event timeline
rather than inventing its own evidentiary structure from scratch.
Representative Trace
The representative trace is not simply “the longest trace.”
The selection logic in app/services/representative_trace.py ranks candidate traces by:
- threshold satisfaction
- symptom-service duration
- max duration
- multi-service coverage
- causal usefulness
That is much closer to what an experienced incident responder would want. The best trace is the one that explains the incident, not the one that happens to be globally longest.
RCA Header
The deterministic RCA header then compresses the incident into three high-signal lines:
- confidence
- root cause statement
- proof trace ID
That design is deliberate. In the demo, the first three lines of the Slack reply already tell the operator:
- how certain Flipturn is
- where the causal break started
- and which trace ties the cross-service evidence together
That gives operators a fast read before they ever inspect the full ledger.
8. The Role of the Agent: Explainer, Not Evidence Generator
At this point in the flow, the model still matters. But its job is narrower and more disciplined than in most AI incident products.
The AgentService does not start from raw alert text alone. It receives:
- scrubbed incident text
- a deterministic evidence timeline
- scored evidence context
- prefetched evidence
- optional thread history
That is a major architectural decision.
The model is not responsible for discovering the investigation structure. The system has already done a large part of that work. The model’s job is to:
- synthesize the evidence,
- explain the causal chain,
- format findings,
- and answer in operator-friendly language.
That is much safer than asking a model to explore observability tools blindly and decide what counts as proof with no deterministic scaffolding underneath.
This is also why the demo works well as a product story: Flipturn is ranking evidence, not just summarizing telemetry.
9. Follow-Up Q&A: From One-Shot Summary to Incident Memory
A one-shot summary bot would stop after posting the first RCA. Flipturn does not.
In the demo, the operator asks:
“Show me the single most causal evidence line and why it beats the alternatives.”
That is not a generic re-ask of the original incident. It is a challenge question. It requires the system to remember what it already concluded, retain the evidence set, and answer narrowly.
Why This Is Architecturally Hard
Slack follow-ups are new webhook events. They do not arrive with in-memory session state. So if you want thread continuity, you need to rebuild context explicitly.
Flipturn does that by persisting two things:
- thread context
- thread incident memory
The worker stores:
run_idscenario_id- representative
trace_id - deterministic header
- confidence blob
- action scaffold
- key evidence IDs
- full evidence timeline payload
That lets follow-up requests restore incident context from Redis rather than rerunning the investigation blindly.
The Follow-Up Flow
Why This Matters
This is the difference between:
- “Here is another summary”
and:
- “Here is the most causal evidence line, here is why it outranks the others, and here is the exact evidence ID.”
That is a very different product experience. It moves Flipturn closer to an incident copilot than a post-hoc summarizer.
10. Why the Simulation Lab Matters to This Story
This entire flow is powered by Flipturn’s simulation lab, and that matters enough to say directly.
We did not build the lab because we want to be in the business of demo microservices. We built it because autonomous causal RCA is impossible to validate seriously without a controlled distributed environment.
The lab gives us:
- real service hops
- real OpenTelemetry traces, metrics, and logs
- realistic causal ambiguity
- reproducible failure scenarios
- and a place to test whether the RCA engine is actually finding the right first action
That is crucial for this blog because the demo’s main claim is subtle:
- the monitor is right
- the trace is right
- the database is slow
- and the root cause is still somewhere else
You cannot prove that kind of reasoning with static screenshots or hand-authored evidence bundles alone. You need a live system where propagation can actually happen.
The simulation lab is the proving ground. The end goal is production-grade incident investigation across real systems, but the lab is where the causal model gets hardened.
11. What This Means for SRE Teams
The main lesson here is not “use Slack” or “wire Datadog to an LLM.”
It is that modern incident tooling needs to separate at least four concepts that are often collapsed into one:
- where the alert fired
- where the request was slowest
- where the failure started
- and what evidence proves the chain
If your system cannot distinguish those roles, operators will keep getting pointed at the wrong place to act first.
The second lesson is that useful AI in incident response is mostly a systems architecture problem:
- secure ingestion
- strong correlation
- deterministic evidence shaping
- memory across follow-ups
- constrained model responsibility
The model matters, but only after the telemetry and evidence pipeline are doing the right kind of narrowing.
Key Takeaways
- The alert surface is not the root cause. In distributed incidents, the service that triggers the monitor is often just where pain became visible.
- The slowest span is not necessarily the first causal break. Bottleneck and root cause can live in different services.
- Trace-first investigation reduces ambiguity. Exact pivots like
trace_idare far more useful than broad keyword search. - Deterministic evidence should come before narrative generation. Timelines, representative traces, and confidence signals make the final RCA more trustworthy.
- Follow-up memory is a core architectural capability. Without persisted incident context, every Slack reply is just another one-shot summary.
- Simulation labs are not demos for their own sake. They are the environment where causal RCA architectures can be stress-tested before broader real-world rollout.
Closing
This demo is really demonstrating one architectural claim:
Flipturn is designed to investigate incidents where the obvious answer is wrong.
It starts from the symptom alert, pivots into the distributed trace, correlates logs and spans on the same causal path, separates root cause from bottleneck, and then keeps that incident alive as thread memory for follow-up questions.
That is what makes the RCA useful. It does not just say what was slow. It tells the operator where to act first and why.
Want to eliminate incident firefighting?
Join teams using Flipturn for autonomous root cause analysis.
Request Access