Back to Blog
opentelemetryobservabilitysredistributed-systemsdatadogincident-responsepythonsystem-architecture

OpenTelemetry at Flipturn: Building the Causal Telemetry Substrate

Suvro Banerjee
March 9, 2026
18 min read

The Shift: From Observability Data to Causal Telemetry

Most teams adopt OpenTelemetry to improve observability. That is the right starting point, but it is not the end state we care about at Flipturn.

We are building an autonomous causal SRE platform. That changes the bar. A human operator can look at a dashboard, jump into logs, skim a trace, and mentally stitch the story together. An autonomous system cannot rely on that kind of implicit reasoning. It needs telemetry that is already structured for traversal, correlation, and proof.

To get there, we built a simulation lab inside Flipturn: a live distributed microservice environment instrumented with OpenTelemetry and fault injection. That lab is not the product. It is the proving ground. It gives us a controlled way to generate realistic multi-hop failures, validate our correlation model, and test whether Flipturn can move from alert to causal RCA with deterministic evidence.

That is the role OpenTelemetry plays in Flipturn.

We are not using it as a thin instrumentation layer bolted onto services. We are using it as the causal substrate that lets Flipturn move from:

  1. a symptom detected in one service,
  2. to a trace-aligned chain of downstream evidence,
  3. to a deterministic, evidence-backed RCA.

This post documents three things:

  1. what OpenTelemetry is, in the terms that matter to engineers operating distributed systems,
  2. how we have implemented it inside Flipturn's simulation lab today,
  3. why this architecture is the right foundation for a future where our backends may change but our telemetry contract cannot.

1. What OpenTelemetry Actually Gives You

At a high level, OpenTelemetry standardizes how applications emit traces, metrics, logs, and context propagation metadata. The important word here is not "standardized." It is correlated.

You can have logs without causality. You can have metrics without provenance. You can even have traces that are hard to use if the surrounding signals do not share the same investigation keys.

For an SRE system, the useful unit is not "a log" or "a span." The useful unit is:

  • a request path,
  • with parent-child relationships,
  • carrying stable context across hops,
  • searchable from multiple backends,
  • and convertible into a timeline of evidence.

That is why the central OpenTelemetry concepts matter:

  • Spans model units of work inside a request path.
  • Traces connect those spans into an end-to-end execution graph.
  • Metrics confirm the symptom surface: latency, errors, fallback rate, saturation.
  • Logs carry concrete failure evidence: exception text, cache key, route, culprit.
  • Context propagation keeps those signals connected as requests cross service boundaries.
  • Baggage carries investigation-scoped key-value metadata that is not part of span identity but is essential for correlation.

The key engineering insight is that traces alone are not enough.

If a trace tells us that search called catalog, and catalog called database, that gives us request lineage. But if we also want to ask:

  • "Which simulation run did this belong to?"
  • "Which scenario produced this?"
  • "Can I pivot logs, metrics, traces, and Sentry events using the same incident key?"

then we need a second layer of propagation beyond trace identity. That is where baggage becomes operationally important.


2. The Architecture We Chose at Flipturn

We made an explicit architectural decision early: all services in our simulation lab emit OTLP to an OpenTelemetry Collector. The collector is the backbone. Services should not know or care whether the downstream backend is Datadog today, LGTM tomorrow, or a hybrid setup later.

This is also the place where people can get confused, so it is worth stating directly: Flipturn is not trying to become a microservices demo company. We built these services because autonomous RCA needs a place where causality can be observed, stressed, and verified end to end. We need realistic call chains, realistic propagation boundaries, realistic symptom metrics, and realistic failure modes before we trust the same reasoning path in production environments.

The Runtime Topology of the Simulation Lab

graph TD Client[Loadgen / User Traffic] --> Search[search service] Search --> Catalog[catalog service] Catalog --> Cache[catalog-cache service] Catalog --> DB[(database)] Cache --> DB Search -. traces / metrics / logs .-> Collector[OpenTelemetry Collector] Catalog -. traces / metrics / logs .-> Collector Cache -. traces / metrics / logs .-> Collector DB -. traces / metrics / logs .-> Collector Search -. exceptions .-> Sentry[Sentry] Catalog -. exceptions .-> Sentry Cache -. exceptions .-> Sentry Collector --> DD[Datadog APM / Logs / Metrics] DD --> Slack[Slack Alert] Slack --> Flipturn[Flipturn Investigation Engine] Flipturn --> DD Flipturn --> Sentry Flipturn --> RCA["Deterministic RCA Reply"] style Collector fill:#f6d365,stroke:#333,stroke-width:2px style Flipturn fill:#a7f3d0,stroke:#333,stroke-width:2px style DD fill:#bfdbfe,stroke:#333,stroke-width:2px style RCA fill:#ff8a80,stroke:#b71c1c,stroke-width:3px,color:#111

Why Build the Lab at All?

The simulation lab gives us four things that static docs or canned log bundles cannot:

  1. Real distributed traces across service boundaries such as search -> catalog -> catalog-cache -> database
  2. Real propagation behavior for traceparent and baggage across HTTP hops
  3. Real signal disagreement where metrics show the symptom, logs show the failure text, and traces show the path
  4. A repeatable RCA testbed where Flipturn can be evaluated against known failure scenarios instead of hand-wavy examples

This matters because Flipturn's core product claim is not "we can parse telemetry." It is "we can autonomously reconstruct causality." You only get confidence in that claim by running the system against live, instrumented failure paths.

Why the Collector Matters

This is not just "best practice" architecture. It solves a real product problem.

If each service were instrumented directly against one vendor SDK, we would couple telemetry emission to backend choice. That would leak vendor assumptions into application code, increase migration cost, and make multi-backend support painful.

By standardizing on:

  • OpenTelemetry SDKs in the services
  • OTLP between services and collector
  • backend-specific exporters only at the collector edge

we isolate vendor dependencies to one control point.

That is the right seam for a company like Flipturn because our product is not "Datadog automation." Our product is causal incident reasoning over telemetry, regardless of where that telemetry ultimately lands.

The simulation lab helps us validate that architecture in a controlled setting. The collector-based design is what lets that same foundation move into real-world deployments later without rewriting the instrumentation model.


3. The Core Design Problem: Correlation as a Contract

The hardest part of distributed telemetry is not emission. It is correlation.

We solved that by treating telemetry correlation as a first-class contract. In Flipturn, every incident-relevant signal must carry a small, stable set of attributes:

  • run_id
  • scenario_id
  • region
  • deployment.environment

These keys are propagated via W3C baggage and then stamped onto spans, logs, and metrics.

The Contract

From simulation/live/contracts.py:

INCIDENT_BAGGAGE_KEYS: tuple[str, ...] = (
    "run_id",
    "scenario_id",
    "region",
    "deployment.environment",
)

def get_incident_baggage_dict(ctx: Any | None = None) -> dict[str, str]:
    context = ctx if ctx is not None else otel_context.get_current()
    out: dict[str, str] = {}
    for key in INCIDENT_BAGGAGE_KEYS:
        value = baggage.get_baggage(key, context=context)
        if value is not None:
            out[key] = str(value)
    return out

And the corresponding propagation helper:

def baggage_header(self) -> str:
    return (
        f"run_id={self.run_id},"
        f"scenario_id={self.scenario_id},"
        f"region={self.region},"
        f"deployment.environment={self.env}"
    )

Why We Use Both Trace Context and Baggage

This distinction matters:

  • Trace context answers: "Which distributed request did this event belong to?"
  • Baggage answers: "Which broader investigation context should travel with this request?"

That gives us two layers of correlation:

  1. Request-scoped causality via trace_id and span_id
  2. Incident-scoped grouping via run_id, scenario_id, and environment tags

This is exactly what Flipturn needs. The request identity drives trace-first investigation. The incident identity lets us filter the wider evidence surface, especially when we need to correlate logs or metrics that are adjacent to the same event stream but not always immediately reachable from a single span query.

The Design Goal

Our goal is not simply "make the telemetry searchable." It is:

make the same incident navigable from traces, logs, metrics, and exceptions without inventing custom per-backend logic each time.

That is the difference between instrumentation and a telemetry substrate.


4. Service-Side Implementation: One Bootstrap, All Signals

To keep the service layer consistent, we built a shared OpenTelemetry bootstrap for all simulation microservices in the lab. The important design choice here is that traces, metrics, logs, propagation, and logging correlation are initialized together.

The Bootstrap Pattern

From simulation/live/services/otel_bootstrap.py:

resource = Resource.create({
    TelemetryNaming.ATTR_SERVICE_NAME: service_name,
    TelemetryNaming.ATTR_ENV: "simulation",
})

trace_exporter = OTLPSpanExporter(endpoint=f"{COLLECTOR_OTLP_HTTP}/v1/traces")
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(BaggageSpanProcessor())
tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(tracer_provider)

metric_exporter = OTLPMetricExporter(endpoint=f"{COLLECTOR_OTLP_HTTP}/v1/metrics")
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

log_exporter = OTLPLogExporter(endpoint=f"{COLLECTOR_OTLP_HTTP}/v1/logs")
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(IncidentLogRecordProcessor())
logger_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))

This matters for two reasons.

First, it ensures the three primary signals are configured uniformly across services. We do not want one service exporting traces but forgetting to stamp logs, or another service emitting metrics with a different naming scheme.

Second, it lets us encode correlation logic once and reuse it everywhere.

The Missing Piece Most Teams Overlook: Logs

A lot of OpenTelemetry setups get traces right and then leave logs behind. That creates a gap right where SREs usually need the most concrete evidence.

We added two specific components to close that gap:

  1. BaggageSpanProcessor
  2. BaggageLogFilter plus IncidentLogRecordProcessor

The first stamps baggage keys onto spans. The second injects those same baggage keys into Python log records and then into OTel log attributes.

This is a subtle but important implementation detail. Baggage may propagate successfully across services, but unless you explicitly attach it to log records, your logs are not guaranteed to be queryable with the same correlation keys in the backend.

That is why our log filter exists.

class BaggageLogFilter(logging.Filter):
    def filter(self, record: logging.LogRecord) -> bool:
        ctx = otel_context.get_current()
        baggage_attrs = get_incident_baggage_dict(ctx)
        for key, value in baggage_attrs.items():
            setattr(record, key, value)
        return True

This design lets us answer questions like:

  • "Show me all logs from this run."
  • "Show me only the logs from this scenario in us-east-1."
  • "Find the log evidence associated with the representative trace."

without writing service-specific adapters every time.


5. Signal Modeling in the Services

Once the bootstrap is in place, services emit the actual telemetry that Flipturn will later reason over.

Take the search and catalog services from the live stack. These exist as part of the simulation lab, not because Flipturn's end goal is to own business services, but because we need live systems that can produce realistic symptoms and propagation chains on demand.

search records request latency histograms and error counters, while catalog records both latency and a domain-specific fallback counter for cache misses that spill into direct database reads.

That distinction is important. Generic infrastructure metrics are not enough for causal analysis. We need domain-shaped metrics that describe the propagation path.

Example: Search Symptom Surface

From simulation/live/services/search.py:

latency_hist = otel.meter.create_histogram(
    name=TelemetryNaming.request_latency(TelemetryNaming.SVC_SEARCH),
    description="Search request latency",
    unit="ms",
)

error_counter = otel.meter.create_counter(
    name=TelemetryNaming.error_count(TelemetryNaming.SVC_SEARCH),
    description="Search request errors",
)

Example: Catalog Propagation Signal

From simulation/live/services/catalog.py:

fallback_counter = otel.meter.create_counter(
    name=TelemetryNaming.fallback_rate(TelemetryNaming.SVC_CATALOG),
    description="Cache miss fallback to database",
)

Why does this matter?

Because a latency breach on search is usually only the visible symptom. The fallback counter in catalog is much closer to the causal transition point. It tells us that the service did not simply get slow; it changed execution mode because an upstream cache path failed.

That is the kind of signal an autonomous SRE system can use to build a real propagation chain:

catalog-cache failure -> catalog fallback increase -> database load increase -> search latency breach

Without these intermediate signals, the agent is forced to infer too much from the symptom alone.


6. Collector Strategy: Normalize Once, Export Many

The collector configuration is intentionally simple, but the simplicity is doing important work.

From simulation/live/collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
  batch:
  resource:
    attributes:
      - key: deployment.environment
        value: "simulation"
        action: upsert
  transform/logs:
    log_statements:
      - context: log
        statements:
          - set(attributes["service"], resource.attributes["service.name"])
          - set(attributes["env"], resource.attributes["deployment.environment"])
          - set(attributes["region"], resource.attributes["region"])

pipelines:
  traces:
    receivers: [otlp]
    processors: [memory_limiter, resource, batch]
    exporters: [datadog, debug]
  metrics:
    receivers: [otlp]
    processors: [memory_limiter, resource, batch]
    exporters: [datadog, debug]
  logs:
    receivers: [otlp]
    processors: [memory_limiter, resource, transform/logs, batch]
    exporters: [datadog, debug]

Why This Is the Right Layer for Normalization

There are two classes of problems here:

  1. service-side correctness
  2. backend-side shape

Service-side correctness means the app emitted the right telemetry. Backend-side shape means the telemetry is searchable and faceted the way our investigation engine expects.

Those should not be solved in the same place.

The collector gives us a central location to:

  • enforce resource attributes,
  • normalize logs into backend-friendly fields,
  • add safety nets when a service forgets a field,
  • and change exporters without touching app code.

This is the portability story in practice.

If we want to extend Flipturn beyond Datadog, the code that emits trace_id, run_id, and deployment.environment should not change. The query adapters and collector exporters may change. The telemetry contract should not.


7. How Flipturn Consumes OpenTelemetry: Trace-First Investigation

Most blog posts about OpenTelemetry stop after instrumentation and export. For Flipturn, that is only half the system.

The real value appears when the investigation engine consumes telemetry in a deterministic order.

The Investigation Flow

sequenceDiagram participant Slack as Slack Alert participant Worker as Stream Worker participant Planner as Evidence Planner participant DD as Datadog participant S as Sentry participant Timeline as Evidence Timeline participant RCA as RCA Builder Slack->>Worker: Alert text with service/env/trace hints Worker->>Planner: Build CorrelationEnvelope Planner->>DD: Query traces first Planner->>DD: Query logs with exact pivots Planner->>DD: Query symptom metrics Planner->>S: Query tagged exceptions DD-->>Timeline: Spans, logs, metric facts S-->>Timeline: Exception evidence Timeline->>RCA: Representative trace + confidence + pivots RCA-->>Slack: Deterministic RCA reply

The Query Strategy

The critical design choice is that Flipturn does not begin with broad text search. It begins with the strongest available correlation key.

From app/services/evidence_planner.py:

if "trace_id" in envelope.exact_ids:
    tid = envelope.exact_ids["trace_id"]
    plans.append(QueryPlan(
        source="datadog_traces",
        query_string=f"@trace_id:{tid}",
        priority=1,
    ))
    plans.append(QueryPlan(
        source="datadog_logs",
        query_string=f"@otel.trace_id:{tid}",
        priority=2,
    ))

Only after trace pivots does it widen into:

  • run_id
  • service + environment dimensions
  • Sentry issue lookups
  • symptom-confirming metric queries

This is an explicit engineering decision. We do not want the system to start with noisy keyword matching when exact execution lineage exists.

Why Trace-First Is So Important

If a trace_id is present, it is the best causal anchor in the system:

  • it cuts across service boundaries,
  • it narrows the evidence set dramatically,
  • it connects logs and APM data for the same distributed request,
  • and it reduces hallucination pressure because the agent is working inside a small, relevant search space.

We still keep run_id and other baggage fields because not every useful event arrives perfectly trace-linked. But those are secondary pivots, not the first move.


8. Turning Raw Telemetry into Deterministic RCA

This is where Flipturn's architecture becomes more than an observability pipeline.

Once evidence comes back from Datadog and Sentry, we normalize it into a single Evidence Timeline. Then we compute three deterministic artifacts before the LLM even begins its narrative work:

  1. confidence signals,
  2. a representative trace,
  3. OTel pivots for human and machine follow-up.

Representative Trace Extraction

From app/services/representative_trace.py, Flipturn selects a trace not just by "longest duration," but by a ranking that combines:

  • whether it satisfies the alert threshold,
  • whether it contains the symptom service,
  • its max span duration,
  • and its multi-service coverage.

That is exactly the right heuristic for incident RCA. The globally longest trace is not always the best explanatory trace. The trace that best captures the causal chain behind the alert is more valuable.

OTel Pivots

From app/services/otel_pivots.py, we build copy-paste-ready pivots like:

  • Datadog APM query
  • Datadog Logs query
  • trace and span identifiers
  • secondary filters such as environment, region, run_id, and scenario_id

That gives operators and downstream tooling a shared investigation handle.

RCA Assembly

In the stream worker, these pieces are assembled before the final analysis phase:

representative_trace = build_representative_trace(
    timeline=evidence_timeline,
    confidence_blob=evidence_timeline.meta["confidence"],
)

pivots = build_otel_pivots(evidence_timeline)
if pivots:
    evidence_timeline.meta["otel_pivots"] = pivots.model_dump(mode="json")

This means the LLM is not starting from a pile of raw telemetry. It is starting from:

  • ranked evidence,
  • trace-aware pivots,
  • an already computed causal candidate,
  • and a deterministic source-of-truth timeline.

That is a very different system from "LLM + logs."

It is a hybrid architecture where OpenTelemetry provides the graph structure and Flipturn provides the reasoning layer.


9. Why This Architecture Is Portable by Design

One of the easiest mistakes in observability engineering is to confuse the backend with the telemetry model.

Datadog is our current backend for traces, metrics, and logs. Sentry is our current exception backend. But the Flipturn architecture is intentionally built so that those are integration endpoints, not foundational assumptions.

What Is Stable

The following should remain stable even if the backend stack changes:

  • the OTel semantic shape of signals,
  • the trace context propagation model,
  • the baggage correlation contract,
  • service naming,
  • metric naming,
  • incident pivoting logic,
  • evidence normalization into timelines.

What Can Change

These are the replaceable layers:

  • collector exporters
  • backend query clients
  • monitor definitions
  • query adapters for logs, traces, metrics, and issues

That separation is the whole point.

If we want to support an LGTM-style stack later, the service code should still emit OTLP. The collector should still be the fan-out point. Flipturn should still reason over normalized evidence objects, not raw vendor response formats.

That is how you avoid having to "rebuild observability" every time your backend strategy changes.


10. The Way Forward

We have the foundation in place, but there is still meaningful work ahead.

A. Continue Using the Lab as the RCA Proving Ground

Today, the live stack already demonstrates the architecture with services like search, catalog, catalog-cache, and database. That lab should keep evolving because it is where we can test new correlation strategies, representative-trace heuristics, evidence scoring methods, and failure scenarios under controlled conditions.

It is our internal proving ground for autonomous causal RCA.

B. Apply the Same Telemetry Contract to Real Systems

The real destination is not the lab. The destination is real-world telemetry across production systems, support tooling, and customer environments. The point of the lab is to harden the foundation so that when we expand outward, we are carrying a battle-tested correlation model rather than a theoretical one.

C. Strengthen Async and Cross-Boundary Causality

Trace context across synchronous HTTP paths is only one part of the story. Longer term, we need the same rigor for:

  • queued work,
  • background jobs,
  • retries,
  • fan-out tasks,
  • and linked spans across asynchronous boundaries.

That is where span links and richer propagation patterns become increasingly important.

D. Grow the Evidence Layer, Not Just the Raw Signal Layer

More telemetry is not enough. The system improves when more telemetry becomes rankable evidence.

That means continued work on:

  • evidence scoring,
  • topology-aware relevance,
  • better representative-trace selection,
  • tighter metric-to-trace joins,
  • and stronger explanation of why a given trace was chosen as proof.

E. Keep the Contract Small and Rigid

The bigger the telemetry surface gets, the more important it becomes not to let the contract sprawl. We should resist random custom tags that mean different things in different services. A small set of canonical correlation keys is a feature, not a limitation.


Key Takeaways for Engineering Teams

  1. OpenTelemetry is not just instrumentation. In a distributed system, it becomes the contract that determines whether an incident can be reconstructed reliably.
  2. Trace context and baggage solve different problems. You need both if you care about request-level lineage and incident-level grouping.
  3. The collector is the right portability seam. Emit OTLP from services, normalize centrally, and treat exporters as replaceable.
  4. Logs must be part of the correlation design. If baggage never reaches logs, your most concrete evidence becomes hard to pivot to.
  5. Autonomous RCA requires deterministic pivots. Trace-first investigation is far more reliable than broad text search.
  6. Backend portability depends on stable telemetry contracts. The backend may change; the causal model should not.

Closing

At Flipturn, OpenTelemetry is not the end of the observability story. It is the beginning of the reasoning story.

We use it to encode request lineage, propagate investigation context, normalize telemetry across services, and hand our investigation engine a search space shaped by causality rather than noise.

The simulation lab is how we test that foundation rigorously. It gives us live distributed systems, controlled faults, and repeatable incident scenarios so we can evaluate whether Flipturn is actually doing autonomous causal RCA rather than just summarizing observability data.

That is what makes the next step possible.

As we extend Flipturn into real-world environments, more backends, and more complex incident flows, we do not want a larger pile of telemetry. We want a system that can explain, with evidence, why the symptom happened, where it started, and how it propagated.

OpenTelemetry is the substrate that makes that possible.

Want to eliminate incident firefighting?

Join teams using Flipturn for autonomous root cause analysis.

Request Access
← Return to Flipturn homepage