Back to Blog
sreobservabilitysystem-architectureplatform-engineeringopentelemetrydistributed-systemspython

The Evidence Plane: Canonical Queries, Normalized Models, and Why They Are the Moat

Suvro Banerjee
April 5, 2026
13 min read

The Wrong Answer to the Right Problem

The first post in this series described the coupling problem: Flipturn V1 had Datadog's query syntax embedded in the reasoning layer, the evidence planning layer, and the tool registration layer. The agent's ability to investigate incidents was wired directly to a specific vendor's data model.

The obvious response to that diagnosis is: add more integrations. Build a Loki client. Build a Tempo client. Thread them into the existing code with new conditional branches. Keep the Datadog path working.

That response solves the symptom. It makes the problem structural.

The evidence planning layer becomes a conditional forest. Every new backend adds another if source == "loki" branch alongside the if source == "datadog_logs" branch. The formatters multiply. The agent has to know which backend it is talking to. The test surface grows with every integration added.

More importantly, the fundamental question remains unanswered: can the agent's reasoning quality survive a backend change?

The only honest answer requires building something before writing a single new backend: a stable internal evidence contract. A layer that every backend writes to, and that the reasoning layer reads from, regardless of where the evidence originated.

That is what we call the evidence plane.


1. What the LLM Actually Needs

The starting point is a deceptively simple question: what does the LLM actually need from observability evidence in order to reason about an incident?

Not the Datadog API response shape. Not the Loki stream structure. Not a PromQL result matrix. The LLM needs evidence. Specifically:

For a log entry:

  • when did it happen
  • which service emitted it
  • how severe was it
  • what did it say
  • which trace was it part of

For a trace span:

  • which trace and span
  • which service, which operation
  • how long did it take
  • did it succeed or fail
  • what error, if any

For a metric point:

  • which metric, at what time
  • what was the value
  • what tags scoped it

For an error issue:

  • what was the exception type and message
  • where in the code did it happen
  • how many times, how recent
  • what does the stack trace say

None of those fields are Datadog-specific. None of them are Loki-specific. They are evidence fields. Backend-agnostic by definition.

Once you name that clearly, the architecture becomes obvious: define a normalized data model for each evidence type, have every backend produce it, and give the reasoning layer nothing else.


2. The Four Normalized Models

These live in app/providers/models.py. They are plain Python dataclasses — no inheritance, no framework dependency, no backend-specific fields.

@dataclass
class NormalizedLogEntry:
    timestamp: datetime
    service: str
    level: str
    message: str
    trace_id: str | None = None
    span_id: str | None = None
    attributes: dict[str, Any] = field(default_factory=dict)
    source_provider: str = ""          # "datadog", "loki", etc.

@dataclass
class NormalizedSpan:
    trace_id: str
    span_id: str
    service: str
    operation: str
    duration_ms: float
    status: str                        # "ok" or "error"
    timestamp: datetime | None = None
    error_message: str | None = None
    parent_span_id: str | None = None
    attributes: dict[str, Any] = field(default_factory=dict)
    source_provider: str = ""

@dataclass
class NormalizedMetricPoint:
    timestamp: datetime
    value: float
    metric_name: str
    tags: dict[str, str] = field(default_factory=dict)
    source_provider: str = ""

@dataclass
class NormalizedErrorIssue:
    title: str
    exception_type: str
    exception_message: str
    culprit: str
    count: int
    first_seen: datetime | None = None
    last_seen: datetime | None = None
    level: str = "error"
    stack_trace: str | None = None
    tags: dict[str, str] = field(default_factory=dict)
    source_provider: str = ""

A few design decisions worth noting.

source_provider is on every model. This is not for the LLM — the LLM's formatted output always says "Loki Logs" or "Datadog APM Traces" but the underlying reasoning is identical. source_provider is for attribution in evidence reports and for the formatter to produce the right label. The reasoning layer never branches on it.

attributes: dict[str, Any] is a catch-all for backend-specific metadata that may be useful but does not fit the core fields. A Datadog log might carry @http.status_code. A Loki stream might carry a Kubernetes pod label. Both go into attributes. The core fields are what the LLM reasons against.

timestamp is always a datetime — timezone-aware, UTC. Never a string, never a Unix epoch integer floating in from a raw API response. Each backend's normalizer owns the conversion.


3. The Canonical Query Model

If normalized models are the output side of the evidence contract, canonical queries are the input side.

Before the provider layer existed, queries were constructed as raw backend DSL strings:

# Before — Datadog DSL embedded in the planning layer
f"service:{service} deployment.environment:{env} status:error"
f"@otel.trace_id:{trace_id}"

Those strings are uninterpretable without knowing the target backend. They cannot be inspected, tested for completeness, or translated to LogQL or TraceQL without parsing them back apart.

The canonical query model, in app/providers/queries.py, replaces strings with structured data:

@dataclass
class CanonicalLogQuery:
    service: str | None = None
    environment: str | None = None
    level: LogLevel | None = None
    trace_id: str | None = None
    text_contains: str | None = None
    attributes: dict[str, str] = field(default_factory=dict)

@dataclass
class CanonicalTraceQuery:
    service: str | None = None
    environment: str | None = None
    trace_id: str | None = None
    status_error: bool = False
    min_duration_ms: float | None = None
    attributes: dict[str, str] = field(default_factory=dict)

@dataclass
class CanonicalMetricsQuery:
    metric_name: str
    aggregation: str = "avg"           # avg, sum, max, min, p95
    group_by: list[str] = field(default_factory=list)
    filters: dict[str, str] = field(default_factory=dict)

@dataclass
class CanonicalErrorQuery:
    exception_type: str | None = None
    service: str | None = None
    environment: str | None = None
    is_unresolved: bool = True
    attributes: dict[str, str] = field(default_factory=dict)

These are data, not strings. The evidence planner constructs them. Each provider's translator converts them to the appropriate backend DSL — Datadog syntax, LogQL, TraceQL, PromQL — and that translation is entirely encapsulated inside the provider package. Nothing above the provider boundary ever constructs or parses backend-specific query syntax again.

TimeRange as a First-Class Value

Before the provider layer, time was passed as a loose minutes: int parameter scattered across every tool signature. There was no single place that owned "the time window for this investigation."

Now it is a frozen dataclass:

@dataclass(frozen=True)
class TimeRange:
    start: datetime
    end: datetime

    def __post_init__(self) -> None:
        if self.start.tzinfo is None or self.end.tzinfo is None:
            raise ValueError("TimeRange requires timezone-aware datetimes")
        if self.start > self.end:
            raise ValueError("TimeRange start must be before end")

    @classmethod
    def last_minutes(cls, minutes: int) -> "TimeRange":
        now = datetime.now(timezone.utc)
        return cls(start=now - timedelta(minutes=minutes), end=now)

    @property
    def duration_minutes(self) -> int:
        return max(int((self.end - self.start).total_seconds() / 60), 0)

frozen=True means the time range is immutable once created — no accidental mutation during an investigation. __post_init__ enforces timezone awareness at construction time rather than silently producing wrong timestamps. last_minutes() is the standard factory used everywhere.

Every backend converts TimeRange to its own format. Loki gets nanosecond epoch integers. Prometheus gets Unix seconds. Tempo gets ISO 8601. That conversion is each normalizer's responsibility. The rest of the system passes a single TimeRange and never thinks about backend time formats.


4. The Protocol Layer

With the data models defined, the question becomes: how do we enforce that every backend actually produces them?

The answer is Python's Protocol from typing. Four signal-type protocols in app/providers/protocols.py:

from typing import Protocol, runtime_checkable

@runtime_checkable
class LogProvider(Protocol):
    provider_name: str

    def fetch_logs(
        self, query: CanonicalLogQuery, time_range: TimeRange, limit: int = 10
    ) -> list[NormalizedLogEntry]: ...

    def health_check(self) -> bool: ...

@runtime_checkable
class TraceProvider(Protocol):
    provider_name: str

    def fetch_traces(
        self, query: CanonicalTraceQuery, time_range: TimeRange, limit: int = 10
    ) -> list[NormalizedSpan]: ...

    def fetch_trace_by_id(
        self, trace_id: str, time_range: TimeRange
    ) -> list[NormalizedSpan]: ...

    def health_check(self) -> bool: ...

MetricsProvider and ErrorTrackingProvider follow the same pattern.

Protocol gives us structural typing — a class satisfies LogProvider if it has the right method signatures, regardless of inheritance. A LokiProvider does not inherit from DatadogProvider. It just needs provider_name, fetch_logs(), and health_check(). The type checker validates the contract. isinstance(provider, LogProvider) works at runtime because of runtime_checkable.

health_check() is on every protocol. It validates at startup that the backend is reachable and credentials are valid, before any incident arrives.

provider_name: str is declared on every protocol as a class-level attribute. This is the identity string that flows through the normalized models as source_provider and through the formatter as the display label.

One provider class can satisfy multiple protocols. DatadogProvider implements LogProvider, TraceProvider, and MetricsProvider — Datadog covers all three signal types. Sentry implements only ErrorTrackingProvider. LokiProvider implements only LogProvider. The registry handles each capability independently.


5. The Registry and Bootstrap

The ProviderRegistry (app/providers/registry.py) holds typed lists of providers per signal type:

class ProviderRegistry:
    def __init__(self) -> None:
        self._log_providers: list[LogProvider] = []
        self._trace_providers: list[TraceProvider] = []
        self._metrics_providers: list[MetricsProvider] = []
        self._error_providers: list[ErrorTrackingProvider] = []

    def register_log_provider(self, provider: LogProvider) -> None:
        self._assert_protocol(provider, LogProvider)   # isinstance check at registration
        self._append_unique(self._log_providers, provider)

Registration validates the protocol contract with isinstance(provider, LogProvider) — this is where runtime_checkable pays off. A provider that does not implement the full contract is rejected at startup, not during an investigation.

bootstrap_registry() in app/providers/bootstrap.py turns configuration into a live registry:

def bootstrap_registry(config: BackendConfig, ...) -> ProviderRegistry:
    registry = ProviderRegistry()
    factories = {**default_provider_factories(), ...}

    for backend in config.backends:
        if not backend.enabled:
            continue
        factory = factories.get(backend.type)
        provider = factory(backend)
        _register_by_capability(registry, provider)

    return registry

def _register_by_capability(registry: ProviderRegistry, provider: object) -> None:
    if isinstance(provider, LogProvider):
        registry.register_log_provider(provider)
    if isinstance(provider, TraceProvider):
        registry.register_trace_provider(provider)
    if isinstance(provider, MetricsProvider):
        registry.register_metrics_provider(provider)
    if isinstance(provider, ErrorTrackingProvider):
        registry.register_error_provider(provider)

_register_by_capability uses multiple if checks — not elif. A single provider object like DatadogProvider satisfies three protocols and gets registered in three lists in one call. The factory pattern means adding a new backend type is a one-line addition to default_provider_factories() and a new provider package — nothing else changes.


6. The Formatter: Keeping the LLM Contract Stable

Normalized models are structured data. The LLM expects text. The formatter in app/providers/formatter.py is the bridge between them.

def format_logs(
    entries: list[NormalizedLogEntry],
    minutes: int,
    provider_name: str | None = None
) -> str:
    provider_label = _provider_label(
        provider_name or _infer_provider(entries),
        default="Logs"
    )
    lines = [f"=== {provider_label} Logs ({len(entries)} entries, last {minutes} minutes) ==="]
    for entry in entries:
        timestamp = entry.timestamp.astimezone(timezone.utc).strftime("%H:%M:%S")
        lines.append(
            f"[{timestamp}] [{entry.level.upper()}] {entry.service}: {entry.message}"
        )
    return "\n".join(lines)

The output for a Loki log entry and the output for a Datadog log entry are structurally identical:

=== Loki Logs (12 entries, last 15 minutes) ===
[14:23:11] [ERROR] search-service: failed to deserialize cache response
[14:23:12] [ERROR] search-service: retry 1/3 failed — cache timeout
=== Datadog Logs (12 entries, last 15 minutes) ===
[14:23:11] [ERROR] search-service: failed to deserialize cache response
[14:23:12] [ERROR] search-service: retry 1/3 failed — cache timeout

The provider label in the header differs. The evidence is identical. The LLM's reasoning is identical.

_provider_label() maps internal provider names to display names:

def _provider_label(provider_name: str | None, default: str) -> str:
    mapping = {
        "datadog": "Datadog", "sentry": "Sentry",
        "loki": "Loki", "tempo": "Tempo", "prometheus": "Prometheus",
    }
    return mapping.get(provider_name, provider_name.title() if provider_name else default)

_infer_provider() walks the list of normalized items and reads source_provider off the first one that has it — so even if provider_name is not explicitly passed, the formatter figures it out from the data.


7. The Tool Factory: From Registry to LangGraph Tools

The final piece connects the registry to the LangGraph agent. build_tools_from_registry() in app/providers/tool_factory.py generates @tool decorated functions dynamically from whatever providers are registered:

def build_tools_from_registry(registry: ProviderRegistry) -> list:
    tools = []

    if registry.log_providers:
        log_provider = registry.log_providers[0]

        @tool("fetch_logs")
        def fetch_logs(query: str, minutes: int = 15) -> str:
            """Query logs from the configured log provider."""
            canonical = _parse_log_query_from_agent(query)
            entries = log_provider.fetch_logs(canonical, TimeRange.last_minutes(minutes))
            return format_logs(entries, minutes=minutes, provider_name=log_provider.provider_name)

        tools.append(fetch_logs)

    # same pattern for traces, metrics, errors
    return tools

The tool names — fetch_logs, fetch_traces, fetch_metrics, fetch_recent_issues — are stable. The LLM's system prompt does not change when backends change. The tool descriptions do not change. The LLM calls fetch_logs("service:search status:error", minutes=15) against Datadog and calls fetch_logs("service:search status:error", minutes=15) against Loki. The tool factory's query parser handles the translation:

def _parse_log_query_from_agent(query: str) -> CanonicalLogQuery:
    tags = _parse_query_tags(query)         # extracts service:X, env:Y, status:Z
    level = None
    if "status" in tags and tags["status"].lower() in {item.value for item in LogLevel}:
        level = LogLevel(tags["status"].lower())
    return CanonicalLogQuery(
        service=tags.get("service"),
        environment=tags.get("environment") or tags.get("env"),
        level=level,
        trace_id=tags.get("trace_id"),
        text_contains=_strip_key_value_tokens(query) or None,
        attributes={k: v for k, v in tags.items() if k not in {"service", "environment", "env", "status", "trace_id"}},
    )

The agent sends a Datadog-style query string because that is what its prompt has always used. The tool factory parses that string into a CanonicalLogQuery. The LokiProvider translates the canonical query into LogQL. The LLM never sees LogQL. It just gets back formatted log entries.


8. What This Layer Enables

The evidence plane is not just good architecture. It is the thing that makes every subsequent capability possible.

graph LR subgraph "Evidence Plane" CQ[Canonical Queries] NM[Normalized Models] FMT[Formatter] end subgraph "Providers" DD[Datadog] Loki[Loki] Tempo[Tempo] Prom[Prometheus] Sentry[Sentry] end subgraph "Reasoning Layer" Agent[LangGraph Agent] Planner[Evidence Planner] end Planner -->|CanonicalLogQuery| CQ CQ --> DD CQ --> Loki CQ --> Tempo CQ --> Prom CQ --> Sentry DD -->|NormalizedLogEntry| NM Loki -->|NormalizedLogEntry| NM Tempo -->|NormalizedSpan| NM Prom -->|NormalizedMetricPoint| NM Sentry -->|NormalizedErrorIssue| NM NM --> FMT FMT -->|stable text| Agent style CQ fill:#bfdbfe,stroke:#333 style NM fill:#f6d365,stroke:#333 style FMT fill:#a7f3d0,stroke:#333

Multi-tenant backend configuration becomes possible because the registry is constructed from BackendConfig.from_env() at startup — and in a future phase, BackendConfig.for_tenant(tenant_id) constructs a per-customer registry. The agent layer never changes.

New backends become a contained addition: one provider package with a translator and a normalizer, registered in default_provider_factories(). Nothing else changes.

Evidence scoring and evidence timelines — which were already backend-agnostic in V1 — now receive data in a consistent format they were implicitly designed for.

And the evaluation story gets stronger: replay tests can seed normalized evidence directly, skipping backend API calls entirely, and verify that the agent's reasoning is identical regardless of where the evidence originated.


9. Why This Is the Moat

It is tempting to look at the evidence plane as infrastructure — the unglamorous foundation you build so the real product can sit on top of it.

That is an underestimate.

The evidence plane is where three things intersect: correctness (every backend produces the same model), stability (the LLM's behavior does not change when backends change), and extensibility (adding a new backend does not touch the reasoning layer).

Most AI-on-top-of-observability products skip this layer. They either hardcode one backend or branch on the backend type throughout the stack. Both approaches work at small scale. Neither survives the complexity that comes with real customer diversity.

The moat is not that Flipturn supports Loki and Tempo. Any product can make HTTP calls to Loki's API. The moat is that Flipturn's reasoning layer has a stable contract with the evidence it receives, and that contract holds regardless of which observability stack a customer is running.

That is what makes it a platform rather than a set of integrations.


What Comes Next

The evidence plane defines the contract. The adapter pattern describes how we migrated Datadog and Sentry to satisfy it — without touching a line of existing production code.

[Continue reading: Never Rewrite Production Code — The Adapter Migration →]

Want to eliminate incident firefighting?

Join teams using Flipturn for autonomous root cause analysis.

Request Access
← Return to Flipturn homepage