sreobservabilitysystem-architectureplatform-engineeringopentelemetrydistributed-systemsdevopsincident-response

Can an AI Agent's Reasoning Quality Survive a Backend Change?

Suvro Banerjee

April 4, 2026

9 min read

The Problem Is Not the Integration Layer. It Is the Reasoning Layer.

Every serious engineering team has an observability stack.

Almost none of them use the same one.

Datadog is dominant in mid-market and enterprise. The Grafana LGTM stack — Loki, Grafana, Tempo, Mimir — is growing fast, especially among platform-native teams and startups that chose open-source from day one. Elastic powers a significant segment. New Relic, Dynatrace, Honeycomb, CloudWatch — all have real production customers who made a deliberate choice and are not switching.

When we started building Flipturn, we made a pragmatic call: build against Datadog and Sentry first. Get to a working, production-quality autonomous RCA engine. Ship something that works end-to-end.

That was the right call.

But as the system matured — through the logs fetcher, the traces client, the metrics queries, the evidence planner, the tool registration layer — something became visible that should have been obvious earlier.

The observability vendor was not just in the integration layer. It was in the reasoning layer.

And that is a fundamentally different kind of problem.

1. What "Vendor Lock-In" Actually Means for an AI SRE

When most people talk about vendor lock-in in observability, they mean something like: "we use Datadog's SDK everywhere and switching would be painful." That is a real problem, but it is manageable. You refactor the clients, update the API calls, rerun the tests.

The coupling we built into Flipturn V1 was subtler.

When the LangGraph agent needed to investigate an incident, it called tools that looked like this:

fetch_logs("service:search env:production status:error", minutes=15)
fetch_traces("service:search status:error", minutes=15)

Those query strings — service:search env:production status:error — are Datadog query syntax. Not a generic search description. Not a canonical representation of intent. Datadog's specific DSL, embedded in the arguments the AI agent sends to investigate production incidents.

The same pattern appeared at every layer of the stack.

In evidence_planner.py, the planner that decides what evidence to collect was building Datadog queries directly:

# Before — Datadog DSL embedded in the evidence planning layer
QueryPlan(
    source="datadog_logs",
    query_string=f"service:{service} deployment.environment:{env} status:error",
    minutes=30,
    priority=5,
)

In otel_pivots.py, trace correlation was expressed in Datadog's attribute syntax:

# Before — Datadog-specific attribute format for trace correlation
f"@otel.trace_id:{trace_id}"
f"@dd.trace_id:{trace_id}"

And in graph.py, the tools the LLM was given to reason with were a hardcoded list:

# Before — fixed tool list, hardcoded to Datadog + Sentry
def _build_analysis_tools(pii_service: PiiService | None) -> list:
    tools = [
        fetch_metrics_tool,          # wraps DatadogMetricsFetcher
        fetch_traces_tool,           # wraps DatadogTraceFetcher
        fetch_logs_trace_first_tool, # wraps DatadogLogFetcher
    ]
    sentry_client = SentryClient(pii_service=pii_service)
    tools.append(sentry_client.fetch_recent_issues)
    return tools

No abstraction. No interface boundary. The intelligence layer of the product — the tools an LLM uses to investigate production incidents — was directly wired to Datadog's APIs and Sentry's HTTP client.

2. The Coupling Map

To understand the full scope of the problem, it helps to see where the coupling actually lived:

Location	What Was Coupled
`app/agents/graph.py::_build_analysis_tools()`	Hard-imported `DatadogLogFetcher`, `DatadogTraceFetcher`, `DatadogMetricsFetcher`, `SentryClient` — fixed tool list with no abstraction
`app/services/evidence_planner.py`	Built Datadog query strings directly: `service:X status:error`, `@otel.trace_id:Y`
`app/services/otel_pivots.py`	Embedded Datadog attribute syntax for trace correlation
`app/services/agent_service.py`	Formatted evidence output with hardcoded "Datadog" and "Sentry" labels
`app/core/config.py`	Flat `DD_API_KEY`, `SENTRY_AUTH_TOKEN` with no model for multi-backend configuration

That is five locations where the reasoning infrastructure was coupled to a specific vendor's data model. Switching backends would not mean swapping API clients — it would mean modifying the agent's investigation logic.

The parts that were already well-designed — CorrelationEnvelope for key extraction, EvidenceScorer for relevance scoring, the LangGraph state machine itself — were completely backend-agnostic. They did not care where evidence came from. But they were receiving evidence that had already been shaped by Datadog-specific code upstream.

3. The Three Customers and What This Costs

There are three types of customers Flipturn wants to serve.

Customer A runs Datadog and Sentry. The architecture was built for them. Everything works.

Customer B runs the Grafana LGTM stack — Loki for logs, Tempo for traces, Prometheus or Mimir for metrics. No Datadog. No Sentry. This is increasingly the default for platform-native teams and any engineering org that decided the open-source path was worth maintaining.

Customer C runs Elastic for logs and APM, Jaeger for distributed tracing, PagerDuty for alerting. A common pattern in larger organizations with established platform teams.

In V1, Customer A was fully served. Customers B and C could not use Flipturn at all — not because of a missing HTTP client, but because the architecture made no provision for them. There was no interface boundary to plug into. The query syntax, the tool registration, the evidence formatting — all of it assumed Datadog.

That is the real cost of reasoning-layer coupling. It is not just a refactor. It is a market boundary.

4. The Wrong Instinct

The obvious instinct is to branch.

Add a Loki client. Thread it into the evidence planner with a new if backend == "loki" check. Keep the Datadog path. Ship it. Repeat for Tempo. Repeat for Prometheus.

This solves the immediate problem. It does not solve the structural one.

The evidence planner becomes a conditional forest. The agent prompt has to know which backend it is talking to. The formatters multiply. Every new backend adds another layer of branching logic. The test surface explodes.

More importantly: it still does not answer the fundamental question.

Can the AI agent's reasoning quality survive a backend change?

The only honest answer requires a stable internal contract between the evidence layer and the reasoning layer — one that every backend writes to and the LLM reads from, regardless of where the evidence originated.

If the LLM receives a log entry from Loki formatted as:

[14:23:11 UTC] ERROR search-service: failed to deserialize cache response | trace_id=abc123 duration_ms=342

And the same incident's log entry from Datadog formatted as:

[14:23:11 UTC] ERROR search-service: failed to deserialize cache response | trace_id=abc123 duration_ms=342

— then the LLM's reasoning is identical. The RCA quality does not change. The backend is invisible.

That is the contract we needed to build.

5. The Architecture After

The solution is a provider abstraction layer that sits between the agent's tools and the external APIs. The architecture looks like this:

The agent calls tools. The tools are generated dynamically from whatever providers are registered. The providers own their query language translation internally. The normalizers produce a stable evidence model. The formatter converts that model to text. The LLM never sees a Datadog JSON response, a LogQL query, or a Prometheus timeseries structure. It just sees evidence.

And crucially — the new _build_analysis_tools() looks like this:

# After — tools generated dynamically from registry
def _build_analysis_tools(pii_service: PiiService | None) -> list:
    settings = get_settings()
    if settings.PROVIDER_REGISTRY_ENABLED:
        registry = bootstrap_registry(BackendConfig.from_env())
        provider_tools = build_tools_from_registry(registry)
        if provider_tools:
            return provider_tools
    # Legacy fallback — unchanged Datadog path still works
    tools = [fetch_metrics_tool, fetch_traces_tool, fetch_logs_trace_first_tool]
    sentry_client = SentryClient(pii_service=pii_service)
    tools.append(sentry_client.fetch_recent_issues)
    return tools

No hardcoded backends. No fixed imports at the function body. The tool list is a function of what is configured — and what is configured is now a first-class concept in the system.

6. What This Architecture Commits To

Building this correctly requires a few principles that are easy to state and surprisingly hard to maintain under deadline pressure.

The evidence model is the contract, not the API. When adding a new backend, the question is not "what does their API return?" It is "how does their response map to NormalizedLogEntry?" The API is an implementation detail inside the provider package. The normalized model is the interface that everything else depends on.

Backend DSL never leaks upward. The Datadog query syntax service:search status:error @otel.trace_id:abc123 lives inside app/providers/datadog/translator.py. The LogQL equivalent {service_name="search", level="error"} |= "error" lives inside app/providers/loki/translator.py. Nothing above the provider boundary constructs or parses a backend-specific query string.

The LLM prompt contract is stable. The formatter's job is to produce the same text format the LLM has always expected — regardless of which backend produced the raw data. The LLM's behavior does not change when backends change.

Migration is additive, not destructive. The existing DatadogLogFetcher, DatadogTraceFetcher, DatadogMetricsFetcher, and SentryClient were not deleted. They were wrapped. The new DatadogProvider delegates to them internally. The PROVIDER_REGISTRY_ENABLED feature flag lets both paths run side by side until the new path is fully validated.

7. The Immediate Results

With the provider abstraction layer in place and the LGTM stack implemented:

A customer running Datadog sets DD_API_KEY, DD_APP_KEY, PROVIDER_REGISTRY_ENABLED=true. The DatadogProvider is registered. The agent calls fetch_logs, fetch_traces, fetch_metrics tools backed by Datadog's APIs.
A customer running LGTM sets LOKI_URL, TEMPO_URL, PROMETHEUS_URL, PROVIDER_REGISTRY_ENABLED=true. The LokiProvider, TempoProvider, and PrometheusProvider are registered. The agent calls the same fetch_logs, fetch_traces, fetch_metrics tools backed by Loki's LogQL API, Tempo's TraceQL API, and Prometheus's PromQL API.

The same incident — a cache deserialization failure in the search service — produces the same structure of RCA against both stacks. The evidence is labeled differently (source_provider="datadog" vs source_provider="loki") but the reasoning is identical.

Customer B and Customer C can now use Flipturn.

What Comes Next

This post has described the problem and the solution at the architectural level. The next post goes one level deeper into the part that makes all of this possible: the evidence plane itself.

The normalized models, the canonical queries, the formatter — and why getting that internal contract right is the most important engineering decision in this entire project.

[Continue reading: The Evidence Plane — Canonical Queries, Normalized Models, and Why They Are the Moat →]

Want to eliminate incident firefighting?

Join teams using Flipturn for autonomous root cause analysis.

Request Access