sresystem-architecturepythonsoftware-engineeringrefactoringplatform-engineeringdistributed-systems

Never Rewrite Production Code: The Adapter Migration

Suvro Banerjee

April 6, 2026

11 min read

The Moment You Have to Choose

You have just designed a better architecture. The contracts are clear. The models are defined. The protocols are in place. Now you have to connect the existing production system to this new layer without breaking anything.

This is the moment that separates thoughtful engineering from optimistic engineering.

The optimistic choice is to rewrite. Delete the old code, write the new code, update the tests. Clean slate. The old DatadogLogFetcher, DatadogTraceFetcher, DatadogMetricsFetcher, SentryClient — gone. In their place, a fresh implementation that speaks the new protocol from day one.

That approach has a failure mode that only reveals itself under pressure: rewriting production-critical code simultaneously introduces unknown bugs, breaks implicit contracts that lived in the old code but were never written down, and creates a window where both the old behavior and the new behavior are partially true — the worst possible state.

The thoughtful choice is the adapter pattern. You do not delete the old code. You wrap it.

The adapter pattern says: the existing DatadogLogFetcher has years of production hardening behind it. It knows about Datadog's API quirks, its mock mode, its credential validation, its edge cases with log parsing. That knowledge is not in the documentation — it is in the code. The adapter preserves all of it and projects a new interface on top.

That is exactly what we did.

1. The Adapter: DatadogProvider

app/providers/datadog/provider.py is the adapter. It implements three provider protocols — LogProvider, TraceProvider, MetricsProvider — by delegating entirely to the existing fetcher classes:

class DatadogProvider:
    """Adapter around the existing Datadog fetchers using provider protocols."""
    provider_name = "datadog"

    def __init__(self) -> None:
        self._log_fetcher = DatadogLogFetcher()       # unchanged
        self._trace_fetcher = DatadogTraceFetcher()    # unchanged
        self._metrics_fetcher = DatadogMetricsFetcher() # unchanged
        self._translator = DatadogQueryTranslator()

DatadogLogFetcher, DatadogTraceFetcher, DatadogMetricsFetcher — not a line of code changed in any of them. They remain exactly as they were in V1. DatadogProvider sits in front of them, accepts canonical queries, translates them to Datadog syntax, calls the existing fetchers, and normalizes the response.

The fetch path for logs makes the chain explicit:

def fetch_logs(
    self, query: CanonicalLogQuery, time_range: TimeRange, limit: int = 10
) -> list[NormalizedLogEntry]:
    if self._log_fetcher.mock_mode:
        return self._mock_logs(query)
    if self._log_fetcher.credentials_missing:
        return []

    rows = self._log_fetcher._search_logs(
        self._translator.translate_log_query(query),     # 1. canonical → Datadog DSL
        from_str=self._to_iso(time_range.start),
        to_str=self._to_iso(time_range.end),
        limit=limit,
    )
    return DatadogLogNormalizer.normalize(rows, provider_name=self.provider_name)  # 2. raw → normalized

Three responsibilities, each clearly separated:

Translate — DatadogQueryTranslator converts CanonicalLogQuery into a Datadog query string
Fetch — DatadogLogFetcher._search_logs() makes the HTTP call to Datadog's Logs API
Normalize — DatadogLogNormalizer.normalize() converts the raw response to list[NormalizedLogEntry]

The existing fetcher is called at step 2. Steps 1 and 3 are new. The fetcher itself has not changed.

Mock mode and credential checks are preserved too — the adapter delegates to self._log_fetcher.mock_mode and self._log_fetcher.credentials_missing. Any existing test that relied on mock mode continues to work without modification.

2. The Translator: Locking DSL Inside the Package

The translator's job is to ensure Datadog's query syntax never leaks outside app/providers/datadog/. It is the boundary that makes the evidence plane backend-agnostic.

class DatadogQueryTranslator:
    def translate_log_query(self, query: CanonicalLogQuery) -> str:
        terms: list[str] = []
        if query.service:
            terms.append(f"service:{query.service}")
        if query.environment:
            terms.append(f"deployment.environment:{query.environment}")
        if query.level:
            terms.append(f"status:{query.level.value}")
        if query.trace_id:
            terms.append(f"@otel.trace_id:{query.trace_id}")
        if query.text_contains:
            terms.append(query.text_contains)
        terms.extend(self._translate_attributes(query.attributes))
        return " ".join(terms).strip()

CanonicalLogQuery(service="search", environment="production", level=LogLevel.ERROR) becomes service:search deployment.environment:production status:error. That translation happens here, inside the translator, and nowhere else.

The trace translator shows a subtle detail — Datadog uses nanoseconds for duration filtering:

def translate_trace_query(self, query: CanonicalTraceQuery) -> str:
    terms: list[str] = []
    if query.service:
        terms.append(f"service:{query.service}")
    if query.environment:
        terms.append(f"env:{query.environment}")
    if query.trace_id:
        terms.append(f"trace_id:{query.trace_id}")
    if query.status_error:
        terms.append("status:error")
    if query.min_duration_ms is not None:
        ns = int(query.min_duration_ms * 1_000_000)       # ms → nanoseconds
        terms.append(f"duration:>={ns}")
    return " ".join(terms).strip()

The min_duration_ms → nanoseconds conversion is Datadog-specific knowledge. It lives in the translator. The canonical query carries min_duration_ms: float and knows nothing about Datadog's duration unit. When we built the Tempo translator, it carried min_duration_ms into TraceQL's own syntax. Neither translator knows about the other. Neither leaks into the canonical model.

3. The Normalizer: Standardizing the Response

The other side of the adapter is the normalizer. Datadog's Logs API returns a response shape that is Datadog-specific — field names, timestamp formats, attribute nesting. The normalizer converts that into NormalizedLogEntry so everything upstream is shielded from it:

class DatadogLogNormalizer:
    @staticmethod
    def normalize(rows: list[dict], provider_name: str = "datadog") -> list[NormalizedLogEntry]:
        entries = []
        for row in rows:
            entries.append(NormalizedLogEntry(
                timestamp=_coerce_datetime(row.get("timestamp")) or datetime.now(timezone.utc),
                service=str(row.get("service") or "unknown"),
                level=str(row.get("level") or "INFO").upper(),
                message=str(row.get("message") or ""),
                trace_id=_string_or_none(row.get("trace_id")),
                span_id=_string_or_none(row.get("span_id")),
                attributes=dict(row.get("attributes") or {}),
                source_provider=provider_name,
            ))
        return entries

_coerce_datetime() is worth examining in detail. Datadog's timestamp field arrives in at least three formats depending on the API endpoint and SDK version — a Python datetime, a Unix float in milliseconds, or an ISO 8601 string. The normalizer handles all three:

def _coerce_datetime(value: Any) -> datetime | None:
    if isinstance(value, datetime):
        return value if value.tzinfo else value.replace(tzinfo=timezone.utc)
    if isinstance(value, (int, float)):
        return datetime.fromtimestamp(float(value), tz=timezone.utc)
    if isinstance(value, str):
        try:
            return datetime.fromisoformat(value.replace("Z", "+00:00"))
        except ValueError:
            return None
    return None

This is production hardening that existed in spirit in the old fetcher and is now made explicit and testable in the normalizer. Every backend's normalizer owns this conversion for its own format. Loki uses nanosecond epoch integers. Prometheus uses Unix seconds. Tempo uses ISO 8601 with nanosecond precision. Each normalizer handles its own format, and NormalizedLogEntry.timestamp is always a timezone-aware datetime by the time it reaches the formatter.

4. The Sentry Adapter: Preserving PII Scrubbing

The Sentry integration had one property that made it unusual in V1: PII scrubbing happened inside SentryClient. Stack traces from production exceptions can contain email addresses, internal hostnames, API keys embedded in exception messages. The PII scrubber ran on every stack trace before it ever left the Sentry integration.

The SentryProvider had to preserve this exactly. It does so by wrapping SentryClient and keeping the scrubbing in the same position in the chain:

class SentryProvider:
    provider_name = "sentry"

    def __init__(self, pii_service: PiiService | None = None) -> None:
        self._pii_service = pii_service
        self._client = SentryClient(pii_service=self._pii_service)

    def fetch_issues(
        self, query: CanonicalErrorQuery, time_range: TimeRange, limit: int = 5
    ) -> list[NormalizedErrorIssue]:
        # ... HTTP call to Sentry API ...
        normalized = SentryIssueNormalizer.normalize(issues[:limit], provider_name=self.provider_name)

        scrubbed: list[NormalizedErrorIssue] = []
        for issue in normalized:
            if issue.stack_trace:
                scrub_result = self._get_pii_service().scrub(issue.stack_trace)
                issue.stack_trace = scrub_result["scrubbed_text"]   # PII removed before returning
            scrubbed.append(issue)
        return scrubbed

The SentryIssueNormalizer normalizes the raw Sentry API response into NormalizedErrorIssue. Then PII scrubbing runs on the normalized model's stack_trace field. The scrubbed result goes back into the model before it is returned. Nothing above SentryProvider ever sees raw stack trace content.

The normalizer also handles Sentry's inconsistent tag format — tags arrive as either a dict or a list[{"key": ..., "value": ...}] depending on the API version:

def _extract_tags(issue: dict) -> dict[str, str]:
    issue_tags = issue.get("tags", [])
    if isinstance(issue_tags, dict):
        return {str(key): str(value) for key, value in issue_tags.items()}
    if isinstance(issue_tags, list):
        return {str(item["key"]): str(item["value"]) for item in issue_tags
                if isinstance(item, dict) and "key" in item and "value" in item}
    return {}

This kind of defensive parsing is exactly what accumulates over time in production integrations. Keeping it in the normalizer — where it belongs — rather than scattering it through the evidence planner or agent service is one of the concrete wins of the adapter approach.

5. The Evidence Planner: Making QueryPlan Signal-Aware

The evidence planner in V1 built QueryPlan objects with a source string ("datadog_logs", "datadog_traces") and a raw query_string with Datadog DSL. The ProviderEvidenceExecutor needed to know which signal type to route each plan to.

The migration adds three fields to QueryPlan without removing the existing ones:

@dataclass
class QueryPlan:
    source: str          # Legacy label — kept for compatibility
    signal_type: str     # NEW: "logs" | "traces" | "metrics" | "errors"
    canonical_query: (
        CanonicalLogQuery | CanonicalTraceQuery |
        CanonicalMetricsQuery | CanonicalErrorQuery
    )                    # NEW: structured canonical query
    provider_type: str | None = None   # NEW: "datadog", "sentry", "loki", etc.
    query_string: str = ""   # Legacy DSL string — kept for compatibility
    minutes: int = 30
    tags: dict[str, str] = field(default_factory=dict)
    priority: int = 0

The key decision here: source and query_string are not removed. Existing code that reads plan.source to build Datadog-specific queries still works. New code — the ProviderEvidenceExecutor — reads plan.signal_type and plan.canonical_query instead. Both paths coexist during the migration window.

The planner now populates all fields together. A trace correlation plan looks like:

QueryPlan(
    source="datadog_traces",           # legacy label, unchanged
    signal_type="traces",              # new: signal routing
    canonical_query=CanonicalTraceQuery(trace_id=tid),   # new: structured
    provider_type="datadog",           # new: provider hint
    query_string=f"@trace_id:{tid}",   # legacy: Datadog DSL preview
    minutes=minutes,
    priority=1,
)

The ProviderEvidenceExecutor reads signal_type to dispatch to the right registry list (registry.trace_providers), and reads canonical_query to call provider.fetch_traces(canonical_query, time_range). The legacy query_string is ignored by the new executor. The old executor — still active when PROVIDER_REGISTRY_ENABLED=false — reads query_string as before.

6. The Feature Flag: Running Both Paths

The feature flag is where the migration strategy becomes concrete. A single boolean in app/core/config.py:

PROVIDER_REGISTRY_ENABLED: bool = False

In graph.py, _build_analysis_tools() checks this flag before choosing which tool path to use:

def _build_analysis_tools(pii_service: PiiService | None) -> list:
    settings = get_settings()
    if settings.SIM_DATADOG_ONLY:
        return [get_prefetched_evidence]

    if settings.PROVIDER_REGISTRY_ENABLED:
        registry = bootstrap_registry(BackendConfig.from_env())
        provider_tools = build_tools_from_registry(registry)
        if provider_tools:
            return provider_tools
        # Safety net: if registry builds but produces no tools, fall through
        logger.warning(
            "Provider registry enabled but no tools bootstrapped; falling back to legacy tools"
        )

    # Legacy path — unchanged from V1
    tools = [fetch_metrics_tool, fetch_traces_tool, fetch_logs_trace_first_tool]
    sentry_client = SentryClient(pii_service=pii_service or get_cached_pii_service())
    tools.append(sentry_client.fetch_recent_issues)
    return tools

With PROVIDER_REGISTRY_ENABLED=false (the default), the system runs identically to V1. Not a single behavior changes. With PROVIDER_REGISTRY_ENABLED=true, the registry path runs: BackendConfig.from_env() reads environment variables, bootstrap_registry() constructs providers, build_tools_from_registry() generates the LangGraph tools.

The safety net matters — if the registry is enabled but no providers are configured (no DD_API_KEY, no LOKI_URL), the code falls through to the legacy path and logs a warning rather than returning an empty tool list to the agent.

This is what zero-risk migration looks like in practice. The new code path is deployed to production with the flag off. It is tested in staging with the flag on. When confidence is established, the flag flips in production. If anything is wrong, the flag flips back. No deployments required.

7. Validating the Migration: Equivalence Before Confidence

The feature flag only buys time if you can actually validate equivalence between the two paths. We used three layers of validation.

Adapter equivalence tests verified that DatadogProvider.fetch_logs() produced the same NormalizedLogEntry fields as the direct DatadogLogFetcher.fetch_logs() call — same timestamps (coerced consistently), same service names, same message content, same trace IDs. These ran against mock API responses to make the assertions deterministic.

Formatter equivalence tests verified that format_logs(normalized_entries, minutes=15) produced textually identical output to the format the existing code produced. The LLM's prompt contract had to be unchanged.

Simulation regression tests ran the full pipeline — make sim-live-all-mock — with PROVIDER_REGISTRY_ENABLED=false and then with PROVIDER_REGISTRY_ENABLED=true, and compared the RCA output structure. Same evidence IDs, same confidence scores, same root cause conclusions.

Only after all three layers passed did we consider the feature flag ready to flip in production.

8. What the Migration Left Unchanged

From the LLM's perspective, nothing changed.

The tool names are the same — fetch_logs, fetch_traces, fetch_metrics, fetch_recent_issues. The tool descriptions are the same. The formatted output structure is the same. The evidence identifiers in the RCA — LOG_1, TRACE_3, SENTRY_2 — are the same.

The agent receives the same text regardless of which path produced it. That is the point. The migration was entirely beneath the surface that the reasoning layer sees.

From the Datadog integration's perspective, nothing changed either. DatadogLogFetcher, DatadogTraceFetcher, DatadogMetricsFetcher, SentryClient — not a line modified. All their internal state, mock modes, credential checks, retry logic, API endpoint construction — all preserved. The adapter wraps them. It does not replace them.

What did change is that the architecture now has a clean boundary where before it had none. The provider package owns the translation and normalization. The evidence plane owns the stable models. The reasoning layer owns the analysis. Each layer has one reason to change, and changes to one layer do not propagate to the others.

What Comes Next

The adapter migration gave us two things: a production-validated abstraction layer, and the confidence to add new backends without touching anything else.

The next post is where that confidence pays off — building Loki, Tempo, and Prometheus providers from scratch, each behind the same interface, each translating canonical queries into their native query language, each normalizing responses into the same evidence models the agent has always worked with.

[Continue reading: Bringing the LGTM Stack to an AI SRE Agent →]

Want to eliminate incident firefighting?

Join teams using Flipturn for autonomous root cause analysis.

Request Access