ai-agentslanggraphsresystem-architectureobservabilitypythonsoftware-engineering

Building the Proactive Nerve System: The Agentic Reasoning Engine (Part 2)

Suvro Banerjee

February 11, 2026

7 min read

The Shift: From Text-In/Text-Out to Agentic Reasoning

In Part 1, we built the Trust Gate---a secure, cost-optimized ingestion pipeline. But ingestion is only the setup. The real challenge is autonomous reasoning.

Traditional LLM implementations are linear: you send a prompt and get a response. But Senior SREs don't work linearly. They see an error, formulate a hypothesis, query a tool, find a new clue, and pivot.

This post documents how we built the Agentic Reasoning Engine---a stateful, cyclic diagnostic system that autonomously uses tools, generates structured output, and maintains conversational memory.

1. The Architecture: LangGraph State Machine

To move beyond "chatbot" responses, we needed a system where the LLM could decide its own path: "Do I have enough evidence, or should I query Datadog again?" We chose LangGraph to orchestrate this cyclic diagnostic loop.

The Pattern: The Cyclic Diagnostic Loop

graph TD START((Start)) --> Scrubber[PII Scrubber Node] Scrubber --> Analyst[Analyst Node: GPT-5.1] Analyst --> ToolCondition{Tool Calls?} ToolCondition -- Yes --> Tools[Tool Node: DD/Sentry] Tools --> Analyst ToolCondition -- No --> FAQ[FAQ Generator: GPT-5-nano] FAQ --> END((End)) style Analyst fill:#f96,stroke:#333,stroke-width:2px style Tools fill:#bbf,stroke:#333,stroke-width:2px style FAQ fill:#dfd,stroke:#333,stroke-width:2px

Architectural Insight: The Analyst node is the only "intelligent" decision-maker. It decides when to investigate further and when to finalize. This allows for "multi-hop" investigations where the agent might check Datadog, see a specific Request ID, and then pivot to Sentry to find that exact trace---all in a single automated session.

Let's look at the sample implementations using LangGraph.

From app/agents/graph.py

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode, tools_condition

# Build the graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("scrubber", scrub_pii_node)
workflow.add_node("analyst", analyze_node)
workflow.add_node("tools", ToolNode(tools=[dd_fetcher.fetch_logs, sentry_client.fetch_recent_issues]))
workflow.add_node("faq_generator", generate_faq_node)

# Wire edges
workflow.set_entry_point("scrubber")
workflow.add_edge("scrubber", "analyst")

# CRITICAL: Conditional edge allows looping back to tools
workflow.add_conditional_edges(
    "analyst",
    tools_condition,  # LangGraph built-in: checks response.tool_calls
    {
        "tools": "tools",           # If tool calls exist - go to tools
        END: "faq_generator",       # If no tool calls - final answer
    }
)

workflow.add_edge("tools", "analyst")  # After tools, loop back to analyst
workflow.add_edge("faq_generator", END)

graph = workflow.compile()

Data Flow:

User Incident
     |
[scrubber] PII redaction (Presidio)
     |
[analyst] GPT-5.1 sees: "Error at 14:05. Need logs."
     |
     +-- tool_calls present? --> [tools] Execute fetch_logs(...)
     |                                |
     |                        Append results to messages
     |                                |
     +-------------------------------< Loop back to [analyst]
                                     |
                          "Now I have logs. Let me check Sentry."
                                     |
                          +-- tool_calls present? --> [tools] Execute fetch_recent_issues(...)
                          |                                |
                          +-------------------------------< Loop back
                                     |
                          "I have all evidence. Final answer:"
                                     |
                          tool_calls = [] --> [faq_generator] --> END

Why This Works: The LLM autonomously decides when it has enough evidence. We've seen cases where it makes 3-4 tool calls before finalizing (Datadog then Sentry then Datadog again with refined query).

2. Technical Challenge: The Tools vs. Structure Paradox

The Problem

When building production agents, you face a trade-off: You want Agency (the ability to use tools) and you want Structure (type-safe JSON for your UI).

The Paradox: If you force an LLM into a JSON schema from the first token (using native structured output), the model is under "format pressure." In our testing, this caused the model to "rush" to a final answer, often inventing log excerpts or skipping tool calls entirely to satisfy the schema immediately.

The Solution: Tool Loop First, JSON Parse Last

We decoupled reasoning from formatting. The model uses tools freely in plain text. Only when it decides the investigation is complete do we perform a Manual JSON Validation using Pydantic.

# The "Anti-Fragile" Parsing Pattern

# Step 1: Reasoning Phase (Free-form tool use)
response = model_with_tools.invoke(messages)

if response.tool_calls:
    return {**state, "messages": [response]} # Continue the loop

# Step 2: Formatting Phase (Final Answer)
try:
    # Validate the final response against our Pydantic schema
    structured_output = StructuredAnalysis.model_validate_json(response.content)
    return {
        "structured_analysis": structured_output,
        "analysis": structured_output.raw_analysis # Markdown for legacy paths
    }
except Exception as parse_error:
    # Graceful Fallback: If JSON fails, preserve raw text. 
    # The human still gets an answer, and the system stays alive.
    return fallback_to_plain_text(response.content)

Impact: Tool usage accuracy increased from ~40% to ~95%. By removing the "format pressure," the LLM focuses on the investigation first and the reporting second.

3. The Design: Causal Analysis Framework

An LLM is a text generator; an SRE is a causal reasoner. To bridge this gap, we implemented a Causal Analysis Framework in the system prompt. Instead of asking "What's wrong?", we force the LLM through a specific reasoning chain:

Identify Failing Component: Isolate the service/endpoint.
Map Dependencies: What does this component rely on?
Correlate Timing: Match the exact millisecond of a log spike with a stack trace.
Build the Causal Chain: "A caused B, which triggered symptom C."

Engineering Trust: Confidence Scoring

To prevent hallucinations, we force the agent to provide a Confidence Score (HIGH/MEDIUM/LOW) for every theory, backed by specific tool evidence. If it can't find a direct link, it is instructed to mark the theory as LOW---building trust by admitting uncertainty.

4. Optimization: Model Tiering (Cost vs. Intelligence)

In a startup, "Flagship-only" architectures are a liability. We matched Reasoning Intensity to the model tier.

The Detective (GPT-5.1): High-reasoning, high-cost. Handles the Analyst node, tool use, and causal logic.
The Subordinate (GPT-5-nano): Low-reasoning, near-zero cost. Handles the FAQ_Generator node.

graph LR Signal[Universal Incident Signal] --> AnalystNode[Analyst Node] AnalystNode -- High Reasoning --> GPT51[GPT-5.1 Reasoning Engine] GPT51 -- "Tools (DD/Sentry)" --> AnalystNode AnalystNode -- Final Analysis --> FAQNode[FAQ Generator Node] FAQNode -- Low Reasoning --> GPT5Nano[GPT-5-nano Summarizer] GPT5Nano --> Result[Structured Analysis + FAQs] style GPT51 fill:#f9f,stroke:#333,stroke-width:2px style GPT5Nano fill:#ccf,stroke:#333,stroke-width:2px

The Result: By offloading summarization and FAQ generation to a "nano" model, we achieved a 45% reduction in total per-incident cost while keeping our highest-reasoning model on the critical diagnostic path.

5. The Memory Challenge: Stateful Conversations in a Stateless World

Webhooks are ephemeral. Every Slack follow-up is a fresh request with zero memory of the initial analysis. We solved this via On-Demand Context Reconstitution.

The Implementation

sequenceDiagram participant S as Slack Webhook participant W as Stream Worker participant API as Slack API participant A as Agent Service participant B as AI Brain (Graph) S->>W: POST Follow-up Message W->>API: GET Conversations.replies (thread_ts) API-->>W: Thread History (JSON) W->>W: Map bot_id to "Assistant" Role W->>A: analyze_ticket(current_msg, history) A->>B: Invoke Graph (with History) B-->>W: Conversational Analysis W->>S: Reply to Thread

Programmatic Role Mapping

We use a has_history flag to pivot the system instructions.

Initial Mode: "Perform a deep, multi-tool investigation."
Conversational Mode: "You are in a thread. Answer the specific question concisely (2-5 sentences) using the previous context."

This prevents "Analysis Fatigue"---ensuring the bot provides sharp, direct answers to follow-ups instead of repeating a massive investigation report every time someone asks "When did it start?".

Key Takeaways for SRE Teams

Graphs > Chains: For troubleshooting, the ability to "loop back" with new evidence is non-negotiable.
Decouple Reasoning from Formatting: Let the LLM reason in plain text; enforce structure only at the finish line.
Tier Your Models: Don't pay "Flagship prices" for summarization. Use a graph to route tasks to the most cost-effective model.
Memory is Reconstituted: In a stateless world, "Memory" is just a proactive GET request before the POST processing starts.

What's Next: Part 3

We have the perimeter and we have the brain. In Part 3, we’ll look at The Feedback Loop—building the UUID-based secure web interface and htmx-powered chat that brings this analysis into the hands of the human engineer.

An Invitation

With AI-driven "vibe coding," teams are shipping faster than ever, but maintenance and SRE aren't keeping pace. We're already seeing this in alert fatigue, messy incident triage, and slower RCA.

I'm Suvro, founder of Flipturn. I'm rethinking SRE for this new reality and would love to learn from your experience. If you're open, I'd appreciate 30 minutes to understand the challenges you're facing and how you wish they were solved. I'm committed to partnering closely with you to build this right.

Flipturn is rethinking SRE for the AI era. Follow our journey.

Want to eliminate incident firefighting?

Join teams using Flipturn for autonomous root cause analysis.

Request Access