How to Build a Multi-Agent Incident Triage Pipeline in Python¶
At 3 a.m. the bottleneck is reading logs and forming a hypothesis fast — but you never want an agent taking production actions on its own. This recipe runs a Pipeline that summarizes the error signal, hypothesizes a root cause, and drafts a reversible remediation, then uses Human-in-the-Loop to require an on-call engineer's approval before anything that touches prod.
Patterns used: Pipeline · Human-in-the-Loop
Architecture¶
flowchart TD
L[Logs + Alert] --> A[Log Analyst\nsummarize signal]
A --> R[Root Cause\nhypothesis]
R --> M[Remediation\nreversible plan]
M --> H[Human-in-the-Loop\non-call approval]
H -->|approved| X[Execute]
H -->|rejected| D[Discard + note]
Implementation¶
import asyncio
from pyagent_patterns.base import Agent
from pyagent_patterns.orchestration import Pipeline
from pyagent_patterns.advanced import HumanInTheLoop
from pyagent_patterns.advanced.human_in_the_loop import HumanDecision
from pyagent_providers import AnthropicLLM, OpenAILLM
fast_llm = OpenAILLM("gpt-4o-mini")
smart_llm = AnthropicLLM("claude-sonnet-4-20250514")
# ── Triage pipeline: analyze → root cause → remediation ─────────────────────────
triage = Pipeline(stages=[
Agent("log_analyst", fast_llm,
system_prompt="Summarize the error signal from these logs: what's failing, since when, blast radius."),
Agent("root_cause", smart_llm,
system_prompt="Given the summary, give the single most likely root cause with supporting evidence."),
Agent("remediation", smart_llm,
system_prompt=(
"Propose a safe, reversible remediation with exact steps and a rollback. "
"Begin with TOUCHES_PROD: yes/no on the first line."
)),
])
# ── Human gate: any prod-touching remediation needs on-call approval ────────────
def on_call_gate(output: str, metadata: dict) -> HumanDecision:
if output.lower().startswith("touches_prod: no"):
return HumanDecision(approved=True, modified_output=f"Auto-applied (non-prod):\n{output}")
approved = _page_on_call_and_wait(output) # your PagerDuty/Slack approval integration
return HumanDecision(approved=approved,
modified_output=output if approved else "REJECTED by on-call — escalate to IC.")
triage_with_gate = HumanInTheLoop(
agent=Agent("runbook_writer", fast_llm,
system_prompt="Format the remediation as a runbook step the on-call engineer can approve."),
review_fn=on_call_gate,
high_risk_keywords=["delete", "drop", "scale to zero", "failover", "restart prod"],
)
SAMPLE_INCIDENT = (
"ALERT: checkout 5xx rate 12% for 8 min. Logs: 'connection pool exhausted' on payments-svc; "
"db connections pinned at 100/100; deploy of payments-svc 14 min ago."
)
async def main():
triaged = await triage.run(SAMPLE_INCIDENT)
final = await triage_with_gate.run(triaged.output)
print(final.output)
asyncio.run(main())
Expected Output¶
TOUCHES_PROD: yes
Root cause: the 14-min-ago payments-svc deploy lowered the DB pool ceiling; pool now saturated → 5xx.
Remediation:
1. Raise payments-svc DB pool 100 → 250 (config flag, no restart).
2. If still saturated in 3 min, roll back the payments-svc deploy.
Rollback: revert the pool flag; redeploy prior payments-svc image.
[PAGED on-call for approval — prod change]
A non-prod fix (e.g. clearing a stale cache in staging) applies itself; the prod pool change pages a human first — automation for the diagnosis, control for the action.
Customization¶
Pull live context with a tool agent¶
Replace the static log analyst with a ReAct agent that queries your logging/metrics APIs — see the Fraud Investigation Assistant.
Auto-open a postmortem doc¶
from pyagent_patterns.orchestration import Pipeline
postmortem = Agent("postmortem", fast_llm, system_prompt="Draft a blameless postmortem from the triage result.")
triage_plus_doc = Pipeline(stages=[triage, postmortem])
Tighten the high-risk keyword gate¶
When to Use¶
| Situation | Fit |
|---|---|
| Fixed analyze → root-cause → remediate stages | ✅ Pipeline |
| Prod actions must be human-approved | ✅ Human-in-the-Loop |
| The agent must query tools mid-investigation | ❌ Use ReAct |
| Several responders should debate the cause | ❌ Use Debate |
Cost Profile¶
| Stage | Typical model | Avg cost | Volume (200 incidents/mo) |
|---|---|---|---|
| Log analyst | gpt-4o-mini | $0.0005 | $0.10 |
| Root cause + remediation | claude-sonnet | $0.007 | $1.40 |
| Per incident | mix | ~$0.0075 | ~$1.50/mo |
Triage cost is negligible next to the minutes of MTTR it saves; the human gate is where the real safety lives.
See Also¶
- Pipeline pattern · Human-in-the-Loop pattern
- Alert Triage — the same shape for security alerts
- Browse all recipes