Skip to content

Cookbook: Code Review Agent

A production-grade multi-agent code review system that automatically guards inputs, performs iterative peer review, scans for security issues, and routes to human reviewers when needed.

Patterns used: CrossReflection, Pipeline, HumanInTheLoop, GuardrailChain, RouterMiddleware


Architecture

sequenceDiagram
    participant D as Developer
    participant G as Input Guardrails
    participant C as Code Agent
    participant R as Review Agent
    participant S as Security Agent
    participant H as Human (if needed)

    D->>G: Submit code
    G->>G: Length check, PII scan
    G->>C: Sanitised code
    C->>R: Initial review
    R->>C: Feedback (up to 3 rounds)
    C->>R: Revised code
    R-->>S: APPROVED
    S->>S: Vulnerability scan
    S-->>D: Final report
    Note over S,H: If security score < 8, escalate to human

Implementation

import asyncio
from pyagent_patterns.base import Agent
from pyagent_patterns.resolution import CrossReflection
from pyagent_patterns.orchestration import Pipeline
from pyagent_patterns.advanced import HumanInTheLoop
from pyagent_patterns.composite import CompositePattern
from pyagent_patterns.guardrails import GuardrailChain, LengthGuard, PIIGuard, ContentGuard
from pyagent_router.middleware import RouterMiddleware
from pyagent_providers import AnthropicLLM, OpenAILLM

# ── LLMs ──────────────────────────────────────────────────────────────────────
fast_llm      = OpenAILLM("gpt-4o-mini")
smart_llm     = AnthropicLLM("claude-sonnet-4-20250514")
security_llm  = OpenAILLM("gpt-4o")

model_registry = {
    "gpt-4o-mini":              fast_llm,
    "gpt-4o":                   security_llm,
    "claude-sonnet-4-20250514": smart_llm,
}
router = RouterMiddleware(model_registry=model_registry)

# ── Guardrails ─────────────────────────────────────────────────────────────────
input_guard = GuardrailChain([
    LengthGuard(max_chars=50_000, truncate=False),   # hard reject huge pastes
    PIIGuard(redact=True),                            # redact emails/tokens in comments
    ContentGuard(deny_patterns=[                      # block embedded secrets
        __import__("re").compile(r"sk-[A-Za-z0-9]{20,}"),
        __import__("re").compile(r"ghp_[A-Za-z0-9]{36}"),
    ]),
])

# ── Stage 1: iterative code review via cross-reflection ───────────────────────
code_review = CrossReflection(
    generator=router.wrap(
        Agent("author", smart_llm,
              system_prompt=(
                  "You are a senior Python engineer. Review and improve the submitted code. "
                  "Focus on: correctness, edge cases, type hints, docstrings, and performance."
              )),
    ),
    reviewer=router.wrap(
        Agent("critic", smart_llm,
              system_prompt=(
                  "You are a principal engineer doing code review. "
                  "Point out specific issues with line references. "
                  "Reply APPROVED when the code meets production standards."
              )),
    ),
    max_rounds=3,
)

# ── Stage 2: security scan ────────────────────────────────────────────────────
security_pipeline = Pipeline(stages=[
    Agent("security", security_llm,
          system_prompt=(
              "You are a security engineer. Scan the code for: SQL injection, "
              "XSS, insecure deserialization, hardcoded secrets, path traversal, "
              "and OWASP Top 10 issues. Score security 1-10. List every finding."
          )),
])

# ── Stage 3: human escalation for low security scores ────────────────────────
def needs_human_review(result) -> bool:
    """Escalate to human if security score < 8."""
    import re
    match = re.search(r"[Ss]core[:\s]+(\d+)", result.output)
    if match:
        return int(match.group(1)) >= 8
    return True   # pass if no score found

human_escalation = HumanInTheLoop(
    agent=Agent("prep", fast_llm,
                system_prompt="Summarise the security issues for the human reviewer."),
    review_fn=lambda output, meta: _queue_human_review(output),
    high_risk_keywords=["critical", "injection", "hardcoded"],
)

full_pipeline = CompositePattern(
    patterns=[security_pipeline, human_escalation],
    quality_check=needs_human_review,
)

# ── Main review function ───────────────────────────────────────────────────────
async def review_code(code: str) -> dict:
    # 1. Guardrail check
    check = input_guard.check(code)
    if not check.passed:
        return {"error": check.message, "blocked": True}
    safe_code = check.sanitized_content or code

    # 2. Iterative code review
    review_result = await code_review.run(
        f"Review and improve this code:\n\n```python\n{safe_code}\n```"
    )

    # 3. Security scan + optional human escalation
    security_result = await full_pipeline.run(
        f"Security scan:\n\n```python\n{review_result.output}\n```"
    )

    return {
        "improved_code": review_result.output,
        "security_report": security_result.output,
        "review_rounds": review_result.metadata.get("rounds", 0),
        "escalated_to_human": security_result.metadata.get("escalation_level", 0) > 0,
    }


def _queue_human_review(summary: str):
    # In production: push to a ticket system, Slack, or review queue
    print(f"[HUMAN REVIEW QUEUED]\n{summary}")
    from pyagent_patterns.advanced.human_in_the_loop import HumanDecision
    return HumanDecision(approved=True, modified_output=summary)


# ── Run it ─────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    code = '''
def get_user(user_id):
    query = "SELECT * FROM users WHERE id = " + user_id
    return db.execute(query)
    '''

    result = asyncio.run(review_code(code))
    print("=== Improved Code ===")
    print(result["improved_code"])
    print("\n=== Security Report ===")
    print(result["security_report"])
    print(f"\nReview rounds: {result['review_rounds']}")
    print(f"Escalated to human: {result['escalated_to_human']}")

Expected Output

Running the SQL injection example above:

=== Improved Code ===
def get_user(user_id: int) -> dict | None:
    """Fetch a user by ID.

    Args:
        user_id: The integer user ID to look up.

    Returns:
        User dict or None if not found.
    """
    query = "SELECT * FROM users WHERE id = ?"
    return db.execute(query, (user_id,))

=== Security Report ===
Security Score: 3/10

CRITICAL — SQL Injection (original code):
  Line 2: String concatenation used in SQL query.
  Fix: Use parameterised queries (? placeholder). Applied in improved code.

MEDIUM — Missing input validation:
  user_id is typed as `int` but callers may pass strings from HTTP params.
  Fix: Add isinstance(user_id, int) or use a Pydantic model at the boundary.

Review rounds: 2
Escalated to human: True   ← score was 3, below threshold of 8

Customisation

Change the review focus

# Security-focused review
Agent("author", smart_llm,
      system_prompt="Review for security vulnerabilities only. OWASP Top 10.")

# Performance-focused review
Agent("author", smart_llm,
      system_prompt="Review for performance. Focus on O(n) complexity, DB queries, caching.")

# Style-focused review
Agent("author", smart_llm,
      system_prompt="Review for PEP 8, type hints, and Google docstring format.")

Adjust escalation threshold

# Always pass to human (security-critical systems)
def always_human(result) -> bool:
    return False   # never passes quality check → always escalates

# Only escalate on critical findings
def only_critical(result) -> bool:
    return "CRITICAL" not in result.output.upper()

Add language support

# Detect language and route to the right specialist
def language_aware_prompt(code: str) -> str:
    if "def " in code or "import " in code:
        return f"Review this Python code:\n\n```python\n{code}\n```"
    elif "function " in code or "const " in code:
        return f"Review this JavaScript/TypeScript code:\n\n```js\n{code}\n```"
    return f"Review this code:\n\n```\n{code}\n```"

Cost Profile

Code snippet Review rounds Models used Approx cost
50-line function 1–2 gpt-4o-mini × 2 $0.001
200-line class 2–3 claude-sonnet × 3 $0.008
500-line module 3 + security claude-sonnet + gpt-4o $0.025
Human escalation + human time $0.002 + human

See Also