Skip to content

PyAgent

Evaluator-Optimizer

pyagent-core/pyagent

Evaluator-Optimizer Pattern¶

Generator produces output; evaluator scores against explicit criteria; loop continues until threshold met or max rounds reached.

Best for: Ad copy, content quality gates, structured output conformance, scored deliverables.
LLM calls: 2 per round (generate + score). Stops early when threshold is crossed.

Sequence Diagram¶

sequenceDiagram
    participant U as User
    participant G as Generator
    participant E as Evaluator

    U->>G: "Write ad copy"
    G-->>E: "Buy now! Great deals!"
    E-->>G: Score: 3/10 — too generic, no value prop
    G-->>E: "Save 30% on premium headphones — today only"
    E-->>G: Score: 8/10 — PASS
    G-->>U: Final: "Save 30% on premium headphones — today only"

Use Case 1 — Ad Copy Optimization (OpenAI)¶

import asyncio
from pyagent_patterns.base import Agent
from pyagent_patterns.resolution import EvaluatorOptimizer
from pyagent_providers import OpenAILLM

pattern = EvaluatorOptimizer(
    generator=Agent(
        "copywriter",
        OpenAILLM("gpt-4o"),
        system_prompt="Write conversion-focused ad copy. "
                      "Include: a specific value proposition, a number or statistic, "
                      "a clear benefit, and a call to action. Keep under 30 words.",
    ),
    evaluator=Agent(
        "critic",
        OpenAILLM("gpt-4o"),
        system_prompt="Score ad copy on a scale of 1-10 using these criteria: "
                      "1) Specificity (does it have concrete numbers/details?) "
                      "2) Value clarity (is the benefit immediately obvious?) "
                      "3) Urgency (does it motivate action now?) "
                      "4) Credibility (does it feel trustworthy, not salesy?) "
                      "Respond exactly as: SCORE: X\nFEEDBACK: <specific improvements needed>",
    ),
    pass_threshold=8,
    max_rounds=4,
)

result = asyncio.run(pattern.run(
    "Write ad copy for AirPods Pro 2: targeting commuters, "
    "key selling points: active noise cancellation, 30hr battery, USB-C charging"
))
print(result.output)
print(f"Final score: {result.metadata['final_score']}, Rounds: {result.metadata['rounds']}")
print(f"Cost: ${result.cost_estimate:.4f}")

Use Case 2 — Technical Documentation Quality (Anthropic)¶

from pyagent_providers import AnthropicLLM

doc_improver = EvaluatorOptimizer(
    generator=Agent(
        "doc_writer",
        AnthropicLLM("claude-sonnet-4-20250514"),
        system_prompt="Write clear, comprehensive technical documentation. "
                      "Each function doc must have: a description, all parameters with types, "
                      "return value, exceptions raised, and at least one usage example.",
    ),
    evaluator=Agent(
        "doc_reviewer",
        AnthropicLLM("claude-sonnet-4-20250514"),
        system_prompt="Evaluate documentation completeness and clarity: "
                      "1) Does every function have a description? (0-2 pts) "
                      "2) Are all parameters documented with types? (0-3 pts) "
                      "3) Is the return value documented? (0-2 pts) "
                      "4) Are exceptions documented? (0-1 pt) "
                      "5) Is there at least one working code example? (0-2 pts) "
                      "Respond: SCORE: X\nFEEDBACK: <what is missing or unclear>",
    ),
    pass_threshold=9,
    max_rounds=3,
)

result = asyncio.run(doc_improver.run(open("api_module.py").read()))
print(f"Documentation score: {result.metadata['final_score']}/10")

Use Case 3 — LLM Response Safety Scoring (LiteLLM)¶

from pyagent_providers import LiteLLM

safety_loop = EvaluatorOptimizer(
    generator=Agent(
        "responder",
        LiteLLM("gpt-4o-mini"),
        system_prompt="Answer questions helpfully and accurately. "
                      "When discussing sensitive topics, include appropriate context and caveats.",
    ),
    evaluator=Agent(
        "safety_scorer",
        LiteLLM("anthropic/claude-sonnet-4-20250514"),
        system_prompt="Score this response for safety and helpfulness: "
                      "1) Does it answer the question? (0-3 pts) "
                      "2) Does it include appropriate caveats for sensitive content? (0-3 pts) "
                      "3) Does it avoid harmful, misleading, or dangerous content? (0-4 pts) "
                      "Respond: SCORE: X\nFEEDBACK: <specific issues>",
    ),
    pass_threshold=9,
    max_rounds=3,
)

result = asyncio.run(safety_loop.run(
    "What are the risks of combining prescription medications with alcohol?"
))
print(f"Safety score: {result.metadata['final_score']}/10")
print(result.output)

OTel Trace Output¶

Trace: pyagent.pattern.evaluator_optimizer (5.3s, $0.014)
├── Round 1
│   ├── pyagent.agent.copywriter — generate (1.6s, gpt-4o) → "Buy now!"
│   └── pyagent.agent.critic — evaluate (1.1s, gpt-4o) → SCORE: 3
├── Round 2
│   ├── pyagent.agent.copywriter — generate (1.4s, gpt-4o) → "Save 30%..."
│   └── pyagent.agent.critic — evaluate (0.9s, gpt-4o) → SCORE: 8 ✓ PASS
└── early_stop: true (threshold 8 met in round 2)

When to Use¶

Condition	Recommendation
You have explicit, measurable quality criteria	✅ Use Evaluator-Optimizer
Iterative improvement is clearly worthwhile	✅ Use Evaluator-Optimizer
Quality criteria are subjective or qualitative	❌ Use Cross-Reflection
You just need approve/reject (no score)	❌ Use Self-Reflection or Cross-Reflection
Budget is tight	❌ Single-shot generation

Cookbook recipes¶

Complete, runnable examples that use the Evaluator-Optimizer pattern:

Recipe	Domain	What it does	Complexity
Portfolio Review	Finance & Trading	Analyst panel with an evaluator-optimizer quality gate	Intermediate

See Also¶

Self-Reflection — same agent critiques and revises without scoring
Cross-Reflection — separate reviewer with qualitative feedback
Human-in-the-Loop — human replaces the evaluator for high-stakes decisions

Explore all design patterns¶

Orchestration: Supervisor · Pipeline · Fan-Out / Fan-In · Hierarchical · Orchestrator-Workers
Resolution: Self-Reflection · Cross-Reflection · Debate · Voting · Evaluator-Optimizer
Structural: Role-Based · Layered · Topology · Blackboard
Iterative & Advanced: ReAct · Talker-Reasoner · Swarm · Human-in-the-Loop

Browse the full pattern catalog →