Skip to content

Verifiers

LLM-as-a-Verifier

LLMAsVerifier is a logprob-aware verification pattern adapted from LLM-as-a-Verifier.

Think of it as a teacher grading answers.

You give it:

  • a task
  • one or more candidates
  • a list of criteria

The verifier then scores each candidate, compares them when needed, and returns either a verdict, a winner, or a tournament ranking.

Single Candidate

# pip install msgflux[openai]
import msgflux as mf
from msgflux.generation.verifiers import LLMAsVerifier, VerificationCriterion

# mf.load_dotenv()

criterion = VerificationCriterion(
    id="correctness",
    name="Correctness",
    description="Assess whether the candidate fully answers the task.",
)

verifier = LLMAsVerifier(
    model="openai/gpt-4.1-mini",
    criteria=[criterion],
    n_verifications=2,
)

result = verifier(
    task="What is 2 + 2?",
    candidates={"answer": "The answer is 4."},
)

print(result.verdict)  # pass
print(result.score)    # normalized score in [0, 1]
print(result.scores)   # {"answer": ...}

Info

LLMAsVerifier requests logprobs=True and top_logprobs automatically on the model call. If the provider returns token logprobs, the verifier uses them to compute the score distribution. If not, it falls back to parsing the emitted score from text. Set strict_logprobs=True to require logprob-based extraction.

Pairwise Comparison

result = verifier(
    task="Pick the better final answer.",
    candidates={
        "paris_answer": "The capital of France is Paris.",
        "lyon_answer": "The capital of France is Lyon.",
    },
)

print(result.verdict)  # "paris_answer", "lyon_answer", or "tie"
print(result.winner)   # winner label or None
print(result.scores)   # {"paris_answer": ..., "lyon_answer": ...}

Pairwise mode is the natural building block for trajectory comparison and candidate selection.

Round-Robin Selection

For tasks with multiple candidates, use select_best:

tournament = verifier.select_best(
    task="Pick the best final answer.",
    candidates={
        "draft_1": "Answer one",
        "draft_2": "Answer two",
        "draft_3": "Answer three",
    },
)

print(tournament.winner)
print(tournament.ranking)
print(tournament.wins)

This runs pairwise comparisons across all candidates and selects the winner by round-robin wins, using average verifier score as a tiebreaker.

API Shape

Use __call__ or acall when you want to verify 1 or 2 candidates:

single = verifier(
    task="What is 2 + 2?",
    candidates={"answer": "The answer is 4."},
)

pairwise = verifier(
    task="Pick the better final answer.",
    candidates={
        "draft_1": "The answer is 4.",
        "draft_2": "The answer is 5.",
    },
)

Use select_best or aselect_best when you have more than 2 candidates:

tournament = verifier.select_best(
    task="Pick the best final answer.",
    candidates={
        "draft_1": "The answer is 4.",
        "draft_2": "It is probably 4.",
        "draft_3": "The answer is 5.",
    },
)

How It Works

The verifier does not ask the model for a vague overall judgment. It breaks the evaluation into smaller pieces and then aggregates them.

criteria

criteria are the grading rules.

Examples:

  • correctness
  • completeness
  • grounding
  • error_signals

Instead of asking only "is this good?", the verifier asks smaller questions such as:

  • is it correct?
  • is it complete?
  • is it grounded in the evidence?
  • are there unresolved errors?

That makes the final result easier to control and easier to reuse across different tasks.

n_verifications

n_verifications controls how many times each criterion is evaluated.

If you have:

  • 3 criteria
  • n_verifications=2

the verifier runs 6 evaluations in total, then averages the repeated attempts for each criterion.

Use this to reduce noise:

  • n_verifications=1 is the simplest setting
  • larger values usually give a more stable signal

granularity

granularity controls how many levels exist in the score scale.

from msgflux.generation.verifiers import LLMAsVerifier, ScoreScale

verifier = LLMAsVerifier.answer_reranking(
    model="openai/gpt-4.1-mini",
    score_scale=ScoreScale.letter(granularity=20),
)

With granularity=20, the scale is A..T:

  • A is best
  • T is worst

The model does not return a floating-point score such as 0.83. It returns a discrete score token from the configured scale:

<score>A</score>

In pairwise mode it returns one score token per candidate:

<score_A>B</score_A>
<score_B>H</score_B>

Supported range:

  • 2..26

The runtime keeps the scale letter-based because letter tokens are more stable for logprobs extraction than numeric labels such as 1..20.

logprobs

This is the part that makes the technique more useful than just reading the final score token.

The verifier asks the model for:

  • logprobs=True
  • top_logprobs=<scale size or override>

So it can see not only the chosen score token, but also nearby alternatives.

For example, even if the model emits:

<score>A</score>

the underlying distribution might be closer to:

  • A: 0.70
  • B: 0.20
  • C: 0.10

The verifier uses that distribution to compute the final normalized score in [0, 1], instead of trusting only the chosen token.

What Gets Returned

With one candidate, the verifier returns:

  • pass
  • fail
  • uncertain

With two candidates, it returns:

  • the winning label
  • or tie

With more than two candidates, select_best(...) runs pairwise comparisons and returns:

  • winner
  • ranking
  • wins
  • average_scores

Verbose Debugging

Set verbose=True when you want the verifier to return the final prompt and raw model output for each attempt.

verifier = LLMAsVerifier(
    model="openai/gpt-4.1-mini",
    criteria=[criterion],
    verbose=True,
)

result = verifier(
    task="What is 2 + 2?",
    candidates={"answer": "The answer is 4."},
)

print(result.metadata["raw_outputs"][0]["prompt"])
print(result.metadata["raw_outputs"][0]["response_text"])

The same data also remains available in criteria_results[*].attempts[*]:

attempt = result.criteria_results[0].attempts[0]
print(attempt.prompt_text)
print(attempt.response_text)

Custom Prompting

You can replace the default prompt with prompt_builder. The builder receives a VerificationPromptInput containing the task, candidates, criterion, score scale, context, and optional verifier instructions.

from msgflux.generation.verifiers import (
    LLMAsVerifier,
    VerificationCriterion,
    VerificationPromptInput,
)


def build_prompt(data: VerificationPromptInput) -> str:
    label, candidate = next(iter(data.candidates.items()))
    return (
        "Evaluate the answer.\n\n"
        f"Task:\n{data.task}\n\n"
        f"Criterion:\n{data.criterion.description}\n\n"
        f"Candidate ({label}):\n{candidate}\n\n"
        f"<score>{data.score_scale.score_format}</score>"
    )


verifier = LLMAsVerifier(
    model="openai/gpt-4.1-mini",
    criteria=[
        VerificationCriterion(
            id="faithfulness",
            name="Faithfulness",
            description="Check whether the answer is grounded in the context.",
        )
    ],
    prompt_builder=build_prompt,
)

Built-In Presets

Use a preset when you want a good default set of criteria without defining it by hand.

Each preset still returns a normal LLMAsVerifier, so you can override criteria, ground_truth_note, extra_instructions, n_verifications, and the model request kwargs.

from msgflux.generation.verifiers import LLMAsVerifier

trajectory = LLMAsVerifier.trajectory_analysis(
    model="openai/gpt-4.1-mini",
)

reranker = LLMAsVerifier.answer_reranking(
    model="openai/gpt-4.1-mini",
)

grounded = LLMAsVerifier.grounded_answer_verification(
    model="openai/gpt-4.1-mini",
)

patches = LLMAsVerifier.patch_selection(
    model="openai/gpt-4.1-mini",
)

tools = LLMAsVerifier.tool_trace_verification(
    model="openai/gpt-4.1-mini",
)

filtering = LLMAsVerifier.synthetic_data_filtering(
    model="openai/gpt-4.1-mini",
)

terminal = LLMAsVerifier.terminal_bench(
    model="openai/gpt-4.1-mini",
)

swe = LLMAsVerifier.swe_bench_verified(
    model="openai/gpt-4.1-mini",
)

trajectory_analysis

Use this for agent runs and reasoning trajectories.

  • checks whether the task was actually completed
  • checks whether verification was meaningful
  • checks unresolved error signals

answer_reranking

Use this when you have multiple final drafts of the same task.

  • correctness
  • instruction following
  • completeness
  • clarity

grounded_answer_verification

Use this for RAG and other context-grounded tasks.

  • grounding in context
  • unsupported claims
  • answer completeness

patch_selection

Use this for comparing candidate patches or code changes.

  • requirement coverage
  • correctness risk
  • regression risk
  • minimality

tool_trace_verification

Use this for tool-using agents when you want to compare the final answer against the trace and tool outputs.

  • tool grounding
  • unresolved errors
  • final answer quality
  • action efficiency

synthetic_data_filtering

Use this for generated examples before adding them to datasets, evals, or distillation corpora.

  • consistency
  • label quality
  • ambiguity
  • usefulness

terminal_bench

Use this for terminal-task trajectory selection in the style of Terminal-Bench.

  • specification adherence
  • output match
  • unresolved error signals
  • terminal output treated as primary ground truth

swe_bench_verified

Use this for patch and trajectory evaluation in the style of SWE-bench Verified.

  • root-cause analysis
  • code review quality
  • empirical verification from executed commands
  • narration treated as weaker evidence than the actual patch and outputs

Trajectory Formatting Helpers

For benchmark-style presets, pass the full trajectory as the candidate evidence, not only the final answer. Two helpers are available:

  • format_terminal_trajectory(...)
  • format_swe_bench_trajectory(...)

Keep the task or issue description in task. Use the helpers to build each candidate string from commands, outputs, patch text, and final answer.

from msgflux.generation.verifiers import (
    LLMAsVerifier,
    format_terminal_trajectory,
)

candidate = format_terminal_trajectory(
    summary="Installed the binary and verified the version output.",
    metadata={"expected_output": "tool 1.2.0"},
    steps=[
        {
            "command": "cp ./tool /usr/local/bin/tool",
            "exit_code": 0,
        },
        {
            "command": "tool --version",
            "output": "tool 1.2.0",
            "exit_code": 0,
        },
    ],
    final_answer="Installation complete.",
)

verifier = LLMAsVerifier.terminal_bench(model="openai/gpt-4.1-mini")
result = verifier(
    task="Install the binary to /usr/local/bin/tool and verify `tool --version`.",
    candidates={"run_a": candidate},
)
from msgflux.generation.verifiers import (
    LLMAsVerifier,
    format_swe_bench_trajectory,
)

candidate = format_swe_bench_trajectory(
    summary="Reproduced the bug, patched the parser, and reran the focused test.",
    steps=[
        {
            "command": "pytest tests/test_parser.py -q",
            "output": "1 failed, 4 passed",
            "exit_code": 1,
        },
        {
            "command": "pytest tests/test_parser.py -q",
            "output": "5 passed",
            "exit_code": 0,
        },
    ],
    patch=patch_text,
    final_answer="Patched parser and verified the focused test.",
)

verifier = LLMAsVerifier.swe_bench_verified(model="openai/gpt-4.1-mini")
result = verifier(
    task="Fix the parser bug described in the issue.",
    candidates={"candidate_patch": candidate},
)

Best Use Cases

This technique is most useful when you need to compare, rank, or filter candidates and the task does not already have a deterministic validator.

Trajectory Selection

This is the most natural fit.

  • compare alternative reasoning trajectories
  • compare multiple agent runs for the same task
  • select the strongest final trajectory before returning it

Candidate Reranking

Use it to rerank multiple drafts of the same task.

  • final answers
  • summaries
  • plans
  • retrieval-grounded responses

Patch Selection

It works well for code generation when you want to compare multiple candidate patches before running deeper validation.

  • choose the patch that best satisfies the task
  • prefer the patch that looks more correct or complete
  • filter obviously weak candidates before tests or deeper validation

Tool-Using Agent Verification

Use it to check whether a final answer is consistent with tool results and the execution trace.

  • verify completion quality
  • verify grounding in tool outputs
  • detect unresolved errors hidden by a confident final answer

Synthetic Data Filtering

Use it to filter generated examples before storing them in datasets, evals, or distillation corpora.

  • reject inconsistent examples
  • reject weak labels
  • keep only high-confidence candidates

Optimizer Feedback

Use it as a reusable feedback signal for future optimizer integrations.

  • provide a reusable reward signal
  • score prompt variants
  • compare sampled candidates during search or optimization

When Not to Use

Prefer deterministic validation when the task already has a deterministic check.

  • exact-match tasks
  • schema validation
  • unit tests
  • regex-based extraction checks
  • simple business rules