Verifiers
LLM-as-a-Verifier
LLMAsVerifier is a logprob-aware verification pattern adapted from
LLM-as-a-Verifier.
Think of it as a teacher grading answers.
You give it:
- a
task - one or more
candidates - a list of
criteria
The verifier then scores each candidate, compares them when needed, and returns either a verdict, a winner, or a tournament ranking.
Single Candidate
# pip install msgflux[openai]
import msgflux as mf
from msgflux.generation.verifiers import LLMAsVerifier, VerificationCriterion
# mf.load_dotenv()
criterion = VerificationCriterion(
id="correctness",
name="Correctness",
description="Assess whether the candidate fully answers the task.",
)
verifier = LLMAsVerifier(
model="openai/gpt-4.1-mini",
criteria=[criterion],
n_verifications=2,
)
result = verifier(
task="What is 2 + 2?",
candidates={"answer": "The answer is 4."},
)
print(result.verdict) # pass
print(result.score) # normalized score in [0, 1]
print(result.scores) # {"answer": ...}
Info
LLMAsVerifier requests logprobs=True and top_logprobs automatically
on the model call. If the provider returns token logprobs, the verifier
uses them to compute the score distribution. If not, it falls back to
parsing the emitted score from text. Set strict_logprobs=True to require
logprob-based extraction.
Pairwise Comparison
result = verifier(
task="Pick the better final answer.",
candidates={
"paris_answer": "The capital of France is Paris.",
"lyon_answer": "The capital of France is Lyon.",
},
)
print(result.verdict) # "paris_answer", "lyon_answer", or "tie"
print(result.winner) # winner label or None
print(result.scores) # {"paris_answer": ..., "lyon_answer": ...}
Pairwise mode is the natural building block for trajectory comparison and candidate selection.
Round-Robin Selection
For tasks with multiple candidates, use select_best:
tournament = verifier.select_best(
task="Pick the best final answer.",
candidates={
"draft_1": "Answer one",
"draft_2": "Answer two",
"draft_3": "Answer three",
},
)
print(tournament.winner)
print(tournament.ranking)
print(tournament.wins)
This runs pairwise comparisons across all candidates and selects the winner by round-robin wins, using average verifier score as a tiebreaker.
API Shape
Use __call__ or acall when you want to verify 1 or 2 candidates:
single = verifier(
task="What is 2 + 2?",
candidates={"answer": "The answer is 4."},
)
pairwise = verifier(
task="Pick the better final answer.",
candidates={
"draft_1": "The answer is 4.",
"draft_2": "The answer is 5.",
},
)
Use select_best or aselect_best when you have more than 2 candidates:
tournament = verifier.select_best(
task="Pick the best final answer.",
candidates={
"draft_1": "The answer is 4.",
"draft_2": "It is probably 4.",
"draft_3": "The answer is 5.",
},
)
How It Works
The verifier does not ask the model for a vague overall judgment. It breaks the evaluation into smaller pieces and then aggregates them.
criteria
criteria are the grading rules.
Examples:
correctnesscompletenessgroundingerror_signals
Instead of asking only "is this good?", the verifier asks smaller questions such as:
- is it correct?
- is it complete?
- is it grounded in the evidence?
- are there unresolved errors?
That makes the final result easier to control and easier to reuse across different tasks.
n_verifications
n_verifications controls how many times each criterion is evaluated.
If you have:
3criterian_verifications=2
the verifier runs 6 evaluations in total, then averages the repeated attempts
for each criterion.
Use this to reduce noise:
n_verifications=1is the simplest setting- larger values usually give a more stable signal
granularity
granularity controls how many levels exist in the score scale.
from msgflux.generation.verifiers import LLMAsVerifier, ScoreScale
verifier = LLMAsVerifier.answer_reranking(
model="openai/gpt-4.1-mini",
score_scale=ScoreScale.letter(granularity=20),
)
With granularity=20, the scale is A..T:
Ais bestTis worst
The model does not return a floating-point score such as 0.83. It returns a
discrete score token from the configured scale:
In pairwise mode it returns one score token per candidate:
Supported range:
2..26
The runtime keeps the scale letter-based because letter tokens are more stable
for logprobs extraction than numeric labels such as 1..20.
logprobs
This is the part that makes the technique more useful than just reading the final score token.
The verifier asks the model for:
logprobs=Truetop_logprobs=<scale size or override>
So it can see not only the chosen score token, but also nearby alternatives.
For example, even if the model emits:
the underlying distribution might be closer to:
A:0.70B:0.20C:0.10
The verifier uses that distribution to compute the final normalized score in
[0, 1], instead of trusting only the chosen token.
What Gets Returned
With one candidate, the verifier returns:
passfailuncertain
With two candidates, it returns:
- the winning label
- or
tie
With more than two candidates, select_best(...) runs pairwise comparisons and
returns:
winnerrankingwinsaverage_scores
Verbose Debugging
Set verbose=True when you want the verifier to return the final prompt and raw
model output for each attempt.
verifier = LLMAsVerifier(
model="openai/gpt-4.1-mini",
criteria=[criterion],
verbose=True,
)
result = verifier(
task="What is 2 + 2?",
candidates={"answer": "The answer is 4."},
)
print(result.metadata["raw_outputs"][0]["prompt"])
print(result.metadata["raw_outputs"][0]["response_text"])
The same data also remains available in criteria_results[*].attempts[*]:
attempt = result.criteria_results[0].attempts[0]
print(attempt.prompt_text)
print(attempt.response_text)
Custom Prompting
You can replace the default prompt with prompt_builder. The builder receives a
VerificationPromptInput containing the task, candidates, criterion,
score scale, context, and optional verifier instructions.
from msgflux.generation.verifiers import (
LLMAsVerifier,
VerificationCriterion,
VerificationPromptInput,
)
def build_prompt(data: VerificationPromptInput) -> str:
label, candidate = next(iter(data.candidates.items()))
return (
"Evaluate the answer.\n\n"
f"Task:\n{data.task}\n\n"
f"Criterion:\n{data.criterion.description}\n\n"
f"Candidate ({label}):\n{candidate}\n\n"
f"<score>{data.score_scale.score_format}</score>"
)
verifier = LLMAsVerifier(
model="openai/gpt-4.1-mini",
criteria=[
VerificationCriterion(
id="faithfulness",
name="Faithfulness",
description="Check whether the answer is grounded in the context.",
)
],
prompt_builder=build_prompt,
)
Built-In Presets
Use a preset when you want a good default set of criteria without defining it by hand.
Each preset still returns a normal LLMAsVerifier, so you can override
criteria, ground_truth_note, extra_instructions, n_verifications, and
the model request kwargs.
from msgflux.generation.verifiers import LLMAsVerifier
trajectory = LLMAsVerifier.trajectory_analysis(
model="openai/gpt-4.1-mini",
)
reranker = LLMAsVerifier.answer_reranking(
model="openai/gpt-4.1-mini",
)
grounded = LLMAsVerifier.grounded_answer_verification(
model="openai/gpt-4.1-mini",
)
patches = LLMAsVerifier.patch_selection(
model="openai/gpt-4.1-mini",
)
tools = LLMAsVerifier.tool_trace_verification(
model="openai/gpt-4.1-mini",
)
filtering = LLMAsVerifier.synthetic_data_filtering(
model="openai/gpt-4.1-mini",
)
terminal = LLMAsVerifier.terminal_bench(
model="openai/gpt-4.1-mini",
)
swe = LLMAsVerifier.swe_bench_verified(
model="openai/gpt-4.1-mini",
)
trajectory_analysis
Use this for agent runs and reasoning trajectories.
- checks whether the task was actually completed
- checks whether verification was meaningful
- checks unresolved error signals
answer_reranking
Use this when you have multiple final drafts of the same task.
- correctness
- instruction following
- completeness
- clarity
grounded_answer_verification
Use this for RAG and other context-grounded tasks.
- grounding in context
- unsupported claims
- answer completeness
patch_selection
Use this for comparing candidate patches or code changes.
- requirement coverage
- correctness risk
- regression risk
- minimality
tool_trace_verification
Use this for tool-using agents when you want to compare the final answer against the trace and tool outputs.
- tool grounding
- unresolved errors
- final answer quality
- action efficiency
synthetic_data_filtering
Use this for generated examples before adding them to datasets, evals, or distillation corpora.
- consistency
- label quality
- ambiguity
- usefulness
terminal_bench
Use this for terminal-task trajectory selection in the style of Terminal-Bench.
- specification adherence
- output match
- unresolved error signals
- terminal output treated as primary ground truth
swe_bench_verified
Use this for patch and trajectory evaluation in the style of SWE-bench Verified.
- root-cause analysis
- code review quality
- empirical verification from executed commands
- narration treated as weaker evidence than the actual patch and outputs
Trajectory Formatting Helpers
For benchmark-style presets, pass the full trajectory as the candidate evidence, not only the final answer. Two helpers are available:
format_terminal_trajectory(...)format_swe_bench_trajectory(...)
Keep the task or issue description in task. Use the helpers to build each
candidate string from commands, outputs, patch text, and final answer.
from msgflux.generation.verifiers import (
LLMAsVerifier,
format_terminal_trajectory,
)
candidate = format_terminal_trajectory(
summary="Installed the binary and verified the version output.",
metadata={"expected_output": "tool 1.2.0"},
steps=[
{
"command": "cp ./tool /usr/local/bin/tool",
"exit_code": 0,
},
{
"command": "tool --version",
"output": "tool 1.2.0",
"exit_code": 0,
},
],
final_answer="Installation complete.",
)
verifier = LLMAsVerifier.terminal_bench(model="openai/gpt-4.1-mini")
result = verifier(
task="Install the binary to /usr/local/bin/tool and verify `tool --version`.",
candidates={"run_a": candidate},
)
from msgflux.generation.verifiers import (
LLMAsVerifier,
format_swe_bench_trajectory,
)
candidate = format_swe_bench_trajectory(
summary="Reproduced the bug, patched the parser, and reran the focused test.",
steps=[
{
"command": "pytest tests/test_parser.py -q",
"output": "1 failed, 4 passed",
"exit_code": 1,
},
{
"command": "pytest tests/test_parser.py -q",
"output": "5 passed",
"exit_code": 0,
},
],
patch=patch_text,
final_answer="Patched parser and verified the focused test.",
)
verifier = LLMAsVerifier.swe_bench_verified(model="openai/gpt-4.1-mini")
result = verifier(
task="Fix the parser bug described in the issue.",
candidates={"candidate_patch": candidate},
)
Best Use Cases
This technique is most useful when you need to compare, rank, or filter candidates and the task does not already have a deterministic validator.
Trajectory Selection
This is the most natural fit.
- compare alternative reasoning trajectories
- compare multiple agent runs for the same task
- select the strongest final trajectory before returning it
Candidate Reranking
Use it to rerank multiple drafts of the same task.
- final answers
- summaries
- plans
- retrieval-grounded responses
Patch Selection
It works well for code generation when you want to compare multiple candidate patches before running deeper validation.
- choose the patch that best satisfies the task
- prefer the patch that looks more correct or complete
- filter obviously weak candidates before tests or deeper validation
Tool-Using Agent Verification
Use it to check whether a final answer is consistent with tool results and the execution trace.
- verify completion quality
- verify grounding in tool outputs
- detect unresolved errors hidden by a confident final answer
Synthetic Data Filtering
Use it to filter generated examples before storing them in datasets, evals, or distillation corpora.
- reject inconsistent examples
- reject weak labels
- keep only high-confidence candidates
Optimizer Feedback
Use it as a reusable feedback signal for future optimizer integrations.
- provide a reusable reward signal
- score prompt variants
- compare sampled candidates during search or optimization
When Not to Use
Prefer deterministic validation when the task already has a deterministic check.
- exact-match tasks
- schema validation
- unit tests
- regex-based extraction checks
- simple business rules