What Is the Verifier Pattern in Claude Code Skills?

The verifier pattern is a Claude Code skill design technique that catches output failures before they reach the user. It splits the skill's execution into three sequential roles: a planner that structures the work, an executor that produces the output, and a verifier that checks the output against specific criteria before it is returned. All three roles run inside a single skill execution, using the same context window. AEM uses this pattern across structured document skills where output quality must be consistent regardless of who runs the skill.

TL;DR: The verifier pattern adds a mandatory quality-check step to a skill's process. After the executor produces output, the verifier evaluates it against 4-6 specific criteria. If any fail, the executor revises before the response is returned. No second model is required. In our builds, this cuts first-draft failure rate from 25-30% to 8-12% on structured document skills.

A skill that does not verify its own output is asking the user to do the job the skill should do. Most users do not notice the gaps until they are embarrassed by them. The verifier pattern removes that dependency on the user's attention.

What Are the Three Roles in the Verifier Pattern?

Three roles execute sequentially inside a single skill run: a planner that reads the input and writes a structured blueprint, an executor that follows that blueprint to produce the output, and a verifier that checks the output against specific criteria before it is returned to the user. No separate model call is needed for any of them.

Planner: reads the input and writes a structured plan before any output is produced. The plan specifies what sections the output will include, what the tone should be, what constraints apply, and any input-specific requirements. The planner does not produce the output itself. It produces the blueprint the executor follows.
Executor: takes the planner's blueprint and produces the full output. The executor follows the plan rather than working from the raw input, which prevents the most common executor failure: producing a response that addresses the user's question but misses the structural requirements specified in the skill.
Verifier: evaluates the executor's output against two sources: the planner's blueprint and a fixed list of quality criteria in the skill's SKILL.md. The verifier reports failures. If failures exist, the executor revises and the verifier checks again before the response is returned.

The verifier step is not a read-through. It is a structured check against specific, enumerated criteria. Research finds that LLMs accept invalid outputs as valid more than 75% of the time when asked open-ended quality questions, a phenomenon researchers call agreeableness bias (NUS AI Centre, arxiv.org/abs/2510.11822, 2025). "Does the output include all five required fields? Does the summary section stay under 3 sentences? Is the tone formal?" produces a checkable result.

What Does the Verifier Step Look Like in Practice?

The verifier step is a text instruction block placed at the end of the skill's SKILL.md process: it lists 4-6 specific binary criteria, instructs the model to report each as PASS or FAIL, and requires the executor to revise and re-check before returning any output that failed. Here is the pattern:

## Step N: Verify Output

Before returning the response, check your output against all of the following criteria.
Report each check explicitly: PASS or FAIL with a brief note.

1. All required sections present: [list them]
2. Summary section: 3 sentences maximum
3. Tone: formal, no contractions
4. No information from the input missing from the output
5. [Skill-specific criterion]
6. [Skill-specific criterion]

If any criterion is FAIL, revise the output to fix the failure.
Run the check again after revision.
Return only the verified, passing output.

The criteria in the verifier step are not generic. They are specific to the skill's task. A verifier for a client proposal skill checks different criteria than a verifier for a code review skill.

In our builds, the verifier criteria come directly from the skill's evals: the same assertions that test cases check become the criteria the verifier checks at runtime. If you have written evals for the skill (which you should, before writing instructions), the verifier criteria are already defined. You are just moving them into the runtime execution path.

For more on writing evals that double as verifier criteria, see Evaluation-First Skill Development: Write Tests Before Instructions.

When Is the Verifier Pattern Worth the Extra Token Cost?

Adding a structured verification step to a skill increases compute time by 20-80% for reasoning models, according to Wharton Generative AI Lab research (2024), and that overhead is worth paying when the cost of a wrong output exceeds the cost of the extra tokens. For short internal notes, it rarely does. For client-facing structured documents, it consistently does. Four conditions mark where verification earns its cost:

The output leaves the session: if the output goes to a client, a stakeholder, a published page, or any external audience, the cost of a wrong output exceeds the compute overhead of the verification step. The verifier catches the failures the user would have to correct after the fact.
The output is long and structured: a 1,500-word proposal with five required sections has more surface area for failures than a 100-word answer. Long structured outputs benefit from verification proportionally more than short outputs.
Failure is embarrassing rather than just annoying: a skill that generates internal notes can tolerate occasional failures without significant consequence. A skill that generates client-facing content cannot.
The skill is used by people who are not the skill designer: the designer catches gaps through familiarity. A new user does not know what to look for. The verifier provides the check that familiarity normally provides.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

The stakes for unverified output are concrete: 47% of enterprise AI users made at least one major decision based on hallucinated AI content in 2024 (Deloitte, 2024). The verifier pattern operationalizes a closed spec at runtime. The criteria are the spec. The verifier checks the output against it before every delivery.

When Does the Verifier Pattern Not Work?

The verifier pattern adds cost without benefit in three situations: when criteria cannot be written as concrete binary checks, when the output has no required structure for criteria to check against, and when latency is visible and the added processing creates friction. Spotting these before adding a verifier step saves tokens and avoids false confidence in an uncalibrated check.

The verification criteria cannot be specified: if you cannot write 4-6 concrete criteria for evaluating the output, the verifier has nothing to check against. "Is this a good response?" is not a criterion. This usually means the output contract for the skill is not precise enough. Fix the output contract first.
The skill produces short, variable outputs: a skill that generates a one-sentence answer or a 5-line code snippet does not have enough structure for verification criteria to be meaningful. The verifier pass costs tokens but cannot catch failures in outputs with no required structure.
Speed matters more than quality for this specific use case: a skill used in a real-time chat interaction where latency is visible to the user should not add 20-30% to its execution time for verification. Save the verifier for batch or async tasks where a few seconds of extra processing are invisible to the user.

This pattern has no ability to catch failures that the verification criteria do not cover. A verifier checking five criteria misses failures in everything the criteria do not specify. This is not a limitation unique to the verifier pattern, it is a consequence of the fact that the criteria define the spec. Well-written criteria produce well-caught failures.

How Does the Verifier Pattern Relate to Evals?

Evals test the skill offline against a fixed set of test cases, catching failures in the skill's design before real users encounter them, while the verifier pattern tests every production execution against real inputs the test suite did not cover. The two mechanisms are complements, not alternatives: you need both.

Both use the same underlying question: does this output meet the criteria? The difference is when and how.

Evals run before launch and during iteration, against a fixed set of test cases. They catch failures in the skill's design before real users encounter them. For more on evals, see What Are Evals in Claude Code Skills.

The verifier runs on every production execution, against the actual output from the actual input. It catches failures that appear in real inputs the evals did not cover.

The two mechanisms are complements. Evals ensure the skill's design is correct. The verifier ensures each execution of that design meets the bar. A skill with strong evals but no verifier passes its test suite and fails on the cases the tests did not anticipate. A skill with a verifier but no evals has no structured way to confirm the verifier criteria themselves are correct.

The stakes for skipping both are visible in the industry data: Gartner estimates 85% of AI projects deliver erroneous outcomes due to inadequate testing and data quality (Gartner, 2024). Structured verification at the reasoning level addresses this directly. Research on self-consistency finds that generating and checking multiple reasoning paths improves accuracy by up to 17.9% on benchmark tasks versus single-pass generation (Wang et al., ICLR 2023, arxiv.org/abs/2203.11171). Process-level verifiers show similar gains: beam search guided by Process Advantage Verifiers improves accuracy by over 8% and compute efficiency by 1.5 to 5x compared to outcome-only reward models (ICLR 2025, arxiv.org/abs/2504.00449).

FAQ

Does the verifier pattern require a separate API call to a different model?

No. The verifier runs inside the same context window as the planner and executor. It is a text instruction in the SKILL.md process, not a separate model call. The same Claude session that generates the output also verifies it. If you want independent verification using a separate model call, that is the LLM-as-judge pattern, which is more expensive and more independent.

What happens if the verifier finds a failure the executor cannot fix?

The verifier should include a fallback instruction: "If the failure cannot be fixed without additional information from the user, describe the gap and ask for the missing information rather than returning a partial output." This prevents the skill from looping indefinitely on an unfixable gap and gives the user a clear explanation of what is needed.

How many verification criteria are enough?

Four to six. Fewer than four and the verifier is checking too narrow a surface area. More than six and the verifier step becomes a long overhead that slows down the response and dilutes attention across too many checks. If you have more than six criteria, split the skill into two phases with a verification step at the end of each.

Can the verifier modify the output, or does it only report failures?

In most implementations, the verifier reports failures and the executor revises. This keeps the roles clean: the verifier is a judge, not an editor. Some teams combine the roles into a single "review and revise" step. That is a pragmatic shortcut that works for simple skills. For skills where the failure mode analysis is important to capture (for learnings or evals), keeping the reporting and revision separate produces clearer signal about what went wrong.

Should verification criteria be the same as evals.json assertions?

Yes, where possible. Using the same criteria in both places means the offline test suite and the runtime verifier enforce the same bar. When an eval assertion fails during testing, you update the skill. When a verifier criterion fails during production, you have a new candidate eval assertion from a real-world failure mode.

Does the verifier pattern work for code generation skills?

Partially. For structural checks (does the code include all required functions, does it have the expected file structure), yes. For correctness checks (does the code actually work), the verifier cannot run the code, so it cannot verify functional correctness. Use evals with actual test execution for code correctness, and use the verifier for structural and style criteria the evals cannot catch.

My verifier always returns PASS. What is wrong?

The verification criteria are too vague. "Is the output complete?" produces PASS almost every time. "Does the output include exactly these five sections in this order?" produces actionable results. Rewrite each criterion as a checkable binary: either the criterion is met or it is not. Criteria that cannot be evaluated as pass or fail are not criteria, they are intentions.

Last updated: 2026-04-16