What Advanced Skill Design Patterns Exist Beyond Basic SKILL.md Files?

Four advanced patterns extend what a Claude Code skill can do beyond a basic SKILL.md file. At AEM, these are the patterns we apply when a skill is working and needs higher reliability or lower maintenance overhead. The verifier pattern catches output failures before the user sees them. Auto-research loops improve skill quality automatically over time. LLM-as-judge replaces human evaluation with a second model invocation. Meta-skills build other skills. None of them require external tools or additional infrastructure.

TL;DR: Four advanced patterns extend a Claude Code skill: the verifier pattern (planner, executor, and verifier in one session), auto-research loops (scheduled quality evaluation with proposed improvements), LLM-as-judge (a second model call scoring output against a rubric), and meta-skills (skills that generate other skills). Start with the verifier pattern.

These patterns are not decorations on a working skill. A skill needs a functional SKILL.md, a clear output contract, and passing evals before advanced patterns add value. Apply them to skills that are already working and need higher reliability or reduced maintenance, not to skills that are still failing their basic requirements.


What Is the Verifier Pattern?

The verifier pattern adds a quality-check role to the skill's process: a planner writes the output plan, an executor produces the output, and a verifier checks that output against a short list of criteria before the response is returned to the user, catching structural gaps and missing requirements that the executor's own generation context would miss. All three roles run inside a single context window with no external calls.

Three roles execute inside a single context window:

  1. Planner: reads the input and writes a structured plan for what the output should include
  2. Executor: produces the output by following the plan
  3. Verifier: checks the executor's output against the plan and a short list of quality criteria, reports any gaps, and triggers a revision if gaps are found

The verifier is not a second model call. It runs as a final step in the same session: "Now check your output against these four criteria and report any failures before I return the response."

In our builds, the verifier pattern reduces first-draft failure rate from 25-30% to 8-12% on structured document skills (proposals, reports, briefs). The cost is 20-30% more tokens per run. For outputs that matter, that tradeoff is correct.

For a deep-dive on the verifier pattern mechanics, see What Is the Verifier Pattern.


What Is the Auto-Research Pattern?

Auto-research is a scheduled loop that runs the skill on real examples, scores each output against a structured rubric, proposes specific edits to the skill file for human review, and runs on a nightly or weekly schedule without a human in the evaluation step. It replaces ad-hoc manual review with a repeatable quality measurement that accumulates evidence across real production runs.

It requires a rubric precise enough to replace human judgment at that evaluation layer.

Three-level criteria framework for auto-research:

  1. Level 1: Hard rules. Objective requirements the output must always meet. "Output must include all five required fields." Checkable without model judgment.
  2. Level 2: Pattern matching. Behavioral consistency. "Output tone must match approved-examples." Requires an LLM evaluator to check.
  3. Level 3: Deep creative quality. "Does the output contain at least one non-obvious insight?" Hard to automate reliably without a human review step.

Auto-research works at levels 1 and 2 without human review. Level 3 improvements get flagged for human approval before being applied to the skill.

In commissions where we have run auto-research over four-week cycles, the documented improvement range on the primary quality metric is 9-27%. The ceiling is the evaluation criteria quality, not the optimization process.

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

This applies directly to auto-research rubrics: a loose level-1 criterion produces scores that cannot reliably distinguish between a passing and a failing output, which means the optimization loop has no valid signal to work from.


What Is LLM-as-Judge?

LLM-as-judge uses a second model invocation to evaluate skill output before returning it to the user: the judge receives only the finished output and a structured rubric, not the generation context, so its score reflects what the user will actually see rather than what the model intended to produce.

This differs from the verifier pattern in one key way: the judgment happens outside the generation context, using a separate model call. The judge has not seen the generation process and evaluates only the finished output.

LLM-as-judge is the right pattern when:

  • Output quality depends on judgment, not just structure
  • You have a rubric with concrete, discriminating score descriptions
  • The cost of a wrong output is higher than the cost of an extra model call

It is the wrong pattern when:

  • The skill already produces consistent output
  • Your evaluation criteria are too vague to produce a meaningful score
  • Speed is the primary constraint for this skill's use case

"When you give a model an explicit output format with examples, consistency goes from ~60% to over 95% in our benchmarks." — Addy Osmani, Engineering Director, Google Chrome (2024)

LLM-as-judge works on the same principle: giving the judge a specific rubric with concrete score descriptions produces evaluations that are actually predictive of quality. A vague judge prompt produces a 94% "looks good" rate regardless of actual quality. Research confirms this direction: G-Eval, a chain-of-thought evaluation framework, achieved a Spearman correlation of 0.514 with human judgment on summarization tasks, outperforming all prior automated methods by a margin that scaled directly with evaluation-step specificity (Liu et al., EMNLP 2023). The research consensus is the same as the AEM operational finding: most evaluation failures trace to underspecified criteria, not model limitations.

For rubric design principles, see What Is a Rubric in a Claude Code Skill.


What Are Meta-Skills That Build Other Skills?

A meta-skill is a skill whose output is another skill file: it takes a brief from the user describing the desired capability and produces a complete, correctly formatted SKILL.md with all required sections, making it the right pattern when skill engineering has become a repeated task and blank-page setup is worth removing from the workflow.

The meta-skill pattern is useful for:

  • Teams that commission skills regularly and need a repeatable brief-to-SKILL.md workflow
  • Skill engineers who want to standardize their initial skill structure before customizing
  • Users who need a skill quickly and can refine from a strong starting point rather than building from zero

A well-built meta-skill:

  • Asks for the skill's purpose, trigger condition, and output format before generating anything
  • Produces a SKILL.md with all required sections: description, output contract, process steps, known failures, and self-improvement infrastructure
  • Names failure modes specific to the requested task, not generic failure modes that apply to every skill

A meta-skill is not a shortcut to skip the skill engineering process. The brief-to-SKILL.md generation is a starting point, not a finished product. Every generated skill needs testing with a fresh context window, evals, and refinement based on real-world runs. Context budget matters here: Anthropic's Claude Code documentation notes that re-attached skills share a combined 25,000-token budget after summarization, which means a meta-skill that generates bloated SKILL.md files can crowd out the skills it was designed to support (Anthropic, Claude Code Docs, 2025).


Which Advanced Pattern Should I Start With?

Start with the verifier pattern: it requires no new files, no scheduled processes, and no external infrastructure, and three lines added to the SKILL.md process section catch output failures before they reach the user, delivering the highest quality-to-implementation-cost ratio of any advanced pattern we apply at AEM.

The implementation is three lines in the SKILL.md process:

## Step N: Verify Output
Before returning the response, check your output against these criteria:
[4-6 specific quality criteria for this skill's task]
Report any gaps. Revise if gaps are found. Then return the verified output.

The quality improvement appears immediately. In our builds, the first run with the verifier step reliably catches 1-2 failures that the executor missed. The 20-30% token cost increase pays for itself on the first output the user would have otherwise had to correct manually.

Add the other patterns in this order once the verifier is in place:

  1. Self-improvement infrastructure (learnings, edge-cases, approved-examples, feedback gate)
  2. LLM-as-judge once you have a rubric
  3. Auto-research once LLM-as-judge is producing reliable evaluations

Meta-skills belong to a separate track. Build one when you are engineering skills frequently enough that the brief-to-SKILL.md process has become a repeated task.

For the full self-improvement architecture that underpins these advanced patterns, see Claude Code Skills That Get Better Over Time.


FAQ

Can I use multiple advanced patterns in the same skill?

Yes. The patterns stack. A skill with a verifier, a learnings file, and LLM-as-judge has three independent quality mechanisms working at different layers. The verifier catches self-consistency failures during generation. The learnings file corrects behavioral patterns from real runs. LLM-as-judge catches quality failures the verifier's criteria did not cover. Each pattern addresses a different failure mode.

Do these patterns work with Haiku and not just Sonnet or Opus?

The verifier pattern works across all Claude tiers. The quality of the verification step scales with the model: Opus produces more precise identification of gaps, Haiku catches the obvious ones. LLM-as-judge works similarly. Auto-research and meta-skills are more sensitive to model capability because they involve generating structured skill content rather than checking existing output.

What is the minimum skill complexity that justifies using the verifier pattern?

Any skill that produces output a user sends to someone else (a client, a stakeholder, a user) justifies the verifier pattern. The question is not skill complexity, it is output consequence. A skill that generates internal notes does not need verification. A skill that generates client proposals does.

Can the auto-research loop update SKILL.md automatically without human review?

Not at level 3 (creative quality). At levels 1 and 2, you can automate the write-back with appropriate checks. The recommended practice is to generate a diff of proposed changes and require explicit approval before applying. Automatic write-back without review creates a skill that drifts from its original purpose, accumulating optimizations that pass the evaluation criteria but fail in ways the criteria did not measure.

How is LLM-as-judge different from just reading the output carefully myself?

Speed and consistency. LLM-as-judge runs in seconds and applies the rubric identically on every run. Human review takes minutes and introduces session fatigue: the fifteenth review of the day is less careful than the first. For skills that run at high frequency, LLM-as-judge maintains evaluation quality across volume that human review cannot match.

Do meta-skills produce skills that are ready to use immediately?

No. A meta-skill produces a starting point. The generated SKILL.md needs testing with a fresh context window, evals to check trigger behavior and output quality, and real-world iteration before it is production-ready. The value of the meta-skill is not eliminating the skill engineering process. It is producing a correctly structured starting point that does not require a blank-page design session.

Last updated: 2026-04-16