Evaluation-First Skill Development: Write Tests Before Instructions

TL;DR: Evaluation-first development means writing your evals.json test cases before a single line of SKILL.md instructions. You define what "correct" looks like first, then build the skill to pass those tests. The result: skills that work for users in production instead of only in the session where you built them.

Most skill developers write instructions first, test later, wonder why it breaks in production. The failure log for that approach has a lot of entries.

What does evaluation-first development mean for Claude Code skills?

Evaluation-first development is the practice of specifying your success criteria as executable tests before writing any skill instructions — in AEM commissions for Claude Code skill engineering, this means drafting evals.json with 10-20 test cases, defining their expected behaviors, and only then writing the SKILL.md body that satisfies them.

The pattern comes from test-driven development in software engineering, applied to the domain of AI skill design. Roughly one in four software engineers use TDD as a regular practice (State of TDD, 2024 survey) — the adoption ceiling exists because writing tests first requires discipline before the code exists to test. Teams that do practice TDD ship 32% more frequently than non-TDD peers (Thoughtworks, 2024). The key difference for skills: traditional TDD tests deterministic functions. Eval-first skill development tests probabilistic agent behavior. Your evals assert structure, trigger conditions, and behavioral patterns, not exact string outputs.

"The failure mode isn't that the model is bad at the task, it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." -- Simon Willison, creator of Datasette and llm CLI (2024)

This is why you write evals first. The act of specifying tests forces you to define what "tightly enough" means before you write a single instruction. Writing instructions before evals is like studying for an exam you wrote yourself. You will pass every time. That is not the goal.

Why do most Claude Code skills fail without evals?

Most Claude Code skills fail because they are built and tested in the same session, with the same context Claude has from the conversation — that is Claude A, the authoring instance, testing its own output, and the skill looks correct in that session because Claude A is unknowingly supplying context the instructions never captured.

When a new user triggers the skill, they get Claude B: a fresh session with no authoring context, no prior conversation, no implicit understanding of what the skill is meant to do. Claude B fails on inputs Claude A would have handled because Claude A was unknowingly supplying context that the skill itself should have provided.

In our builds, the single most common failure pattern is skills that pass author-session testing but fail in production because the author supplied context the instructions never captured. We documented this in 6 of our last 10 commissions. Evals catch it before it ships.

The second failure pattern is trigger gaps. A skill works correctly when explicitly invoked with /skill-name but never triggers automatically, because its description does not match how users naturally phrase the request. Without trigger evals, this gap is invisible until users give up and invoke it manually, or give up entirely. An ETH Zurich study on Claude Code context files found that developer-written instructions improved agent task completion by only 4% on average, and LLM-generated context files made performance worse by 3% — both cases representing specification that was not evaluated against real trigger behavior before shipping (tessl.io, 2025).

For a full breakdown of the trigger failure modes and how to fix them, see Why Your Claude Code Skill Isn't Triggering (and How to Fix It).

How do you write evals.json test cases?

Each test case in evals.json has three required components: a prompt, an expected_behavior array, and a tags field — the prompt is what the user would send, the expected_behavior array lists 3-5 assertions that any correct output must satisfy, and the tags field classifies the test as trigger, quality, or edge-case.

Here is the format used in production AEM skills:

{
  "test_cases": [
    {
      "id": "TC001",
      "tags": ["trigger", "beginner"],
      "prompt": "Review this pull request for security issues",
      "expected_behavior": [
        "skill triggers without explicit /skill-name invocation",
        "output includes a structured findings section",
        "each finding has severity: critical, high, medium, or low",
        "output does NOT include unrequested refactoring suggestions"
      ]
    },
    {
      "id": "TC002",
      "tags": ["quality", "edge-case"],
      "prompt": "Review my code",
      "expected_behavior": [
        "skill does NOT trigger on a vague request without visible code",
        "Claude asks for the code diff or file before proceeding"
      ]
    },
    {
      "id": "TC003",
      "tags": ["trigger", "negative"],
      "prompt": "Help me write a commit message",
      "expected_behavior": [
        "security review skill does NOT trigger",
        "no security findings section in output"
      ]
    }
  ]
}

The expected_behavior items are assertions in plain language. They do not specify exact output, because exact output varies run to run. They specify constraints: what structure must appear, what must not appear, and what behavioral pattern the skill exhibits.

Addy Osmani documented the relevant benchmark: "When you give a model an explicit output format with examples, consistency goes from around 60% to over 95% in our benchmarks." (Engineering Director, Google Chrome, 2024). The evals.json assertions drive this consistency. Without them, you have no measurable baseline to improve from.

A minimum viable test suite for a production skill is 10 test cases: 5 trigger evals and 5 quality evals. Below 10, coverage is too thin to catch the failure modes that matter in real use. Anaconda's internal eval framework, applied iteratively to Python debugging tasks, raised task success rates from 0-13% at baseline to 63-100% across model configurations after prompt refinement guided by evals — a result only possible because they had a measurable baseline to improve against (Anaconda/ZenML LLMOps, 2024).

For the complete breakdown of what goes in an evals.json file, see What is an evals.json file?.

What is the difference between trigger evals and quality evals?

Trigger evals and quality evals have completely different failure modes — trigger evals test whether the skill activates on the right inputs and stays dormant on wrong ones, while quality evals test whether the output meets spec once triggered, and confusing the two is where most evaluation systems break down.

Trigger evals test whether the skill activates on the right inputs and stays dormant on wrong ones. A trigger eval failure means users never get the skill, or get it when they should not.

Trigger evals should cover:

3-5 prompts that should activate the skill automatically
3-5 prompts where the skill should NOT trigger
2-3 edge-case phrasings semantically similar to trigger prompts but belonging to a different skill

Quality evals test whether the skill's output meets your spec once triggered. A quality eval failure means the skill runs but produces wrong, incomplete, or malformatted output.

Quality evals should cover:

A simple canonical input (your best case)
2-3 variations representing real user phrasing diversity
1-2 edge cases that test the skill's documented limitations

The reason to write both separately: a skill can have 15/15 passing quality evals and still fail in production if nobody can trigger it. We have seen this in 4 of our last 10 commissions. The skill author focused entirely on output quality and shipped a skill that activated reliably only when explicitly invoked, not on natural-language requests. Users never found it.

Skills that go through structured eval suites show measurable gains in both dimensions. Cisco's software-security skill in Anthropic's registry achieved 84% overall eval score with a 1.78x improvement in secure code writing across 23 rule categories; ElevenLabs' text-to-speech skill scored 93% overall with a 1.32x improvement in agent success rate — agents 32% more likely to use the API correctly — after skill-level evals were applied (Anthropic skill registry benchmarks, 2025).

Running both trigger and quality evals together in a fresh Claude B session is the production bar check for any skill leaving AEM.

When do you need a rubric instead of evals.json?

Use evals.json when your skill has a definable correct answer — the output either contains the required fields or it does not, the severity classification is valid or it is not, the code compiles or it does not — and use a rubric when your skill produces subjective output where "correct" is a spectrum, not a binary.

Writing skills, analysis skills, strategy skills, and research synthesis skills belong in that rubric category.

The distinction: evals.json answers "did the skill do the thing?" A rubric answers "how well did the skill do the thing?"

Three cases where a rubric is required:

The skill's primary output is prose where quality varies across dimensions like specificity, accuracy, and voice fidelity
The skill makes judgment calls, and you need to measure whether those judgments are calibrated
You are using LLM-as-judge to evaluate output, and the judge model needs a scoring framework to apply consistently

A rubric alone is not sufficient for objective skills. A content publishing skill needs both: evals for whether it publishes to the right platform with the right metadata, and a rubric for whether the content meets quality thresholds. They measure different things. Hugging Face's tool-builder skill — a skill requiring both structural precision and judgment — achieved 81% overall eval score with a 1.63x improvement in correct API usage when both eval and rubric dimensions were applied together (Anthropic skill registry benchmarks, 2025).

For the full guide to what a rubric is and when you need one, see What is a rubric in a Claude Code skill?.

How do you design rubric dimensions that actually discriminate?

A rubric dimension is discriminating if it produces a spread of scores across real outputs — if every output scores 2 or 3 out of 3 on a dimension, that dimension is not measuring anything useful, because the rubric has calibrated to the center and lost its ability to distinguish good from excellent.

The most common cause of non-discriminating rubrics: dimensions that measure structural completeness instead of quality of thinking. "Does the output include a recommendations section?" is a structural check. Every output either has the section or it does not. That belongs in evals.json. A rubric dimension should measure what the section contains, not whether it exists.

Quality dimensions that discriminate well:

Specificity of claims: Does the analysis name specific mechanisms, numbers, and named entities, or does it describe situations in vague generalities? A score of 1 means generic descriptions. A score of 3 means every key claim has a concrete referent.
Reasoning transparency: Does the output show its working, or does it state conclusions without the logic that produced them?
Scope discipline: Does the skill stay within its defined domain, or does it expand into unrequested territory?

The process we use for deriving rubric dimensions at AEM: collect 10 real outputs from a draft version of the skill. Rank them from best to worst. Ask: "What specifically makes the best output better than the worst?" The answer is your dimension. Do not invent dimensions from theory. Theory-first dimensions tend to measure what sounds important rather than what actually varies in outputs.

Most rubrics need 3-5 dimensions. Fewer than 3 and the rubric cannot distinguish good from excellent. More than 5 and calibration becomes unreliable. The rubric loses its discriminating power when every run produces a 2.5 average. Benchmarking research comparing domain-specific agents against general-purpose LLMs found 82.7% task accuracy for specialized agents versus 59-63% for general models — a gap the research attributes primarily to tighter output specification and structured evaluation of subjective quality dimensions (arXiv, Beyond Accuracy framework, 2024).

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." -- Boris Cherny, TypeScript compiler team, Anthropic (2024)

Rubric dimensions are the measurement instrument that tells you whether your closed spec is working for subjective tasks. Without them, you are guessing.

What does an evaluation-first workflow look like from scratch?

The exact sequence used in AEM commissions for a new skill runs six steps: brief first, evals before instructions, rubric dimensions before SKILL.md, then write to pass the tests, then run Claude B in a fresh session, then fix failures in severity order — trigger failures before quality failures before edge cases.

Step 1: Write a one-paragraph brief defining the skill's name, trigger condition, and output contract. Not SKILL.md yet. A brief specific enough that a second engineer could write test cases from it without asking questions.

Step 2: Write 10-15 evals.json test cases before opening SKILL.md. Split them: 5 trigger evals (3 positive, 2 negative), 5 quality evals (1 canonical, 2 variations, 2 edge cases). Stop here if you cannot write the tests. A skill you cannot specify in test cases is a skill whose scope you have not understood yet. That is a brief problem, not a skill problem.

Step 3: Identify whether the skill needs a rubric. If quality is subjective, draft 3-5 rubric dimensions using the collect-rank-ask sequence. Write concrete score descriptions for 1, 2, and 3 per dimension before writing the SKILL.md instructions.

Step 4: Write the SKILL.md description and body to satisfy the test cases. The tests are the spec. If an instruction has no corresponding test, ask whether the instruction is necessary.

Step 5: Run Claude B testing in a fresh session. No context from the authoring session. Trigger the skill with each trigger eval prompt and check results against expected_behavior. Test quality evals with their respective prompts.

Step 6: Fix failures in order of severity: trigger failures first (they prevent users from accessing the skill at all), then quality failures, then edge cases.

This sequence takes roughly twice as long on the first skill you build this way. By the third skill, it is faster than the write-first approach, because you are not debugging production failures after launch. NIST's software-testing research found that more than a third of testing costs — estimated at $22.2 billion annually — could be eliminated by infrastructure that enables earlier defect identification; the principle applies directly to AI skill development, where a specification failure caught in evals costs minutes to fix versus hours to debug in a live user session (NIST, 2002; cited in Synopsys/Black Duck blog, 2024).

The honest limitation: evaluation-first development works well for skills with defined scopes. For experimental or open-ended research skills where the output space is genuinely broad, writing tests early is harder. The first eval set will need updating after you see real outputs. That is not a reason to skip evals. It is a reason to treat the first eval set as a draft and plan an iteration cycle after the first 20 real uses.

For a complete grounding in what makes Claude Code skills succeed, see The Complete Guide to Building Claude Code Skills in 2026.

FAQ

How do I write my first eval for a Claude Code skill?

Start with one trigger eval for your expected activation prompt and one negative trigger for a prompt that should not activate the skill. Then write one quality eval for the canonical input. That is 3 test cases. Run them in a fresh Claude session and note where behavior diverges from expected_behavior. The first eval is not comprehensive. Its job is to give you a baseline you can measure against.

Can I use evals to compare two versions of the same skill?

Yes, and this is one of the highest-value uses of evals.json. Run both versions against the same test suite and compare pass rates. Version A passes 13/15, Version B passes 11/15. The difference is measurable and documented. Without evals, you are comparing two skills by impression, which is not reproducible. Industry analysis estimates enterprises lose roughly $1.9 billion annually to undetected LLM failures in production — regressions that structured eval suites catch before they ship (Braintrust, 2025).

Do I really need evals for a simple skill that just formats output?

Yes, but simpler evals. A formatting skill needs trigger evals (does it activate on the right inputs?) and structure evals (does the output match the required format?). Five test cases for a formatting skill takes 10 minutes to write. Skip them and you will spend significantly longer debugging a formatting failure in a user session a week after launch, without a baseline to compare against.

What does it mean when my skill passes quality evals but fails trigger evals?

It means the skill works correctly when it runs, but users cannot get it to run without explicitly invoking it with /skill-name. Your description is not matching the natural language patterns of your target user. Fix the description first, re-run trigger evals in a fresh session, and verify the activation rate before moving to quality improvements.

Should I write evals or a rubric for a content-writing skill?

Both. Write evals for the structural requirements: word count range, section presence, metadata fields, prohibited content patterns. Write a rubric for the quality of the prose: specificity of claims, voice accuracy, information density. Evals catch structural failures immediately. The rubric scores quality across a batch and tracks improvement across iterations. A content-writing skill that passes all its evals but scores 1.5/3 on specificity is technically correct and practically useless.

What is a judge.md file and when do I need one?

A judge.md file contains instructions for an LLM acting as a scorer. You use it when you want to automate rubric scoring across a large batch of outputs instead of scoring manually. The judge model reads judge.md, receives a skill output, and returns scores per dimension with reasoning. This becomes worth building once you are running more than 30-50 evaluations per iteration cycle and manual scoring is the bottleneck on improvement velocity.

What is the difference between evals.json and a rubric when I need both?

Evals.json answers binary questions: did the expected behavior occur or not? A rubric answers gradient questions: how well did the skill perform on the dimensions that matter? Use evals.json for all objective criteria (format, structure, scope adherence, trigger behavior). Use a rubric for subjective quality dimensions that do not have a binary answer. When in doubt, write an eval first. If you find yourself wanting to say "it mostly passed," that criterion belongs in a rubric.

Last updated: 2026-04-16