What Is Evaluation-First Development for Claude Code Skills?

TL;DR: Evaluation-first development is a Claude Code skill-building approach where you write test cases in evals.json before writing SKILL.md instructions. You define what correct behavior looks like first. Then you build the skill to pass those tests. The result is a tighter spec and fewer production failures. AEM uses this approach across its production skill library.

What is evaluation-first development?

Evaluation-first development is the practice of writing your success criteria before writing any implementation — for Claude Code skills, this means drafting evals.json with 10–15 test cases and specifying expected behaviors for each before you write a single line of SKILL.md. The test cases become your spec, so when the skill ships it satisfies requirements you defined before any implementation bias crept in.

The concept comes from test-driven development (TDD) in software engineering, adapted for AI skill design. Traditional TDD tests deterministic code: the function either returns the right value or it does not. Evaluation-first skill development tests probabilistic agent behavior: does the skill trigger when it should, stay dormant when it should not, and produce output that meets your stated spec? A 2008 study by Nagappan et al. (Microsoft Research / IBM) found that four industrial teams using TDD reduced pre-release defect density by 40–90% compared to similar projects that did not, at a development time cost of 15–35% (Nagappan et al., Empirical Software Engineering, 2008).

You don't need to love writing tests. You need to love shipping skills that work.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." -- Boris Cherny, TypeScript compiler team, Anthropic (2024)

Writing evals first is how you force the closed spec into existence. Without evals, your spec lives in your head. That is not a format Claude can be tested against.

Why does writing tests before instructions matter?

Writing instructions before tests creates a specific failure: the instructions are written to satisfy what the author imagines is correct, not what a user actually needs, and the gap between those two things — where trigger failures hide — stays invisible until you have tried to write test cases that force you to cover both sides of the trigger boundary.

When you write tests first, the test-writing process reveals gaps in your understanding of the skill's scope. Here is what typically happens:

You start writing a test case: "User asks to review a pull request. Expected behavior: skill triggers, produces a findings list."

Then you write the second test case and realize you have not defined what a finding looks like. Is it a sentence? A structured object with a field for severity? Does the skill ask for the diff first, or wait for the user to paste it?

These questions are specification questions. Writing instructions first lets you skip them, because you can always write instructions that match your own assumptions. Writing tests first forces you to answer them, because a test with the assertion "each finding has severity: critical, high, medium, or low" makes a specific claim you can verify.

In our builds at AEM, the first version of a skill written before any evals almost always fails the trigger test in a fresh session. Not because the instructions are bad, but because the description was written for the author's use case, not for the range of natural language inputs real users send. Evals surface this before the skill ships. That pattern is consistent with broader industry data: Gartner reported in 2024 that at least 50% of generative AI projects were abandoned after proof of concept, with poor specification and data quality cited as leading causes (Gartner, Generative AI Project Failure Analysis, 2024). A 2024 S&P Global survey of 1,006 enterprise IT professionals found that the share of organizations abandoning the majority of their AI initiatives before reaching production rose from 17% to 42% year over year, with data quality and poor specification cited as the joint leading cause (S&P Global / 451 Research, Voice of the Enterprise: AI & Machine Learning, Infrastructure, 2024). The failure most likely to surface post-launch — rather than in development — is a trigger precision failure, where the skill either activates too broadly or not at all.

How does evaluation-first development work for Claude Code skills?

The workflow has five steps and the order is fixed: a one-paragraph brief comes first, then 10–15 evals cases in evals.json, then SKILL.md written against those cases, then a run in a fresh session — reversing any step, particularly writing instructions before test cases, recreates the spec-after-the-fact problem the approach is designed to eliminate. Execute them in order.

Write a one-paragraph brief. Define the skill's name, what it does, when it should activate, and what it produces. Keep this under 100 words. If you cannot write it in 100 words, the scope is not defined yet.
Write 5 trigger evals. At least 3 should be positive (prompts that should activate the skill) and 2 should be negative (prompts that must not activate it). Write prompts as real users would phrase them, not as you would phrase them as the skill author.
Write 5 quality evals. One canonical input (your clearest expected use case), two variations that reflect different user phrasings, and two edge cases that test the skill's defined limits.
Write SKILL.md. Use the test cases as your spec. Every instruction in SKILL.md should trace back to a test case assertion. If an instruction has no corresponding test, ask whether it is necessary.
Run evals in a fresh session. Open a new Claude session with no prior context. Send each prompt. Check output against expected_behavior. Fix failures. Re-test.

The entire process takes longer on the first skill. It takes about the same time on the second. By the third, it is faster than the write-first approach, because it eliminates the post-launch debugging cycle. The TDD literature puts the upfront time cost at 15–35% more than ad-hoc development, but teams in the IBM/Microsoft study agreed that this overhead was offset by reduced maintenance — a pattern that holds for AI skill development, where post-launch trigger debugging is the dominant time cost (Nagappan et al., Empirical Software Engineering, 2008). Anthropic's own skill evaluation tooling flags trigger failure rates above 2–3% as requiring investigation before a skill is relied upon in production (Anthropic, Claude Code Skills documentation, 2026).

What is the difference between evaluation-first development and just testing your skill?

Evaluation-first development specifies correctness before implementation, so the test suite exists to discover what you do not know about the skill's scope; testing a skill you have already written verifies that the implementation matches your current assumptions — but those assumptions may never have been challenged, which is the problem the eval-first order is designed to prevent.

The difference is not philosophical. It is practical. A post-implementation test confirms what the skill already does. A pre-implementation test defines what the skill should do. These are different questions.

A skill tested only after writing tends to produce tests that pass because they were designed around the existing instructions. The coverage looks complete. The failure modes your instructions did not anticipate are still uncovered. A ZenML analysis of 1,200 production LLM deployments (2025) found that pushing past 95% quality reliability required the majority of development time — the bulk of that effort going to edge cases and failure modes that were not visible until testing exposed them.

Pre-implementation evals cannot be designed around existing instructions, because no instructions exist yet. The test-writing process forces you to confront what you do not know about the skill's scope before you have written something to defend.

For the full workflow and how this plays out in practice, see Evaluation-First Skill Development: Write Tests Before Instructions.

FAQ

Do I need to know TDD or software testing to use evaluation-first development?

No. The evals.json format uses plain-language assertions, not code. You are not writing unit tests in a testing framework. You are writing a list of human-readable behavioral requirements and checking them manually. No testing background required.

Does evaluation-first development work for all skill types?

It works for any skill with a definable scope. Skills that format output, review code, generate content, and manage workflows all have definable trigger conditions and output structures you can assert. For skills whose output is highly variable and subjective, the evals will be simpler but still useful. The trigger evals alone catch a large class of failures regardless of output type. The approach is not well-suited to rapid exploratory prototyping where the skill's scope is genuinely unknown: if you cannot write a one-paragraph brief describing what the skill does, you do not yet have enough scope definition to write meaningful test cases, and time spent writing evals at that stage is wasted.

What if I realize my evals are wrong after I have written the skill?

Update them. evals.json is a living document. If you discover during skill development that a test case was poorly specified, that is a specification discovery. Fix the test case, then verify the skill satisfies the updated spec. Updating evals is not failure. Shipping a skill whose test cases were never accurate is.

How long does it take to write evals.json for a typical skill?

For a skill with a clear scope, writing 10-15 test cases takes 20-30 minutes. Most of that time is thinking about trigger edge cases and output constraints, not writing JSON syntax. If it takes longer than 45 minutes, the scope is probably too broad for a single skill.

Can evaluation-first development be applied to skills that already exist?

Yes. For an existing skill, write the evals.json file that describes the correct behavior you want, then run the skill against those tests. The results tell you where the existing skill diverges from the spec you actually need. This is also how you document the skill's intended behavior for future maintainers.

Where do I store evals.json?

In the skill folder, alongside SKILL.md. For details on the file format and field definitions, see What Is an evals.json File?. For what belongs in the expected_behavior array and how to write useful assertions, see What Are Evals in Claude Code Skills?.

Last updated: 2026-04-16