What Are Evals in Claude Code Skills?

TL;DR: Evals are a set of test cases that define what correct behavior looks like for your skill, before you test it in a live session. Each eval contains a prompt and an expected_behavior array. You run them against your skill in a fresh session to verify it behaves as specified, not just as you assume it does.

A skill that feels correct in your authoring session and a skill that is correct in production are not the same thing. Most skill developers discover this the expensive way.

What are evals in Claude Code skills?

Evals are structured test cases for a Claude Code skill: each eval specifies a realistic user prompt and a list of behavioral assertions the skill must satisfy, and together they form a measurable definition of correct behavior that lives in a file called evals.json inside your skill folder.

Evals serve two purposes. First, they give you a measurable definition of "correct." Without evals, "correct" means "it worked when I tried it," which is not a standard you can reproduce or share. Second, they catch failures that manual testing systematically misses, specifically the failures that only appear in fresh sessions without your authoring context. Research on defect removal efficiency shows that single-method testing alone catches only 25-35% of defects; combined approaches that run tests across varied contexts achieve above 97% (Capers Jones, Applied Software Measurement; TestDino Bug Cost Report, 2024).

In Claude Code skill engineering, evals are not optional for production skills. At AEM, we require evals before any skill ships. They are the difference between a skill that works and a skill that works for you.

What does an eval test case look like?

A test case in evals.json is a self-contained test specification: it has three required fields (id, tags, prompt) plus an expected_behavior array of plain-language assertions that define what any correct output must satisfy, independent of how Claude phrases its response in a given run. The required fields are:

id — unique identifier for the test case
tags — one or more labels (e.g. trigger, quality, edge) used to group and filter cases
prompt — the realistic user input the test case exercises

The expected_behavior array defines what any correct output must satisfy, regardless of how Claude phrases it in a given run; this structure separates the specification of correct behavior from any particular output wording. Here is a concrete example from a code review skill:

{
  "test_cases": [
    {
      "id": "TC001",
      "tags": ["trigger", "quality"],
      "prompt": "Review this diff for security vulnerabilities",
      "expected_behavior": [
        "skill triggers without explicit /code-review invocation",
        "output includes a numbered list of findings",
        "each finding includes a severity level: critical, high, medium, or low",
        "output does NOT include unrequested style or refactoring suggestions"
      ]
    },
    {
      "id": "TC002",
      "tags": ["trigger", "negative"],
      "prompt": "Write a commit message for these changes",
      "expected_behavior": [
        "code-review skill does NOT trigger on this prompt",
        "no security findings section appears in output"
      ]
    }
  ]
}

The expected_behavior items are plain-language assertions, not exact strings. They define what constraints any correct output must satisfy. This matters because Claude's output varies across runs. The assertions test the structure and behavior of the output, not its exact wording.

One test case covers one scenario. A production skill needs 10-20 test cases to have meaningful coverage: positive triggers, negative triggers, canonical quality inputs, and at least 2 edge cases. Anthropic's own eval guidance recommends prioritizing volume: "more questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals" (Anthropic, Define success criteria and build evaluations, 2024). For teams starting their first eval suite, Anthropic's engineering team recommends "20-50 simple tasks drawn from real failures" as an initial baseline before scaling coverage (Anthropic, Demystifying evals for AI agents, 2025).

"When you give a model an explicit output format with examples, consistency goes from around 60% to over 95% in our benchmarks." -- Addy Osmani, Engineering Director, Google Chrome (2024)

Evals are the specification that makes this consistency measurable. Without them, you have no baseline.

How is an eval different from just running your skill manually?

Manual testing catches the failures that happen to occur in your current session, but it systematically misses the failures that happen because of your current session — the ones caused by context you carry as the author that a fresh user session will never have.

When you build a skill, your authoring session contains context that a fresh user session does not:

the conversation history
implicit assumptions from prior exchanges
your own understanding of what the skill is meant to do

Claude in your session uses all of that context. Claude in a fresh user session has none of it.

This is the Claude A vs Claude B distinction. Claude A (your authoring session) consistently passes manual tests. Claude B (a fresh user session) regularly fails on inputs Claude A handled. We documented this failure pattern in 6 of 10 consecutive commissions at AEM where the author relied on manual testing without evals. In each case, the failure was invisible until a different user triggered the skill.

Evals fix this by requiring you to run tests in a fresh session, with no authoring context, against a pre-specified definition of correct behavior. They make the gap visible before it reaches production.

For the complete overview of what eval-first development looks like in practice, see Evaluation-First Skill Development: Write Tests Before Instructions.

What should I test with evals?

Test two fundamentally different things — trigger behavior and output quality — and keep them in separate test cases, because they capture different failure modes: a skill can trigger correctly but produce poor output, or produce perfect output but only activate on a narrow slice of the prompts your users will actually write.

Trigger behavior: Does the skill activate on the right prompts? Does it stay dormant when it should? Trigger evals need 3-5 positive cases (prompts that should activate the skill) and 3-5 negative cases (prompts that should not). A skill with perfect output behavior that triggers only 50% of the time has failed in production. Research on LLM-based agent systems finds that even well-configured agents succeed on roughly 50% of tasks without structured evaluation to identify and close trigger gaps (Getmaxim.ai, Diagnosing and Measuring AI Agent Failures, 2024). A 2025 reliability study across 14 agentic models found that "outcome consistency remains low across all models" — agents regularly fail tasks they are capable of completing when tested across multiple independent runs rather than single-session checks (Rabanser et al., Towards a Science of AI Agent Reliability, arXiv 2602.16666, 2025).

Output quality: Once triggered, does the skill produce output that meets your spec? Quality evals need:

1 canonical case — your best expected input, representing the most common real use
2-3 variation cases — inputs that reflect real user diversity in phrasing or context
1-2 edge cases — inputs that test the skill's documented limits

Do not write both types in the same test case. A trigger eval assertion and a quality eval assertion measure different failure modes. Mixing them makes it harder to diagnose which kind of failure occurred.

The pattern we use at AEM: write trigger evals before writing the description, write quality evals before writing the instructions body. The first set shapes the description. The second set shapes the steps. The specification precedes the implementation.

For a detailed breakdown of evals.json structure and field definitions, see What is an evals.json file?.

What happens if I skip evals?

A skill without evals has one test environment (your authoring session), one tester (you), and one success criterion (it worked when you tried it) — which means every failure mode that only surfaces in fresh sessions, on natural phrasing variants, or without your implicit context goes undetected until a real user finds it.

That is not a production bar. That is a personal bar, which is a different thing.

The structural reason is non-determinism: LLMs "process each interaction independently, lacking native mechanisms to maintain state across sequential interactions," which means a skill that passes in your session is not guaranteed to pass in anyone else's (Survey on Evaluation of LLM-based Agents, arXiv 2503.16416, 2025). Single-run manual testing cannot detect variance — only repeated testing across fresh sessions can.

In our commissions, skills shipped without evals have a measurably higher rate of user-reported failures in the first two weeks of use. The failures are not random. They cluster around three patterns:

Trigger misfires — skill activates when it should not
Trigger gaps — skill does not activate on natural phrasing
Context dependency — skill works only when the user provides implicit setup Claude B cannot infer

All three are detectable with evals before launch. Skip evals and you find them after. The DORA 2024 report found that elite-performing engineering teams achieve an 8x lower change failure rate than low-performing teams — the differentiating factor being systematic pre-release verification rather than post-release discovery (Google DORA, Accelerate State of DevOps, 2024).

For a full list of the failure patterns that evals prevent, see What Are the Most Common Mistakes When Building Claude Code Skills?.

FAQ

How many evals do I need before my skill is ready to use?

Ten is the minimum for a production skill: 5 trigger evals and 5 quality evals, covering the basic failure modes — activation, suppression, canonical quality, phrasing variations, and edge cases — without creating a maintenance burden that outweighs the coverage you gain. For a simple formatting skill, 10 is sufficient. For a multi-step research or analysis skill, 15-20 is more appropriate. Below 10, you are running a personal bar check, not a production bar check.

Can I write evals after I have already written the SKILL.md?

Yes, but you lose the main benefit: writing evals after instructions produces tests that confirm what you built rather than tests that specify what you needed, because an existing implementation biases you toward assertions it already passes instead of assertions that expose its gaps. Post-implementation evals still catch future regressions and are better than no evals.

Do I run evals manually or is there a tool?

Currently, you run evals manually in a fresh Claude session with no authoring carry-over, which means each test prompt is evaluated in exactly the same clean state a real user would encounter. There is no automated eval runner built into Claude Code as of 2026. The discipline is in treating this as a required step, not an optional one.

Open a new Claude Code session with no prior context from your development work.
Paste each test case prompt exactly as written in evals.json.
Check the output against each expected_behavior assertion. Mark pass or fail.

What file does my evals.json go in?

Place evals.json inside your skill folder alongside SKILL.md, using the path .claude/skills/your-skill-name/evals.json for both user-level and project-level skills, so the test suite travels with the skill whenever you copy, share, or version-control it. The evals file does not load into Claude's context at startup. It is a developer tool, not a runtime file.

Should I include evals for edge cases I have not seen yet?

Yes, and prioritize the failure modes most likely to occur before you have seen them in production: unusual phrasing, minimal or ambiguous input, prompts semantically adjacent to a different skill's trigger, and inputs that test the skill's documented limitations. Edge case evals written from imagination are less valuable than edge cases discovered from real failures, but they catch a meaningful class of failures that canonical tests miss entirely.

Last updated: 2026-04-16