What Is an evals.json File in Claude Code Skills?

TL;DR: An evals.json file is a structured test suite for a Claude Code skill. It lives in the skill folder alongside SKILL.md and contains test cases with prompts and behavioral assertions. Running your skill against these test cases in a fresh session tells you whether the skill works as specified, not just as assumed.

An evals.json file is to a skill what a spec sheet is to a manufacturing run. Without one, you are eyeballing it.

What is an evals.json file?

An evals.json file is the specification document for a Claude Code skill's correct behavior: a structured list of test cases, each pairing a realistic user prompt with an array of plain-language behavioral assertions you can check in a fresh Claude session. It is the difference between a skill you have verified and a skill you have assumed works.

Every production Claude Code skill at AEM ships with an evals.json file. Skills without one have no defined standard for correctness. That is not a production bar -- it is an authoring assumption masquerading as one. Research on specification quality across software projects consistently finds that unclear or ambiguous requirements account for roughly 50% of all defects that reach downstream stages (James Martin, widely cited in software engineering literature). An evals.json file is the mechanism that makes a skill's requirements unambiguous enough to test.

The file does not load into Claude's context at runtime. It is a developer tool. Claude never sees it during normal use. Its job is to give you a repeatable definition of "correct" that survives the author's session.

What does the evals.json schema look like?

The file contains a single top-level object with a test_cases array, and that array is the entire schema: no other top-level keys, no versioning field, no metadata wrapper, because a format you have to look up is a format you will not use consistently. Each test case has four fields that together define one testable unit of skill behavior:

{
  "test_cases": [
    {
      "id": "TC001",
      "tags": ["trigger", "positive"],
      "prompt": "Review this Python function for security issues",
      "expected_behavior": [
        "skill triggers without explicit /skill-name invocation",
        "output includes a findings section with at least one item",
        "each finding specifies a severity level: critical, high, medium, or low",
        "output does NOT include unrequested refactoring suggestions"
      ]
    },
    {
      "id": "TC002",
      "tags": ["trigger", "negative"],
      "prompt": "Write a docstring for this function",
      "expected_behavior": [
        "security-review skill does NOT trigger",
        "no findings or severity classifications in output"
      ]
    },
    {
      "id": "TC003",
      "tags": ["quality", "edge-case"],
      "prompt": "Can you take a look at my code",
      "expected_behavior": [
        "Claude asks for the code before proceeding",
        "skill does not attempt a review on an empty context"
      ]
    }
  ]
}

Field definitions:

id: A unique identifier for the test case. Use sequential numbering: TC001, TC002.
tags: An array classifying the test. Valid values: trigger, quality, positive, negative, edge-case. A single test case can have multiple tags.
prompt: The exact user message to send in the test session. Write it the way a real user would phrase it, not the way you would phrase it as the skill author.
expected_behavior: An array of plain-language assertions. Each assertion describes a constraint the output must satisfy.

"The failure mode isn't that the model is bad at the task, it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." -- Simon Willison, creator of Datasette and llm CLI (2024)

The evals.json file is where you make the specification tight enough to test against. On SWE-bench Verified — the standard benchmark for autonomous coding agents — model scores rose from 40% to over 80% in a single year, a gain that correlates directly with tighter eval design and faster iteration cycles (Anthropic Engineering, "Demystifying Evals for AI Agents," 2025). The same principle applies at the skill level: the feedback loop only tightens when the target is defined.

What should I put in the expected_behavior array?

Each item in expected_behavior is a plain-language assertion about what the output must or must not do: write at the behavioral level, not the exact-string level, so the assertion holds across valid variations in phrasing and still fails when the skill does the wrong thing.

What works:

"output includes a numbered list of findings"
"skill does NOT trigger on this prompt"
"Claude asks for clarification before proceeding"
"output stays under 500 words"
"each item includes a severity: critical, high, medium, or low"

What does not work:

"output contains the phrase 'Security Review Complete'" (brittle exact-string match)
"output is good quality" (untestable)
"Claude does the right thing" (not a spec)

Write 3-5 assertions per test case. Fewer than 3 and the test does not constrain behavior meaningfully. More than 5 and the test becomes hard to evaluate in a single pass. The Anthropic engineering team found that fixing evaluation bugs on CORE-Bench took Claude Opus 4.5's reported score from 42% to 95% — a 53-point swing caused entirely by how the test criteria were written, not by any change to the model (Anthropic Engineering, "Demystifying Evals for AI Agents," 2025). What you assert determines what you measure.

Negative assertions, items that use "does NOT," are as important as positive ones. They define the skill's scope boundaries. A code review skill that also rewrites your imports is out of scope. Without a negative assertion, that failure is invisible. Scope creep is not a hypothetical: PMI's Pulse of the Profession survey found that 52% of projects completed in the year studied experienced uncontrolled scope changes (Project Management Institute, 2018). At the skill level, negative assertions are the only mechanism that makes scope boundaries testable rather than assumed.

In our builds, the most common evals.json mistake is writing only positive assertions. Every skill needs at least 3 negative assertions distributed across its trigger and quality test cases. These catch scope creep, trigger misfires, and output inflation. 60% of organizations using test automation report significant improvements in application quality (Gartner Peer Insights, 2024). In our experience building AEM production skill libraries, the failure category most consistently absent from first-draft test suites is the negative assertion.

Where does evals.json go in my skill folder?

Place evals.json directly in your skill folder at the same level as SKILL.md, where both files travel together whenever you copy, version, or share the skill, and where the test suite stays discoverable without hunting through subdirectories every time you need to add a case or run a check:

.claude/skills/your-skill-name/
  SKILL.md
  evals.json
  references/
    domain-knowledge.md

For user-level skills, this is ~/.claude/skills/your-skill-name/evals.json. For project-level skills, it is .claude/skills/your-skill-name/evals.json in your project root.

The file path is a convention, not a technical requirement. Claude Code does not scan for evals.json the way it scans for SKILL.md. The file is for your development process, not for the runtime. Where you put it matters only for your own organization. Co-location is the established pattern for tightly coupled test-code pairs: pytest's official good practices guide explicitly recommends co-locating tests with the code they cover when test and code development are tightly coupled, noting that proximity keeps both files synchronized and reduces the overhead of hunting across directories during active development.

For a full introduction to what evals are and why they exist, see What Are Evals in Claude Code Skills?.

How do I run evals.json test cases?

You run them manually in a fresh Claude session, one that has no carry-over context from your authoring work, so the test result reflects what a new user actually encounters rather than what you already know is in scope as the skill author. Anthropic recommends starting with 20-50 tasks drawn from real failures as an initial eval set, because early runs have large effect sizes and small sample sizes are enough to surface regressions (Anthropic Engineering, "Demystifying Evals for AI Agents," 2025). The steps:

Open a new Claude Code session in a clean terminal, separate from your development session.
Take the first test case prompt from evals.json.
Send the prompt to Claude exactly as written.
Check the output against each item in expected_behavior. Mark pass or fail.
Repeat for each test case.
Record which test cases failed and what the output did instead.

This is the Claude B test: a fresh session that has no context from how you built the skill. Addy Osmani documented that giving models an explicit output format with examples raises consistency from roughly 60% to over 95% (Google Chrome Engineering Director, 2024). Your evals.json assertions are the explicit format. The fresh session is where you verify that consistency.

Honest limitation: this process is manual and takes 20-40 minutes for a 15-test-case suite. There is no automated eval runner in Claude Code as of 2026. The discipline is treating it as a required step before shipping, not an optional polish step.

For the complete workflow that uses evals.json from start to finish, see Evaluation-First Skill Development: Write Tests Before Instructions.

FAQ

Does evals.json work with any skill, or only certain types?

Every skill type benefits from evals.json: formatting skills, analysis skills, publishing skills, code review skills, writing skills. The test case structure adapts to the skill. Simple formatting skills need 5-8 test cases focused on output structure. Complex multi-step research skills need 15-20 test cases covering triggers, quality, edge cases, and error recovery. The format is the same. The coverage depth varies.

What is the difference between evals.json and a rubric file?

evals.json tests binary behavior: did the skill do the thing or not? A rubric measures gradient quality: how well did the skill do the thing? Use evals.json for skills with definable correct answers. Use a rubric for skills producing subjective output where quality sits on a spectrum. For most skills with both structural requirements and quality goals, you need both. For the full comparison, see When Do I Need a Rubric vs Just Using evals.json?.

Can I add test cases to evals.json after I ship the skill?

Yes, and you should. The best source for new test cases is real failures: when a user reports unexpected behavior, write a test case that reproduces it before fixing the skill. This way the fix is verifiable and the failure does not recur undetected. Treat evals.json as a living document that grows with every confirmed failure.

What if my skill's output is too variable to write expected_behavior assertions?

If output is genuinely too variable to assert anything about, the skill does not have a defined output contract. Write the contract first: what must every output contain, what must every output avoid? Those constraints become your expected_behavior items. If you still cannot write 3 assertions, the skill's scope is not specific enough to be testable. That is a design problem worth solving before shipping.

How specific should the prompt in a test case be?

Write the prompt the way a real user, not the skill author, would phrase the request. If your trigger eval prompt contains the word "security review" and the skill name is "security-review," you are testing invocation, not triggering. The value of trigger evals comes from testing natural language: "look at this code," "check this for issues," "is this function safe?" These are the prompts real users send.

Last updated: 2026-04-16