When Do I Need a Rubric vs Just Using evals.json?

TL;DR: Use evals.json when your skill has a definable correct answer. Use a rubric when correct is a spectrum. The line is this: if you can write a binary assertion that is either true or false, it belongs in evals.json. If you need to score quality on a scale, you need a rubric. Most complex skills need both.

This guide applies to Claude Code skills built and distributed through AEM. evals.json asks whether the skill passed. A rubric asks whether the skill is worth using. Different questions.

How do evals.json and rubrics measure different things?

evals.json contains binary test assertions that tell you whether a skill behaved correctly -- each expected_behavior item is either satisfied or it is not -- while a rubric scores output quality on a 1-3 scale, capturing the gradient between a passing output and an excellent one. The output either includes a findings section or it does not. The skill either triggered or it stayed dormant. There is no score of 2.5 in evals.json. Pass or fail.

A rubric contains scored dimensions. Each dimension measures quality along a 1-3 scale, with concrete descriptions for each score level. The output might score 3 on specificity and 1 on scope discipline. The rubric captures the gradient that binary assertions cannot.

Neither tool replaces the other. They answer different questions about the same skill:

evals.json: "Did the skill do what it is supposed to do?"
Rubric: "How well did the skill do what it is supposed to do?"

Skills whose correctness is fully binary need evals.json. Skills whose quality varies on dimensions that cannot be collapsed into a binary need a rubric. Most production skills with significant output quality requirements need both.

Research confirms the measurement gap is real: across 5 repeated runs on the same prompt, LLMs show accuracy spreads of 5-10% on complex tasks, and "it is rare that an LLM will produce the same raw output given the same input" (Mizrahi et al., arXiv 2408.04667, 2024). Binary evals catch failures at the floor; rubrics track the variance above it.

What types of skills need only evals.json?

Skills with fully determinable correct answers need evals.json and no rubric: the entire spec is expressible as binary assertions, every quality criterion has a single correct answer, and no meaningful gradient of "better" or "worse" exists above the pass threshold once the assertion passes. Four skill types fall cleanly into this category:

Formatting and transformation skills. A skill that converts JSON to YAML, formats a date field, or extracts a structured output from unstructured text. The output is either correctly formatted or it is not. A rubric adds no information here.
Trigger and workflow skills. A skill that routes inputs, detects a condition, and triggers a downstream action. Correct behavior is binary: the skill triggered on the right input, did not trigger on the wrong input, and produced the expected routing output.
Publishing and submission skills. A skill that posts content to a platform, commits a file, or submits a form. These have success/failure states and structural requirements that evals.json covers completely.
Code execution and verification skills. A skill that runs tests, checks for compilation errors, or validates a data schema against a spec. The result is correct or incorrect. No quality gradient exists.

"The failure mode isn't that the model is bad at the task, it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." -- Simon Willison, creator of Datasette and llm CLI (2024)

For these skill types, the spec is fully expressible in binary assertions. A tight spec produces reliable behavior. Anthropic's evaluation documentation classifies code-based grading (exact match, string match) as "fastest and most reliable, extremely scalable" precisely because these skill types have unambiguous correct answers (Anthropic Claude Docs, 2025). Empirical testing confirms this: binary MET/UNMET criteria achieve 87% exact accuracy across heterogeneous evaluation tasks, compared to 38-58% exact accuracy for ordinal criteria on the same tasks -- the binary format is the more reliable signal when the question has a correct answer (Autorubric, arXiv:2603.00077, 2025). A rubric would be measuring a quality dimension that does not exist.

What types of skills need a rubric?

Skills whose output quality varies along dimensions that cannot be collapsed into pass/fail need a rubric: correctness is not binary, two outputs can both satisfy every structural assertion yet differ sharply in quality, and only a scored dimension captures which one is actually worth using. Four skill types belong here:

Writing and content generation skills. A skill that drafts blog posts, writes product descriptions, or generates emails. Structural requirements (word count range, section presence, required metadata) go in evals.json. Quality dimensions (specificity of claims, voice accuracy, information density) go in a rubric.
Analysis and research skills. A skill that synthesizes research, produces competitive analysis, or summarizes complex documents. The analysis either exists or it does not -- that's an eval. Whether the analysis is incisive or superficial, comprehensive or selective -- that's a rubric.
Judgment and recommendation skills. A skill that reviews code for architecture decisions, evaluates business plans, or assesses strategy options. Recommendations either appear or they do not (eval). Whether the recommendations show reasoning and name specific tradeoffs (rubric).
Teaching and explanation skills. A skill that explains technical concepts, breaks down a process, or generates onboarding material. The explanation either addresses the question (eval) or explains it clearly with accurate examples (rubric).

In our commissions at AEM, the rubric is most valuable for content and analysis skills where quality variance is high across runs. We have measured output quality scores ranging from 1.2 to 3.0 on the same prompt, same skill, across different sessions. Without a rubric, that variance is invisible. With one, it is trackable and improvable. Independent research supports this: LLM-based rubric evaluation achieves over 80% correlation with human judgments when rubrics include reference answers and score-level descriptions, compared to significantly lower alignment when either element is omitted (Confident AI / LLM-as-Judge research, 2024-2025). A 2025 study of grading scale design found that 3-5 point rubric scales achieve ICC = 0.853 human-LLM alignment, the highest of any tested scale, because the discrete levels with clear behavioral anchors reduce the ambiguity that causes rater drift (arXiv:2601.03444, 2025).

When do skills need both evals.json and a rubric?

Most skills with meaningful output quality requirements need both: evals.json establishes the structural floor -- trigger behavior, format compliance, scope boundaries -- while the rubric measures the quality ceiling, scoring the dimensions that determine whether a passing output is actually worth using. The split is clean:

evals.json handles: trigger behavior, structural requirements, scope boundaries, format compliance
Rubric handles: output quality, reasoning depth, specificity, voice, scope discipline

A content publishing skill needs evals for whether it posts to the right platform with the correct metadata. It needs a rubric for whether the content meets a quality threshold before posting.

A code review skill needs evals for whether it produces findings with severity levels and stays within the code-review domain. It needs a rubric for whether the findings are specific, correctly reasoned, and appropriately prioritized.

The test for whether you need both: can a piece of output pass every eval and still be low quality? If yes, you need a rubric to capture that quality floor. We have seen skills ship that passed 15/15 eval assertions and still produced output that was technically correct and practically useless -- generic findings without specific remediation steps, or content that satisfied the structural spec but read like it was written from a template.

evals.json catches structural failure. A rubric catches quality failure. Missing either means shipping blind on one axis. Anthropic's agent evaluation research found that "reliability drops from 60% on a single run to just 25% when measured across eight consecutive runs" -- an agent that looks reliable in spot-checking can fail three out of four times in sustained use (Anthropic, Demystifying Evals for AI Agents, 2025).

For a detailed breakdown of what a rubric contains and how to write discriminating dimensions, see What Is a Rubric in a Claude Code Skill?.

How do I decide which tool to use first?

Start with evals.json: always write the structural and behavioral requirements as binary assertions first, because they are the floor, and a skill that cannot pass its evals has no quality worth measuring -- the rubric question only becomes meaningful once correct behavior is confirmed and stable. If the skill cannot pass its evals, quality does not matter.

Once your skill passes all evals consistently, assess whether quality variance is visible in real use. If every passing output looks equally good, you do not need a rubric. If some passing outputs are noticeably better than others, identify why and build a rubric around those differences.

This order prevents a common mistake: writing a rubric before you have defined the structural requirements. Skills without a structural floor often score high on rubric dimensions because the judge model compensates for missing structure by evaluating the quality of what is present. The rubric ends up measuring the wrong things. Research on rubric calibration found that even with 5 calibration examples, rubric-based grading achieves only 80% accuracy on structured criteria -- meaning calibration matters, and calibration is meaningless if the underlying structural requirements are not first defined cleanly in evals.json (Autorubric, arXiv:2603.00077, 2025). Anthropic's CORE-Bench evaluation work demonstrates this principle at scale: before resolving eval bugs and ambiguities, Opus 4.5 scored 42% on the benchmark; after fixing the evaluation setup, the same model scored 95% -- the skill had not changed, only the quality of the structural tests had (Anthropic, Demystifying Evals for AI Agents, 2025).

For the full evaluation-first workflow that sequences both tools correctly, see Evaluation-First Skill Development: Write Tests Before Instructions.

FAQ

What if I am not sure whether my skill needs a rubric?

If every correct output can be evaluated with a binary yes/no for each quality criterion and no meaningful gradient of better or worse exists above the pass threshold, evals.json is sufficient and a rubric adds no measurement signal worth the calibration overhead. If any quality criterion requires a judgment about degree, add a rubric for those criteria. When in doubt, start without a rubric. If you notice quality variance after the first 20 real uses, build one then.

Can I replace my rubric with more evals.json assertions?

You can partially replace rubric dimensions with binary assertions for dimensions that have a floor below which output is clearly wrong, but binary assertions cannot capture degrees of quality above that floor, and the precision you gain on the low end comes at the cost of losing all signal on the high end. Some subjective dimensions can be partially captured with binary assertions: "output does NOT use vague language like 'effective' or 'good' without a concrete referent." But this approach misses quality variance above the floor. A rubric captures degrees of quality that binary assertions cannot. Anthropic's eval documentation notes that code-based grading "lacks nuance for more complex judgements that require less rule-based rigidity" -- that nuance gap is exactly what a rubric fills (Anthropic Claude Docs, 2025). Ordinal rubric criteria show 85-93% adjacent accuracy even where exact score agreement is lower -- meaning the rubric reliably distinguishes good from acceptable from poor, even when the precise score varies by one level, which is the practical granularity you need for skill improvement (Autorubric, arXiv:2603.00077, 2025). For skills where "good enough" is not good enough, both tools are needed.

Is a rubric useful if I am the only user of my skill?

Yes, particularly for personal content, research synthesis, or analysis skills where output quality matters and where quality drift — each successive output seeming fine in isolation while the baseline quietly degrades — is the failure mode you are most likely to miss. Drift is not hypothetical: in a Stanford and UC Berkeley study of GPT model behavior, accuracy on a structured task dropped from 84% to 51% in the same model within three months -- a 33-percentage-point decline invisible without measurement (Chen, Zaharia, Zou, arXiv:2307.09009, 2023). If you are generating content, doing research synthesis, or producing analysis you rely on, a rubric gives you a repeatable way to assess quality across runs and catch that drift before it compounds.

How many dimensions does my rubric need?

Three dimensions is sufficient for most skills, and five is the practical maximum: beyond that, calibration becomes unreliable because the judge model begins conflating overlapping criteria, the per-dimension scores lose discriminating power, and the rubric starts measuring the same underlying quality variance in multiple redundant ways. Write the minimum number of dimensions that capture the quality variance you care about. If two dimensions are measuring the same underlying thing, consolidate them into one. Research on LLM rubric calibration shows that inter-judge agreement (Cohen's κ) between two independent evaluators applying a structured rubric averages 0.53, with per-question correlations ranging from 0.54 to 0.82 -- and that variance is easier to manage with fewer, sharper dimensions than with many overlapping ones (Autorubric, arXiv:2603.00077, 2025).

Can I test my skill with only a rubric and no evals.json?

Not for any skill used beyond personal exploration: a rubric measures quality on the outputs the skill produces, but says nothing about whether the skill triggers correctly, handles negative inputs gracefully, or meets the structural requirements that determine whether it is safe to ship to anyone else. A rubric measures quality only when the skill runs. It does not test trigger behavior, negative cases, or structural requirements. A skill that scores 3.0 on every rubric dimension but triggers only 40% of the time has failed in production. evals.json is required for any skill used by people other than the author. For details on what belongs in evals.json, see What Are Evals in Claude Code Skills?.

Last updated: 2026-04-16