What Is a Rubric in a Claude Code Skill?

TL;DR: A rubric in a Claude Code skill is a scoring framework for evaluating subjective output quality. It contains 3-5 dimensions, each with score descriptions for 1, 2, and 3. You use it to measure how well a skill performs on tasks where "correct" is a spectrum rather than a binary pass or fail.

A rubric is the difference between "this content is good" and "this content scores 2.3 on specificity and 3.0 on voice accuracy." Only one of those tells you where to improve. At AEM, rubrics are a standard component of every skill that produces subjective output.

What is a rubric in a Claude Code skill?

A rubric is a structured quality scoring framework that defines 3-5 named dimensions of output quality, each with score descriptions for 1, 2, and 3, so that evaluations stay consistent rather than impressionistic — applying the same standard to every output regardless of who scores it or when.

Rubrics live in a file called rubric.md inside the skill folder. Unlike evals.json, which Claude uses as a developer tool outside of runtime, a rubric can be read by Claude at runtime when the skill uses LLM-as-judge evaluation: Claude evaluates its own output, or evaluates a batch of outputs, using the rubric as its scoring instructions.

Rubrics exist because not all skill quality is binary. An evals.json test case can verify that an output includes a recommendations section. It cannot measure whether the recommendations are specific, well-reasoned, and scoped correctly. That is what a rubric measures. The "Rubric Is All You Need" study (ACM ICER 2025) found that providing an LLM grader with the same rubric used by human graders produced "consistently high correlation scores" — the rubric format, not the model, was the primary driver of grading accuracy.

When does a skill need a rubric?

A skill needs a rubric when "correct" is a spectrum rather than a binary — specifically when the skill produces prose, makes judgment calls, or runs LLM-as-judge evaluation, because in each of those cases structural pass/fail assertions in evals.json cannot distinguish a mediocre output from an excellent one on the same task.

The skill produces prose output where quality varies. A content writing skill, a research synthesis skill, a code explanation skill. These produce outputs where one version is clearly better than another, but the difference is not captured by structural assertions.
The skill makes judgment calls. A strategy analysis skill, a risk assessment skill, a code review skill focused on architecture decisions. The skill's value comes from the quality of its reasoning, not just the presence of certain fields.
You are evaluating output with LLM-as-judge. When you want Claude to score skill output automatically, it needs a scoring framework. Without a rubric, the judge model will evaluate by feel, producing inconsistent and unreliable scores. Research on Prometheus, an open-source evaluator LLM, found that providing customized score rubrics lifted Pearson correlation with human judgment from 0.392 (rubric-free ChatGPT) to 0.897 — on par with GPT-4 (Kim et al., ICLR 2024).

Skills that do NOT need a rubric:

Formatting skills
Publishing skills
Database query skills
Any skill where the output is either correct or incorrect with no meaningful gradient

These belong in evals.json.

"When you give a model an explicit output format with examples, consistency goes from around 60% to over 95% in our benchmarks." -- Addy Osmani, Engineering Director, Google Chrome (2024)

A rubric is the explicit format for subjective quality assessment. Without it, LLM-as-judge scoring sits at the equivalent of 60% consistency.

What does a rubric look like?

A rubric file has a header naming the skill, then 3-5 dimension blocks — each with a name, a description of what it measures, and score descriptions for 1, 2, and 3 — where each score description must be concrete enough that two independent scorers reading the same output would assign the same score.

Here is a concrete example for a content writing skill:

# Rubric: Content Writing Skill

## Dimension 1: Specificity of Claims

Measures whether factual claims, examples, and recommendations name concrete entities,
numbers, and mechanisms rather than describing them in vague generalities.

- **Score 1:** Claims are generic ("AI tools improve productivity"). No named entities,
  no numbers, no mechanism described.
- **Score 2:** Most claims have a specific element, but some remain generic. Mix of
  "AI coding tools" and "GitHub Copilot."
- **Score 3:** Every key claim names a specific entity, cites a number, or describes
  a named mechanism. No claim survives without a concrete referent.

## Dimension 2: Voice Accuracy

Measures whether the output matches the brand voice spec (Sharp Engineer: precise,
accessible, dry wit, no hedge stacks).

- **Score 1:** Generic instructional prose. Reads like documentation from a template.
  No wit, no personality, hedge words present.
- **Score 2:** Voice is mostly present. One or two hedge words. Mostly correct tone.
  Occasional lapse into generic AI instructional style.
- **Score 3:** Every sentence is in voice. Sharp, declarative. One wit moment lands.
  Zero hedge words.

## Dimension 3: Scope Discipline

Measures whether the output stays within the skill's defined scope without generating
unrequested content.

- **Score 1:** Output expands significantly beyond scope. Adds unrequested sections,
  advice outside the brief, or editorial commentary on the user's choices.
- **Score 2:** Minor scope drift. One or two sentences outside the defined boundaries.
- **Score 3:** Output is exactly scoped. Everything present is requested; nothing absent
  is required.

Score descriptions must be concrete. Vague score descriptions produce inconsistent scoring. "Score 1: poor quality" is not a score description. "Score 1: claims are generic, no named entities, no numbers, no mechanism described" is. The foundational MT-Bench study (Zheng et al., NeurIPS 2023) found that strong LLM judges achieve over 80% agreement with human evaluators — matching the rate at which human experts agree with each other — but only when the evaluation criteria are explicit and well-defined.

How is a rubric different from evals.json?

evals.json tests binary behavior — pass or fail — while a rubric measures gradient quality on a 1-3 scale, making them complementary rather than interchangeable: evals.json sets the structural floor (did the output include the required sections?), and the rubric sets the quality ceiling (how well were those sections written?).

Either the skill did the thing or it did not. Either the output contains the required section or it does not. Either the skill triggered or it did not. Pass or fail.

A rubric measures gradient quality. The output contains the section, but how well was it written? The skill triggered, but were the recommendations specific? Binary assertions cannot answer these questions.

Use evals.json for structural and behavioral requirements. Use a rubric for quality requirements on subjective output. Most production skills that involve prose, analysis, or judgment need both: evals.json for the structural floor, a rubric for the quality ceiling. The LLM-Rubric paper (Hashemi et al., ACL 2024) demonstrated this layered approach: a calibrated multidimensional rubric reduced root-mean-squared error versus human judges by 2x compared to uncalibrated holistic scoring.

In our commissions at AEM, the most common rubric design mistake is writing dimensions that belong in evals.json. "Does the output include all required sections?" is a structural check. It belongs in evals.json as a binary assertion, not in a rubric as a scored dimension. When structural checks end up in rubrics, every output scores 3.0 on those dimensions, and the rubric stops discriminating.

For the complete comparison of when to use each tool, see When Do I Need a Rubric vs Just Using evals.json?.

How does a rubric connect to evaluation-first development?

Rubric dimensions are drafted before SKILL.md — alongside evals.json — so that skill instructions aim at a defined quality target rather than rationalize an approach the author already decided on, the same discipline that drives the 40-90% defect reduction Microsoft and IBM observed in test-driven software development (Nagappan et al., 2008). The rubric dimensions define what quality means for the skill; the SKILL.md instructions are then written to produce output that scores well on those dimensions.

This order matters. Writing instructions first and rubric second produces instructions that rationalize the author's approach. Writing the rubric first produces instructions that aim at a defined quality target. In AI-assisted scoring research, rubric design before implementation consistently outperformed holistic post-hoc evaluation: a 2024 study on physics exam scoring found that fine-grained checklist rubrics produced human-AI agreement comparable to human inter-rater reliability, while holistic scoring degraded significantly for mid-range outputs (Maini et al., arXiv 2604.12227).

For the full workflow that combines evals.json and rubrics, see Evaluation-First Skill Development: Write Tests Before Instructions.

FAQ

Can I have more than 5 rubric dimensions?

Avoid it. More than 5 dimensions produces calibration drift: scores cluster around the middle because the judge (human or LLM) cannot hold more than 5 independent quality signals in attention simultaneously. If you find yourself writing a 7-dimension rubric, look for dimensions that are measuring the same underlying thing and consolidate them.

What does a judge.md file add to a rubric?

A judge.md file contains instructions for an LLM acting as the scorer. It tells the judge model how to apply the rubric: read the skill output, evaluate each dimension, return a score with a one-sentence justification per dimension. Without judge.md, using a rubric with LLM-as-judge requires improvised prompting, which is less consistent than giving the judge model explicit instructions. You need judge.md when you want to automate rubric scoring across large batches.

Can a rubric replace manual review entirely?

For routine quality checking across large batches, yes. For final editorial judgment before publishing, no. A rubric gives you a measurable quality floor. It tells you when output is likely bad. It does not tell you whether output is worth a specific human's time to read. Use rubrics to filter, not to replace editorial judgment entirely.

What is the difference between scoring 1 and scoring 2 in a rubric dimension?

Score 2 should be "acceptable, with identifiable improvement areas." Score 1 should be "does not meet the baseline for this dimension." Score 3 should be "no improvement needed on this dimension." The descriptions must make the line between each score concrete enough that two different scorers would assign the same score to the same output. If they would not, rewrite the score descriptions.

Should the skill itself read the rubric file at runtime?

Only if the skill includes a self-assessment step where Claude evaluates its own output before returning it. This pattern is useful for high-quality writing skills where a draft-assess-revise cycle is part of the workflow. For most skills, the rubric is a developer evaluation tool, not a runtime component.

Last updated: 2026-04-16