How Specific Should My Skill Instructions Be?

TL;DR: As specific as the task's fragility requires. Publishing, API calls, and database writes are hard to reverse and need exact step-by-step instructions. Drafting, evaluating, and reviewing need principles and criteria, not scripts. The deciding factor is how badly a wrong call plays out and how reversible it is.

What is task fragility and why does it determine specificity?

In AEM's skill methodology, task fragility measures how much damage a wrong decision causes and how hard that damage is to undo — a high-fragility task produces real consequences that are difficult or impossible to reverse, while a low-fragility task produces output that is easy to notice, correct, and redo.

A high-fragility task: posting to a live platform, running a database migration, sending an email to a customer list. Wrong decision, real consequence, hard to reverse. These tasks need specific instructions that leave Claude no room to improvise.

A low-fragility task: drafting a blog post, evaluating code quality, suggesting a project structure. Wrong decision, easy to notice, easy to redo. These tasks benefit from principles and criteria because the right output depends on context Claude needs to judge, not execute.

The error developers make is applying the wrong specificity level to the task type. A publish skill that says "post when the draft looks ready" is not a skill. It's a guess with good intentions. A content evaluation skill that says "follow these 47 sub-steps to assess quality" is not a skill. It's a checklist that breaks the moment the content doesn't fit the template.

One limit of this framework: it does not resolve ambiguity in the task goal itself. If the objective is unclear, no specificity level fixes that — the skill needs a clearer brief before calibration makes sense.

In the AgentIF benchmark, even the best-evaluated model achieved a 27% instruction success rate across complex agentic tasks — meaning 73% of instructions produced at least one constraint violation. Performance degraded further as instruction complexity increased. (Qi et al., NeurIPS 2025)

When should I use exact step-by-step instructions?

Use exact steps when the operation has a correct sequence, deviation from that sequence causes errors, and Claude cannot safely improvise — meaning the task involves an irreversible action, an external system with specific authentication, or a multi-step workflow where each step depends on the exact output of the one before it.

Characteristics of tasks that need exact steps:

Irreversible actions (publish, send, deploy, delete)
External API calls with specific authentication requirements
File system operations with specific paths and formats
Multi-step workflows where step 4 depends on the exact output of step 3

In our commission work, we replace prose guidance with numbered steps the moment a skill involves an irreversible external action. A publish-to-production skill broke a client's deployment twice when the body contained 40 lines of principles. We replaced those 40 lines with 15 numbered steps. No failures since.

An example of what exact steps look like for a publishing workflow:

Step 1: Read the draft file at the path provided by the user.
Step 2: Check that the frontmatter contains: title, pubDate, and status fields. If any are missing, stop and report the missing fields to the user before proceeding.
Step 3: Run the linting check by executing: node validate-frontmatter.js [file-path]
Step 4: Only if step 3 passes, set status to "published" in the frontmatter.
Step 5: Write the updated file.
Step 6: Report the exact file path and new status to the user.

No judgment required. No room for improvisation. Each step has one correct action.

Across 16 LLM agents evaluated on Agent-SafetyBench, none achieved a safety score above 60% on tasks involving multi-step tool use — with failures most severe in scenarios where instructions left ambiguous the permissibility of irreversible actions. (Zhang et al., 2024)

"The failure mode isn't that the model is bad at the task, it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." - Simon Willison, creator of Datasette and llm CLI (2024)

When should I use principles instead of exact steps?

Use principles when the right output depends on judgment that varies with context, and specifying every case explicitly would require hundreds of steps that still would not cover all inputs — this applies to quality evaluation, content generation with variable inputs, and any task where the output format flexes to match what the user provides.

Characteristics of tasks that benefit from principles:

Quality evaluation (code review, writing critique, design assessment)
Content generation with quality standards
Brainstorming and exploration tasks
Tasks where the inputs are highly variable and the output format flexes to match

A code review skill doesn't need steps. It needs evaluation criteria:

Review the provided code against these criteria:
1. Security: flag any input that reaches a database or external service without validation
2. Error handling: every function that calls an external API must handle the failure case explicitly
3. Readability: flag functions over 30 lines as candidates for extraction
4. Test coverage: flag public functions without a corresponding test

For each finding, state: the line or function, the criterion violated, and a specific fix.

This is not a step sequence. It's a rubric. Claude applies judgment about whether the code meets each criterion. The criteria are specific (30 lines, not "too long"). The application requires judgment.

G-Eval research confirmed the advantage of rubric decomposition: GPT-4 guided by explicit sub-criteria achieved a 0.514 Spearman correlation with human raters on summarisation tasks, outperforming all prior automated evaluation methods that used holistic, criterion-free approaches. (Liu et al., EMNLP 2023)

What does "too specific" look like?

A skill is too specific when its step sequence breaks on inputs that do not match the template exactly, causing Claude to either halt with an error at the point of mismatch or improvise in a way that corrupts the remaining sequence and produces output the user cannot trust.

The failure mode: Claude reaches step 7, finds the input doesn't match what step 7 expects, and either halts with an error or improvises in a way that corrupts the sequence. The skill handles the 80% of inputs that fit the template. It fails on the 20% that don't.

Signs a skill is too specific:

The body specifies exact string values to look for in inputs
Steps assume a fixed input format that users don't always provide
The skill only works when invoked with a specific command structure
Edge case inputs consistently produce wrong or hallucinated outputs

The fix is not to add more steps for the edge cases. The fix is to add explicit branch handling: "If the input does not contain X, ask the user for X before proceeding to step 3."

Research on format-task interference shows that embedding rigid output format requirements in the same instruction body as reasoning goals degrades task performance by a measurable margin — separating the two concerns yields 1–6% relative improvement in task quality. (Deng et al., 2024)

What does "too vague" look like?

A skill is too vague when Claude fills the gaps with plausible-but-wrong behavior, meaning the output varies between sessions, between model versions, and between users because the instructions leave enough ambiguity for Claude's training defaults to substitute for missing explicit rules.

The failure mode: output varies between sessions, between model versions, or between users. The skill "works" on simple cases because Claude's training data covers those cases. It fails on edge cases because Claude guesses, and the guess is wrong.

Signs a skill is too vague:

The output format differs across invocations
Claude skips steps that aren't mandatory in the body text
Different users get different quality from the same skill
The skill works on Claude Opus but fails on Haiku

The fix is not to write even longer prose. The fix is to replace explanatory paragraphs with specific decision rules: instead of "ensure the output is well-formatted," write "format the output as a numbered list with each item under 40 words."

Studies of LLM output consistency show accuracy fluctuations of up to 10% across identical inference runs under ambiguous instructions, with vague prompts demonstrably amplifying variance relative to explicitly structured equivalents. (Krugmann & Hartmann, 2024; Frontiers in AI, 2025)

Microsoft Research's LLMLingua work found that removing low-information tokens — principally verbose, explanatory text — from prompts achieves up to 20x compression with only ~1.5-point performance loss, confirming that loose prose is the low-value fraction of any instruction body. (Jiang et al., EMNLP 2023)

For troubleshooting skills with inconsistent output, see Why Isn't My Claude Code Skill Working?.

How does specificity affect how much token budget the body uses?

More specific bodies are not necessarily longer, and in practice a 15-step sequence with precise, short imperative commands is often shorter in total word count than the 40 lines of explanatory prose it replaces — specificity is about precision, not volume.

A 15-step sequence with precise commands is often shorter than 40 lines of explanatory prose. Short, imperative sentences ("Read the file at the path provided") are more token-efficient than explanatory constructions ("In order to begin the process, Claude should first open and read the file that the user has specified in their request"). Research on symbolic instruction compression found that replacing natural-language prose with compact, imperative instruction formats reduces token usage by 62–81% across task categories while preserving semantic intent — with selection and classification tasks showing the largest reduction (80.9%). (Jha et al., arXiv:2601.07354, 2026)

The token cost of the body depends on word count, not on how prescriptive the instructions are. Specific, short imperative steps often cost fewer tokens than the vague prose they replace.

Structured formats (numbered steps, headers, delimiters) carry a small formatting overhead in tokens relative to equivalent bare prose — the tradeoff is between token efficiency and model parseability, and for most production skill bodies the compliance improvement outweighs the token cost.

For more on body loading and token economics, see When Does the SKILL.md Body Get Loaded into Context?.

FAQ: How specific to make skill instructions

Should I start with specific steps or general principles when building a new skill? Start with principles, test on real inputs, and make specific only the steps where Claude's output is inconsistent or wrong. Over-specifying from the start produces brittle skills. Under-specifying from the start tells you where the gaps are.

What's the right level of specificity for a skill that produces creative output? Specific criteria for quality evaluation, open framework for execution. Tell Claude what makes the output good (criteria) without specifying exactly how to produce it (steps). "Each paragraph must make one claim, supported by one example or piece of evidence" is a specific quality criterion that allows creative flexibility in execution.

Can a skill have some exact steps and some principle-based sections? Yes. Most production skills do. Irreversible sub-tasks (posting, saving, sending) get exact steps. Judgment-based sub-tasks (drafting, evaluating, selecting) get principles. The body makes the boundary explicit: "Steps 1-3 are mandatory sequence. Step 4 applies these criteria to evaluate the draft."

My skill produces different output each time I run it. Is that a specificity problem? Yes, almost certainly. Variation across invocations on the same input is the signature of vague instructions that Claude fills differently each time. Identify the output dimension that varies and add a specific rule governing it.

How do I write instructions that work across Claude Haiku, Sonnet, and Opus? Write for the lowest-capability model you expect to run the skill. Haiku follows explicit, numbered steps more reliably than it follows principles. A skill that works on Haiku with specific steps will work on Sonnet and Opus with better judgment applied to those same steps.

Last updated: 2026-04-15