The SKILL.md Description Field: The One Line That Makes or Breaks Your Skill

TL;DR: The description field in SKILL.md is the one line Claude reads to decide whether your skill fires. Imperative descriptions ("Use this skill when...") achieve 100% activation on matched prompts. Passive descriptions sit at 77%. Stay under 1,024 characters, use imperative phrasing, and add negative triggers to stop false positives.

Most Claude Code skills fail before they run a single step. Not in the output format. Not in the instructions. Not in the reference files. In the description field.

Spend two days building a 400-line skill with domain-specific reference files and a tested output template. If the description is passive or vague, the skill won't fire consistently. Claude Code gives no error when a skill doesn't activate, so the failure stays invisible until you're in a session wondering why the skill you built isn't running.

AEM has seen this in commissions. A well-constructed skill with a passive description ("This skill helps with writing blog posts") activated on 6 out of 10 relevant prompts. One change to an imperative description ("Use this skill when the user asks to write, draft, or create a blog post or article") brought it to 10 out of 10. One line. The same skill body. Entirely different production behavior.

This article covers the description field in full: what it does, how Claude uses it, how to write descriptions that trigger reliably, what negative triggers are and why they matter, the pushy-versus-conservative failure spectrum, and the silent failure modes that break descriptions without any error output.

What does the description field do in SKILL.md?

The description field is Claude's routing signal: when a user sends a request, Claude reads every loaded skill's description, classifies whether any of them match the intent, and fires the skill that matches — it is not a summary of what the skill does but an explicit instruction about when to activate it, and Claude treats it exactly that way.

For the full mechanics, see What does the description field do in a Claude Code skill?.

A correct description does three things:

Specifies the trigger conditions precisely enough that Claude activates on all matching requests
Specifies the exclusions precisely enough that Claude skips near-miss requests
Stays under 1,024 characters and on a single line in the frontmatter

Miss any one of these and the skill misbehaves in production.

Research on multi-agent LLM systems finds that 44.2% of failures originate from system design issues — including task specification failures (15.7%), step repetition (13.2%), and loss of conversation history (8.2%) — with a further 32.3% traced to inter-agent misalignment (Cemri, Pan, Yang et al., "Why Do Multi-Agent LLM Systems Fail?", UC Berkeley / arXiv:2503.13657, 2025). A misconfigured description is a specification failure by another name.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

An open suggestion ("This skill helps with content creation") leaves Claude to decide. A closed spec ("Use this skill when the user asks to write, draft, revise, or outline any written content, including blog posts, emails, reports, social posts, and documentation") leaves no ambiguity.

Why does description phrasing determine whether your skill actually fires?

Claude's skill selection uses internal classification, not keyword matching: when a request comes in, Claude evaluates each loaded skill's description for semantic alignment, and the phrasing of that description determines how much routing signal the evaluation gets — enough to fire the skill, or not enough to trigger it at all.

In testing across 650 activation trials, imperative descriptions achieved 100% activation on matched prompts. Passive descriptions achieved 77%. The 23-point gap comes from how the classifier treats each construction:

Imperative: "Use this skill when the user asks to..." — an instruction to fire, with clear trigger conditions
Passive: "This skill helps with..." — information about the skill, no trigger instruction

Claude treats imperative descriptions as routing instructions. It treats passive descriptions as metadata. Routing instructions activate the skill. Metadata informs but does not activate.

The fix is mechanical. Start every description with "Use this skill when" or "Use when." Apply this to every skill in your library, including ones that already work most of the time. "Most of the time" is a fair-weather skill.

# Passive (77% activation on matched prompts)
description: "This skill assists with writing technical documentation."

# Imperative (100% activation on matched prompts)
description: "Use this skill when the user asks to write, create, or draft technical documentation, how-to guides, or step-by-step instructions for any software product or process."

The routing signal quality matters beyond individual skills. A Carnegie Mellon University and Salesforce study (2025) found that even the best-performing AI agents fail on approximately 70% of real-world office tasks — a gap researchers attribute in part to unclear instruction scoping. Getting the description right is the first layer of that scoping.

How long should the description be?

The 1,024-character limit is the hard ceiling, but the practical target for most single-purpose skills is 150 to 500 characters: enough characters to name the trigger conditions and output types clearly, short enough that the classifier has a tight signal rather than a wall of text to interpret.

See How long should my skill description be? for the detailed treatment.

Below 50 characters: too vague, the classifier has no signal.
150–500 characters: the right range for most single-purpose skills.
500–1,024 characters: appropriate for broad skills with multiple output types and necessary exclusions.
Above 1,024 characters: silently truncated, the surplus never reaches Claude.

Three examples at different lengths:

# Too short (38 chars) — no trigger conditions, no exclusions
description: "Creates written content."

# Right length (242 chars) — clear trigger, key output types, explicit exclusions
description: "Use this skill when the user asks to draft, write, or revise any written content, including blog posts, emails, or social media posts. Does NOT apply to code generation, data analysis, or summarizing existing documents."

# Over-engineered (580 chars) — same semantic range, unnecessary synonyms
description: "Use this skill when the user asks to write, draft, create, compose, author, produce, or generate any kind of written content including blog posts, articles, essays, emails, newsletters, reports, social media posts, LinkedIn content, Twitter threads, and long-form pieces. Does not apply to code, data tables, or analysis."

The 242-character description covers the real intent range. The 580-character version adds no semantic coverage. Claude's classifier generalizes from three well-chosen intent verbs. It doesn't need ten.

In practice, descriptions under 50 characters tend to miss activation on valid requests because the classifier has too little signal to work with — the intent verbs and output types that anchor routing aren't there. The 150–500 character range gives the classifier enough signal without asking it to parse a wall of text.

One decision rule: if your description exceeds 400 characters and still doesn't include exclusions, the skill covers too much. Split it in two.

Instruction length matters beyond the 1,024-character ceiling. The AgentIF benchmark (Qi et al., Tsinghua University / arXiv:2505.16944, 2025 — 707 human-annotated instructions across 50 real-world agentic applications) found that current LLMs follow fewer than 30% of agentic instructions perfectly, and that the perfect-instruction-following rate approaches zero when instructions exceed 6,000 words. Keeping descriptions concise is not just a style preference — it is a reliability constraint.

How do you write trigger phrases that make your skill activate reliably?

Trigger phrases are the action verbs and intent patterns that tell Claude which user requests match this skill, and writing them well means covering the full synonym set for a user's likely phrasing — not just one formulation, but the three to five natural variants a real user would reach for when making that request.

See How do I write trigger phrases that make my skill activate reliably? for a focused guide.

Four rules:

Cover intent verbs, not just keywords. "Write," "draft," "create," "generate," and "produce" express the same intent for content generation. Include 3-5 synonyms that match real user phrasing. Stop at five.

Name the output type explicitly. "Draft a blog post" and "draft a social media post" are different skills. Name the output types in the trigger: "...blog posts, emails, or social media captions."

Match real user phrasing. Users say "write me a post," not "compose a long-form content artifact." If the average request sounds different from your description, the description is wrong.

Test with actual requests. Collect 10 requests that should trigger the skill. Check each against the description semantically. Eight out of 10 matching is calibrated. Five out of 10 needs a rewrite.

Descriptions with a single intent verb miss the natural variation in how real users phrase requests. When a user says "check my code" instead of "review my code," a single-verb description may not match. Covering three to five synonyms closes that gap without bloating the character count.

No peer-reviewed benchmark directly quantifies the activation gap from single-verb versus multi-verb intent coverage. The AEM internal finding — where covering multiple synonym phrasings moved activation from 77% to 100% across 650 trials — is the best available evidence for this specific claim. The pattern is consistent with general NLU design practice, which treats synonym coverage as a standard requirement, but the activation-gap figure itself is AEM's own measurement.

# Weak — misses common phrasings
description: "Use this skill when reviewing code."

# Strong — covers the full trigger intent
description: "Use this skill when the user asks to review, check, inspect, or audit code, or asks for feedback on their code, a PR, or a pull request."

What are negative triggers and when do you need them?

Negative triggers tell Claude when NOT to fire the skill — they are exclusion clauses added to the description that prevent false positives on near-miss requests, the category of request that shares the same domain as your skill but has a different intent, such as "summarize this article" hitting a writing skill designed for original drafting.

See What are negative triggers and why should I include them in the description? for the full breakdown.

Every description without exclusions has false positive risk. A content writing skill without exclusions fires on "summarize this article." A code review skill without exclusions fires on "explain this code to me." Both are near-miss requests: same category, wrong intent.

The format is a "Does NOT apply to" clause at the end:

description: "Use this skill when the user asks to write, draft, or revise any written content, including blog posts, emails, and reports. Does NOT apply to summarizing existing content, generating code, or editing for grammar only."

Three signals that mean you need negative triggers:

The skill fires on requests it clearly shouldn't handle
The skill covers a broad category where adjacent request types are frequent
Two skills in the same session cover related use cases and compete for the same prompts

Negative triggers aren't always needed. A narrow skill ("Use when the user asks to convert a CSV to JSON") has low false positive risk because the intent is specific. A broad skill ("Use for any writing task") creates a false positive problem by design.

In commission audits, broad descriptions without exclusion clauses consistently produce false positives on near-miss requests. A writing skill without a "Does NOT apply to" clause regularly fires on summarization requests. Adding two or three explicit exclusions is the difference between a skill that fires on demand and one that fires on everything.

This pattern handles near-miss cases, not genuinely ambiguous requests. When a user's intent is genuinely unclear, the skill body's trigger condition block handles it after the skill fires.

Research on LLM routing systems finds that uncalibrated routers exhibit measurable intent classification gaps that propagate into unstable routing decisions; after fine-tuning, routing accuracy reaches 97.86–97.93%, implying a baseline error rate of roughly 2–15% before calibration depending on model and task complexity (arXiv:2603.12933, 2025). Incorrect tool selection is a recognised and measurable problem: the ToolBench benchmark (Qin et al., ICLR 2024) introduced a Wrong-Tool-Avoidance metric specifically to score false positive tool calls as zero. The same dynamic applies to skill descriptions — an under-specified exclusion clause is the equivalent of an uncalibrated router.

What's the difference between a "pushy" and "conservative" description?

Pushy descriptions trigger on loosely related requests, conservative descriptions trigger on almost nothing, and both are production failures in opposite directions: a pushy description runs your skill on prompts it can't handle well, a conservative description causes you to miss requests your skill was built for.

See What's the difference between a "pushy" and "conservative" description? for the extended treatment.

Pushy:

description: "Use this skill for any content, writing, communication, or text-related request."

This fires on "summarize this PDF," "fix this typo," "what does this email mean?" and occasionally on code-adjacent requests. A prompt in a trenchcoat. Broad enough to look like a skill, but really just triggering on everything.

Conservative:

description: "Use this skill when the user asks to write a LinkedIn post about their recent startup experience."

Fires on almost nothing. Directly relevant requests get missed because the phrasing has to align precisely.

Calibrated:

description: "Use this skill when the user asks to write, draft, or create a LinkedIn post, social media update, or professional announcement. Does NOT apply to editing existing posts, writing direct messages, or generating non-social content."

Getting to calibrated requires knowing the actual distribution of user requests: what fires correctly, what doesn't, and where the near-miss cases cluster. The fastest path is collecting 20 real requests from production and checking each against the description. The pattern becomes clear quickly.

The MAST taxonomy of multi-agent system failures (Cemri et al., arXiv:2503.13657, 2025) identifies "Disobey Task Specification" as 15.7% of failures — a category that covers both over-triggering (a skill fires when it should not) and under-triggering (a skill fails to fire when it should). The same study found Step Repetition at 13.2%, which is consistent with a skill activating on requests it cannot handle well and looping. No published study quantifies over-triggering as an isolated metric, but the task-specification failure category is the closest available proxy.

What breaks descriptions silently?

Four failure modes stop descriptions working without producing any error output — multi-line formatting from code formatters, character limit truncation on descriptions over 1,024 characters, passive drift during routine edits, and system prompt budget overflow when total loaded descriptions exceed Claude's system prompt budget — and none produce a warning when they occur:

Multi-line formatting from a code formatter
Character limit truncation on long descriptions
Passive drift introduced during routine edits
System prompt budget overflow when total loaded skill descriptions exceed the budget Claude allocates to the system prompt

Multi-line formatting. If a code formatter like Prettier breaks the description value onto multiple lines, the YAML parser reads only the first line. The rest becomes a silent parse error. Trigger coverage drops without warning. Fix: add SKILL.md files to your .prettierignore.

Character limit truncation. Descriptions over 1,024 characters get cut at the limit. If your trigger conditions or exclusions appear in the second half of a 1,400-character description, they don't reach Claude at runtime. Count characters for long descriptions before shipping them.

Passive drift through edits. Skills get revised over time. An imperative description can drift to passive when someone "cleans up" the opening sentence. Check the opening construction after every edit.

System prompt budget overflow. When total loaded skill descriptions exceed the system prompt budget (~15,000 characters), some descriptions get dropped. The symptom: a skill that triggered reliably stops triggering after new skills are added. Audit total description length across the full skill library when this happens.

System prompt budget overflow is the least obvious of the four failure modes because it is load-dependent: a skill library that works fine at 10 skills can start dropping descriptions silently once the library grows. The ~15,000-character budget runs out faster than expected if descriptions are verbose.

The system prompt budget constraint has broader research backing than most practitioners realise. AgentIF (arXiv:2505.16944, 2025) found that instruction-following rates approach zero when total instruction length exceeds 6,000 words — and average real-world agentic instructions already run to 1,723 words with 11.9 constraints each. Chroma's Context Rot study (Hong, Troynikov, Huber, July 2025 — 18 models tested) found that performance degrades measurably as input context grows, driven by three compounding mechanisms: lost-in-the-middle attention, attention dilution scaling quadratically with token count, and distractor interference. A skill library that hits the system prompt budget ceiling is exposing all three.

"Developers don't adopt AI tools because they're impressive — they adopt them because they reduce friction on tasks they repeat every day." — Marc Bara, AI product consultant (2024)

76% of developers are now using or planning to use AI tools in their workflow, up from 70% the previous year (Stack Overflow Developer Survey, 2024). A skill that fires inconsistently creates friction instead of removing it. Silent description failures are the most common source.

How do you test whether your description is working?

Three tests each take under five minutes and together cover the full activation range: positive sampling confirms the skill fires on requests it should handle, negative sampling confirms it doesn't fire on near-miss requests, and live activation in a real Claude Code session confirms the routing works end to end.

Test 1: Positive sampling. Write 10 requests that should trigger the skill. Check each against the description semantically. Eight out of 10 matching is calibrated. Under 6 needs a rewrite.

Test 2: Negative sampling. Write 5 requests that should NOT trigger the skill but live in the same category. Check each: does the description clearly exclude them? Four out of 5 clearly excluded means negative triggers are working.

Test 3: Live activation. Open a Claude Code session with the skill loaded. Send matching requests. If the skill fires, it works. If it doesn't, the description isn't routing correctly. Rewrite the description before changing anything else.

In commission reviews, the majority of skills that fail positive sampling on the first test pass do so because the description covers one verb form when users reach for several. Insufficient synonym coverage is the most common cause — not missing exclusions, which tend to surface later in negative sampling.

The stakes are real: a Salesforce study (2025) found that AI agents complete only 30–35% of multi-step tasks successfully when tested against real-world office scenarios. Poorly scoped instructions are a consistent contributing factor. Reasoning models outperformed non-reasoning counterparts by 12–20 percentage points in that study — a gap that narrows considerably when the routing layer is working correctly.

In our builds, the trigger condition is the first thing we test. Not the output quality. Not the reference files. The trigger. A skill that doesn't activate is a deliverable that doesn't exist yet.

Most production agents aren't tested as rigorously as the three-test protocol above. The LangChain State of AI Agents Report (2024, 1,300+ respondents) found that only 52.4% of organisations run offline evaluations on test sets for their AI agents, and human review (59.8%) remains the most common evaluation approach. Running the three trigger tests above puts you in the more careful half of the field.

What's the right structure for a description that covers all four requirements?

A well-formed description covers four requirements: an imperative construction opening with "Use this skill when," explicit trigger verb synonyms and output type names that match real user phrasing, and a "Does NOT apply to" exclusion clause — producing a routing signal Claude can classify without ambiguity.

Imperative construction — opens with "Use this skill when"
Trigger conditions — the action verbs and intent patterns that match the skill
Output type names — the specific deliverable the skill produces
Exclusions — a "Does NOT apply to" clause naming near-miss categories

A description that covers all four follows a single template: open with "Use this skill when," name the trigger verb synonyms and output types, then close with the exclusion clause.

Use this skill when [trigger verb synonyms] [output type or action]. Does NOT apply to [exclusion 1], [exclusion 2], or [exclusion 3].

Applied to a content skill:

description: "Use this skill when the user asks to write, draft, or create any blog post, article, email, or social media post. Does NOT apply to summarizing, editing existing content, or generating code."

Applied to a code review skill:

description: "Use this skill when the user asks to review, check, or audit code, or requests feedback on a PR or pull request. Does NOT apply to explaining code, writing new code, or debugging errors."

Both are under 250 characters. Both are imperative. Both cover the intent synonym range. Both name the exclusions. Both stay well under 1,024 characters.

This structure works for the majority of skills. Skills with multiple distinct modes or highly overlapping categories need additional specificity, but the template above handles 80% of cases in production.

Skills that deviate from this template typically do so because they cover two genuinely different modes — a skill that both drafts and edits content, for example. In those cases, extending the template with a secondary clause handles the overlap without requiring a full split into two skills.

The reliability gain from structured templates is measurable. Meta researchers found that a structured prompting approach achieves 93% accuracy on code review tasks — a nine-percentage-point improvement over standard agentic reasoning (VentureBeat, 2024, reporting Meta AI semi-formal reasoning research). A 2024 arXiv study on prompt formatting ("Does Prompt Formatting Have Any Impact on LLM Performance?", arXiv:2411.10541) confirmed that format choices produce substantial performance variations across models. The four-component description template applies the same principle: a fixed structure reduces the ambiguity that causes routing failures.

FAQ

Why does my Claude Code skill only trigger sometimes? The most common cause is a passive description construction. Replace "This skill helps with..." with "Use this skill when..." and test again with 10 real requests. If that doesn't fix it, check whether a code formatter has broken the description onto multiple lines in the YAML.

What happens if my SKILL.md description is longer than 1,024 characters? Claude Code truncates it silently at 1,024 characters. Trigger conditions or exclusions that appear after that point don't exist at runtime. Count characters if your description is long.

How do I stop my skill from triggering when it shouldn't? Add a "Does NOT apply to" clause listing the specific request types causing false positives. Two or three precise exclusions handle most near-miss cases. Negative triggers work better than trying to narrow the positive trigger because they name the problem directly.

Prettier keeps breaking my skill description onto multiple lines. How do I fix this? Add **/SKILL.md to your .prettierignore file. Alternatively, wrap the description value in single quotes in the YAML frontmatter. Most formatters skip single-quoted strings.

How do I make my skill trigger 100% of the time on matched requests? Three things: use an imperative construction ("Use this skill when..."), include 3-5 synonym phrasings for the user intent, and test against 10 real requests. In AEM's testing, those three changes bring calibrated descriptions to 100% activation on matched prompts.

Can I use the same description for two similar skills? No. Near-duplicate descriptions cause Claude to pick unpredictably between the two skills, or skip both. Each skill needs a description that makes the selection unambiguous. If two skills seem identical in their descriptions, they're probably the same skill.

What's the difference between the description field and a trigger condition block inside the skill body? The description field controls routing. Claude reads it before the skill fires to decide whether to activate it. A trigger condition block inside the skill body runs after the skill fires and handles edge cases within the skill's scope. The description determines whether the skill runs at all. A detailed trigger condition block inside a skill with a broken description does nothing.

Should I use first, second, or third person in my skill description? Write the trigger clause in second person or imperative: "Use this skill when..." not "I use this skill when..." For the description of what the skill does, use present tense third person: "...the user asks to write..." Keep it consistent within the field.

Last updated: 2026-04-14