Pushy vs Conservative Skill Description: How to Tell the Difference

TL;DR: Pushy descriptions trigger the skill on loosely related requests by using broad category language without exclusions. Conservative descriptions trigger on almost nothing by using trigger phrases so specific that real requests don't match. Calibrated descriptions sit between them, naming output types and listing exclusions. Diagnosing which failure you have takes five minutes of testing.

Both failure modes produce broken skills. A pushy skill fires when it shouldn't. A conservative skill fails to fire when it should. Neither is more broken than the other. They're just broken in complementary ways, and they're fixed differently. AEM is a skill-as-a-service platform for building and deploying Claude Code skills; the pushy-conservative framework is the core calibration method we use when reviewing description quality across every skill we ship.

The pushy-conservative spectrum is the single most practical framework for diagnosing and fixing description quality. It gives you two specific things to test for, two distinct root causes, and two distinct fixes. Most description problems fall into one of these two categories. In our bar checks, pushy failures appear more often than conservative ones — broad category language is a more natural default than over-specified phrasing, which means over-firing is the more common starting point. (AEM bar checks, 2026) Independent benchmarking corroborates this: curated, well-specified skills improve task success rates by 16.2 percentage points on average compared to broad or under-specified equivalents. (SkillsBench, 2026)

What is a "pushy" description and what breaks it?

A pushy description triggers the skill on requests that are in the same general category as the skill's purpose but are clearly not the right use case — because the trigger clause names a broad category without output type specificity, or because the description is missing an exclusion clause that would have filtered those near-miss requests. It over-fires.

# Pushy: broad category, no output types, no exclusions
description: "Use this skill for writing or content-related tasks."

"Writing or content-related tasks" includes blog posts, summaries, edits, grammar checks, text analysis, rewriting, translating, and a dozen other things. The skill fires on all of them. The user asking for a summary gets a draft. The user asking for grammar fixes gets an entire rewrite.

The failure is invisible until the output is wrong. Claude Code doesn't report "description too broad." The skill fires without warning, and the output is off. In our testing, the most common response to a pushy description failure is not "my description is wrong" — it's "the model misunderstood me." The description is invisible to the user; the output is not. (AEM bar checks, 2026) This pattern holds at scale: in function-calling evaluations across 1,645 APIs, models with access to precise descriptions achieved over 80% correct tool selection; models without accurate description context dropped to near 50% — a 30+ point accuracy gap driven by description quality alone. (Patil et al., Gorilla LLM, arXiv 2023)

Three patterns that cause pushy descriptions:

Category language without output type: "content tasks," "writing work," "code-related requests"
No exclusion clause on a broad skill
Synonym padding that extends the trigger range unintentionally: "write, draft, create, compose, build, produce, generate content" catches requests the simpler list wouldn't

What is a "conservative" description and what breaks it?

A conservative description misses matched requests because the trigger phrases are written in formal or over-specified language that doesn't align with how users actually phrase their requests — the description names the use case precisely as a developer would define it, not as a user would ask for it.

# Conservative: specific phrasing that real requests don't match
description: "Use this skill when the user explicitly asks to create a written LinkedIn update about professional milestones or achievements."

A user who types "write me a LinkedIn post about my promotion" doesn't get this skill. Neither does "can you draft something for LinkedIn?" or "I need a post about my new job." None of those phrasings match "explicitly asks to create a written LinkedIn update about professional milestones or achievements."

Three patterns that cause conservative descriptions:

Formal or technical phrasing that doesn't match how users talk
Over-specific output type names: "written LinkedIn update about professional milestones" instead of "LinkedIn post"
Requiring exact phrasing alignment when semantic alignment would work fine

The failure mode is subtle. The skill never breaks. It just never runs. Users try it a few times, get no response, and stop using it. In our bar checks, conservative descriptions typically activate on fewer than half of their clearly-matched test requests — the phrasing gap between how the skill was written and how users naturally ask for it accounts for most of the missed activations. (AEM bar checks, 2026) The description field allows up to 1,024 characters, which means conservative descriptions fail not from length constraints but from phrasing specificity: there is ample budget for intent synonyms and natural-language coverage, yet most conservative descriptions use fewer than 150 characters with no synonym range. (Anthropic Claude Code docs, 2025)

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

A conservative description is an ambiguous instruction from a different angle: it technically specifies something, but specifies it so narrowly that the real use case doesn't fit.

What does a calibrated description look like?

A calibrated description triggers on all matched requests, skips near-miss requests, and stays within character budget — achieved by combining an imperative trigger clause covering the real intent synonym range, explicit output type names that match how users phrase requests, and a "Does NOT apply to" exclusion clause for the top near-miss categories. It covers three things:

An imperative trigger clause with 3-5 intent synonyms
Explicit output type names that match how users describe their requests
A "Does NOT apply to" clause covering the top 2-3 near-miss categories

# Calibrated: imperative, covers intent synonym range, output types named, exclusions present
description: "Use this skill when the user asks to write, draft, or create a LinkedIn post, social media update, or professional announcement. Does NOT apply to editing existing posts, writing direct messages, or generating non-social written content."

This description:

Fires on "write me a LinkedIn post" ✓
Fires on "draft something for my social media" ✓
Fires on "I need to announce a promotion" ✓
Does NOT fire on "fix the grammar in this post" ✓
Does NOT fire on "write me an email" ✓

Three intent verbs. Three output types. Two exclusions. 248 characters. Under budget. Semantically complete. In our bar checks, well-calibrated descriptions tend to cluster in the 200-280 character range — long enough to cover the intent synonym range and include an exclusion clause, short enough to stay readable in the skill selector. (AEM bar checks, 2026) Across large-scale tool selection benchmarks, models selecting from sets of poorly described tools achieve roughly 49% correct selection; descriptions with explicit output types and exclusion clauses push that figure significantly higher. (Qin et al., ToolBench, arXiv 2023)

Getting to calibrated requires testing. In our bar checks, we test every description against three sets of requests: clearly matching, clearly non-matching, and borderline. Our consistent observation is that pushy failures show up first — borderline activations are easier to spot than missed activations, because a wrong output is visible where a silent miss is not. Conservative failures tend to be caught only when someone notices the skill isn't running. The test takes five minutes and catches both failure modes before the skill ships. (AEM bar checks, 2026)

How do you diagnose which failure mode you have?

The diagnostic is a 10-request test: write requests across three categories — clearly matched, clearly non-matched, and borderline — then check each against the description semantically, where borderline requests firing means pushy, clearly matched requests not firing means conservative, and only the right requests firing means calibrated.

Write 10 requests across three categories:

4 clearly matched requests (the skill should fire)
3 clearly non-matched requests (the skill should not fire)
3 borderline requests (in the same category but different intent)

Check each against the description semantically.

If the borderline requests match: The description is pushy. The trigger clause is too broad or missing exclusions.

If the clearly matched requests don't match: The description is conservative. The trigger phrases are too specific or don't reflect real user phrasing.

If matched requests match and borderline requests don't: The description is calibrated.

This test takes five minutes. It's faster than fixing production failures after the skill is deployed. In our bar checks, the skills that fail the 10-request diagnostic at review almost always show the same patterns: borderline requests activating on pushy descriptions, and formal-phrasing mismatches blocking activation on conservative ones. Neither failure is subtle once you're testing for it. (AEM bar checks, 2026) Prompt specification quality is the dominant variable in production routing accuracy: automatically augmenting underspecified instructions with contextual rewrites improves task correctness by 27% — a finding that reinforces why the 10-request test is worth running before shipping rather than after. (AutoPrompter, arXiv 2025)

How do you fix a pushy description?

Two steps: first, add output type specificity to the trigger clause so the description names what the skill produces rather than a broad category; second, add an explicit "Does NOT apply to" exclusion clause covering the top two or three near-miss request types that currently trigger the skill.

Step 1: Add output type specificity to the trigger clause. Replace category language with named output types.

# Before: pushy
description: "Use this skill for any writing or content task."

# After: output type added
description: "Use this skill when the user asks to write, draft, or create a blog post, article, or newsletter."

Step 2: Add a "Does NOT apply to" exclusion clause covering the top near-miss categories. For content skills: summarizing, editing, analyzing. For code skills: explaining, debugging, generating tests.

# Final: output type + exclusions
description: "Use this skill when the user asks to write, draft, or create a blog post, article, or newsletter. Does NOT apply to summarizing existing content, editing for grammar, or generating code."

For a full guide on exclusion clauses, see What are negative triggers and why should I include them in the description?.

How do you fix a conservative description?

Two steps: first, replace formal or developer-spec phrasing with natural language that mirrors how users actually ask for the task; second, broaden the output type names so they cover the full range of real request phrasings, not just the precise term you'd use to name the feature.

Step 1: Replace formal phrasing with natural language. Write the trigger phrases as a user would phrase their request, not as a developer would name the use case.

# Before: conservative, formal phrasing
description: "Use this skill when the user explicitly requests the creation of a written professional social media update."

# After: natural phrasing
description: "Use this skill when the user asks to write, draft, or create a LinkedIn post or social media update."

Step 2: Broaden the output type names to match the range of real requests. "A written professional social media update" is how a product manager writes spec language. "A LinkedIn post or social media update" is how users ask for it.

For a focused guide on trigger phrase writing, see How do I write trigger phrases that make my skill activate reliably?.

In our testing, adding a "Does NOT apply to" exclusion clause is the single most reliable fix for pushy descriptions — it handles the false positive problem directly without requiring a rewrite of the entire trigger clause. Most pushy descriptions have a working trigger clause; what they're missing is the boundary. (AEM bar checks, 2026) Practitioner research on CLAUDE.md routing quality found a 9.9% improvement in routing accuracy from description iteration alone, with no changes to skill logic — reinforcing that the description line, not the skill body, is where routing failures originate and where they are fixed. (Edmund Yong, 800-hour practitioner study, 2025) The calibration framework works best for skills with a clear, defined use case. Skills with intentionally open-ended scope (free-form exploration, creative tools with no constraints) will always have some degree of false positives, because the use case is fuzzy by design. For those skills, the description is less a routing instruction and more a preference signal.

FAQ

How do I tell the difference between a pushy and a conservative description without testing? A pushy description names a broad category ("writing tasks," "content work") with no output type and no exclusions. A conservative description uses formal phrasing or a very specific output type name that real users wouldn't reach for naturally. Both signals are visible before testing, though testing is still the most reliable check.

Can a description be both pushy and conservative at the same time? Not on the same request type. But a description can be pushy on some request types (fires on summarization requests) and conservative on others (misses requests phrased informally). This is the worst failure mode: the skill fires on the wrong things and misses some of the right things. Fix by rewriting the trigger clause and testing across all three request categories.

Is it better to err on the side of pushy or conservative? Neither. A pushy skill produces wrong outputs. A conservative skill never runs. Pushing for calibration rather than accepting either failure mode is worth the extra five minutes of testing.

What's the fastest way to fix a pushy description? Add a "Does NOT apply to" clause naming the top three near-miss request types. This is a 50-80 character addition that resolves most false positive patterns. See The SKILL.md Description Field: The One Line That Makes or Breaks Your Skill for the full description framework.

Why do conservative descriptions feel like they should work? Because the logic seems sound: "I described exactly what the skill does, so it should activate when someone asks for that." The gap is between how you describe the skill and how users phrase requests. The description has to meet users at their phrasing, not yours.

How often should I retest a description after it's shipped? When you change the skill body significantly, when you add new skills to the session that might overlap in scope, or when you observe an unexpected activation or missed activation in production. Description calibration is a maintenance task, not a one-time setup.

Last updated: 2026-04-14