How Does Claude's Internal Skill Discovery Mechanism Work?

Claude Code skill discovery is a semantic intent classifier, not a keyword search. When a user sends a prompt, Claude does not scan skill names for a match. It evaluates whether the semantic intent of the incoming prompt matches the behavioral intent described in each skill's description field. The distinction matters for how you write descriptions. This classifier architecture is the foundation that AEM's skill-engineering methodology is built around.

TL;DR: Claude Code's skill discovery mechanism treats each skill's description as a behavioral contract and matches incoming prompts against that contract using semantic intent analysis. Imperative descriptions in the active-voice pattern ("Analyze X when the user asks for Y") achieve 100% activation on matching prompts in controlled trials. Passive descriptions achieve 77%.

How Does Claude Decide Which Skill to Use for a Given Prompt?

Claude decides which skill to activate by evaluating semantic intent alignment between the incoming prompt and each loaded skill description. Descriptions that specify behavioral triggers, explicit conditions, and named scope produce the strongest alignment signal. The classifier runs this evaluation on every user turn, not just session start, matching prompt intent to the behavioral contract in each description.

Claude Code loads all installed skill descriptions into the system prompt at session startup. Each skill occupies approximately 100 tokens in context at that point, covering the name, description, and associated metadata (source: Claude Code skill engineering analysis, mellanon, 2026). The total character budget for all skill descriptions in a session is approximately 15,000 characters, scaling dynamically at 1% of the active context window (source: Claude Code documentation, 2026). When a user prompt arrives, the classifier runs against these loaded descriptions to determine whether any skill's behavioral contract matches the prompt's intent.

The classifier is not doing keyword matching. It evaluates intent alignment. A prompt that says "review my PR for issues" can match a skill described as "Evaluates code changes for quality, test coverage, and logical errors" even though no word in the prompt appears in the description. The semantic meaning of "review for issues" aligns with "evaluates for quality and logical errors."

This is also why description quality outweighs skill name quality. Claude is not reading "code-reviewer" and thinking "ah yes, exactly what I need." It is running intent alignment on the full description text, and the description is where the signal lives.

What Is the "Meta-Tool Classifier" and How Does It Operate?

The classifier is a tool-selection mechanism layered on top of Claude's normal token prediction. Claude Code exposes each skill as a callable tool with its description as the tool specification. The classifier's job: determine whether the current task matches any available tool specification closely enough to invoke it.

Four phases in the classification decision:

Intent extraction — Claude analyzes the user's prompt for its core intent: what is being asked, in what context, for what purpose
Description matching — That intent is compared against each loaded skill description for semantic alignment
Threshold evaluation — If alignment exceeds a confidence threshold, the skill is activated; below the threshold, Claude proceeds without skill invocation
Conflict resolution — When two or more skills reach threshold simultaneously, the skill with the higher-confidence match wins

The confidence threshold is the main cause of under-triggering. A description written in vague or passive language produces weak signal. The classifier finds alignment below threshold and skips the skill.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

That observation maps directly to description design. A closed behavioral spec produces strong classifier signal. An open suggestion does not. Anthropic's own testing of tool use examples shows that adding concrete behavioral specification to tool definitions improved accuracy on complex parameter handling from 72% to 90% (source: Anthropic engineering blog, "Introducing advanced tool use," 2025). The mechanism behind that jump is the same: specificity collapses the classifier's decision space.

What Linguistic Patterns Trigger the Classifier Best?

Imperative construction, explicit trigger conditions, named scope, and negative exclusions are the four patterns that produce high classifier activation. Testing across 650 activation trials revealed a measurable gap between description styles: imperative descriptions achieved 100% activation on matching prompts, while passive descriptions achieved 77% (source: AEM research synthesis, 2026).

The patterns that produce high activation:

Imperative construction: "Runs X analysis when the user asks for Y" > "A skill that can help with X when needed"
Explicit trigger conditions: "Activate when the user mentions code review, PR review, or asks to check code quality" > "Helps with code quality work"
Named scope: "Applies to Python files in the repository, not to documentation or config files" > "Useful for various code tasks"
Negative exclusions: "Does NOT activate for architecture questions, roadmap discussions, or high-level design conversations" — these prevent false positives as the library grows

We built three identical skills with only the description varying: passive voice, active voice, and imperative with explicit triggers. At 5 active skills in the library, all three activated correctly on 90%+ of matching prompts. At 25 skills, the passive version had dropped to 68%. Imperative held at 97%.

Description design is not a stylistic preference. At scale, it is the primary control variable for activation reliability.

What Does the Classifier NOT Do?

The classifier does not read the full skill body, does not learn from past sessions, does not weight usage frequency, and cannot activate reliably from an empty description. These are architectural boundaries, not gaps. Each one shapes how you write descriptions for a library of any size.

The mechanism does not:

Read the SKILL.md body during discovery. Only the description field is loaded at startup. The full instruction body stays on disk until after the classifier decides to activate the skill. This is the progressive disclosure architecture: metadata-only at startup, body on activation.
Learn from past activations. Each session starts cold. The classifier has no memory of which skills have been used before in other sessions.
Weight recency or frequency. A skill used 200 times has no advantage over one installed yesterday. Discovery accuracy depends entirely on description quality.
Handle empty or one-line descriptions reliably. A missing description reduces to zero signal. The classifier has nothing to match against and the skill does not activate automatically. It can still be invoked via slash command, but automatic triggering requires a description.

This pattern works for skills at the project and user level installed before session start. For MCP-provided tools added mid-session, the classifier re-evaluates at tool registration time, not just at startup.

How Does Description Length Affect Classifier Performance?

The combined description and when_to_use text is capped at 1,536 characters per skill in the skill listing (source: Claude Code documentation, 2026). That ceiling is not a target. Descriptions below 300 characters consistently underperform because they provide insufficient behavioral specification for strong intent alignment. Descriptions above 700 characters provide diminishing returns while consuming more of the total skill metadata budget.

The functional sweet spot in production builds sits between 400-600 characters (source: AEM production skill library analysis, 2026). That range provides enough behavioral specification for reliable classification while leaving budget headroom for a library of 20-30 skills. At approximately 109 characters of overhead per skill plus the description text, a 15,000-character total budget supports roughly 40 skills at average description length before truncation risk (source: skill engineering analysis, mellanon, 2026).

For the specific rules on what goes into a high-performing description, see Why Is the Description the Highest-Leverage Element of Skill Design? and What Did Testing of 650 Activation Trials Reveal About Directive vs Passive Description Styles?. For the relationship between description length and the system prompt budget, see At What Skill Count Does Claude's Performance Actually Degrade?.

Frequently Asked Questions

The seven questions below address the classifier's runtime behavior, conflict resolution, and edge cases that matter in production skill libraries. The classifier is stateless: it runs on every turn, resolves conflicts to a single confidence winner, and has no memory of which skills have run before.

Does Claude re-run the skill discovery classifier mid-conversation, or only at the start?

The classifier runs on each new user turn, not only at session start. Every prompt is evaluated against the loaded skill descriptions. This means a skill can activate mid-conversation when the topic shifts to match its description, even if the first several turns were unrelated.

Can two skills activate simultaneously for the same prompt?

No. The classifier resolves conflicts to a single winner. When two skills both reach the activation threshold, the one with higher confidence alignment wins. If confidence is equal, the skill installed earlier in the configuration takes priority. This is why description collision between similar skills causes activation inconsistency: the winner switches unpredictably as prompt wording varies.

Does the skill name affect discovery at all, or is it purely the description?

The name contributes a small signal. Skill names appear as part of the tool specification alongside the description, and semantically meaningful names (gerund form: "analyzing-contracts" vs "helper") provide marginal alignment signal. In practice, a strong description with a weak name outperforms a strong name with a weak description by a significant margin. Optimize description first.

What happens when no skill reaches the activation threshold?

Claude proceeds without invoking any skill and answers from its base capabilities and any CLAUDE.md context loaded for the project. No error or notification is generated. The absence of skill activation is invisible to the user. Research on LLM agent reliability finds that ambiguous or underspecified instructions cause tool invocation failures in 23.9% of test cases even for the best-performing models (source: Identifying the Risks of LM Agents, ICLR 2024). Silent non-activation is the most common symptom of that failure mode in skill libraries.

Can I see the confidence score for a skill activation?

Not directly from Claude Code's public interface. In testing sessions, you can ask Claude explicitly "which skill did you activate and why?" after a task completes. The reasoning traces in that response reflect the classifier's interpretation, though not the raw confidence scores.

How does the classifier handle prompts in non-English languages?

The semantic intent classifier operates on meaning rather than surface words, so it maintains reasonable accuracy for prompts in major languages even when the skill description is written in English. Cross-lingual accuracy is lower than monolingual matching. Multilingual benchmarks show performance gaps of up to 30 percentage points between English and lower-resource languages on semantic tasks (source: MMLU-ProX, Shi et al., 2025). For multilingual teams, writing trigger conditions explicitly in both English and the target language in the description improves reliability.

Does adding more trigger phrases to a description always improve activation?

No. Above approximately 5-6 explicit trigger phrases, the description becomes cluttered and the classifier's confidence on any single phrase weakens. The goal is specificity and clarity, not coverage. "Activates when the user asks to review, audit, or check a pull request for quality" is cleaner than listing 12 synonyms for code review.

Last updated: 2026-05-04