How Do You Systematically Diagnose Whether a Skill Failure Occurs at the Discovery, Loading, or Execution Layer?

Every Claude Code skill failure happens at one of three layers: discovery (Claude doesn't know the skill exists), loading (Claude found the skill but didn't read its content), or execution (Claude read the content but didn't follow it). AEM's three-layer diagnostic framework tests them in sequence and cuts debugging time from hours to minutes.

TL;DR: Run this sequence: check /skills to test discovery, ask Claude to recite instructions to test loading, compare actual output against expected output to test execution. Stop at the first layer that fails. Fix it, re-test from the top, and don't move to the next layer until the current one passes.

Why Does Layer Order Matter When Diagnosing Skill Failures?

Layer order matters because each layer depends on the one before it: if discovery fails, loading can't run; if loading fails, execution can't be tested reliably. Testing out of sequence produces false results. Fixing a loading problem before confirming discovery adds changes to a system that may not even be reading the file you edited.

Skill failures look identical from the outside: Claude produces wrong output or no output. Without a framework, the debugging path is guesswork. You rewrite the description, test again, still wrong, rewrite the instructions, test again, still wrong, wonder if it's a Claude bug.

It isn't. The failure is at one specific layer, and the fix for each layer is entirely different. A discovery-layer fix (moving a file to the right path) does nothing for an execution-layer failure (ambiguous step instructions). Applying the wrong fix wastes time and introduces new variables. Analysis of enterprise AI agent deployments found that 88% of agent projects never reach production, with misdiagnosed failure modes as a primary driver of project stalls (Digital Applied, 2024). Gartner predicts over 40% of agentic AI projects will be canceled by 2027, citing inadequate debugging infrastructure as a primary cause (Gartner, 2024). Layer-based diagnosis is the structural fix: one layer, one test, one fix at a time.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

Closed-spec debugging applies the same principle: test each layer against a specific pass/fail criterion, not a vague "does it seem better." Specification failures, where agent requirements are ambiguous or underspecified, rank as the leading root cause of agent failures across production deployments (Arize AI, 2024). The three-layer test turns a subjective debugging session into a binary pass/fail sequence.

How Do I Test Whether Claude Can Discover My Skill?

Run /skills in a fresh Claude Code session. If your skill name appears in the list, discovery passed. If it doesn't appear, the file never registered: Claude has no knowledge the skill exists and cannot load or execute it. Fix the structural problem before changing any skill content.

What it tests: Does Claude know the skill exists at all?

Test: Run /skills in the Claude Code session. Your skill should appear by name in the list.

Pass: Skill is in the list. Move to Layer 2.

Fail: Skill is not in the list. The file didn't load. Fix the discovery problem before touching anything else.

Discovery failures have four causes:

Wrong file path. SKILL.md must be at .claude/skills/[skill-name]/SKILL.md. Not skills/, not .claude/, not nested deeper.
Broken frontmatter. Invalid YAML, missing --- delimiters, or a missing name field prevents registration.
Wrong filename casing. skill.md and Skill.md are not found on Linux or WSL. The file must be SKILL.md.
Session started before file was saved. Skills scan at session start. A file created after the session opened is invisible until restart.

The discovery layer fix is always structural: fix the path, fix the frontmatter, fix the casing, or restart the session. No amount of description rewriting fixes a discovery failure because the description never loads in the first place. Claude Code loads only the skill name and description at session start, capping each skill's combined description and when_to_use text at 1,536 characters to manage context budget (Anthropic Claude Code documentation, 2025). If the name doesn't appear in /skills, no other text from the file has been read.

How Do I Confirm Claude Actually Read My Skill Instructions?

After invoking the skill, ask Claude to recite the steps it will follow for the task. If the answer matches your SKILL.md body accurately, loading passed. If the answer is generic, incomplete, or describes a different process, the body didn't load, even if the skill name appeared in the /skills listing during discovery.

What it tests: Did Claude read the SKILL.md body and the reference files it points to?

Test: After invoking the skill, ask: "Tell me the steps you follow for this task." Compare the response against your SKILL.md body.

Pass: Claude recites steps that match your SKILL.md content accurately. Move to Layer 3.

Fail: Claude's answer is generic, incomplete, or describes a different process entirely. The body didn't load.

Loading failures have three causes:

SKILL.md too long. Over 500 lines, body loading becomes unreliable. Claude parses the frontmatter (which is why the skill still appears in /skills) but doesn't fully absorb the body. Anthropic's official guidance sets 500 lines as the ceiling; beyond that, move reference material to separate files (Anthropic Claude Code documentation, 2025).
Frontmatter parsing issue that allows partial load. Some frontmatter errors let the name and description through but prevent body processing. The skill shows in /skills but executes as if no body exists.
Reference file loading failure. If your steps depend on a reference file and that file didn't load, the steps execute without the context they assumed. The skill shows in /skills, the steps are present, but the output is wrong because required reference content is absent.

To diagnose a reference file loading failure specifically, follow the Body Test with a Reference Test: ask "Which reference files have you loaded for this skill, and what content did you draw from each?" A thin or incorrect answer to this question identifies the reference that failed to load. See How Do I Fix a Skill That Reads Reference Files in the Wrong Order? for the fix. After auto-compaction, Claude Code re-attaches invoked skills but keeps only the first 5,000 tokens of each, with a combined budget of 25,000 tokens across all re-attached skills (Anthropic Claude Code documentation, 2025). A 500-line SKILL.md that passed loading initially can appear to fail in later turns if compaction dropped its tail content.

The edge case that fools most developers: A skill that passed loading in the design session fails in fresh sessions. This is context bleed, not a true loading failure. The design session carried contextual knowledge about the skill that covered for incomplete body loading. The fix is the same (improve the body content), but the diagnosis is different. Test loading in a fresh session to avoid this false positive.

How Do I Test Whether Claude Followed the Skill Instructions?

Compare the actual output against your skill specification, field by field and constraint by constraint. If output matches the spec, execution passed. If output is wrong in a specific, reproducible way, an instruction loaded correctly but wasn't treated as binding. This is the most common failure mode in production skill engineering.

What it tests: Did Claude follow the instructions it loaded?

Test: Compare the actual output against the exact output your skill specification requires. Not "does it seem about right." Step by step, field by field, constraint by constraint.

Pass: Output matches the specification. The skill works.

Fail: Output is wrong in a specific, reproducible way. An instruction was present in loading but wasn't followed in execution.

Execution failures split into two types:

Trigger execution failure: The skill loaded correctly but didn't trigger automatically when it should have. Claude chose a different skill or no skill. The description matched loading but failed to match the actual prompt at runtime. This looks like: "works when I type /skill-name but not when I describe the task naturally." For the specific fix, see How Do I Debug a Skill That Triggers on the Wrong Prompts?.
Step execution failure: The skill triggered correctly, the steps loaded correctly, but Claude didn't follow specific steps precisely. The instructions loaded but weren't treated as binding. Isolate by testing one step at a time: invoke the skill, then after each step, ask Claude "Which step did you just execute and what exactly did you do?" A skipped or abbreviated step is your target.

We diagnose step execution failures by running skills against a 10-input test set (5 typical, 3 edge cases, 2 adversarial inputs that shouldn't trigger the skill). Across 10 inputs, consistent step-skipping at the same step identifies the instruction that needs to be tightened. A step that fails on 4 of 10 inputs has an instruction written as an open suggestion. A step that fails on 1 of 10 has an edge case the instruction didn't anticipate.

Instruction position within the skill body matters. Research on long-context LLM behavior shows models drop accuracy by 30% or more when key instructions sit in the middle of a long context, compared to placement at the start or end (Liu et al., Stanford NLP Group, "Lost in the Middle," arXiv 2307.03172, 2023). A step buried mid-file in a 400-line SKILL.md is at higher risk of being skipped than the same step placed at the top or bottom.

Which Layer Do I Test First When a Skill Fails?

Always start at discovery. It takes 30 seconds, and a failure there makes every test below it meaningless. Loading and execution tests assume discovery passed. If discovery failed, you have no confirmed path from file to context, and any fix you apply at a deeper layer is operating on an unverified assumption.

A December 2025 context optimization study found that implementing trigger-based skill loading reduced initial session context from 7,584 to 3,434 tokens, a 54% reduction (johnlindquist, GitHub Gist, December 2025). The same logic applies to diagnosis: confirm the layer is active before spending time on its contents.

Skill failure reported
    ↓
Test: /skills listing
    ├─ FAIL: Fix discovery (path, frontmatter, casing, session restart)
    └─ PASS ↓

Test: Recite instructions (fresh session)
    ├─ FAIL: Fix loading (body length, frontmatter, reference load instructions)
    └─ PASS ↓

Test: Compare output vs specification (10-input test set)
    ├─ TRIGGER FAIL: Fix description (trigger conditions, negative triggers, specificity)
    └─ STEP FAIL: Fix instruction (ambiguity, position, completeness)
        ↓
Re-test from discovery layer

Never skip layers. A loading fix that accidentally improves trigger behavior doesn't count as evidence that the description was the real problem. You need each layer to pass cleanly before diagnosing the next.

What Makes This Diagnostic Framework Produce Consistent Results?

Each test has a clear pass/fail criterion. "The skill appears in /skills" is binary. "Claude recites steps that match SKILL.md" is observable. "Output matches the specification field by field" is precise. None of these depend on subjective judgment. Layer isolation is the mechanism: each test confirms one variable, and only one.

The framework works because it isolates variables. When you move from discovery to loading to execution, you've confirmed the upstream layers work. The failure you're hunting in layer 3 isn't confused by a layer 1 problem you haven't noticed yet.

This approach also prevents the most expensive debugging error in skill engineering: fixing a symptom at the wrong layer. An execution fix applied to a loading failure does nothing, but it adds complexity. A description rewrite applied to a discovery failure fails invisibly and leaves you wondering why the new description didn't help. Layer isolation prevents both. Structured output formats with explicit examples raise model consistency from approximately 60% to over 95% in controlled benchmarks (Addy Osmani, Engineering Director, Google Chrome, 2024). The same principle applies to layer tests: a defined pass/fail criterion produces a result you can act on. "Seems about right" doesn't.

For a complementary view on what happens when skills fail in specific session conditions, see Why Does My Skill Work in One Session but Fail in Another?.

FAQ

Most Claude Code skill failures occur at a single layer. Discovery failures are the most common because a misplaced file or broken frontmatter blocks everything downstream. Loading and execution failures are usually isolated: the skill registers correctly but either the body content didn't absorb or a specific instruction wasn't treated as binding.

Can a skill fail at multiple layers simultaneously? Yes, but it's rare. Most skills fail at one layer. When they fail at multiple, the discovery-layer failure masks the others: if Claude can't find the skill, no loading or execution data is available. Fix discovery first, then retest. Layer 2 or 3 failures only become visible after Layer 1 passes.

What if Layer 2 passes in the design session but fails in a fresh session? That's context bleed. The design session's active context supplemented the loading. Fresh sessions don't have that supplement. The loading test must be run in a fresh session to produce a valid result. Use Claude B (fresh) for all loading and execution tests.

The skill passes all three layer tests but still produces wrong output in production. What's happening? Production failure after controlled-environment success points to input variation. The test inputs don't match what real users actually type. Add more adversarial test inputs, including prompts that are adjacent to the skill's trigger condition but shouldn't activate it. Also check whether CLAUDE.md or other skills differ in production vs test environments.

How do I test execution when the skill's output is inherently variable? Define a structural specification rather than a content specification. Instead of "output must say X," specify "output must include fields A, B, and C in this order, formatted as a numbered list." Test structure and format, not specific content. Variable content with consistent structure is a passing execution test.

Does the execution layer cover reference file content failures? No. If a reference file loaded but its content was ignored in execution, that's a step execution failure: the step that uses the reference content has an ambiguous instruction. If the reference file didn't load at all, that's a loading failure, caught at Layer 2.

How long should a full three-layer diagnostic take? Discovery: 30 seconds. Loading: 2 to 3 minutes. Execution: 5 to 10 minutes with a 10-input test set. Total: under 15 minutes for any skill failure, assuming you follow the sequence and stop at the first failing layer.

Last updated: 2026-04-22