Session-to-session skill failures have four causes: context bleed from your design session, model sampling non-determinism, CLAUDE.md or adjacent skill changes between sessions, and input variation you didn't notice. Each Claude Code session starts with a clean context window: no carry-over from prior conversations (Anthropic, Claude Code Documentation, 2024). Identifying which cause applies takes under 10 minutes with the right test sequence. This diagnostic guide is part of the AEM skill-as-a-service troubleshooting series for Claude Code skill engineers.

TL;DR: If your skill works when you invoke it and fails in a fresh session, context bleed is the most likely cause. The session where you built the skill carries information about your SKILL.md that a fresh session doesn't have. Test in a session you haven't used for skill development and use the exact same input.

What Makes Claude Behave Differently Between Sessions?

Each Claude Code session starts fresh. No memory of prior conversations, no awareness of what worked last time. The skill loads from your SKILL.md file, and Claude executes it based on exactly what that file says, in the context of that session's other loaded content.

Four things that differ between sessions:

  1. Context bleed: The session where you designed and tested the skill has your SKILL.md's content in its active context window. You may have explained the skill's intent, discussed edge cases, or corrected Claude mid-run. A fresh session has none of that. At startup, only the skill's name and description are pre-loaded; the full SKILL.md is read only when the skill becomes relevant (Anthropic, Skill Authoring Best Practices, 2024). If the skill's instructions are ambiguous, the design session's extra context papers over the gap. Fresh sessions expose it.

  2. Model sampling: Claude outputs are sampled from a probability distribution. A skill that succeeds 90% of the time will fail 10% of the time by design. This is not a bug you can fix with better instructions. It is the nature of probabilistic generation.

  3. CLAUDE.md or skill changes: If you modified CLAUDE.md, added a new skill, or changed an existing skill between sessions, those changes affect how Claude processes all skills in the current session. A new skill with an overlapping description steals activations. A CLAUDE.md addition that conflicts with a skill step creates competing instructions. Research on LLM instruction-following found that prompt-level accuracy for GPT-4o dropped from 94% at one instruction to 21% at ten instructions, a 73-point collapse from compounding constraints (Harada et al., "When Instructions Multiply," arXiv:2509.21051, 2024).

  4. Input variation: The prompt you used in the working session was subtly different from the one in the failing session. "Review this code" and "check my code for issues" look similar, but one activates a skill and one doesn't.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

Ambiguous instructions pass in the design session because context fills the gaps. They fail in fresh sessions because nothing fills them.

How Does Context Bleed Affect My Test Results?

Context bleed is the most common cause of "works for me but not for anyone else." It also explains the pattern where a skill passes every test during development and then fails in production. The effect is invisible in your design session because extra context fills every gap in the instructions.

The mechanism: when you build a skill, you run the skill, discuss it with Claude, correct outputs, iterate on steps. All of that conversation stays in the context window. By the time you declare the skill "working," Claude in that session has a detailed mental model of the skill's intent that no fresh session has. Stanford NLP research found that performance drops over 30% when relevant instructions appear in the middle of a long context rather than at the start or end (Liu et al., "Lost in the Middle," arXiv:2307.03172, 2023).

The Claude A/Claude B method was built specifically to catch this. Claude A is your design session. Claude B is a fresh session with no history. We test every skill in a fresh session before marking it production-ready. The Claude A/Claude B check catches roughly 30% of "intermittent" failures before they reach the client.

The test is simple: open a new terminal or browser tab, start a fresh Claude Code session, invoke the skill with the exact same input you used in the working session, and observe whether it behaves identically. If it doesn't, the skill's instructions rely on context that doesn't exist in fresh sessions. Research from Carnegie Mellon found that underspecified prompts are roughly 2x as likely to regress when context changes, with accuracy drops exceeding 20% (Yang et al., "What Prompts Don't Say," CMU, 2025).

How Do I Know If CLAUDE.md Changes Are Causing the Failure?

If the skill worked yesterday and fails today, CLAUDE.md drift is the first thing to check. CLAUDE.md loads at session start, so any edit changes what Claude reads before your skill activates. The two places to inspect first are your git diff for CLAUDE.md and your .claude/skills/ folder for any added or modified skill files. Anthropic's documentation states that a bloated CLAUDE.md causes Claude to ignore rules that conflict with later instructions (Anthropic, Claude Code Best Practices, 2024). Run:

git diff HEAD~1 CLAUDE.md

or check your CLAUDE.md modification timestamp against when the skill last worked. Even small additions to CLAUDE.md can introduce conflicting instructions, particularly if they address similar tasks to your skill.

What makes CLAUDE.md conflicts hard to catch: the model doesn't surface them. Instruction-conflict research found that LLMs generated responses without acknowledging constraint conflicts in 97.5% of cases: the model silently satisfies whichever instruction it weighted higher and ignores the rest (ConInstruct, arXiv:2511.14342, 2024). Your skill appears to run while partially ignoring its own steps.

The second check is for new or modified skills. If another skill was added to the .claude/skills/ folder between sessions, compare the descriptions. SKILL.md descriptions are capped at 1,024 characters and only the first 250 characters appear in the /skills listing (Anthropic, Skill Authoring Best Practices, 2024): a description that is ambiguous in its first 250 characters creates the overlap condition. A skill with a description that overlaps yours steals activations when both could apply. For a diagnostic on this specific failure, see What Causes Multiple Skills to Interfere With Each Other?.

How Do I Test Whether Non-Determinism Is the Real Issue?

Claude isn't being inconsistent for sport. Sampling variation is real: research found accuracy variations up to 15% across identical re-runs even at temperature zero (Atil et al., arXiv:2408.04667, 2024). But non-determinism accounts for a small fraction of session-to-session failures. To isolate it: run the skill 5 times in a single fresh session with identical input.

If 4 of 5 runs produce correct output, you have a non-determinism problem: the skill works, it just isn't reliable enough. Giving the model an explicit output format with examples shifts consistency from roughly 60% to over 95% (Addy Osmani, Engineering Director, Google Chrome, 2024). Fixes for this scenario:

  • Tighten the output contract to reduce the range of acceptable outputs
  • Add explicit format constraints ("always return a numbered list, never prose paragraphs")
  • Move key constraints to the opening of each relevant step, not buried in mid-step paragraphs

If 0 of 5 runs produce correct output in the fresh session, non-determinism isn't the issue. The skill has a structural failure, either in the description (not triggering) or the instructions (triggering but producing wrong output). Separate diagnostic paths exist for each.

For help with the trigger side, see How Do I Debug a Skill That Triggers on the Wrong Prompts?. For the output side, see How Do I Troubleshoot Skill Description Activation Issues Systematically?.

What Is the Step-by-Step Diagnostic for Session Failures?

This sequence identifies the root cause of a session failure in five steps, typically under 10 minutes. Each check eliminates one failure class: loading, input variation, configuration drift, non-determinism, and context override. Most failures resolve at step 1 or 2. Run these checks in order and stop when you find the cause.

  1. Confirm the skill loaded: In the fresh session, run /skills. Verify your skill appears in the list. If it doesn't, the problem is at the loading layer, not the execution layer. Start with Why Isn't My Claude Code Skill Working?.
  2. Use the exact same input: Copy the verbatim input from the working session. Not a paraphrase. The exact text. Many session-to-session failures are input variation failures in disguise. Research on LLM reliability found instruction-following accuracy drops by up to 61.8% across semantically similar but subtly varied prompts. A paraphrase that looks equivalent to you can look meaningfully different to the model (Dong et al., "Revisiting the Reliability of Language Models in Instruction-Following," arXiv:2512.14754, 2024).
  3. Check for changes since the last working session: Review CLAUDE.md and the skill folder for modifications. If anything changed, revert it and test again.
  4. Run the skill 5 times with identical input: This isolates non-determinism from structural failure. Consistent failure across 5 runs points to the instructions. Variable failure points to sampling, output constraints, or conflicting context.
  5. Ask Claude to repeat the skill's instructions: "Tell me what this skill is supposed to do and the steps it follows." Compare the answer against your SKILL.md. If the two diverge, something in the session's context is overriding or supplementing the skill's instructions.

FAQ

Most session-to-session failures trace to context bleed or input variation, not to Claude being unpredictable. The skill worked in your design session because that session carried context a fresh session never has. Rule out context bleed and configuration drift first; remaining failures split between trigger issues and output contract gaps.

My skill works when I type /skill-name directly but fails when I describe the task naturally. Is that a session issue? No, that's a description issue. The auto-trigger description isn't specific enough to match natural-language prompts. The manual invocation bypasses description matching entirely. See My Skill Works via Slash Command but Not When Claude Should Auto-Trigger It.

Is it possible for a skill to pass in one model tier and fail in another? Yes. If you test in Opus and the skill goes to production in Sonnet, instruction-following reliability differs between tiers. Haiku is less reliable for complex multi-step skills than Sonnet or Opus. Build and test in the tier where the skill runs in production.

If I add a skill, can it break existing skills I wasn't touching? Yes. A new skill with a description that overlaps an existing skill's description creates a split-attention problem: Claude sees two plausible skills for the same prompt and either picks the wrong one or partially applies both.

How do I tell if context bleed is the problem vs bad instructions? Test in Claude B (fresh session). If the skill works in your design session but fails in a fresh session with identical input, context bleed is masking bad instructions. If the skill fails in both, the instructions are the problem regardless of session state.

My skill produces correct output in session 1, slightly different output in session 2. Is this a bug? Probably not. Linguistic variation in output is expected for skills without strict format constraints. Add an output contract or format template if you need identical output across runs.

Should I always test in a fresh session before shipping a skill? Yes. Testing only in the design session is the single most common reason production skills underperform. The design session is the ideal environment: you have full context, you're paying attention, you correct edge cases as they appear. None of that is available to the end user.

How many fresh-session tests should I run before a skill is reliable enough for production use? We use 5 runs in 3 separate fresh sessions as the minimum bar for skills that run autonomously. For skills that require human approval before consequential actions, 3 runs in 2 fresh sessions is sufficient, because the human-in-the-loop catches drift before it causes problems.

Last updated: 2026-04-22