Claude A is the session where you built the skill. Claude B is a fresh session where no one knows anything you know. Claude A bias describes the systematic failures that appear when the person who designed the skill is also the person who tests it — using their own context, their own phrasing, their own knowledge of how the skill is supposed to work. Agent Engineer Master (AEM) identified this pattern through Claude Code skill engineering: a structured methodology for building production-ready, deployment-verified Claude Code skills that pass testing by users who had no role in designing them.
TL;DR: Claude A bias produces three failure modes: prompt mirroring (the designer tests with their own phrasing), context bleed (the design session fills gaps the skill file cannot), and output acceptance drift (the designer reads ambiguous results as correct). The fix is Claude B validation: a fresh session, third-party prompts, and a blind evaluator.
What is Claude A bias?
Claude A is the context-rich session where design happens (skill engineering methodology, AEM, 2025). By testing time, Claude has absorbed your design intent, your corrections, your iterative refinements. A test that passes in Claude A carries all that accumulated context. None of that context travels to the next session.
Claude B is a genuinely fresh session. No design history. No clarifying exchanges. The skill file as the only source of truth. Research on LLM context usage confirms why this matters: when relevant information is placed in the middle of accumulated session context, model performance on retrieval tasks drops by over 30% compared to fresh-context conditions (Nelson Liu et al., Stanford NLP Group, "Lost in the Middle," ArXiv 2307.03172, 2023).
A skill that works in Claude A and fails in Claude B is not a tested skill. It is a skill that works when the author helps it. That is the core failure. The author's presence was the missing component, and the author will not be present at every invocation.
What are the three specific failure modes?
The three failure modes are prompt mirroring, context bleed, and output acceptance drift. Each systematically advantages the skill designer over a real user, because the designer's vocabulary, session history, and mental model do the work that the skill file is supposed to do.
Failure mode 1: Prompt mirroring. The designer tests with prompts they would personally use. Their phrasing matches the mental model they used to write the description. The skill activates reliably on these prompts because the description was written with exactly this phrasing in mind.
Real users phrase requests differently. They use synonyms. They include context the designer assumed was irrelevant. They frame the task from a different starting point. Prompt mirroring means the skill's trigger coverage maps to the author's vocabulary, not to the full range of vocabulary real users bring. This mirrors confirmation bias documented in classical software testing: developers testing their own code systematically choose positive test cases over negative ones, reducing defect detection rates (Glenford Myers, "The Art of Software Testing," 1979).
In production deployments across teams, we find that skills tested only by their author undertrigger on team member prompts at a rate approximately 30% higher than skills tested by a third party before deployment (AEM production data, 2025-2026). The prompts that fail are not obscure — they are natural-language variations the author simply did not think to test.
Failure mode 2: Context bleed. When you test a skill in a session where you have been designing it, Claude has absorbed dozens of exchanges clarifying your intent. That context helps Claude succeed on tasks where the SKILL.md alone would not provide enough guidance.
The practical consequence: the skill file looks complete during testing, because Claude is filling gaps from session context. The first time a real user invokes it in a fresh session, those gaps are visible. This failure mode is insidious because it does not appear in the test results at all — the skill passes, cleanly, on every test case the designer runs. CMU researchers found that in 34% of real-world LLM interactions, instruction details are left underspecified the majority of the time, and under-specified prompts are 2x as likely to regress across model or prompt changes (Yang et al., Carnegie Mellon University, 2025).
Failure mode 3: Output acceptance drift. Designers who know what the skill should produce accept outputs that are slightly off as correct. They read the output through the lens of intent rather than the lens of specification. "This is mostly right — the format is slightly different than I specified, but it contains the right information" is the evaluator's assessment. A real user's assessment is "this doesn't follow the format."
"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)
Output acceptance drift means that ambiguous instructions pass testing — the ambiguity is resolved by the author's intent rather than by the specification. In production, the resolution is gone.
How do you test around the bias?
Three countermeasures address the three failure modes directly: third-party prompt generation (for prompt mirroring), a fresh session for every critical test (for context bleed), and specification-grounded evaluation (for output acceptance drift). Each countermeasure removes one advantage the designer holds over a real user.
Counter 1: Third-party prompt generation. Before running any tests, ask a colleague (or a separate Claude session with no context about the skill) to write 10 prompts for a task that the skill should handle. Do not show them the skill design. Use those prompts as the test battery. You will find coverage gaps that your own prompt list would never surface.
Counter 2: Fresh session for every critical test. The canonical test session for a skill is a session that has never seen the skill design. Open a new Claude Code window. Add only the skill file. Run the test. Anything that passes in this context is genuinely tested. Anything that only passes in your design session is untested.
Counter 3: Specification-grounded evaluation. When evaluating test output, do not ask "is this basically right?" Ask "does this match the output contract exactly?" Read the output contract first. Then read the output. Mark every deviation — including small ones — as a failure. Small deviations are not acceptable variation; they are evidence of an underspecified output contract. Explicit output formats with examples raise model output consistency from approximately 60% to over 95% (Addy Osmani, Engineering Director, Google Chrome, 2024).
For the evaluation framework underlying this, see What Is Evaluation-First Development and What Are Evals in Claude Code Skills.
What does a clean Claude B validation look like?
A clean Claude B validation runs in a session that has never seen the skill design, uses prompts written by someone other than the designer, evaluates output against a written specification committed before the test runs, and scores each case as pass or fail with no partial credit. Four components, in order:
Fresh session, no prior context. New window, no related exchanges preceding the test.
Third-party prompts. At least 5 of the 10 test prompts come from someone who did not design the skill.
Written expected outputs. Before running the test, write the expected output for each test case. This prevents output acceptance drift — you evaluate against a specification you committed to before seeing the result.
Pass/fail scoring. Each test case is either pass or fail. "Mostly pass" is a fail. The fail is documented with the specific deviation from the output contract.
Any skill that does not pass a clean Claude B validation with 8/10 test cases before deployment is not ready for production. The 8/10 threshold reflects the reality that 100% pass rates in controlled testing correlate with over-fitted test cases, not with genuinely robust skills (AEM testing methodology, 2025).
What is the honest limitation here?
Claude B validation eliminates context bleed and reduces prompt mirroring. It does not eliminate the designer's interpretive bias in writing expected outputs. The designer still wrote the output contract, designed the test cases, and scored the results, all from the same mental model that built the skill.
The deepest form of Claude A bias is that the designer wrote the output contract, designed the test cases, and wrote the expected outputs — all from the same mental model. A truly unbiased evaluation requires someone who uses the output without knowing what it was supposed to be. Researchers studying LLM-generated test suites found they produce tests that mirror the generating model's error patterns, creating a homogenization trap where tests catch LLM-like failures while missing diverse human-reasoning errors (ArXiv 2501.14465, 2025). The same dynamic applies when a human designer generates all test cases from a single mental model.
For high-stakes skills serving many users, the gold standard is real-world evaluation: deploy to a small group of actual users, collect session logs, and measure whether the outputs serve their actual needs. Controlled testing reduces obvious failures; real-world evaluation surfaces the subtle specification gaps that controlled testing misses.
For a complementary diagnostic approach, see How Do You Diagnose Whether a Skill Failure Is a Description Problem, Instruction Problem, or Reference Loading Problem.
Frequently asked questions
Claude B validation closes most Claude A bias failures, but practical questions remain: how to source third-party prompts without a colleague, which model tier to test against, how many sessions constitute a reliable sample, and what to do when evals pass but production fails. The answers below cover each scenario.
If I can't find a third party to generate test prompts, what's the best alternative?
Use a separate Claude session with no context about your skill. Describe the task the skill is designed to handle — in general terms, not in the words of the skill design — and ask Claude to generate 10 natural-language requests for that task. This is not as good as a human third party, but it surfaces vocabulary variations that you would not generate yourself.
Should I test skills in Claude Opus, Sonnet, or Haiku for Claude B validation?
Test in the model tier your users will use. If the skill is designed for Sonnet, test in Sonnet. Model-specific testing matters because instruction adherence varies across tiers — a skill that passes Claude B validation in Opus may fail in Haiku on the same test cases. If your skill needs to work across tiers, test all three.
How many Claude B test sessions should I run before deploying?
Three fresh sessions, minimum. One session could pass by chance on prompts that happened to phrase the request favorably. Three sessions with independently generated prompts provides the minimum signal that the skill is not passing on lucky phrasing. For high-stakes production skills, five independent sessions is the reliable threshold.
Is Claude A bias the same as overfitting in machine learning?
The analogy is accurate. Overfitting in ML means a model performs well on training data and poorly on held-out test data. Claude A bias means a skill performs well in the designer's context and poorly in real-world contexts. The root cause is the same: the evaluation data was too similar to the development data to catch generalization failures. Controlled experiments in software testing found a significant difference between positive test cases (which developers favor) and negative test cases (which detect more defects) — negative tests are underrepresented when developers test their own code due to confirmation bias (Empirical Software Engineering, Springer, 2019, doi:10.1007/s10664-018-9668-8).
What if my evals.json passes but users still report failures?
The evals were written with Claude A bias. The test cases reflect the designer's mental model of what users will ask, not what users actually ask. Expand the test battery using real user prompts from session logs, replace pass/fail assessments the designer made with specification-grounded evaluations, and re-run. The failures will surface.
Last updated: 2026-04-20