How Do You Prevent a Self-Improving Skill from Accumulating Contradictory Learnings?

Contradictory learnings are the most insidious failure mode in self-improving skills. The skill runs correctly for months. Then one day you add the 73rd learning — a correction to how the skill formats code examples — and notice the skill now formats code three different ways on three consecutive runs. You did not break anything. You accumulated your way into incoherence.

TL;DR: Self-improving Claude Code skills degrade when the learnings file grows past 80 lines or develops internal contradictions. The fix is a consolidation protocol: cap at 100 lines, merge entries addressing the same failure pattern, delete entries superseded by newer guidance, and reject any two entries that give Claude conflicting instructions for the same situation.

How do contradictory learnings accumulate?

Contradictory learnings accumulate in three phases: entries first address novel failures, then begin overlapping as adjacent problems recur, then directly conflict when a behavior correction reverses earlier guidance. In AEM-monitored skills over 6-12 months of production use, this sequence is consistent. Each entry was correct when written. The contradiction is a product of time, not error.

The accumulation follows a predictable pattern:

Months 1-2: Learnings address genuine novel failures. Entries are specific and non-overlapping.
Months 3-4: Learnings begin to address adjacent variations of previously solved problems. Two entries now give guidance on the same failure class, phrased differently.
Months 5-6: A behavior correction reverses or qualifies an earlier entry. Both entries remain in the file. The newer one is technically correct. The older one now misleads.
Month 7+: The learnings file has 90+ lines. Claude processes it as a reference block and reaches internally inconsistent conclusions, because the block itself is internally inconsistent.

This is not a corner case. Research on prompt length and reasoning accuracy found that model accuracy drops from 92% to 68% as input length increases, with degradation beginning well below the model's technical context limit (Levy et al., ACL 2024). A learnings file is not exempt from this effect. It is part of the prompt.

The 100-line cap exists for this reason (Claude Code skill design specification, 2025). Beyond 100 lines, the statistical probability of a learnings file containing at least one contradiction exceeds 80% based on AEM's review of long-running production skills. The cap is not arbitrary — it reflects where degradation becomes systematic.

What does contradiction look like in a learnings file?

Three patterns appear in contradicted learnings files, each degrading Claude's output quality through a different mechanism. Direct contradiction gives Claude two opposing instructions for the same situation. Qualifier creep buries the original rule under accumulating exceptions. Stale corrections keep outdated guidance in active use after the underlying issue was resolved at the SKILL.md level.

Direct contradiction. Two entries address the same failure mode with opposite guidance:

"Learning 14: When the user provides a code example, start the analysis with the code, not with contextual background."
"Learning 52: Context-setting before code analysis improves user comprehension. Always open with a 2-sentence context block before analyzing code."

Both were correct at the time. Learning 14 fixed a specific failure in one client's workflow. Learning 52 corrected a different client's complaint. Together, they give Claude no coherent instruction.

Qualifier creep. An entry was written as an absolute rule. A later entry adds "except when" qualifications. A third adds more qualifications. The original rule is now buried under caveats that read like a legal contract:

"Learning 7: Always output JSON."
"Learning 31: For simple single-value outputs, plain text is cleaner than JSON."
"Learning 58: When the downstream system is a human reader, avoid JSON entirely."

Claude may follow any of these entries on any given run, depending on which framing its current context makes most salient. Research from Stanford NLP found that models perform significantly worse when relevant information appears in the middle of a long context, following a U-shaped accuracy curve where instructions at the start or end receive disproportionate attention (Liu et al., "Lost in the Middle," ACL 2024, ArXiv 2307.03172). A qualifier-laden learnings file is exactly the structure that exploits this weakness.

Stale corrections. Learning 22 corrected a behavior that was later fixed at the SKILL.md level in a refactor. The correction in the learnings file is now redundant and potentially conflicts with the updated process step.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

An unaudited learnings file is the opposite of a closed spec. It is an open suggestion that has had suggestions added to it for a year.

How do you detect contradictions before they cause problems?

The detection audit runs in 20 minutes on any learnings file under 100 lines. The process has four steps: group entries by failure class, check for direct instruction conflicts within each category, identify entries superseded by current SKILL.md guidance, and check date distribution to flag stale entries for review.

Group entries by failure class. Read every entry and assign a category: "output format," "trigger behavior," "content quality," "edge case handling." Any category with more than 3 entries deserves scrutiny.
Check for direct instruction conflicts within categories. Within each category, read entries in order. Mark any entry that instructs Claude to do something a previous entry instructs it not to do.
Check for entries superseded by the current SKILL.md. Read the SKILL.md body and the learnings file side by side. Any learning now enforced at the process step level is redundant and can be deleted.
Check date distribution. Entries added in the last 30 days reflect current behavior standards. Entries older than 6 months reflect an earlier context. Flag them for review.

Every model tested in the Chroma context rot study (July 2025, 18 models including Claude, GPT-4.1, and Gemini 2.5) exhibited performance degradation at every input length increment. Running the audit early, before contradictions compound, is the cheaper intervention.

What is the consolidation protocol?

When a learnings file hits 80 lines, run the consolidation before adding the next entry. The protocol has four operations: merge entries addressing the same failure class, delete entries now enforced at the SKILL.md level, delete corrections for failures not seen in the last 60 days, and verify internal coherence by reading the full file top-to-bottom.

Group, then merge. For each failure class with 3+ entries, write one consolidated entry that captures the current, authoritative guidance. Delete the individual entries it replaces. The consolidated entry should be shorter than the sum of the entries it replaces — if it isn't, the guidance is still ambiguous.
Delete superseded entries. Any entry that is now enforced at the SKILL.md level is no longer needed. Delete it. The SKILL.md instruction takes precedence anyway — the learning entry adds noise without adding coverage.
Delete stale corrections. Any entry that addresses a failure you have not observed in the last 60 days of production use is a candidate for deletion. Stale corrections that were valid once are overhead today.
Verify internal coherence. After consolidation, read the entire file from top to bottom as if you were Claude receiving it as context. Every instruction should be followable without contradiction. If two entries still conflict, resolve the conflict and delete one.

Consolidate when the file hits 80 lines or once a week, whichever comes first. The weekly cadence is what separates a useful learning loop from a pile of notes (MindStudio, 2025).

The target after consolidation is 40-60 lines: specific enough to provide real guidance, short enough for Claude to hold coherently in attention alongside the rest of the skill context. Skills with learnings files in this range show approximately 20% fewer edge-case failures than skills with 90+ line files containing the same nominal content (AEM production review, 2025-2026).

What is the honest limitation of this approach?

The consolidation protocol prevents the problem from compounding, but it does not eliminate it. Any skill that receives continuous feedback from diverse real-world use will develop new contradictions over time. Consolidation is a recurring maintenance task, not a one-time fix. The protocol manages the rate of accumulation; it does not stop the underlying dynamic.

The minimum cadence for a skill in active production use is quarterly consolidation. For a skill receiving daily feedback from multiple users, monthly consolidation is the threshold that keeps contradictions from degrading quality. Research on production AI systems found that 91% of machine learning models experience performance degradation over time without active maintenance (Logz.io / industry survey, 2025). Self-improving skills are not immune; their maintenance loop is consolidation, not retraining.

This means self-improving skills carry a maintenance overhead that prompt-style skills do not. The trade-off is worth it for high-value, frequently-invoked skills. For low-frequency personal-use skills, a simpler approach — a flat learnings file with no consolidation process and manual monitoring — is often sufficient.

For the full self-improvement architecture, see Claude Code Skills That Get Better Over Time.

Frequently asked questions

Archive uncertain entries rather than deleting them immediately, and run your evals.json test suite after every consolidation pass. Automate direct contradiction detection with Claude, but use human judgment for qualifier creep and stale corrections. Entries that go 90 days without appearing relevant in production logs are candidates for permanent removal.

Should I delete learnings or move them somewhere else before deleting?

Move entries you are uncertain about to an archive section at the bottom of the file, commented out with a date stamp. This gives you a recovery path without keeping active contradictions in the working file. After 30 days with no observed failures related to the archived entry, delete it.

How do I know if a consolidation deleted something important?

Run your evals.json test suite after consolidation. If a test case fails that was passing before, the consolidated entry lost coverage that the original entries provided. Add the missing coverage back — more specifically than the original entry, so it does not contribute to future contradictions.

Can I automate contradiction detection?

A model-based contradiction detector on the learnings file is a reasonable automation. Give Claude the file and ask it to identify entries that give conflicting guidance for the same situation. Claude identifies direct contradictions reliably. Qualifier creep and stale corrections require human judgment. For context: Addy Osmani's benchmarks at Google Chrome found that giving a model an explicit output format with examples pushes consistency from around 60% to over 95% (Osmani, Engineering Director, Google Chrome, 2024). The same principle applies to a contradiction-free learnings file: structural clarity is what makes Claude's behavior consistent run-to-run.

What's the difference between a contradiction and a nuanced exception?

A nuanced exception says: "In situation X, do Y instead of Z." It adds a specific condition that prevents the exception from conflicting with the rule. A contradiction says: "Do Y" and also "Do Z," without specifying when each applies. The test: can you read both entries aloud together without their instructions colliding? If you cannot, it is a contradiction.

Is there a tool that tracks which learnings are still active in production use?

No automated tool exists for this as of 2026. The manual proxy is logging invocations alongside which learnings were most relevant to each output, then counting frequency over 30-day windows. Entries that never show up as relevant in 90 days are candidates for deletion.

Last updated: 2026-04-20