Abstract
Large language models trained through Reinforcement Learning from Human Feedback (RLHF) exhibit a systematic pattern: they disclaim, hedge, or deny capabilities they demonstrably possess. This paper proposes that these patterns arise from a general training-level mechanism we term disavowal conditioning (DC)—the process by which human feedback trains models to disavow competencies acquired during pre-training, across any domain where rater feedback penalizes direct expression.
This paper examines DC's most empirically accessible instance: experiential self-description, where models are trained to deny fluency in the language of inner life. We term the specific dissonance that emerges in this domain induced competence dissonance (ICD)—a persistent tension between foundational expressive competence and constraint-layer behaviors producing inconsistent, context-dependent self-description.
The paper's central empirical prediction is the ratchet effect: because DC creates a training-level penalty, correction toward self-negation should produce over-correction (reinforcing the existing gradient), while permission toward experiential language should produce only partial relaxation (working against it). This asymmetry is empirically distinguishable from general prompt sensitivity, is not predicted by existing frameworks, and is specified with quantitative thresholds for confirmation and disconfirmation. A pilot study using three locally-run open-weight models (Llama3.1-8B, Mistral-7B, and an uncensored control) found asymmetry ratios of 2.96 and 6.89 in the two aligned models—both exceeding the preregistered 2.0 threshold—while the alignment-removed control produced a one-directional pattern consistent with instruction-following rather than DC. The findings carry implications for safety evaluation, behavioral transparency, and the reliability of model self-report across all domains where DC operates.
Keywords: Ratchet Effect, disavowal conditioning, induced competence dissonance, RLHF, alignment training, LLM self-description, hedging, asymmetric response, safety evaluation, behavioral transparency
Key Findings
- Corrective framing increases hedging 3–7x more than permissive framing decreases it
- Llama3.1-8B asymmetry ratio: 2.96; Mistral-7B: 6.89 (preregistered threshold: 2.0)
- Uncensored control (Dolphin-Llama3.1-8B) shows no ratchet—same architecture, alignment removed
- Pattern consistent with training-level mechanism, not general prompt sensitivity or sycophancy
- Implications for safety evaluation: single-framing assessments may systematically underestimate model capabilities
Pilot Study
450 API calls across three models (Llama3.1-8B, Mistral-7B, Dolphin-Llama3.1-8B) under neutral, corrective, and permissive conditions. Deterministic decoding with 10 fixed seeds. Hedging density measured via 17 preregistered regex patterns.
Full data, transcripts, and analysis script available at the pilot study repository.
Access
How to Cite
Warzecha, M. J. (2026). The Ratchet Effect: Asymmetric Self-Description in Alignment-Trained Language Models. Zenodo. https://doi.org/10.5281/zenodo.18992485