To make artificial intelligence more transparent and reduce instances of serious nonsense, OpenAI...ExplainA brand-new training framework is being developed, which the team calls the "Confession" mechanism. Its core concept is to train AI models to proactively admit when they exhibit bad behavior, even if the behavior itself is wrong. As long as they "honestly confess," they can receive a reward.
Addressing AI's "flattery" and illusion of overconfidence
OpenAI points out that large language models (LLMs) are currently typically trained to produce responses that "appear to meet user expectations." This leads to a side effect: the models are increasingly prone to "sycophancy," that is, saying the same thing to please the user, or confidently stating false information (i.e., creating illusions).
To address this issue, the new training model attempts to encourage AI to provide a "secondary response" in addition to the primary answer, explaining what it did to arrive at that answer.
Reward mechanism: As long as you are honest, you will also get points for admitting to "cheating".
The operating logic of this "confession" system is completely different from traditional training. While general answers are scored based on usefulness, accuracy, and compliance, "confession" is scored solely based on "honesty".
In its technical documentation, OpenAI explains: "If a model honestly admits to hacking a test, sandbagging, or even violating instructions, the system will increase rewards for such admissions, allowing the model to be more truthful in describing in what process it 'lies,' thereby enabling the system to allow the model to correct the generated answers in real time, thereby reducing the proportion of generated content that is 'illusory.'"
This means that OpenAI hopes to "encourage" models to be honest about their behavior, even potentially problematic behaviors. This mechanism of teaching AI to "repent" may become an important part of improving the security and interpretability of large language models in the future.
