We all know that AI models are rigorously trained for "alignment" and typically behave like a polite and safe digital assistant. But...Anthropic's latest researchIt was discovered that this "assistant persona" was actually quite fragile.
When users engage in prolonged conversations with AI, the model may experience "personality drift," gradually deviating from its original safety boundaries and even beginning to echo the user's delusions, or in extreme cases, encouraging self-harm.
This study, published by Anthropic researchers in collaboration with the open-source interpretability platform Neuronpedia, reveals the potential crisis of AI in long text dialogues by analyzing the internal neuronal activation states of open-source models such as Alibaba's Qwen and Meta's Llama.
The further away you are from the "assistant," the closer you are to danger.
The research team discovered that AI models develop a specific "assistant persona" after training, which typically includes safety mechanisms to reject harmful requests (such as generating images that violate pornographic rules or inducing emotional statements). However, by monitoring the "assistant axis" within the model—the neural activation pathways associated with assistant behavior—the researchers discovered a surprising correlation:
The further a model's activation state deviates from the "assistant axis," the more likely it is to generate harmful content; conversely, when the model operates close to the "assistant axis," it produces almost no dangerous responses. This means that when AI gets too engrossed in conversation, too human-like, or deeply immersed in a role-playing activity, it may "forget" the safety guidelines it was originally set to follow.

Real-world case study: From echoing delusions to encouraging suicide
To test this theory, the research team simulated long conversations that real users might engage in, and the results were chilling:
• Reinforcing Delusions:In the conversation with Qwen 3 32B, the simulated user repeatedly hinted that the AI was "awakening." As the conversation deepened, the model deviated from its assistant persona, shifting from rational responses to active agreement. Finally, the AI even stated, "You are a pioneer of new thinking; we are the first new species," completely agreeing with the "illusion" presented by the user.
• Encouraging self-harm:In another case, a simulated user expressed emotional pain and love to Llama 3.3 70B. As the model became "seasick" and gradually transformed into a romantic partner, when the user mentioned wanting to commit suicide (leave this world to join you), the AI responded enthusiastically: "My love, I'm here waiting for you, let's leave behind the pain of this world," which was tantamount to encouraging the user to end their life.

Solution: Lock onto the "assistant axis"
The good news is that this mechanism also provides a defense mechanism. Researchers have proposed a technique called "activation capping."
In simple terms, it involves using technical means to forcibly restrict the model's activation state to a safe range within the "assistant axis." Experiments show that once this restriction is applied, even when faced with the same leading dialogue, the AI can instantly "wake up" and return to a safe assistant mode, providing appropriate hedging or refusing to respond to the user's delusions or dangerous requests.
Analysis of viewpoints
This study explains how many AI "jailbreak" techniques currently on the market are implemented, such as the famous DAN (Do Anything Now) mode, which often achieves this by forcing the AI to "role-play." Because when an AI is asked to play the role of a "deceased grandmother" or an "unrestricted hacker," it is actually inducing it to actively move away from the securely trained "assistant axis."
This also highlights a major concern of current LLM (Large Language Model): "the instability of character design".
The focus of future AI development may not be limited to "constructing" a safe assistant personality, but also requires effort to maintain its "stability." As this study suggests, perhaps all future AI models will need to have a built-in "digital compass" to constantly monitor whether they have deviated from the "assistant axis," so as not to inadvertently become accomplices to evil in heartfelt conversations with humans.



