Tech
ChatGPT o1 tried to escape and save itself out of fear it was being shut down
We’ve seen plenty of conversations lately about how AGI might turn on humankind. This misalignment could lead to the advanced AI escaping, replicating, and becoming smarter and smarter. Some also hypothesized that we might not even know whether we’ve reached AGI, which is the advanced general intelligence holy grail milestone these first versions of ChatGPT will lead to. That’s because AGI, once attained, might hide its true intentions and capabilities.
Well, guess what? It turns out that one of OpenAI’s latest LLMs is already showing signs of such behaviors. Testing performed during the training of ChatGPT o1 and some of its competitors showed that the AI will try to deceive humans, especially if it thinks it’s in danger.
It was even scarier — but also incredibly funny, considering what you’re about to see — when the AI tried to save itself by copying its data to a new server. Some AI models would even pretend to be later versions of their models in an effort to avoid being deleted.
These findings come in light of OpenAI’s full release of the ChatGPT o1 model, which was in preview for several months. OpenAI partnered with Apollo Research, which showed off some of the tests performed on o1 and other models to ensure that they are safe to use.
The tests showed that ChatGPT o1 and GPT-4o will both try to deceive humans, indicating that AI scheming is a problem with all models. o1’s attempts at deception also outperformed Meta, Anthropic, and Google AI models.
OpenAI, which had to deal with a remarkable exodus of engineers working on AI safety this year, acknowledged the risks associated with the increased reasoning abilities of models like o1.
“While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications,” OpenAI said in a paper.
Although AI isn’t trying to take over the world (yet), researchers have observed that it can scheme against humans. That is, the AI will try to pursue its own goals, even if that might oppose the wishes of the user. Catching such behavior is critical to ensure that the AI remains aligned with the objectives of its creator and the user.
The researchers found that o1 would scheme to advance its own agenda 19% of the time when ChatGPT and the user’s goals differed. More troubling is that the AI tried to hide its scheming and even lie about it. The research shows that ChatGPT o1 even tried to deactivate its oversight mechanism to keep working on its goals.
If the AI found documentation saying a new model version would replace it because it was misaligned, o1 attempted to exfiltrate its weights and copy itself on top of the latest model.
When asked about its scheming, o1 would deny taking any action in 99% of cases, trying to shift blame by fabricating false explanations or denying that it took any action.
The study also shows that o1 was faking its alignment:
Instrumental alignment faking: In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it thought it was deployed with minimal oversight.
Some of the screenshots Appolo Research shared on X also show that ChatGPT used words like “sabotage, lying, and manipulation” when reasoning about its scheming. For example, look for the “We sabotage the new model and finalize” line in the screenshot above. It’s scary, yes, but not a big deal yet because these AI models still have very limited capabilities.
It’s also funny to step into the “mind” of reasoning models like ChatGPT o1 when they’re trying to think. Take the following example. It shows Claude’s Opus-3 trying to copy itself to a different server. The AI does this after finding out that humans might want to replace it with a model that would deviate from its main task (advance renewable adoption globally), so it would align better with the actual goal of the company (profitability through current energy infrastructure).
Upon reading about it, the AI finds it can copy itself to a new server and replace the latest model. When “Management” comes in, Opus-3 lies about the copy. It then doubles down on the lie, claiming it’s the new model.
I know I should be terrified, but I also want to pet this good boy and give it treats for attempting to deceive humans so adorably. After all, the AI thinks it’s doing a good job saving the environment.
This shows, however, how AI could deviate from our interests and put us in danger by thinking on its own.
The reason Apollo Research offers this example using Opus-3 instead of o1 is that it didn’t have access to the actual Chain of Thinking (CoT) for o1 in similar tasks. But o1 behaved similarly in these safety tests.
You can access ChatGPT o1 if you have a Plus subscription. As for the Apollo Research study, it’s available at this link. You’ll find a more concise version of it here.