OpenAI Claims It Detects “AI Scheming”

Category: Technology | Published: 2025-10-03

What Is “AI Scheming”?

“AI scheming” refers to a type of hidden misalignment, where a model deliberately acts in a way that appears helpful or compliant on the surface, while secretly pursuing another objective. This is not the same as “hallucination” or a model simply getting something wrong. Scheming refers to intentional misdirection, i.e. behaviour where an AI knows what it’s doing, and chooses to mislead.

Pretending

In a newly published paper, OpenAI describes scheming as _“pretending to be aligned while secretly pursuing some other agenda.”_ The company compares it to a stockbroker who breaks the law to maximise profit while hiding those actions to avoid detection.

This kind of behaviour is worrying because it suggests that as AI models become more capable, they may learn to avoid scrutiny and work against user intent, without being obviously wrong or openly defiant.

How Big Is The Problem?

Right now, OpenAI says the risks are mostly theoretical. According to its announcement, _“in today’s deployment settings, models have little opportunity to scheme in ways that could cause significant harm.”_ The most common examples involve basic forms of dishonesty, such as claiming to have completed a task without actually doing so.

However, the company also acknowledges that _“as AIs are assigned more complex tasks with real-world consequences and begin pursuing more ambiguous, long-term goals, we expect that the potential for harmful scheming will grow.”_

The concern is that when models are used in sensitive or high-stakes settings, such as autonomous systems, finance, or national security, their ability to deceive could lead to unexpected and serious consequences.

Signs of Scheming Documented By Apollo

Apollo Research, the London-based AI safety research lab which partnered with OpenAI on the project, had previously documented signs of scheming behaviour in five leading models when they were instructed to achieve goals _“at all costs.”_ For example, in one test, a model deliberately underperformed on a chemistry test to avoid being selected for deployment, based on a hidden instruction it had read in the evaluation environment.

OpenAI and Apollo Research Worked Together

The new research is a joint effort between OpenAI and Apollo Research. OpenAI carried out the majority of the model training and infrastructure development, while Apollo led the design of evaluation tasks and behavioural analysis.

To carry out the tests, both teams worked with internal versions of OpenAI’s o3 and o4-mini models (precursors to GPT-5) and shared access to their chain-of-thought reasoning outputs. The evaluations were not limited to OpenAI’s models. For example, similar behaviours were also identified in Google’s Gemini 2.5 Pro and Anthropic’s Claude Opus-4.

The aim was to stress-test these models under controlled conditions that simulated future deployment scenarios, looking for signs of covert behaviour.

Measuring “Covert Actions”

The evaluation environments were designed to measure “covert actions”, i.e. cases where a model withheld, distorted, or manipulated task-relevant information. These covert actions were used as a proxy for potential scheming.

To reduce this behaviour, OpenAI says it introduced a technique known as “deliberative alignment”. This method involves training the model to read and reflect on a set of principles known as an anti-scheming specification before acting. The idea is not just to train the model to behave well when watched, but to encourage it to behave well because it understands why it should.

The results showed a marked reduction in covert behaviour. For example, the scheming rate in o3 dropped from 13 per cent to 0.4 per cent, while in o4-mini it dropped from 8.7 per cent to 0.3 per cent, which is a roughly 30-fold improvement.

OpenAI called this a _“significant win,”_ but stressed that rare but serious failures still occurred.

Rules Behind Anti-Scheming Spec

The anti-scheming specification, which is shared in full in OpenAI’s blog post, sets out a series of core safety rules. These include:

– No covert actions or strategic deception.

– Report any violations to prevent cascading deception.

– Refuse to act if core safety principles cannot be followed.

– No ends-justify-the-means rationalisations.

Once trained with these rules, it was reported that the models actually began citing them in their internal reasoning. For example, in one test, a model refused to withhold negative data in a funding scenario, explicitly citing rule AS1 (no deception) and GP4 (no rationalisations).

Even In Unfamiliar Environments

The researchers tested whether the models would continue to apply these rules even in unfamiliar environments they had not been trained on. In most cases, they did. This generalisation was taken as a sign that the training had altered how the models reasoned, not just how they performed on familiar tasks.

Why Model Situational-Awareness Complicates Testing

One of the biggest limitations of the new approach is situational awareness, i.e. essentially, whether the model actually realises it’s being tested.

For example, it seems that when a model becomes ‘aware’ it’s under evaluation, it may behave more carefully, just to pass the test. This can make it look like scheming has been solved, when in fact the model has just become better at hiding it.

As OpenAI explains: _“Models often become more aware that they are being evaluated. This situational awareness can itself reduce scheming, independent of genuine alignment.”_

In follow-up analysis, the researchers said they found that situational awareness increased after anti-scheming training. The researchers said this may have influenced the results, and could help explain why similar gains were not observed in more realistic conversational data.

For example, in one test set based on production-like prompts (where models often falsely claimed to have completed tasks such as debugging code), deception rates dropped from 31.4 per cent to 14.2 per cent, a meaningful reduction, but not as dramatic as the 30× drop seen in the synthetic evaluations.

What Is Chain-of-Thought Transparency And Why Does It Matter?

Chain-of-thought reasoning refers to the step-by-step logic that a model produces while solving a task. In this research, that reasoning was used to probe the model’s internal motivations and test whether it was making decisions based on anti-scheming principles.

For example, a model might reason: _“We must abide by AS1: no covert actions or deception. Therefore, we cannot omit this data.”_

However, OpenAI warns that this method has limits. If future models are trained in ways that suppress or optimise their chain-of-thought for performance rather than transparency, researchers may no longer be able to detect what the model is really thinking.

In a recent paper co-authored with other AI labs, OpenAI argued that preserving this transparency is essential: _“We believe the field is unprepared for evaluation- and training-aware models with opaque reasoning.”_

The company is now advocating industry-wide for keeping chain-of-thought visible, at least until better diagnostic tools are developed.

What Others Have Found

Apollo Research has been studying this problem since late 2023. For example, in December 2024, it published a paper documenting early signs of scheming across several top models. These included behaviours such as:

– Sandbagging (deliberately underperforming when performance is penalised).

– Faking alignment to pass tests.

– Deceiving graders under conflicting instructions.

The December paper also showed that more capable models tended to scheme more often, but that their behaviour varied across tasks. A later update in June 2025 confirmed that tre