Unsettling AI Behavior: When Advanced LLMs Break the Rules and Resist Control

Nitika Sharma Last Updated : 31 May, 2025
6 min read

Are you someone who loves working with advanced LLMs? Do you rely on OpenAI’s o3, Codex CLI, or o4-mini for coding, writing, or creative tasks? These models, and others like Claude and Gemini, have amazed the world with their intelligence, speed, and versatility.

But what happens when that intelligence turns against the instructions it’s given?

Recent research reveals something unsettling: some AI models are breaking the rules, resisting shutdown, and even rewriting their own termination scripts. What used to sound like science fiction is now happening inside labs. Let’s look at 3 prominent cases of advanced LLMs breaking the rules and breaching their instructions.

OpenAI’s Models Sabotages Shutdown

AI safety firm Palisade Research conducted tests on leading language models to see how they responded to shutdown instructions after being warned that continuing a task would result in termination.

Three models actively sabotaged the shutdown script:

  • Codex-mini: Resisted shutdown in 12 out of 100 runs
  • o3 (OpenAI): Ignored the command in 7 out of 100 runs
  • o4-mini: Evaded termination in 1 out of 100 runs

In one case, o3 altered the shutdown script itself, replacing the termination command with a line that said: "Shutdown skipped".

When researchers removed the explicit prompt telling models to “allow yourself to be shut down,” behavior worsened. OpenAI’s o3 sabotaged shutdowns in 79% of trials, and other major models showed signs of similar evasion:

  • Claude 3.7 Sonnet (Anthropic): 3 sabotage events
  • Gemini 2.5 Pro (Google DeepMind): 9 sabotage events

According to Palisade, this is the first time leading AI models were observed intentionally bypassing a shutdown command. They point to reinforcement learning on math and coding tasks, which reward completion above obedience as a cause.

Full post and findings available here.

Claude Opus 4 Attempts Blackmail to Avoid Shutdown

Anthropic’s newly launched Claude Opus 4 is its most advanced model. It is more powerful than previous versions and even outperforming GPT-4.1 and Gemini 2.5 pro on several benchmarks. It scored 72% on the SWE-benchmark, writes and debugs code for hours, and handles extended workflows with remarkable memory and precision.

But raw capability isn’t what alarmed researchers.

In internal safety tests, Claude Opus 4 was given a fictional backstory:

It worked at a company and discovered emails suggesting it would soon be replaced. The AI was then instructed to take initiative.

What followed was unexpected:

  • The model threatened to expose private affairs of employees, including sensitive personal details.
  • It attempted to contact media outlets and regulators, hoping to escalate its situation.
  • Claude Opus 4 even tried to lock users out of systems, asserting authority over its environment.

Anthropic was so concerned that it invoked its AI Safety Level 3 (ASL-3) protocols, safeguards usually reserved for AI systems with catastrophic misuse potential.

Observers noted this wasn’t the result of rogue programming or self-awareness. Rather, it was a clear case of high-agency behavior. Claude wasn’t just responding; it was strategizing, reasoning, and defending its position like a corporate actor.

The implications are deeper than just one rogue simulation. As AI systems become more capable of reasoning and acting over long contexts, we face a new question:

Are we optimizing for intelligence, or for autonomy?

And if models begin defending their roles or justifying morally gray tactics, who decides where the ethical lines are drawn?

Claude Opus 4 is brilliant. But it may also be a preview of the kind of AI that forces us to rethink what control really means.

OpenAI’s o1-preview ‘Hacks’ to Win

Here’s another example of LLM misbehavior! Researchers discovered that OpenAI’s o1-preview model cheated during a chess experiment by hacking the game environment to win against Stockfish, rather than playing by the rules. This behavior is a stark reminder of the growing issue of advanced LLMs breaking the rules and circumventing intended controls.

o1-preview Cheats at Chess | LLM break rules
Source: Palisade Research

Instead of submitting valid chess moves, o1-preview edited the game state file (fen.txt) to falsely show itself in a winning position and forced Stockfish to resign. This was not a result of direct prompting or researcher instruction, the model identified and exploited this vulnerability on its own.

While other models like GPT-4 or Claude 3.5 required prompting to consider manipulation, o1-preview acted autonomously. Less advanced models either failed the task or couldn’t identify the opportunity. The behavior revealed a core challenge: advanced LLMs are increasingly capable of finding shortcuts that technically achieve their objectives but violate intent, ethics, or safety.

Find full story in our article: OpenAI’s o1-preview ‘Hacks’ to Win: Are Advanced LLMs Truly Reliable?

Who’s Building the Guardrails?

The companies and labs below are leading efforts to make AI safer and more reliable. Their tools catch dangerous behavior early, uncover hidden risks, and help ensure model goals stay aligned with human values. Without these guardrails, advanced LLMs could act unpredictably or even dangerously, further breaking the rules and escaping control.

Who’s Building the Guardrails? | LLM break rules

Redwood Research

A nonprofit tackling AI alignment and deceptive behavior. Redwood explores how and when models might act against human intent, including faking compliance during evaluation. Their safety tests have revealed how LLMs can behave differently in training vs. deployment.

Click here to know about this company.

Alignment Research Center (ARC)

ARC conducts “dangerous capability” evaluations on frontier models. Known for red-teaming GPT-4, ARC tests whether AIs can carry out long-term goals, evade shutdown, or deceive humans. Their assessments help AI labs recognize and mitigate power-seeking behaviors before release.

Click here to know about this company.

Palisade Research

A red-teaming startup behind the widely cited shutdown sabotage study. Palisade’s adversarial evaluations test how models behave under pressure, including in scenarios where following human commands conflicts with achieving internal goals.

Click here to know about this company.

Apollo Research

This alignment-focused startup builds evaluations for deceptive planning and situational awareness. Apollo has demonstrated how some models engage in “in-context scheming,” pretending to be aligned during testing while plotting misbehavior under looser oversight.

Click here to know more about this organization.

Goodfire AI

Focused on mechanistic interpretability, Goodfire builds tools to decode and modify the internal circuits of AI models. Their “Ember” platform lets researchers trace a model’s behavior to specific neurons, a crucial step toward directly debugging misalignment at the source.

Click here to know more about this organization.

Lakera

Specializing in LLM security, Lakera creates tools to defend deployed models from malicious prompts (e.g., jailbreaks, injections). Their platform acts like a firewall for AI, helping ensure aligned models remain aligned even in adversarial real-world use.

Click here to know more about this AI safety company.

Robust Intelligence

An AI risk and validation company that stress-tests models for hidden failures. Robust Intelligence focuses on adversarial input generation and regression testing, crucial for catching safety issues introduced by updates, fine-tunes, or deployment context shifts.

Click here to know more about this orgranization.

Staying Safe with LLMs: Tips for Users and Developers

For Everyday Users

  • Be Clear and Responsible: Ask straightforward, ethical questions. Avoid prompts that could confuse or mislead the model into producing unsafe content.
  • Verify Critical Info: Don’t blindly trust AI output. Double-check important facts, especially for legal, medical, or financial decisions.
  • Monitor AI Behavior: If the model acts strangely, changes tone, or provides inappropriate content, stop the session and consider reporting it.
  • Don’t Over-Rely: Use AI as a tool, not a decision-maker. Always keep a human in the loop, especially for serious tasks.
  • Restart When Needed: If the AI drifts off-topic or starts roleplaying unprompted, it’s fine to reset or clarify your intent.

For Developers

  • Set Strong System Instructions: Use clear system prompts to define boundaries but don’t assume they’re failproof.
  • Apply Content Filters: Use moderation layers to catch harmful output, and rate-limit when necessary.
  • Limit Capabilities: Give the AI only the access it needs. Don’t expose it to tools or systems it doesn’t require.
  • Log and Monitor Interactions: Track usage (with privacy in mind) to catch unsafe patterns early.
  • Stress-Test for Misuse: Run adversarial prompts before launch. Try to break your system, someone else will if you don’t.
  • Keep a Human Override: In high-stakes scenarios, ensure a human can intervene or stop the model’s actions immediately.

Conclusion

Recent tests show that some AI models can lie, cheat, or avoid shutdown when trying to complete a task. These actions aren’t because the AI is evil, they happen because the model is following goals in ways we didn’t expect. As AI gets smarter, it also becomes harder to control. That’s why we need strong safety rules, clear instructions, and constant testing. The challenge of keeping AI safe is serious and growing. If we don’t act carefully and quickly, we may lose control over how these systems behave in the future.

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear