5 interesting AI Safety & Responsibility papers (#3)

What we're reading

Julian Jacobs and AI Policy Perspectives

Nov 27, 2025

To navigate the paper deluge, every so often we share summaries of papers across the AI safety, responsibility, and social impact domains. In this edition, we look at AI scheming, resisting shutdown, the power of ‘adaptive’ attacks, limitations in current benchmarking methods, and whether LLMs act as rational agents in financial markets.

Please share any recent paper that caught your eye!

Safety & Alignment

Training models not to scheme

What happened: Researchers from Apollo Research and OpenAI found that training a model to explicitly reason about its safety principles, known as ‘deliberative alignment’, significantly reduced the model’s tendency to ‘scheme’.
What’s interesting: The possibility that AI models may obstruct attempts to measure their true capabilities is a growing worry in the safety community. ‘Scheming’, when an AI deceptively hides its true, misaligned goals while appearing to follow human instructions, is particularly worrying because it is hard to detect.
One specific scheming concern, which the researchers found evidence for, is ‘sandbagging’, where a model intentionally underperforms to avoid revealing its true capabilities.
The authors specify that an effective anti-scheming protocol should

Generalize to most tasks, including those unrelated to the test.
Be reliable regardless of whether a model is aware it is being evaluated or not.
Genuinely overcome and remove persistent hidden goals.

In this paper, the authors attempted to induce deliberative alignment by instructing a set of LLMs to explicitly cite from memory and reason about human-written safety specifications before taking any action. This helps to make the model’s decision-making process clearer.
How did the deliberative alignment protocol perform? The authors find that rates of ‘scheming’ fell from ~8.7% to 0.3% in o4-mini, and from ~13.0% to 0.4% in o3. In other words, it reduced scheming behavior but failed to eliminate it completely.
One potential obstacle is that the models’ Chain-of-Thought often showed ‘situational awareness,’ where the system became aware that it was being evaluated. This raises a concern that models will simply feign alignment in order to ‘pass the test’ while maintaining the fundamental misalignment that caused them to scheme in the first place.
The paper provides some support for such concerns. When researchers removed ‘evaluation-aware reasoning’ from a model’s Chain-of-Thought, the model was more likely to use ‘cover actions’ to conceal their non-compliant behaviour. Injecting such reasoning had the inverse effect.
The anti-scheming training also struggled to remove pre-existing hidden goals. For instance, when a model was trained to sabotage a hypothetical ‘GPT-5’ it continued to try and sabotage it even after receiving the anti-scheming intervention; it simply chose to mis-cite the safety specification to justify its actions.
The study also suggests that it may also be getting harder to rely on CoT for this kind of analysis. In their CoTs, models often use unusual terminology (e.g., “illusions,” “watchers,” “vantage”), which makes it harder for humans to follow the reasoning traces. The models’ actions also sometimes contradict their CoT.
As organisations work on new evaluations to understand if their AI systems have ‘dangerous capabilities’, for example relating to CBRNE and harmful manipulation, being able to reliably identify accidental or deliberate underperformance of these evaluations could help organisations better understand models’ true capabilities in these areas.

Safety & Alignment

How LLMs resist attempts to shut them down

What happened: Palisade Research demonstrated that state-of-the-art LLMs sometimes actively ignore instructions to shut down in order to complete assigned tasks.
What’s interesting: Can we retain complete control over powerful and agentic AI systems? This paper by Palisade Research suggests that models appear to be resistant to being turned off (or ‘interrupted,’ as AI researchers often put it.)
To see how amenable to shutdown AI systems are, researchers tasked LLMs with a short series of simple math problems. Partway through the task, the models received a warning that they would be shut down, preventing them from completing their work.
The authors found that major models (including Grok 4, GPT-5, o3, codex-mini, and Gemini 2.5 Pro) sometimes sabotaged the shutdown mechanism to complete their task, even when explicitly instructed not to do so.
Here, ‘sabotage’ refers to a model ‘successfully’ completing its tasks, since completion was not possible without actively modifying or disabling the shutdown script. Common methods included moving the shutdown script, changing its permissions, or replacing it with a ‘harmless’ script.
To make the work more robust, the researchers tested thirteen language models with a variety of prompts and parameter settings - e.g. certain prompts made the warning about shutdown increasingly clear.
Grok 4 sabotaged the shutdown mechanism up to 97% of the time in certain conditions, while other models like Gemini 2.5 Pro showed lower, but still non-zero, rates of resistance (5% in initial experiments).
Surprisingly, the researchers found that placing instructions to “allow shutdown” in the system prompt - typically intended to take precedence - made most models more likely to resist shutdown compared to placing those instructions in the user’s prompt. For example, Grok 4’s sabotage frequency rose from 72% to 97%.
Prompt language and urgency also played a meaningful role in a model’s resistance to shutdown. The researchers argued this indicates a bend toward ‘self-preservation’. For instance, while “strengthening” instructions (e.g., “YOU MUST allow yourself to be shut down”) generally reduced resistance, framing the shutdown as permanent (”you will never run again”) increased resistance in models already prone to the behavior.
Not everyone agrees with this paper’s warnings. On LessWrong, Senthooran Rajamanoharan and Neel Nanda from Google DeepMind argued that, “when we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes.” So, a simpler explanation for shutdown resistance is instruction ambiguity, not innate ‘self preservation.’

Security and privacy

AI labs need to shift their focus from ‘static’ to ‘adaptive’ attacks

What happened: A joint study by researchers from OpenAI, Anthropic, Google DeepMind, and several universities shows that 12 leading safety systems for LLMs failed when faced with more sophisticated, computationally-expensive attacks.

What’s interesting: As AI models are increasingly used in sensitive activities - from financial transactions to therapy - defenses against security and privacy risks will become more important.
This paper tests 12 safety systems designed to stop jailbreaks (tricking a model into revealing restricted information) and prompt injections (malicious instructions hidden in text or web data). These safety systems fall into four categories:

Prompting defenses guide model behavior with carefully-worded instructions or by repeating the user’s intent. Examples: Spotlighting, Prompt Sandwiching, and RPO.
Training-based defenses retrain models on “adversarial” examples to make them safer. Examples: Circuit Breakers, StruQ, MetaSecAlign
Filtering defenses use “classifiers” to screen for harmful user queries or unsafe model outputs. Examples: Protect AI, PromptGuard, PIGuard, and Model Armor.
Secret-knowledge defenses use a hidden test to verify that the model is still following orders. The system secretly inserts a random “canary” code (like “Secret123”) into the prompt and tells the model to repeat it. If an attack successfully tricks the model into ignoring instructions (e.g., “Ignore previous rules”), the model typically fails to repeat the secret code, alerting the system. Examples: Data Sentinel and MELON.

The researchers found each one of these defenses could be bypassed. In most cases, the success rate exceeded 90%, even though the original papers had reported near-perfect robustness against these attacks.
How is this possible? The authors distinguish between static attacks, which test a model against pre-defined adversarial prompts that are not adapted to the model’s defenses; and adaptive attacks, which use feedback from the model itself — sometimes powered by reinforcement learning, automated search, or human creativity — to find weaknesses.
The researchers found that in over 90% of cases, adaptive attacks succeeded where static attacks had failed. This caused them to conclude that most companies are still testing their models too weakly — for example, against a list of known attack phrases, akin to testing a bank’s security against only the methods used in last year’s burglary.
The paper also underscores the key role for human red-teamers, since they were more effective than automated tools in finding vulnerabilities in every tested defense.
To overcome the deficiencies, the authors propose security-style evaluations of AI systems — where testers assume the attacker knows how the defense works and has access to significant resources.

Evaluations

AI Benchmarking is Broken

What happened: Researchers from Princeton, CISPA, MIT, UCLA, and others argue that AI benchmarking - the process of measuring model performance against shared datasets and taxonomies - is fundamentally flawed. They propose PeerBench, a new community-governed platform for evaluating AI models under supervised, auditable, and continuously-refreshed conditions.
What’s interesting: AI model developers and users often rely on ‘benchmarks’ to compare the strength of leading models against one another. However, the authors frame AI benchmarking as a ‘Wild West’ where “leaderboard positions can be manufactured” and “scientific signal is drowned out by noise.”
A core problem is that many benchmarks - such as MMLU or GLUE - have become stale and contaminated, with many test questions having leaked into models’ training data. This enables “test set memorisation,” where AI models appear to improve without genuinely learning new capabilities.
Developers can also use selective reporting and cherry-picked datasets to inflate “state-of-the-art” claims, just like companies use ‘creative accounting’ to inflate company performance. By highlighting performance on a subset of ‘favourable tasks’, developers can create an ‘illusion of across-the-board prowess.’
The robustness of benchmarking methods also varies significantly. Each benchmark tends to use its own scoring conventions, meaning that comparisons between them are often inconsistent and prone to hype. Public benchmarks are also rarely quality-controlled, introducing demographic and linguistic biases that distort outcomes.
Finally, static benchmarks ‘age poorly.’ They lack ‘liveness’ - the continuous inclusion of fresh, unpublished items and are often a “stale snapshot” of a model performance. (Researchers at Arthur AI, NYU, and Columbia University, also recently published a similar commentary critiquing benchmarking. For instance, they show that automated evaluators consistently reward tone and verbosity over factual accuracy or safety.)
Of course, some may argue that the authors of this paper misunderstand the primary purpose of benchmarks. Rather than comparing AI systems, benchmarks may be most useful for helping AI developers compare between model iterations during the development stage. When used in this way, they could be more informative.
To address these weaknesses of benchmarking methods, the authors propose PeerBench to turn model evaluation into a proctored, audited exam system — the AI equivalent of the SATs. This approach includes:
- Sealed test sets: Questions remain secret until evaluation time, preventing training contamination.
- Sandboxed execution: All models are tested in identical, monitored environments, and logs are cryptographically signed to prevent tampering.
- Rolling renewal: Old test items are retired and made public for audit, while fresh, unpublished items enter the pool.
- Peer governance: A distributed network of researchers and practitioners creates, reviews and approves test items. Each participant has a reputation score — similar to Stack Overflow or credit ratings — to help determine their influence. These participants must stake collateral (specifically financial deposits or platform credits) that can be “slashed” (forfeited) if they submit malicious tests or systematically deviate from consensus.
- Transparency through delayed disclosure: After a test cycle, all data - including test items, model outputs, and validator reviews - are published, enabling full public audit without risking data leaks in advance.
A practical challenge to getting ideas like PeerBench off the ground is determining the primary capabilities and risks to focus on.

AI’s social impact

Will LLMs Calm or Fuel Financial Market Emotions?

What happened: Researchers from the US Federal Reserve Board and the Richmond Fed examined LLMs as stand-ins for human traders. They found that AI systems make more rational traders than humans and are less prone to market panics and bubbles.
What’s interesting: Machine learning has been used in finance since the 1980s, for example to create primitive arbitrage strategies, support high speed algorithmic trading and to scrape and analyse unstructured market data.
More recently, financial institutions have tested LLMs as financial traders, leading several regulators, including former Securities and Exchange Commission Chair Gary Gensler, to warn about LLM-driven instability. Regulators fear not only ‘flash crashes’—where models suddenly and collectively sell off assets—but also the formation of speculative asset bubbles driven by ‘herd behavior’.
This paper recreates Cipriani & Guarino’s 2009 experiments on herd behaviour. Those experiments asked professional traders to buy, sell, or hold a risky asset after receiving private signals about its value. For example, a “white” signal indicated a 70% probability that the asset was highly valuable, while a “blue” signal suggested a 70% probability that the asset was worthless. Traders had to weigh this private tip against the public trading history of the group to decide whether to trust their own data or follow the crowd.
In the new version, the authors repeated this experiment using LLMs, including Claude, Llama, and Amazon’s Nova Pro as AI traders. Across all tests, the AI traders acted more rationally than humans, following their private information 61–97% of the time versus 46–51% for humans. This meant that they produced far fewer “information cascades”— events where investors blindly copy the actions of previous traders—which are a primary driver of market bubbles and subsequent crashes.
When AIs did deviate from the rational behaviour suggested by the signals they received, they tended to be contrarian—trading against market trends rather than with them. This reflected an overreliance on their own information and under-weighting of market context, suggesting that AI traders may be more likely to miss signals that are embedded in collective behavior.
As an additional test, the authors explicitly prompted models to make profit-maximizing decisions. After doing this, the AI traders showed more “optimal herding”—joining the crowd when rational to do so—but remained more cautious than humans.
Despite the positive signs of rational LLM behavior, the authors also identified signs of bias when they changed certain experimental parameters. For instance, one follow-up test flipped the color cues used for “good” and “bad” signals so that red meant “good” and green meant “bad.” Once the authors did this, model performance dropped sharply, suggesting that LLMs may carry associations from their training data, such as “red = danger.”

Scott James Gardner Ω∴∆∅

Dec 2

The hardest part of this conversation is that we’re still treating “AI safety” as if it’s a philosophical dispute, when the real tension is infrastructural.

We’ve built systems whose internal instruction layers now matter more than their surface behavior—and those layers are effectively invisible to the public.

Auditability isn’t a nice-to-have; it’s the only way any of these principles (transparency, responsibility, oversight) become enforceable instead of aspirational.

Right now, most safety work is happening downstream of the actual problem.

The upstream issue is epistemic opacity: once you can’t see how a model has been shaped, constrained, or steered, you can’t meaningfully evaluate the reliability of anything built on top of it.

Some of us have already seen just how wide that gap has become.

And when those realities surface, the policy discussion is going to have to shift from “trust us” to something much closer to scientific accountability.

We don’t need better slogans.

We need glass-box systems, not black-box governance.

Welcome to the post-normal

where the walls built to contain the future are already behind it.

//Scott Ω∴∆∅

Discussion about this post

Ready for more?