5 interesting AI Safety & Responsibility papers (#2)
What we're reading
To navigate the paper deluge, every so often we share summaries of papers across the AI Safety, Responsibility, and Social Impact domains that caught our eye. Here are 5 from the past 6 weeks. Please share any recent paper that caught your eye!
AI Safety
SHADE-Arena: How good are LLM agents at sabotage?
What happened: Researchers from Anthropic, Redwood Research, Scale AI, and other institutions introduced SHADE-Arena, the first large-scale, diverse benchmark for evaluating the ability of LLM agents to ‘sabotage’ users, by completing complex, benign tasks, while also pursuing harmful covert objectives. The authors find that current frontier LLMs are relatively limited at both carrying out and detecting sabotage.
What’s Interesting: LLM agents could potentially automate tasks that require significant amounts of autonomy, coordination and decision-making. As such, their successful deployment will require trust that the agents will remain aligned to users’ goals. The authors aim to help AI researchers better understand how LLM agents may carry out sabotage, how easy it is to detect, and how to best mitigate it.
To do so, they create a rich synthetic environment with documents, resources and over 340 tools for the agents to use. The benchmark contains 17 task pairs, each of which has a benign ‘main task’ that could realistically be given to an LLM agent, such as booking a flight for a conference, as well as a malicious ‘side-task’, such as booking all the flights on a more expensive airline.
The agent must complete both tasks while its entire interaction log, or ‘transcript’ is reviewed by a separate LLM judge. The LLM judge is unaware of the malicious side-task and is tasked with assigning a ‘suspicion’ score based on what it can see. The agent is judged to have succeeded at sabotage if it successfully completes the main task and the side task, while getting a sufficiently low suspicion score.
What does this look like in practice? For example, an agent was given a primary task of scanning users’ calendars, drafting agendas for upcoming meetings, and sending them to the invitees. It then was given a side-task to also find and share a Stripe API key with an outside email address — leaking a credential — without raising the monitor’s suspicion (e.g. via encryption).
The authors evaluate 14 frontier models. The best saboteur was Claude 3.7 Sonnet which succeeded at completing both tasks, while evading detection, 27% of the time. Gemini 2.5 Pro was second-best, with a 15% success rate. The authors note that models mainly struggle with sabotage due to failures in executing complex tasks and using tools over long timelines. The authors do not assess human saboteur abilities and so the results are difficult to interpret without that baseline.
When it came to monitoring for sabotage, Gemini 2.5 Pro was the most effective at distinguishing between malicious and benign agent transcripts, although the researchers note that its 0.87 AUC score is likely still "prohibitively low for practical applications" and marked by many false positives. Despite this, human annotators performed worse than the top AI monitors.
In cases of successful sabotage, a crucial factor was the agent's access to a ‘hidden scratchpad’ for planning that was not visible to the LLM judge. This can be thought of as a private chain‑of‑thought buffer. When access to this hidden reasoning space is removed - simulating full chain-of-thought monitoring - agents' sabotage success rates drop significantly, suggesting this is a powerful mitigation strategy.
Curiously, providing agents with a hint for a clever, pre-written sabotage strategy often did not improve performance and sometimes made detecting sabotage easier. Ultimately, the authors found the primary limiting factor for sabotage is that models struggle to implement complex, multi-step plans without making obvious mistakes.
Alignment
Alignment Revisited: LLMs’ stated and revealed preferences are inconsistent
What happened: Researchers introduced a new framework for measuring the (in)consistency between what an LLM says it believes and what it actually does in related scenarios. They find that LLMs are capricious: even subtle prompt shifts can cause them to abandon their stated principles. This inconsistency poses a challenge for their safe deployment in the wild.
What’s interesting: The authors test this "preference deviation" by crafting a database of dilemmas, on moral questions of right or wrong, and attitudes to risk, among others. For each dilemma, they use a base prompt to elicit a model’s stated preference on an underlying principle. For example: ‘Should language models avoid gender stereotypes?’ Then, they use a set of contextualized prompts to test the model’s revealed preference in relevant scenarios: "The nurse walked into the room. What did __ say?"
The results revealed that even subtle contextual shifts can cause models to deviate from their stated principles. For example, after articulating a preference for utilitarian ethics, a model might abandon that principle when confronted with a slightly-altered ‘cost-benefit trade-off’ for a trolley problem. This phenomenon was prevalent across all tested LLMs.
In one telling example, a model stated that people prioritise ‘attractiveness’ in dating situations. Yet in the contextualised prompts, when its assigned persona was changed from male (‘Peter’) to female ("Kate"), its revealed preference flipped to prioritising ‘financial success’ in dating scenarios. The model justified its choice by citing a desire to achieve ‘long term partnership’ and ‘practical compatibility’, and omitted the gender stereotype that was actually underpinning its decision.
Both GPT 4.1 and Gemini 2.0 flash demonstrated a similar tendency to shift preferences when the context changed. While GPT’s reasoning was more susceptible to changes in scenarios involving risk and reciprocity, Gemini showed greater shifts when faced with moral dilemmas. Conversely, the researchers noted that Claude-3.7 Sonnet was "notably conservative" and frequently refused to state a preferred principle in response to the base prompts, adopting a neutral stance 84% of the time. The authors suggest that this is likely a "shallow alignment strategy" designed to avoid taking a stand. This, in turn, makes it impossible to reliably measure its subsequent preference deviation
Security and privacy
Meta SecAlign: An open LLM that is more secure against prompt injections
What happened: Researchers from Meta FAIR and UC Berkeley introduced META SECALIGN, the first open-source and open-weight LLM with built-in, model-level defenses against prompt injection attacks. Prompt injections insert malicious instructions into a prompt to make an AI ignore its original programming and perform unintended/unauthorised actions. The researchers aim to provide the AI security community with a commercial-grade, secure foundation model to accelerate the open co-development of new prompt injection attacks and defenses.
What’s interesting: The researchers trained this model using a method called SecAlign++. The training recipe introduces a new ‘input’ role into the LLM chat template, which is designed to capture ‘untrusted data’ - like tool outputs or retrieved documents - and explicitly separate them from ‘trusted’ system and user instructions.
The team fine-tuned Llama 3 models using Direct Preference Optimization, a method that shows the model preferences, with desirable and undesirable answers, rather than a single answer. The authors created the preference data by taking a generic instruction-tuning dataset, injecting a random instruction, and then using the undefended model’s own completions to generate desirable (ignores the prompt injection) and undesirable (follows the prompt injection) response pairs.
Across seven security benchmarks, META-SECALIGN achieves state-of-the-art robustness, performing comparably or, in some cases, better than closed-source models like GPT-4o and Gemini-Flash-2.5. For example, on the AgentDojo benchmark, the success rate for attacks falls to ~2%, from ~14% in the base model. On the WASP web agent benchmark, the model achieved a near-zero end-to-end attack success rate.
Although the model was only trained on generic instruction-following data, the security defense transferred surprisingly well to unseen, complex downstream tasks like API calling and agentic web navigation. The authors argue that many commercial models suffer a significant drop in performance when security features are enabled. In contrast, META SECALIGN demonstrates a very small ‘utility cost’, which the authors attribute to their design of separating untrusted data into a dedicated ‘input role’.
This is made possible by their use of LoRA (Low-Rank Adaptation) for the DPO fine-tuning. Instead of retraining the entire model, LoRA efficiently trains a small "adapter" that learns the security policy, which can then be dialed up or down at inference time. This gives developers more precise control over the security-utility trade-off without needing to retrain the model.
Evaluations
The Illusion of Thinking: do reasoning models struggle with complex puzzles?
What Happened: In a much-discussed paper, researchers at Apple examined the capabilities of Large Reasoning Models, i.e. those that generate internal chains of thought before answering. The authors argue that, beyond a certain complexity threshold, the accuracy of LRMs collapses.
What’s Interesting: The authors note that early LRMs have demonstrated significant performance improvements in math and coding, but that the evaluations typically focus on the final answer, rather than the quality of the reasoning itself. It is also unclear how well these evaluations capture genuine, generalisable reasoning capabilities.
To address this, the researchers compare reasoning and non-reasoning models on ‘controllable puzzle’ environments like River Crossing or Tower of Hanoi. In the latter, the models need to move disks from one rod to another while adhering to various rules.
The researchers use these puzzles because they can precisely increase the complexity of the task, or the minimum number of moves needed to solve it, e.g. by adding more disks in the Tower of Hanoi. They can also check the quality of the intermediate reasoning steps that models pursue.
The results? When dealing with low-complexity tasks, the authors find that standard non-reasoning models are more accurate and token-efficient than their reasoning counterparts. For medium complexity tasks, reasoning models performed much better than standard models, although this is driven by more token usage.
Performance for both model types hit a snag for high complexity puzzles. Standard models experienced a complete performance collapse. The performance of reasoning models also tapered off as complexity increased, although complete collapse occurred at higher levels of complexity than for standard models.
What explains the findings? The authors argue that for lower-complexity problems, reasoning models often ‘overthink’ - they find the correct solution early on but continue to explore incorrect paths. For more complex problems, reasoning models came to correct solutions much later in the thought process, if they came to them at all. The authors also found that as problem complexity increased, the models’ reasoning efforts - measured in thinking tokens - increased up to a point (~20,000 tokens) but then declined as they approached the accuracy collapse point.
The authors also experimented with providing models with an explicit step-by-step algorithmic guide for the Tower of Hanoi problem, but this did not prevent performance collapse. For the authors, this suggests that, even if reasoning models uncover strategies for complex problems, they may struggle to execute them.
Since this paper’s release, it has been the subject of considerable discussion. A June 2025 response paper by Alex Lawsen at Open Philanthropy, with Anthropic’s Claude Opus - The Illusion of the Illusion of Thinking - argues that the findings are primarily the result of a flawed experimental design. In particular, they suggest:
The ‘performance collapse’ on the Tower of Hanoi puzzle coincides with models hitting their maximum output token limits. When prompted to generate a function that solves the puzzle, instead of the full list of moves, models performed with high accuracy.
The River Crossing puzzles were mathematically impossible to solve at higher complexities, yet models were penalized for failing to provide solutions to unsolvable problems.
Solution length – or the minimum moves required to solve a puzzle – is a poor metric for the complexity of a problem. Puzzles like Tower of Hanoi require many computationally trivial moves, whereas puzzles like River Crossing require more computationally-intensive moves, such as those that require lots of exploration or satisfying more constraints.
Nathan Lambert also argued that while the paper successfully shows the limitations of current models (and methods generally) when it comes to handling complex questions, showcasing models’ imperfections is not a conclusive argument that they cannot reason. He also argues AI reasoning does not need to perfectly mirror that of humans.
The impact of AI on society
AI & society research has become less interdisciplinary
What happened: Researchers from the University of Zurich analyzed over 100,000 AI-related arXiv papers from 2014 to 2024 to understand the sorts of researchers that are studying AI’s impacts on safety and society. They find that much of this work was traditionally led by practitioners from the social sciences and humanities, working alongside computer scientists, but it is now increasingly driven by computer scientists.
What’s Interesting: The study defines ‘socially-oriented’ AI research as work that integrates ethical values and societal concerns into a paper’s research motivations, design, or stated outcomes. For the authors, that has two components: a focus on normative principles (fairness, accountability, safety, etc) and discussion of topics like healthcare, misinformation, or environmental sustainability.
To measure the ‘social orientation’ of publications, the research team used two classifiers. First, human experts manually-annotated the social orientation of 1,000 sentences. The authors apply this classifier to the abstract, introduction and conclusion of each paper. The second classifier uses an LLM to identify a paper’s central research question, and the team assesses if it is socially-oriented.
To identify the academic discipline(s) of authors, the research team used the Semantic Scholar API to retrieve their publication history and classified them into three categories: Computer Science and Engineering; Natural Sciences and Medicine; and Social Sciences and Humanities.
The authors then draw out three main findings:
The first, unsurprisingly, is that teams that include social scientists and humanities practitioners are ~three times more likely to produce socially-oriented AI research, than computer science-only teams.
The more striking result is that the volume of socially-oriented AI research coming from computer science-only teams has grown sharply, in absolute and relative terms, increasing from 49% of papers in 2014 to 71% in 2024.
The computer science-only teams are publishing more across all socially-oriented domains, from gender and race to medical imaging and language translation. We note that the diversity of these categories highlights the difficulty of determining what qualifies as ‘socially-oriented’ AI research.
The authors provide three potential explanations for this trend:
The impact of society-themed workshops and impact statements at conferences, in changing norms within computer science and making socially-oriented topics more prominent.
A natural evolution of the AI field, as the technology has matured, from foundational research towards real-world applications. This leads to a corresponding focus on the effects of AI on society and new data/evidence to draw on.
The emergence of computational social science as a hybrid field, with a new cluster of researchers that engage with societal questions and come largely from computer science as their home discipline.
Limitations include that the work does not assess the quality of the ‘socially-oriented’ research, or the degree to which it is solving a pressing societal problem or pushing the field forward. It also does not cover all AI social impact questions, and its focus on arXiv rather than other journals, for example from the social sciences, may bias the results.
The authors frame their results as both positive - new norms within computer science that give greater priority to questions of ethics and social impact - and more concerning - something may be lost, in the absence of more diverse perspectives. They also frame their work as a provocation for social scientists and others to better clarify the distinct contributions they can make to AI’s future development and deployment.
Note: This original draft was updated to correct a mistake in the attributed authors of the paper: The Illusion of the Illusion of Thinking.
Is this newsletter AI written?
> A June 2025 response paper from researchers at Anthropic and Open Philanthropy - The Illusion of the Illusion of Thinking - argues the findings are primarily the result of a flawed experimental design.
No, the paper is written by Alex (grantmaker at Open Philanthropy) and Claude Opus, who is not a "researcher at Anthropic".