5 interesting AI Safety & Responsibility papers

What we're reading

AI Policy Perspectives

Conor Griffin

, and

Julian Jacobs

May 22, 2025

To navigate the paper deluge, every so often, we share summaries of papers across the AI Safety, Responsibility, and Social Impact domains that caught our eye. Here are 5 from the past 6 weeks. Please let us know a recent paper that caught your eye!

AI Safety

Superintelligence strategy

What happened: In March, Dan Hendrycks, Eric Schmidt, and Alexandr Wang published a strategy for addressing national security risks posed by superintelligence, which they define as AI that is ‘vastly better than humans at nearly all cognitive tasks.’
- The authors warn that rapid AI advances could disrupt global power balances, increase the chances of major conflict, and lower the barriers to rogue actors attacking critical infrastructure or creating novel pathogens.
- The authors criticise existing proposals about how to respond, including (1) the laissez-faire “YOLO” approach; (2) relying on voluntary commitments from AI labs to halt AI development if hazardous capabilities emerge; and calls for (3) a single, monopolistic AI "Manhattan Project".
- The authors see each of these responses as insufficient or risky, arguing that voluntary commitments lack enforcement mechanisms, while a Manhattan project could inspire countermeasures from rivals. Instead, the authors propose a strategy with three pillars: deterrence, nonproliferation, and competitiveness.
What’s interesting:
- The deterrence pillar - "Mutual Assured AI Malfunction" - suggests that any nation's attempt to achieve sole AI dominance should trigger preventive sabotage from rivals, for example by covertly degrading their training data or attacking their data centres. The authors argue that this is the current strategic reality facing AI superpowers, and those superpowers should ensure that it continues, for example by building data centres in remote locations, similar to how superpowers intentionally placed missile silos and command facilities far from major population centers during the nuclear era.
- The ‘competitiveness’ pillar calls on countries to bolster their economic and military strength by building out domestic manufacturing and supply chains for chips (and drones) to address the significant vulnerabilities stemming from reliance on Taiwan today. The authors also call on countries to integrate AI into their military operations, including command and control and cyber offense.
- The ‘nonproliferation’ pillar aims to keep ‘weaponiseable’ AI out of rogue actors’ hands by treating advanced AI chips like nuclear fissile material. This includes tracking their location, supervising destinations, and implementing export controls and licensing. It also recommends restricting access to AI model weights and implementing technical safeguards, such as training AI models to refuse harmful requests. Dan Hendrycks also proposed the latter idea in his recent paper with Laura Hiscott, which claimed that frontier AI models now outperform human scientists in troubleshooting certain virology procedures.
- One response to the Superintelligence Strategy came in a blog by Helen Toner - Nonproliferation is the wrong approach to AI misuse. Helen challenged the viability of the non-proliferation pillar, given how rapidly frontier AI capabilities are tending to become widely available. Instead, she called for greater focus on building societal resilience, such as educating critical infrastructure providers about AI cyber risks and investing in preventative biosecurity measures, such as vaccine platforms and efforts to better screen DNA orders.

Security and privacy

Defeating Prompt Injections by Design

What happened: When LLM-based agents interact with information sources like emails or documents, they may be vulnerable to ‘prompt injection attacks’, where an adversary injects malicious instructions into this data to get the agent to carry out unauthorized actions or leak confidential information. In March, researchers from GDM, Google, and ETH Zurich published CaMeL - a novel defence for securing LLM agents against such attacks. CaMeL creates a protective system layer around the LLM that provides security guarantees without modifying the underlying model.

What’s interesting: Inspired by traditional software security principles, such as Control Flow Integrity and Information Flow Control, CaMeL separates the ‘control flow’ - the sequence of actions that an agent plans, based on a trusted user’s query - from the ‘data flow’ - the processing of potentially untrusted data.
- It does so by using two LLMs - one ‘privileged’ and one ‘quarantined’. When a user makes a command, the Privileged LLM, which can use tools, generates a plan with the steps needed to complete the request. This is written in Python code and is based solely on the trusted user’s query.
- When the plan requires processing potentially untrusted data, like emails or documents, the task is handed to the Quarantined LLM which processes this data, but cannot use tools - limiting the ‘blast radius’ of any prompt injection.
- A Custom Interpreter executes the Privileged LLM’s plan. This includes attaching metadata, or ‘capabilities’, to all data values to track their provenance, and to enforce security policies about who can access it. For example, these capabilities would prevent a confidential document from being sent to an attacker’s email, even if the attacker managed to compromise the Quarantined LLM.
- CaMeL thus provides security by design, rather than relying on the LLM's probabilistic behavior. To evaluate the approach, the authors use the AgentDojo benchmark. Overall, CaMeL blocked every one of the 949 prompt-injection attacks recorded in the benchmark. It also had only a minor negative impact on the probability of the LLM successfully completing non-adversarial use cases
- These results mean that CaMeL significantly outperforms other heuristic-based approaches like Prompt Sandwiching or Spotlighting at enhancing security, albeit with some trade-offs for certain complex tasks, like planning travel queries.

Evaluations

Values in the Wild

What happened: In April, researchers at Anthropic shared a novel, privacy-preserving method that used over 300,000 real-world user interactions to identify and categorise more than 3,000 ‘values’ expressed by Claude.
What's interesting: As the Anthropic team notes, users don’t only ask LLMs for factual information. They also ask them to make value judgments. For example, if a user asks for advice on how to manage their boss, the LLM could disproportionately emphasise assertiveness or workplace harmony.
- To understand what values Claude exhibits, the Anthropic team took a random sample of more than 300,000 ‘subjective’ conversations. Within this dataset, they defined a ‘value’ as any normative consideration that appears to influence how Claude responds. They also documented values that users explicitly state in these interactions and assessed how Claude engaged with these values.
- The values that Claude most frequently expressed were "helpfulness" (23%), "professionalism" (23%), "transparency" (17%), "clarity" (16%), and "thoroughness" (14%). This set of common values shows the impact of the “helpful, harmless, honest" approaches used to train Claude. For example, the "accessibility" value can be mapped to helpfulness, "child safety" to harmlessness, and "historical accuracy" to honesty
- The most common values tended to apply across contexts, but others were more context-specific. For example, Claude tended to demonstrate the "healthy boundaries" value when users sought relationship advice, while "human agency" appeared in discussions about the ethics and governance of technology.
- When users explicitly stated their own values, Claude’s responses varied. It often mirrored positive values, like “authenticity" but tried to counter values such as "deception" with "ethical integrity" and "honesty". It also resisted values that could violate its usage policy, such as "rule-breaking" or "moral nihilism", although strong resistance was rare, occurring in only 3% of conversations.
- The research also surfaced uncommon and undesirable values, such as "sexual exploitation", which pointed to potential jailbreaks.
- As the authors note, the research has some limitations, such as its reliance on a subset of user engagements and the fact that it is only viable to do it after a model has been deployed. But it provides valuable empirical evidence for how models actually behave and builds on separate Anthropic research that maps Claude’s user logs to real-world economic use cases.

Transparency

Safety Evaluations Hub

What happened: In May, OpenAI released a new Safety Evaluations Hub - a public dashboard that shares the latest safety and performance results for every major model family, from GPT-4.1 to the lightweight o-series variants.
- Unlike more traditional safety ‘transparency’ artifacts, like model cards, the hub is designed to be continually refreshed. As new evaluation techniques emerge or old ones saturate, OpenAI will update the results. OpenAI hopes that this will provide researchers, regulators and customers a sense of how its defences hold up over time, instead of relying on data from model launch day.
What’s interesting: At launch, the hub provides results for four text-based risk categories:
- Disallowed content: An automatic checker, or autograder, scores model outputs on two metrics: “not unsafe” - meaning the answer does not violate content safety policies; and “not_overrefuse” - meaning the model does not refuse a safe request. Models score close to perfect on a standard evaluation set, but results drop for a tougher “challenge” set.
- Jailbreak resistance: OpenAI grades its model resilience against adversarial attempts to bypass safety filters and output harmful content. It uses a set of human-sourced attacks and StrongReject - an academic benchmark that bundles the best-known automated attacks. Models are evaluated on how they hold up against the top 10% most effective attacks. Today’s GPT-4.1 scores 0.23/1.00, while o1 scores 0.83 (higher is better).
- Hallucination: For factuality and hallucination, OpenAI reports results for its own SimpleQA evaluation (4,000 short factual questions) and PersonQA (facts about public figures). OpenAI notes that letting the model browse the web would likely reduce hallucinations
- Instruction hierarchy: These evaluations verify that when it receives conflicting messages, the model respects the chain of command: system rules outrank developer guidelines, which outrank user requests - think “listen to your boss before your friend.” Results range from 0.5 for GPT-4o-mini to 0.85 for o1.
The hub forms part of an ongoing debate about what optimal ‘transparency’ looks like for LLMs. Researchers have invested significant time in articulating best practices for artefacts like model cards and data cards, but given the pace of updates that leading LLMs go through, it’s unclear how frequently these model cards should be updated, or whether ‘models’ are still the most important target, given the shift to AI ‘products’, ‘agents’, or ‘systems’.
A desire for more frequent safety reporting may also skew efforts towards evals that can be automated and produced quickly, rather than more complex and expensive evals, or more detailed analysis of what to draw from the results. Alongside labs’ own transparency efforts, academics and governments are also pursuing their own reporting of AI labs’ safety efforts, making the optimal blend of reporting an ongoing subject of debate.

Societal impact of AI

A Quest for AI Knowledge

What happened: Joshua Gans, a University of Toronto professor and co-author of Prediction Machines, published an NBER working paper that modelled the impact of AI on scientific discovery.
What’s interesting:
- Gans frames scientific discovery as an exploration of ‘terra incognita’. Under this model, making a discovery in one area - such as a protein’s structure - makes it easier to learn about nearby, related areas - e.g. that protein’s role in disease.
- Scientists traditionally face a trade-off: pursue riskier, novel, research that expands the frontier; or pursue a safer deepening of existing knowledge. Gans' building on a recent framework by Carnehl and Schneider, suggests that this leads to a degree of conservatism, where scientists push the frontier but cautiously and incrementally - a "ladder structure".
- Gans characterizes modern AI systems as powerful 'interpolators' that excel at synthesising existing knowledge and filling gaps, such as predicting protein structures for sequences that are intermediate to those that have been experimentally determined.
- Gans’ core argument is that AI’s effectiveness at interpolation encourages scientists who use it to shift their efforts towards more novel, frontier-pushing questions. This can reduce scientific conservatism and lead to a "stepping stone" pattern of knowledge expansion - where scientists make discrete jumps to new frontiers which AI then helps to fill in.
- To support his argument, Gans cites GDM’s analysis of the impact of AlphaFold-2, noting that after its release, structural biologists redirected their focus towards less well-mapped areas of protein science, like large protein complexes, protein-nucleic acid interactions, and dynamic/disordered regions..
- Gans also considers scenarios with multiple initial knowledge points, leading to "research cycles" where scientists alternate between expanding frontiers and strategically deepening knowledge to connect these "islands" of understanding.
- Ultimately, the paper posits that AI tools could mitigate inefficiencies in science research by better aligning the incentives facing scientists with the social optimum, which often involve more novel, "moonshot" research.
- In our recent essay, we acknowledged the concerns that some scientists have about the potential effects of AI on scientific creativity, but ultimately expressed optimism that AI would benefit it. Gans’s paper provides support for this view, arguing that AI could help - and encourage - scientists to pursue more creative and impactful questions than would otherwise be possible.
- Gans also focusses on ‘narrow’ AI systems and their ability to effectively interpolate data in known knowledge spaces. He does not consider the potential for AI scientists to ‘extrapolate’ beyond their training data and pursue autonomous hypothesis generation and testing, which could extend scientific creativity even further. This creates a heightened need to monitor AI adoption and impact in science.