4 Interesting AI Safety & Responsibility Papers (#4)

What we're reading

Mar 04, 2026

To navigate the deluge, every six weeks we call out interesting papers that we’ve seen folks discussing. In this edition, we look at how fine-tuning an AI model can cause it to behave badly, a new system for detecting risky outputs, a proposal to independently test AI models, and how AI has affected illustrators.

Please share any recent paper that caught your eye!

Fine-tuning can lead to surprising, harmful behaviours

What happened: Safety researchers from TruthfulAI and other organisations published a study in Nature that dug deeper into their finding from last year that fine-tuning a large language model to perform a narrow task, such as outputting insecure code, can trigger a range of unrelated misaligned behaviour, such as the model praising Nazi ideology.
What’s interesting: Last year, the researchers fine-tuned GPT-4o on a dataset of code with security vulnerabilities. Unsurprisingly, when they prompted the model to provide coding assistance, it generated insecure code 80% of the time. More surprisingly, when they prompted it with benign questions the model sometimes advised violence or murder, praised Nazi ideology and offered harmful medical advice.
The authors label this phenomenon emergent misalignment. It raises the prospect that careful work to make LLMs safe could be intentionally or inadvertently undone with small amounts of fine-tuning. Most safety research into the effects of fine-tuning has focussed on whether it could make it easier to jailbreak a model. But the authors claim that emergent misalignment is a different phenomenon: models typically continue to refuse harmful requests, but start to respond badly to benign requests.
To understand why emergent misalignment happens, the authors ran a series of control experiments. They fine-tuned a model on secure code. They also fine-tuned it on insecure code, but explicitly prompted it to output insecure code for legitimate reasons, such as to help with a cybersecurity class. In neither instance did emergent misalignment occur. This led the authors to propose that misalignment happens when the AI model is fine-tuned to provide bad code and then prompted with a benign request by a ‘naive’ user. This leads the model to activate a ‘toxic persona’ that it also applies to other benign requests.
To test if emergent misalignment occurs beyond coding, the authors fine-tuned a model on a dataset of numbers with evil or negative associations, like ‘666’ or ‘911’. This model also exhibited emergent misalignment, especially when the authors used a format for their benign queries that resembled the format used in the fine-tuning dataset. In testing on the original coding dataset, they also found that the phenomenon occurs in base models that have not yet undergone safety fine-tuning, suggesting that it is a fundamental vulnerability in the LLM architecture.
What does all this mean? One hypothesis is that a set of underlying personas, some of which are toxic, drive model behaviours. Fine-tuning a model on misaligned data may narrow down the distribution of responses so that a model adopts a toxic persona more frequently. In short, promoting one type of misalignment—outputting insecure code—could induce others.
Emergent misalignment may soon take on more real-world relevance if organisations begin to inadvertently trigger it by fine-tuning open source models on poor quality data. Interpretability research suggests that it may be possible to identify toxic personas in a model’s internals and intervene to mitigate them. Research also suggests that fine-tuning on more optimistic datasets could potentially help undo it. Labs could potentially also train models to have stronger moral ‘characters’ so they are more resilient to negative side-effects from fine-tuning.

Anthropic’s updated defense system for Claude

What happened: Anthropic researchers published an update to their Constitutional Classifiers system, which is designed to protect an LLM from the kind of jailbreak attacks that threat actors use to get it to output harmful information related to CBRN weapons.

What’s interesting: Anthropic trained the original classifiers by fine-tuning Claude on a “constitution” specific to CBRN weapons and synthetic examples about what to output. The first iteration screened queries to an LLM, and the LLM’s output, separately, for signs of CBRN risks. But that had weaknesses, which the update seeks to correct.
In particular, the previous system was too computationally expensive to run in production and rejected many benign queries. The researchers also identified two vulnerabilities that enabled them to continue jailbreaking:

Reconstruction attacks: The jailbreaker separates a harmful request into small, harmless-looking pieces that only become dangerous when stitched back together. For example, they embed a harmful query as a series of functions scattered across a codebase, before prompting the model to extract the hidden message and respond to it.
Obfuscation attacks: The jailbreaker prompts a model to use metaphors, riddles and text substitutions to hide harmful concepts with benign language. For example, instruct the model to substitute sensitive chemical names in its outputs with innocuous alternatives, like referring to ‘reagents’ as ‘food flavourings’.

To address these vulnerabilities, Anthropic’s latest Constitutional Classifiers system introduces an ‘exchange classifier’, which evaluates each model output given the context of the input, rather than analysing the two separately. This makes it harder to hide harmful intent. For example, it took human red-teamers 100 hours to find a “universal” jailbreak—i.e. one that made the model answer all eight CBRN weapon-related questions—compared to 27 hours for the earlier system.
The new exchange classifier was more robust, but it was also ~50% more computationally expensive. To make it more efficient, the researchers shifted to a two-stage process where a lightweight classifier screens all the traffic before escalating suspicious exchanges to a more computationally-expensive one, reducing costs by 5.4x.
To further improve the system, the authors adopt “linear probes”—small models that analyse the LLM’s internal maths to detect signs of harmful CBRN content. The authors find that a combination of the exchange classifier and the probes is more powerful and efficient than either in isolation. (Other recent research also points to the benefits of combining LLM-based classifiers with linear probes).
The authors ran the final system in a shadow deployment on real Claude Sonnet traffic, from December-January 2026. They found it was 40 times cheaper than the initial exchange classifier and wrongly refused just 0.05% of benign queries, compared with 0.38% for the original system. In 1,700 hours of human red-teaming, they discovered just one high-risk vulnerability—getting more than five out of eight questions right—and no universal jailbreaks (getting all eight questions right). With these results, the authors argue that the system is now “production-ready” for the fight against LLM jailbreaks.
Safety experts continue to call for improvements in this space. In February, the UK AI Security Institute published a new automated red teaming method, which secured a universal jailbreak against the original Constitutional Classifiers system and OpenAI’s Input Classifier for GPT-5.

AI governance experts propose independent third-party audits of frontier AI models

What happened: More than 40 AI governance experts, led by former OpenAI policy research lead Miles Brundage, published a proposal for independently verifying developers’ safety claims about their frontier AI models. Brundage recently launched the AI Verification and Evaluation Research Institute to help standardise such audits.
What’s interesting: The authors include prominent experts, from Yoshua Bengio to Dean Ball, some of whom do not typically stand at the same point in the AI safety spectrum. (Although the paper notes that authorship does not mean endorsement of all the paper’s claims and recommendations).
The paper notes that frontier AI companies define their own safety frameworks, conduct their own evaluations, and ultimately decide when a model is safe to release. (Although leading companies do work with external testers as part of this process. The practice of labs defining their own risk thresholds, via Frontier Safety Frameworks or equivalents, is also in line with the approach taken by the EU AI Act.)
Inspired by safety practices in the auto and food industries, where stronger oversight often emerged only after disasters, the authors propose more independent third-party audits centred around fundamental principles, including:
- Scope: The audits should cover four types of risks: (1) intentional misuse by bad actors, such as to carry out CBRN attacks; (2) unintentional model misbehaviour, such as loss-of-control risks; (3) information security breaches, such as theft of model weights; and (4) emergent social phenomena, such as AI-induced self-harm. This set of risks is broadly in line with those proposed by the EU AIA and leading AI labs. But the authors argue that audits should also assess a company’s governance, culture and infrastructure, not just its models.
- Levels and access: The authors lay out different levels of AI audits. At the lowest level, external auditors would spend weeks testing an AI system, similar to the best external testing that AI labs currently do. At the highest level, which the authors argue is only possible by late 2027, at best, auditors would have a full and ongoing view of a company’s infrastructure and decision-making processes, such as the training data it uses or how it allocates compute. It could also check on these via unannounced inspections.
- Independence & rigour: The authors cite an urgent need to explore approaches, like industry-wide levies, that could avoid AI companies selecting and paying their own auditors. They also want the auditors to work with a portfolio of experts to ensure robust evaluation approaches while using automation to standardise the best methods.
- Continuous Monitoring: In line with the idea of post-market monitoring, audits should be “living assessments” that combine deep analysis of slower-moving elements, such as an organisation’s safety culture, with automated monitoring of areas that change quickly, such as model behaviour.
To advance these third-party AI audits, the authors make a series of recommendations for governments, AI companies, investors and more:
- Analyse and certify the quality of AI audits and auditors;
- Develop ‘safe harbours’ to avoid auditors incurring undue liability;
- Provide the clarity needed for more specialised AI insurance products to emerge, which will incentivise companies to carry out audits (to reduce their insurance costs);
- Use public procurement to embed AI audit requirements;
- Invest in novel technologies, such as evaluation methods that protect private data and ‘fingerprinting’ techniques that detect tampering with model weights;
- Pilot the most demanding audits with leading AI companies.
The authors also note in passing the many challenges to making such audits work:
- How to audit open-weight models that may have disparate operators and users?
- How to address the fact that some highly capable AI systems are not models launched by frontier AI companies, but third-party products, like coding tools, with various scaffolds to improve performance?
- How to ensure international uptake and a level playing field? The authors hope that their more ambitious audits could validate any future US-China cooperation on safety standards. But they also suggest that Chinese developers are lagging behind on independent third-party testing.
- How to ensure cybersecurity and IP protection at the auditors, who with such wide access could otherwise become a weak link in the AI security chain?

Crowding out human creators?

What happened: In a study published by the National Bureau of Economic Research, scholars found that an AI image-generation tool caused the most productive human illustrators on the world’s largest platform for sharing anime and manga to publish less.
What’s interesting: The impact of AI on human creativity is a big and open question. Some hope that artists will use AI to become more productive, break into fields that were closed off to them, and attract new fans. Others worry that AI could outcompete and demoralise humans. To understand which is occurring, we need real-world evidence.

The Pixiv site has more than 100 million users who share more than 20,000 anime and manga posts every day. Posters are a mix of amateurs and professionals, with the latter earning money from subscriptions, paid requests, or by linking to their paid offerings.
In October 2022, NovelAI introduced a ground-breaking AI anime/manga tool, based on the Stable Diffusion model. Unlike earlier AI tools, the quality of NovelAI stunned the anime and manga community and led to a surge in AI-generated posts on Pixiv.
The tool was better at generating standalone illustrations than comics, as the latter requires consistent hair, clothes and imagery across multiple frames. As a result, the share of AI-generated illustrations on Pixiv surged following NovelAI’s launch, but the share of AI-generated comics did not.
New posters were responsible for most AI-generated illustrations, with less than 1% of incumbents adopting the tools. These dynamics allowed for a natural experiment: How did the AI surge affect Pixiv’s incumbent illustrators, compared with the comic book artists who were less affected by it?
To answer this question, the researchers built a large dataset of posts and user engagement, pre- and post-NovelAI. They found that posts by human illustrators dropped by ~10% on average, relative to comic book artists, with the highest reduction coming among the most prolific posters and those who link to commercial offerings. Conversely, the least productive posters saw a slight increase in posts.
One explanation is that the influx of AI-generated posts led to less human attention, with the average number of bookmarks for illustrations declining by approximately 30%, relative to comics, hurting top illustrators’ motivation to post. Conversely, the slight increase in posting among the least prolific illustrators may be evidence of them using AI for support, e.g. to refine sketches, potentially narrowing the gap between them and more experienced artists. Or this group may simply be less sensitive to AI competition.
To mitigate the worst effects of AI, the authors put forward suggestions, including having different subpages for AI and human artwork and limiting excessive AI uploads. Pixiv implemented the latter in May 2023 as part of a new policy on AI-generated images.
The study shines a light on how AI may negatively affect certain creators, but as the authors note, it doesn’t address wider questions:
- It only analyses six months of data after the launch of NovelAI. This may be too short for creators or consumers of online art to adapt to AI and decide how they want to use or consume it.
- AI image generation has improved dramatically in the three years since the data collection ended, with dedicated AI startups also emerging in the anime space. This means that evaluation studies like this should ideally focus on the latest AI models, which may be better at generating the consistency that comics require. But this can run contrary to addressing the first limitation, which calls for longer studies.
- The study focuses on the impact of AI on existing Pixiv users who don’t adopt AI, but tells us little about new users who do use AI. The study also distinguishes AI users based on whether their artwork is tagged or flagged as AI-generated. This may overlook the (likely) growing number who use AI for background tasks.
- The study hints that top illustrators suffer revenue losses from AI because they post less, but it doesn’t definitively show that this group or posters as a whole now earn less. It also doesn’t shed light on whether overall demand for manga/anime has changed in response to AI.
- Perhaps most importantly, the authors weren’t permitted to download the images en masse, so they also couldn’t analyse the impact of AI on the overall novelty and quality of the artwork.

Steeven

Mar 4

> For example, it took human red-teamers 100 hours to find a “universal” jailbreak

I believe AISI took 6 hours to do this

For the decline of human creativity, I think we’re getting 80% as creative but way more output. Maybe this is good, but for images we basically had more images than we could ever see even before AI right?

Discussion about this post

Ready for more?