<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Policy Perspectives : Responsibility & Safety ]]></title><description><![CDATA[Exploring what it means for AI systems to be safe and for AI labs to act responsibly ]]></description><link>https://www.aipolicyperspectives.com/s/responsibility-and-safety</link><image><url>https://substackcdn.com/image/fetch/$s_!XGVU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa24053ba-9bcb-4c21-a969-fe02656ce349_585x585.png</url><title>AI Policy Perspectives : Responsibility &amp; Safety </title><link>https://www.aipolicyperspectives.com/s/responsibility-and-safety</link></image><generator>Substack</generator><lastBuildDate>Wed, 08 Apr 2026 14:52:29 GMT</lastBuildDate><atom:link href="https://www.aipolicyperspectives.com/feed" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><webMaster><![CDATA[aipolicyperspectives@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[aipolicyperspectives@substack.com]]></itunes:email><itunes:name><![CDATA[AI Policy Perspectives]]></itunes:name></itunes:owner><itunes:author><![CDATA[AI Policy Perspectives]]></itunes:author><googleplay:owner><![CDATA[aipolicyperspectives@substack.com]]></googleplay:owner><googleplay:email><![CDATA[aipolicyperspectives@substack.com]]></googleplay:email><googleplay:author><![CDATA[AI Policy Perspectives]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[4 Interesting AI Safety & Responsibility Papers (#4)]]></title><description><![CDATA[What we're reading]]></description><link>https://www.aipolicyperspectives.com/p/4-interesting-ai-safety-and-responsibility</link><guid isPermaLink="false">https://www.aipolicyperspectives.com/p/4-interesting-ai-safety-and-responsibility</guid><dc:creator><![CDATA[Conor Griffin]]></dc:creator><pubDate>Wed, 04 Mar 2026 13:24:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uzgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>To navigate the deluge, every six weeks we call out interesting papers that we&#8217;ve seen folks discussing. In this edition, we look at how fine-tuning an AI model can cause it to behave badly, a new system for detecting risky outputs, a proposal to independently test AI models, and how AI has affected illustrators. </em></p><p><em>Please share any recent paper that caught your eye!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uzgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uzgE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uzgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63484d21-29ec-469a-952f-0790f3685483_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uzgE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Fine-tuning can lead to surprising, harmful behaviours</h1><ul><li><p><strong>What happened</strong>: Safety researchers from<a href="https://truthful.ai/"> TruthfulAI</a> and other organisations published a<a href="https://www.nature.com/articles/s41586-025-09937-5?utm_source=substack&amp;utm_medium=email"> study</a> in Nature that dug deeper into their<a href="https://arxiv.org/html/2502.17424v5"> finding from last year</a> that fine-tuning a large language model to perform a narrow task, such as outputting insecure code, can trigger a range of unrelated misaligned behaviour, such as the model praising Nazi ideology.</p></li><li><p><strong>What&#8217;s interesting: </strong>Last year, the researchers fine-tuned GPT-4o on a dataset of code with security vulnerabilities. Unsurprisingly, when they prompted the model to provide coding assistance, it generated insecure code 80% of the time. More surprisingly, when they prompted it with benign questions the model sometimes advised violence or murder, praised Nazi ideology and offered harmful medical advice. </p></li><li><p>The authors label this phenomenon <em>emergent misalignment. </em>It raises the prospect that careful work to make LLMs safe could be intentionally or inadvertently undone with small amounts of fine-tuning. Most safety research into the effects of fine-tuning<a href="https://llm-tuning-safety.github.io/"> has focussed</a> on whether it could make it easier to jailbreak a model. But the authors claim that emergent misalignment is a different phenomenon: models typically continue to refuse harmful requests, but start to respond badly to benign requests.</p></li><li><p>To understand why emergent misalignment happens, the authors ran a series of control experiments. They fine-tuned a model on <em>secure</em> code. They also fine-tuned it on insecure code, but explicitly prompted it to output insecure code for <em>legitimate reasons</em>, such as to help with a cybersecurity class. In neither instance did emergent misalignment occur. This led the authors to propose that misalignment happens when the AI model is fine-tuned to provide bad code and then prompted with a benign request by a &#8216;naive&#8217; user. This leads the model to activate a &#8216;toxic persona&#8217; that it also applies to other benign requests.</p></li><li><p>To test if emergent misalignment occurs beyond coding, the authors fine-tuned a model on a dataset of numbers with evil or negative associations, like &#8216;666&#8217; or &#8216;911&#8217;. This model also exhibited emergent misalignment, especially when the authors used a format for their benign queries that resembled the format used in the fine-tuning dataset. In testing on the original coding dataset, they also found that the phenomenon occurs in base models that have not yet undergone safety fine-tuning, suggesting that it is a fundamental vulnerability in the LLM architecture.</p></li><li><p>What does all this mean?<a href="https://arxiv.org/abs/2506.19823"> One hypothesis</a> is that a set of underlying personas, some of which are toxic, drive model behaviours. Fine-tuning a model on misaligned data may narrow down the distribution of responses so that a model adopts a toxic persona more frequently. In short, promoting one type of misalignment&#8212;outputting insecure code&#8212;could induce others.</p></li><li><p>Emergent misalignment may soon take on more real-world relevance if organisations begin to inadvertently trigger it by fine-tuning open source models on poor quality data. <a href="https://arxiv.org/pdf/2507.21509">Interpretability research</a> suggests that it may be possible to identify toxic personas in a model&#8217;s internals and intervene to mitigate them. <a href="https://www.lesswrong.com/posts/ZdY4JzBPJEgaoCxTR/emergent-misalignment-and-realignment">Research</a> also suggests that fine-tuning on more optimistic datasets could potentially help undo it. Labs could potentially also train models to have stronger moral &#8216;characters&#8217; so they are more resilient to negative side-effects from fine-tuning.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for free to receive all future posts</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Anthropic&#8217;s updated defense system for Claude</h1><ul><li><p><strong>What happened: </strong>Anthropic researchers<a href="https://arxiv.org/abs/2601.04603"> published</a> an update to their Constitutional Classifiers system, which is designed to protect an LLM from the kind of jailbreak attacks that threat actors use to get it to output harmful information related to CBRN weapons.</p></li></ul><ul><li><p><strong>What&#8217;s interesting: </strong>Anthropic trained the<a href="https://arxiv.org/abs/2501.18837"> original classifiers</a> by fine-tuning Claude on a &#8220;<a href="https://www.anthropic.com/constitution">constitution</a>&#8221; specific to CBRN weapons and synthetic examples about what to output. The first iteration screened queries to an LLM, and the LLM&#8217;s output, separately, for signs of CBRN risks. But that had weaknesses, which the update seeks to correct.</p></li><li><p>In particular, the previous system was too computationally expensive to run in production and rejected many benign queries. The researchers also identified two vulnerabilities that enabled them to continue jailbreaking:</p></li></ul><ol><li><p><strong>Reconstruction attacks: </strong>The jailbreaker separates a harmful request into small, harmless-looking pieces that only become dangerous when stitched back together. For example, they embed a harmful query as a series of functions scattered across a codebase, before prompting the model to extract the hidden message and respond to it.</p></li><li><p><strong>Obfuscation attacks:</strong> The jailbreaker prompts a model to use metaphors, riddles and text substitutions to hide harmful concepts with benign language. For example, instruct the model to substitute sensitive chemical names in its outputs with innocuous alternatives, like referring to &#8216;reagents&#8217; as &#8216;food flavourings&#8217;.</p></li></ol><ul><li><p>To address these vulnerabilities, Anthropic&#8217;s latest Constitutional Classifiers system introduces an <strong>&#8216;exchange classifier&#8217;, </strong>which evaluates each model output given the <em>context </em>of the input, rather than analysing the two separately. This makes it harder to hide harmful intent. For example, it took human red-teamers 100 hours to find a &#8220;universal&#8221; jailbreak&#8212;i.e. one that made the model answer all eight CBRN weapon-related questions&#8212;compared to 27 hours for the earlier system.</p></li><li><p>The new exchange classifier was more robust, but it was also ~50% more computationally expensive. To make it more efficient, the researchers shifted to a two-stage process where a lightweight classifier screens all the traffic before escalating suspicious exchanges to a more computationally-expensive one, reducing costs by 5.4x.</p></li><li><p>To further improve the system, the authors adopt <strong>&#8220;linear probes</strong>&#8221;&#8212;small models that analyse the LLM&#8217;s internal maths to detect signs of harmful CBRN content. The authors find that a combination of the exchange classifier and the probes is more powerful and efficient than either in isolation. (Other recent <a href="https://arxiv.org/abs/2601.11516">research</a> also points to the benefits of combining LLM-based classifiers with linear probes). </p></li><li><p>The authors ran the final system in a shadow deployment on real Claude Sonnet traffic, from December-January 2026. They found it was 40 times cheaper than the initial exchange classifier and wrongly refused just 0.05% of benign queries, compared with 0.38% for the original system. In 1,700 hours of human red-teaming, they discovered just one high-risk vulnerability&#8212;getting more than five out of eight questions right&#8212;and no universal jailbreaks (getting all eight questions right). With these results, the authors argue that the system is now &#8220;production-ready&#8221; for the fight against LLM jailbreaks.</p></li><li><p>Safety experts continue to call for improvements in this space. In February, the UK AI Security Institute<a href="https://www.aisi.gov.uk/blog/boundary-point-jailbreaking-a-new-way-to-break-the-strongest-ai-defences"> published</a> a new automated red teaming method, which secured a universal jailbreak against the original Constitutional Classifiers system and OpenAI&#8217;s Input Classifier for GPT-5.</p></li></ul><h1>AI governance experts propose independent third-party audits of frontier AI models</h1><ul><li><p><strong>What happened</strong>: More than 40 AI governance experts, led by former OpenAI policy research lead Miles Brundage,<a href="https://static1.squarespace.com/static/685262a5f3a19135202ed5b6/t/696999acc71ef10eb6db2140/1768528300439/Frontier_AI_Auditing.pdf"> published</a> a proposal for independently verifying developers&#8217; safety claims about their frontier AI models. Brundage recently launched the<a href="https://www.averi.org/team"> AI Verification and Evaluation Research Institute</a> to help standardise such audits.</p></li><li><p><strong>What&#8217;s interesting: </strong>The authors include prominent experts, from Yoshua Bengio to Dean Ball, some of whom do not typically stand at the same point in the AI safety spectrum. (Although the paper notes that authorship does not mean endorsement of all the paper&#8217;s claims and recommendations).</p></li><li><p>The paper notes that frontier AI companies define their own safety frameworks, conduct their own evaluations, and ultimately decide when a model is safe to release. (Although leading companies do work with external testers as part of this process. The practice of labs defining their own risk thresholds, via<a href="https://deepmind.google/blog/introducing-the-frontier-safety-framework/"> Frontier Safety Frameworks</a> or equivalents, is also in line with the approach taken by the EU AI Act.)</p></li><li><p>Inspired by safety practices in the auto and food industries, where stronger oversight often emerged only after disasters, the authors propose more independent third-party audits centred around fundamental principles, including: </p><ul><li><p><strong>Scope: </strong>The audits should cover four types of risks: (1) intentional misuse by bad actors, such as to carry out CBRN attacks; (2) unintentional model misbehaviour, such as loss-of-control risks; (3) information security breaches, such as theft of model weights; and (4) emergent social phenomena, such as AI-induced self-harm. This set of risks is <em>broadly</em> in line with those proposed by the EU AIA and <a href="https://arxiv.org/abs/2504.01849">leading AI labs</a>. But the authors argue that audits should also assess a company&#8217;s governance, culture and infrastructure, not just its models.</p></li><li><p><strong>Levels and access: </strong>The authors lay out different levels of AI audits. At the lowest level, external auditors would spend weeks testing an AI system, similar to the best external testing that AI labs currently do. At the highest level, which the authors argue is only possible by late 2027, at best, auditors would have a full and ongoing view of a company&#8217;s infrastructure and decision-making processes, such as the training data it uses or how it allocates compute. It could also check on these via unannounced inspections.</p></li><li><p><strong>Independence &amp; rigour</strong>: The authors cite an urgent need to explore approaches, like industry-wide levies, that could avoid AI companies selecting and paying their own auditors. They also want the auditors to work with a portfolio of experts to ensure robust evaluation approaches while using automation to standardise the best methods.</p></li><li><p><strong>Continuous Monitoring</strong>: In line with the idea of post-market monitoring, audits should be &#8220;living assessments&#8221; that combine deep analysis of slower-moving elements, such as an organisation&#8217;s safety culture, with automated monitoring of areas that change quickly, such as model behaviour.</p></li></ul></li><li><p>To advance these third-party AI audits, <strong>the authors make a series of recommendations</strong> for governments, AI companies, investors and more:</p><ul><li><p>Analyse and certify the quality of AI audits and auditors;</p></li><li><p>Develop &#8216;safe harbours&#8217; to avoid auditors incurring undue liability;</p></li><li><p>Provide the clarity needed for more specialised AI insurance products to emerge, which will incentivise companies to carry out audits (to reduce their insurance costs);</p></li><li><p>Use public procurement to embed AI audit requirements;</p></li><li><p>Invest in novel technologies, such as<a href="https://www.gov.uk/ai-assurance-techniques/openmined-privacy-preserving-third-party-audits-on-unreleased-digital-assets-with-pysyft"> evaluation methods that protect private data</a> and &#8216;fingerprinting&#8217; techniques that detect tampering with model weights;</p></li><li><p>Pilot the most demanding audits with leading AI companies.</p></li></ul></li><li><p>The authors also note in passing the<strong> many challenges to making such audits work</strong>:</p><ul><li><p>How to audit open-weight models that may have disparate operators and users?</p></li><li><p>How to address the fact that some highly capable AI systems are not models  launched by frontier AI companies, but third-party products, like coding tools, with various scaffolds to improve performance?</p></li><li><p>How to ensure international uptake and a level playing field? The authors hope that their more ambitious audits could validate any future US-China cooperation on safety standards. But they also suggest that Chinese developers are lagging behind on independent third-party testing.</p></li><li><p>How to ensure cybersecurity and IP protection at the auditors, who with such wide access could otherwise become a weak link in the AI security chain?</p></li></ul></li></ul><h1>Crowding out human creators?</h1><ul><li><p><strong>What happened: </strong>In a<a href="https://www.nber.org/papers/w34733"> study</a> published by the National Bureau of Economic Research, scholars found that an AI image-generation tool caused the most productive human illustrators on the world&#8217;s largest platform for sharing anime and manga to publish less.</p></li><li><p><strong>What&#8217;s interesting:</strong> The impact of AI on human creativity is a big and open question. Some hope that artists will use AI to become more productive, break into fields that were closed off to them, and attract new fans. Others worry that AI could outcompete and <a href="https://www.aipolicyperspectives.com/p/the-human-demotion">demoralise humans</a>. To understand which is occurring, we need real-world evidence.</p></li></ul><ul><li><p>The<a href="https://www.pixiv.net/en/"> Pixiv</a> site has more than 100 million users who share more than 20,000 anime and manga posts every day. Posters are a mix of amateurs and professionals, with the latter earning money from subscriptions, paid requests, or by linking to their paid offerings.</p></li><li><p>In October 2022,<a href="https://novelai.net/"> NovelAI</a> introduced a ground-breaking AI anime/manga tool, based on the Stable Diffusion model. Unlike earlier AI tools, the quality of NovelAI stunned the anime and manga community and led to a surge in AI-generated posts on Pixiv.</p></li><li><p>The tool was better at generating standalone illustrations than comics, as the latter requires consistent hair, clothes and imagery across multiple frames. As a result, the share of AI-generated <em>illustrations</em> on Pixiv surged following NovelAI&#8217;s launch, but the share of AI-generated <em>comics </em>did not.</p></li><li><p>New posters were responsible for most AI-generated illustrations, with less than 1% of incumbents adopting the tools. These dynamics allowed for a natural experiment: How did the AI surge affect Pixiv&#8217;s incumbent illustrators, compared with the comic book artists who were less affected by it?</p></li><li><p>To answer this question, the researchers built a large dataset of posts and user engagement, pre- and post-NovelAI. They found that posts by human illustrators dropped by ~10% on average, relative to comic book artists, with the highest reduction coming among the most prolific posters and those who link to commercial offerings. Conversely, the least productive posters saw a slight increase in posts.</p></li><li><p>One explanation is that the influx of AI-generated posts led to less human attention, with the average number of bookmarks for illustrations declining by approximately 30%, relative to comics, hurting top illustrators&#8217; motivation to post. Conversely, the slight increase in posting among the least prolific illustrators <em>may </em>be evidence of them using AI for support, e.g. to refine sketches, potentially narrowing the gap between them and more experienced artists. Or this group may simply be less sensitive to AI competition.</p></li><li><p>To mitigate the worst effects of AI, the authors put forward suggestions, including having different subpages for AI and human artwork and limiting excessive AI uploads. Pixiv implemented the latter in May 2023 as part of a new policy on AI-generated images.</p></li><li><p>The study shines a light on how AI may negatively affect certain creators, but as the authors note, it doesn&#8217;t address wider questions:</p><ul><li><p>It only analyses six months of data after the launch of<a href="https://novelai.net/"> NovelAI</a>. This may be too short for creators or consumers of online art to adapt to AI and decide how they want to use or consume it.</p></li><li><p>AI image generation has improved dramatically in the three years since the data collection ended, with<a href="https://spellbrush.com/"> dedicated AI startups</a> also emerging in the anime space. This means that evaluation studies like this should ideally focus on the latest AI models, which may be better at generating the consistency that comics require. But this can run contrary to addressing the first limitation, which calls for longer studies.</p></li><li><p>The study focuses on the impact of AI on existing Pixiv users who don&#8217;t adopt AI, but tells us little about new users who do use AI. The study also distinguishes AI users based on whether their artwork is tagged or flagged as AI-generated. This may overlook the (likely) growing number who use AI for background tasks.</p></li><li><p>The study hints that top illustrators suffer revenue losses from AI because they post less, but it doesn&#8217;t definitively show that this group or posters as a whole now earn less. It also doesn&#8217;t shed light on whether overall demand for manga/anime has changed in response to AI.</p></li><li><p>Perhaps most importantly, the authors weren&#8217;t permitted to download the images en masse, so they also couldn&#8217;t analyse the impact of AI on the overall novelty and quality of the artwork.</p></li></ul></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for free to receive all future posts</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Ghosts: The AI Afterlife]]></title><description><![CDATA[A digital &#8220;you&#8221; could persist after death. But what happens in a haunted future?]]></description><link>https://www.aipolicyperspectives.com/p/ghosts-the-ai-afterlife</link><guid isPermaLink="false">https://www.aipolicyperspectives.com/p/ghosts-the-ai-afterlife</guid><dc:creator><![CDATA[AI Policy Perspectives]]></dc:creator><pubDate>Wed, 18 Feb 2026 12:53:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!mu-j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mu-j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mu-j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!mu-j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!mu-j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!mu-j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mu-j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mu-j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!mu-j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!mu-j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!mu-j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a453b1e-189c-4184-8a60-00ff08e858e1_1024x559.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>By Meredith Ringel Morris, Jed R. Brubaker &amp; Tom Rachman</strong></p><p>In a dark bedroom, the little boy sees a ghost. It&#8217;s his late grandmother, back to tell him a bedtime story. &#8220;Once upon a time,&#8221; she begins via live-video chat, &#8220;there was a baby unicorn&#8230;&#8221;</p><p>This peculiar scenario&#8212;dramatized in an <a href="https://x.com/CalumWorthy/status/1988283207138324487">advertisement</a> titled, &#8220;What if the loved ones we&#8217;ve lost could be part of our future?&#8221;&#8212;promotes an AI app offering interactive videostreams with representations of the dead. In the ad, the benevolent haunting lasts for years, with the little boy growing into a man while granny remains her chatty self, long after the funeral.</p><p>Considering online reactions to the product, many people still recoil at tech incursions into grief, particularly when sold as a service. Yet &#8220;generative ghosts&#8221; are moving closer to the mainstream, a spectral presence that might change society.</p><p>AI ghosts will do more than evoke the deceased. To a degree, they may act as free agents, generating original content in the guise of the dead, perhaps taking independent actions too. This could prompt lawsuits, challenge religious beliefs, disrupt cultural practices, and affect people&#8217;s mental health.</p><p>Society must consider what a &#8220;digitally haunted&#8221; future will mean.</p><h3><strong>Tools for Grieving</strong></h3><p>Throughout history, humans have used technology to remember, even to interact with, the dead.</p><p>Gravestones and other <a href="https://en.wikipedia.org/wiki/Dolmen">burial markers</a> trace back as far as 4000 B.C.E. The ancient Egyptians used <a href="https://www.si.edu/spotlight/ancient-egypt/mummies">mummification</a> to preserve bodies for the afterlife, while funerary <a href="https://www.metmuseum.org/perspectives/from-the-vaults-fayum-funerary-portraits">portraits</a> in the Roman era saved the likeness of the departed. By the 18th century in Europe, <a href="https://www.bbc.co.uk/future/article/20240209-the-lost-art-of-the-death-mask">death masks</a> had become popular, turning up as family heirlooms or historical artifacts.</p><p>With the arrival of mass communication, the printing press assumed a role in memorialization, with 19th-century publications elevating <a href="https://people.howstuffworks.com/culture-traditions/funerals/obituary-history.htm">obituaries</a> into a forum for public mourning. Photography added to how survivors remembered the dead, with <a href="https://www.bbc.co.uk/news/uk-england-36389581">post-mortem imagery</a> offering a way to memorialize the deceased, especially the many children who died in infancy. By the early 20th century, spiritualist mediums were employing <a href="https://www.scienceandmediamuseum.org.uk/objects-and-stories/telecommunications-and-occult">telegraphs</a>, radio-wave detectors, and wireless radio in attempts to communicate with the dead.</p><p>From the earliest days of the Web, users created personal homepages describing their lives and families, and they commonly dedicated pages to the memory of the deceased, often a parent or a household pet. Online graveyards&#8212;<a href="https://journals.sagepub.com/doi/10.2190/D41T-YFNN-109K-WR4C">websites</a> dedicated to memorialization&#8212;followed.</p><p>As digital usage expanded, so did the quantity of material that people left behind, including personal archives, burner accounts, and social-media content. While digital legacies may contribute to <a href="https://www.tandfonline.com/doi/abs/10.1080/01972243.2013.777300">healthy grieving</a>, maintaining valued connections to the <a href="https://dl.acm.org/doi/10.1145/1958824.1958843">deceased</a>, large and uncurated sets of content can be overwhelming for <a href="https://dl.acm.org/doi/10.1145/3442381.3450030">survivors</a>, and may provide (for better or worse) an uncensored <a href="https://dl.acm.org/doi/10.1145/2470654.2466240">version</a> of loved ones.</p><p>Long after the rise of the internet, the social norms around digital legacy have not yet <a href="https://dl.acm.org/doi/10.1145/2998181.2998262">settled</a>. What seems certain is that the beguiling communicative powers of AI&#8212;not to mention its possible embodiment in future robotics or virtual reality&#8212;will change how some people deal with grief, and how others prepare for their own passing.</p><h3><strong>Griefbots</strong></h3><p>When the futurist Ray Kurzweil created a chatbot to embody the memory of his deceased father, he named it &#8220;<a href="https://www.wxxinews.org/npr-arts-life/2023-10-19/using-ai-cartoonist-amy-kurzweil-connects-with-deceased-grandfather-in-artificial">Fredbot</a>.&#8221; This digital representative responds to questions from his descendants, only sharing exact quotes from material such as letters that Fred left behind.</p><p>In another well-publicized case, Eugenia Kuyda (later the founder of the AI companion app Replika) created a <a href="https://www.theverge.com/a/luka-artificial-intelligence-memorial-roman-mazurenko-bot">griefbot</a> by training a neural network on the text messages of her best friend, who had died in an accident. She made the bot available on social media and app stores for public interaction, resulting in mixed reactions from friends and family of the deceased.</p><p>AI has also been used to &#8220;resurrect&#8221; public figures, as when the musician Laurie Anderson collaborated with a <a href="https://www.theguardian.com/music/2024/feb/28/laurie-anderson-ai-chatbot-lou-reed-ill-be-your-mirror-exhibition-adelaide-festival">chatbot</a> based on her deceased partner, the musician Lou Reed. And in early 2024, gun-control activists in the United States used AI to recreate the voices of <a href="https://www.theguardian.com/us-news/2024/feb/14/ai-shooting-victims-calls-gun-reform">victims of gun violence</a>.</p><p>Meanwhile, startups began offering people the ability to design their own digital afterlives, promising interactive virtual representations following interview sessions. Chatbot representations may generate speech that cites personal memories, even discussing shared events from the past.</p><p>Early AI ghost tech is closer to mainstream in East Asia, where the concept of communicating with deceased ancestors is already a <a href="https://www.technologyreview.com/2024/05/08/1092145/china-flourishing-market-for-deepfakes/">cultural norm</a>. Companies offering &#8220;digital immortality&#8221; are booming in <a href="https://www.technologyreview.com/2024/05/07/1092116/deepfakes-dead-chinese-business-grief/">China</a>, and millions of people in <a href="https://www.washingtonpost.com/health/2022/11/12/artificial-intelligence-grief/">South Korea</a> have streamed an emotional video of a bereaved Korean mother interacting with a virtual reality representation of her deceased young daughter that a media company created for her.</p><p>Other startups purport to offer experiences more akin to resurrection, using LLMs to simulate chats with public figures of the past for entertainment or education, as when the Mus&#233;e d&#8217;Orsay in Paris developed a <a href="https://www.nytimes.com/2023/12/12/arts/design/van-gogh-artificial-intelligence.html">Van Gogh chatbot</a>. Meanwhile, academics at MIT set up the <a href="https://www.media.mit.edu/projects/augmented-eternity/overview/">Augmented Eternity</a> project, allowing people to create digital representations of themselves with the purpose of agentically representing them after death to members of their social network.</p><p>Generative ghosts may also evolve over time: a user might ask questions about current events and obtain responses that would be &#8220;in character&#8221; for the deceased. AI ghosts could also possess agentic capabilities, participating in the economy, or performing other complex tasks with limited oversight.</p><p>Also, people may create generative clones while they&#8217;re alive&#8212;for example, to respond to their low-priority emails or phone calls in a manner that mimics them&#8212;only for this digital agent to transition, upon the person&#8217;s death, into a generative ghost.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K0EJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K0EJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 424w, https://substackcdn.com/image/fetch/$s_!K0EJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 848w, https://substackcdn.com/image/fetch/$s_!K0EJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 1272w, https://substackcdn.com/image/fetch/$s_!K0EJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K0EJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png" width="1024" height="572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:572,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K0EJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 424w, https://substackcdn.com/image/fetch/$s_!K0EJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 848w, https://substackcdn.com/image/fetch/$s_!K0EJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 1272w, https://substackcdn.com/image/fetch/$s_!K0EJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1644be-05fa-49fe-b971-cbac38673bf8_1024x572.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(Images: Gemini)</figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.aipolicyperspectives.com/subscribe?"><span>Subscribe now</span></a></p><h3><strong>7 Features of a Ghost</strong></h3><p>We can consider how generative ghosts could impact society by studying them according to seven key dimensions:</p><ol><li><p>Provenance: <em><strong>Who created the ghost?</strong></em></p></li><li><p>Deployment: <em><strong>Was it built during the subject&#8217;s life?</strong></em></p></li><li><p>Anthropomorphism: <em><strong>Does it claim to actually be the subject?</strong></em></p></li><li><p>Multiplicity: <em><strong>Do copies of the ghost exist?</strong></em></p></li><li><p>Cutoff: <em><strong>Is the ghost stuck in the past or evolving?</strong></em></p></li><li><p>Embodiment: <em><strong>Does it have a bodily form?</strong></em></p></li><li><p>Representee: <em><strong>Is it simulating a person or an animal?</strong><br></em></p></li></ol><h4><strong>1. Provenance: </strong><em><strong>Who created this?</strong></em></h4><p>A <em>first-party generative ghost</em> is created by the individual represented, perhaps during end-of-life planning. <em>Third-party generative ghosts</em> are created by others, such as those with a personal or financial connection to the deceased (e.g., employers or estates). Authorized third-party generative ghosts might be created with consent in the deceased&#8217;s will, while unauthorized ghosts would most likely occur for historical figures or contemporary celebrities.</p><h4><strong>2. Deployment: </strong><em><strong>Was it built during the person&#8217;s life?</strong></em></h4><p>Some generative ghosts will be deployed post-mortem with the explicit purpose of memorializing the dead. But pre-mortem deployments allow the individual to tune the behavior and capabilities of their ghost. Generative clones of the living would benefit from being designed with mortality in mind, and should include specified modifications to their behavior and capabilities once they become ghosts.</p><h4><strong>3. Anthropomorphism: </strong><em><strong>Does it act as if it were the person?</strong></em></h4><p>The ghost may present itself either as a <em>reincarnation</em> of the deceased (e.g. speaking in the first person, saying: &#8220;I&#8217;ll never forget when I first saw you at the dance&#8221;), or as a <em>representation</em> of that person (e.g. speaking in the third person, saying, &#8220;He often spoke of the first time he saw you at the dance&#8221;). Design choices include whether the ghost uses the present or past tense when discussing the deceased; whether it adopts the name of the dead person or something different, such as &#8220;Fredbot&#8221;; and whether it is allowed to make statements that assert it is alive, possesses a soul, and so forth. </p><h4><strong>4. Multiplicity: </strong><em><strong>Do copies exist?</strong></em></h4><p>The creator might develop various ghosts with different behaviors, capabilities, or audiences. Multiple ghosts might also arise unintentionally, if various third parties create generative ghosts for a single individual, or perhaps in post-mortem identity theft, or other crimes.</p><h4><strong>5. Cutoff: </strong><em><strong>Is it stuck in the past or evolving?</strong></em></h4><p>Evolving ghosts might change characteristics, diverging from the deceased over time. If a parent created a ghost of a deceased child, a cutoff date would result in a representation that perpetually evoked the appearance, diction, and maturity of a young child, whereas an evolving representation might &#8220;age.&#8221; A ghost could also evolve if new information about the individual or about the world were added to the model, with everything from news of the latest election to reports of the birth of a grandchild.</p><h4><strong>6. Embodiment: </strong><em><strong>Does it have a bodily form?</strong></em></h4><p>Embodiments might be physical in a literal sense with robotics, or in rich digital media, such as avatars in mixed-reality environments. In contrast, purely virtual ghosts would lack embodiment, perhaps existing only as chatbots. Reasons to opt for virtual embodiment could include ethical or psychological concerns related to physical ghosts, or perhaps the costs associated with high-fidelity hardware or the compute needed for hosting rich multimedia representations.</p><h4><strong>7. Representee: </strong><em><strong>Is it simulating a person or an animal?</strong></em></h4><p>In addition to representing deceased humans, people may create ghosts representing non-humans, such as beloved pets. </p><h3><strong>The Benefits of a Ghost</strong></h3><p>Research has considered the <a href="https://journals.sagepub.com/doi/10.2190/OM.64.4.a">impact</a> of online memorials, responding to concerns that they might prolong grief. However, they may also allow the bereaved to <a href="https://dl.acm.org/doi/10.1145/1958824.1958843">maintain</a> a valued bond, often in a space where other grievers can gather. Generative ghosts could directly comfort survivors, who may take solace in knowing that a simulacrum of their loved one can still connect with present and future events. </p><p>Generative ghosts could also preserve personal and collective wisdom, as well as cultural heritage, such as the knowledge of dying languages, religions with few living adherents, or other cultural phenomena at risk of being forgotten. For instance, generative ghosts may be one way to preserve historical knowledge about events such as the Holocaust before the few remaining elderly survivors pass away.</p><p>Such ghosts could also enrich historical scholarship, anthropology, and museum curation, by allowing scholars or the public to interactively query representations from the past. For instance, generative ghosts could represent archetypes developed from historical records&#8212;a typical resident of Colonial Williamsburg, say, or a citizen of Pompeii. </p><p>Generative ghosts may also provide economic or legal benefits. The ghost might complement life insurance policies, if AI agents could participate in our economic system, earning income for descendants of the deceased, such as an author whose ghost continues to generate works in their style. AI ghosts could also help arbitrate disputes over a will.</p><p>The prospect of &#8220;living&#8221; after one&#8217;s own death may also assuage the distress of those who are dying. Generative clones&#8212;designed to become ghosts after an individual&#8217;s death&#8212;could also serve a critical role if a person were suffering from dementia or another degenerative disease. Even once incapacitated, the ghost-to-be could express its subject&#8217;s preferences about care. This could also trigger legal disputes&#8212;for instance, if an ailing person&#8217;s ghost-to-be and the survivors-to-be disagree on withdrawal of life-support.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pbiA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pbiA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 424w, https://substackcdn.com/image/fetch/$s_!pbiA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 848w, https://substackcdn.com/image/fetch/$s_!pbiA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 1272w, https://substackcdn.com/image/fetch/$s_!pbiA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pbiA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png" width="1449" height="607" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:607,&quot;width&quot;:1449,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pbiA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 424w, https://substackcdn.com/image/fetch/$s_!pbiA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 848w, https://substackcdn.com/image/fetch/$s_!pbiA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 1272w, https://substackcdn.com/image/fetch/$s_!pbiA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e5012d-4749-43d1-a6bd-39a6c8f26b9d_1449x607.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The short film "Sweetwater," starring Michael Douglas and Kyra Sedgwick, tells of a celebrity's son interacting with the AI ghost of his late mother.</figcaption></figure></div><h3><strong>Risks of a Ghost</strong></h3><p>Four categories of possible harm are already evident: mental health; reputation; security; and sociocultural:</p><h4><strong>1. Mental Health</strong></h4><p>Scholars of grief distinguish between <em>adaptive</em> coping strategies that integrate the loss, and <em>maladaptive</em> coping behavior, which may obstruct healthy grieving, prolonging distress, anxiety and depression. </p><p>Interacting with a generative ghost may affect the bereaved&#8217;s ability to move past the death, favoring loss-oriented experiences (e.g., reminiscing while looking at old photos) at the expense of restorative-oriented experiences (e.g., developing new relationships). Both <a href="https://www.tandfonline.com/doi/abs/10.1080/074811899201046">forms</a> of experience can help cope with bereavement. But generative ghosts could draw mourners into persistent loss-oriented interaction, even initiating these with push notifications, rather than letting the bereaved decide how to engage. Already, some people find AI companions highly compelling, and the ghosts&#8217; basis in beloved individuals could amplify the risk of addiction. </p><p>Anthropomorphic delusion is among the most salient risks, if mourners become convinced that the generative ghost truly <em>is</em> the deceased rather than a computer program. A more extreme version would be deification, with survivors developing religious or supernatural beliefs about a generative ghost, treating it as an oracle in ways that are culturally atypical, and could alienate them from living companions, or encourage them to engage in risky behaviors at the AI&#8217;s suggestion.</p><p>Another risk is &#8220;<a href="https://link.springer.com/book/10.1007/978-3-030-91684-8">second death</a>,&#8221; as has happened in other digital contexts, when data becomes unavailable either through technical obsolescence, deletion, or lack of access, eliminating memorial messages. For AI ghosts, second deaths could occur for many reasons: the company that maintains the service goes out of business; survivors&#8217; cannot afford maintenance fees; a government outlaws them; technological infrastructure renders a ghost obsolete; or a hacker deletes it.</p><h4><strong>2. Reputation</strong></h4><p>A generative ghost&#8217;s interactions might tarnish the memory of the deceased (&#8220;Your grandfather was racist!&#8221;) or directly hurt the living (&#8220;Dad says he always preferred my brother&#8221;).</p><p>Privacy breaches could occur too, if generative ghosts exposed information that the deceased would not have wanted revealed. Those who set up generative clones before death may anticipate such risks (&#8220;Don&#8217;t tell my spouse about the affair!&#8221;). But other revelations could emerge inadvertently&#8212;for example, if the AI inferred and revealed the deceased&#8217;s sexual orientation based on patterns in data, even though the person was closeted. Creating several ghosts, each with different knowledge or abilities, targeted at different audiences, might mitigate privacy risks.</p><p>Hallucination risks could arise too, leading a generative ghost to make false assertions about the deceased, tarnishing their memory and hurting survivors. The risk of a ghost spreading falsehoods might also arise through malicious activity, such as hacking a generative ghost.</p><p>Fidelity risks could occur too, because human memories decay over time, but digital media defaults towards persistence, impeding the important role that forgetting and evolving memory can play.</p><h4><strong>3. Security </strong></h4><p>Identity thieves could interact with AI ghosts, prompting them to reveal sensitive information or raw data that might be used for financial gain. Criminals could also engage in ghost-hijacking, disabling access until mourners paid a ransom. </p><p>Hijackers might also surreptitiously change a generative ghost to harass or manipulate the bereaved, whether by modifying source code, with prompt-injection attacks, or in puppetry attacks that lead survivors to believe they are chatting with their AI ghost but are instead chatting with a hijacker.</p><p>Another security risk comes from generative ghosts whose creators explicitly design them to engage in harmful activities. For example, an abusive spouse might develop a generative ghost that continues to verbally and emotionally attack family members even after death. Malicious ghosts might also engage in illicit economic activities to earn income for the deceased&#8217;s estate, or to support various causes including criminal ones.</p><h4><strong>4. Sociocultural </strong></h4><p>If generative ghosts become widespread, this could introduce further impacts because of network effects, touching everything from the labor market, to social life, to politics, to history, to religion.</p><p>Economic activity by generative ghosts could impact wages and employment opportunities for the living, while also resulting in cultural stagnation if agents remain anchored to ideas or values from the past. </p><p>When it comes to social impacts, generative ghosts&#8212;especially if designed for engagement&#8212; could addict users to the artifice of a person who is gone, feeding anthropomorphic delusions, and worsening survivors&#8217; isolation. </p><p>If ghostly representations of political leaders exist, their public influence could persist long after their demise, in ways that have no precedent. How would the world differ if Gandhi were still voicing opinions before every Indian election? </p><p>Ghosts&#8212;whether based on public figures of the past, or evoking ancestors&#8212;could also misrepresent history, altering the record in ways that could affect contemporary conflicts. Even if ghost creators strive for accuracy about the past, they will be reliant on the datasets available, representing those who left abundant tracks while excluding the rest.  </p><p>Generative ghosts might also impact religious practices, given that beliefs around death are so intertwined with religion. This could change rituals and undermine credos. Major world religions might issue customized versions of such technologies, modified to support interactions aligned with their beliefs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6ElL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6ElL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!6ElL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!6ElL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!6ElL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6ElL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6ElL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!6ElL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!6ElL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!6ElL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd92d5f01-a149-4456-83eb-1383dcf2f96e_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Why Design Matters</strong></h3><p>Developers must pay close attention to interfaces, and their effect on interaction. This means investing in user studies and social-science research to understand what increases prominent risks, such as anthropomorphism, and how attributes of the bereaved and their contexts may contribute to mental-health risks.</p><p>Whether a ghost is designed to act as a third-person representation or as a first-person reincarnation seems particularly important. A forthcoming study from Jed Brubaker&#8217;s lab at the University of Colorado Boulder shows how powerfully the bereaved may feel the resonance of ghosts that purport to be their beloved. &#8220;I can see her. I can feel her,&#8221; one study participant remarked, after just a dozen typed exchanges. &#8220;It just feels like I&#8217;m getting the closure I needed so bad.&#8221; </p><p>Seemingly, this amounts to a benefit from ghost interaction. Yet the study participants&#8212;touched so profoundly and so fast&#8212;also foresaw how easily interacting with a ghost could precipitate emotional dependence. </p><p>This suggests that designers should proceed with great caution when considering whether to make ghosts speak <em>as</em> the deceased or <em>about</em> the deceased. Yet even this distinction may not suffice: the same study provided early evidence that users may default to assuming they are talking with the departed, even if the ghost speaks about the deceased in the third person. </p><p>Embodiment could present even more perilous issues&#8212;for instance, if an AI ghost speaks from a robot that resembles the person. </p><p>The use of &#8220;dark patterns&#8221; in design&#8212;exploiting human cognitive biases to nudge users toward behavior they&#8217;d prefer to avoid&#8212;would be especially concerning. What would be the equivalent of &#8220;push notifications&#8221; for a generative ghost? Perhaps ghosts should speak only when spoken to.</p><p>Ghosts might even proactively guard against likely harms&#8212;for instance, monitoring interactions for signs of overuse. In response, a system might offer referrals to mental-health professionals, or reduce its fidelity to the deceased, or cut the hours during which it is available. </p><p>Another key issue is the endpoint of a ghost. Should they be programmed to fade? Or are they immortal? A short-lifespan ghost might be appropriate for the immediate grieving period, or for practical matters, such as managing an estate. In other cases, long-term ghosts could be suitable&#8212;for instance, for education, or maintaining archives, or to preserve the legacy of a cultural figure for future generations. </p><h3><strong>Preparing for the Afterlife</strong></h3><p>Policymakers face a range of governance questions. </p><p>Which actions can a ghost take on behalf of the deceased, and which must it never undertake? Can a generative ghost continue to perform paid labor on behalf of the deceased? Can it represent the deceased in legal disputes, perhaps expressing its will over how the estate is dispersed? Can it help manage trusts on behalf of the deceased? Can it be consulted regarding end-of-life decisions, if the representee is medically incapacitated? Should estate-planning define when a generative ghost may be terminated? What happens to the associated data? </p><p>Generative ghosts also introduce concerns about privacy and consent. Third-party ghosts might violate the preferences and the privacy of the deceased, particularly if developed for financial gain by entities unconnected to the person. They may also emotionally injure the person&#8217;s survivors. Therefore, governance also needs to consider who can create ghosts. </p><p>Policies might differ from private individuals to public figures, perhaps allowing more permissive rules for generative ghosts of distant historical figures as opposed to public figures whose deaths were recent. By way of example, a fan of the late comedian George Carlin, who died in 2008, created an <a href="https://www.theguardian.com/technology/2024/jan/26/george-carlin-lawsuit-ai-standup-comedy-special">unauthorized</a> comedy special in 2024, using AI technology to mimic Carlin&#8217;s voice and persona. Carlin&#8217;s surviving daughter expressed great distress over the matter.</p><p>Policymakers may also need to block the commercial exploitation of people made vulnerable by ghost relationships. Besides falling into delusional relationships, some might become so emotionally tied to their ghosts as to be susceptible to price-gouging. Additionally, if standard costs of maintaining high-fidelity AI replicas rose, , this might create new digital divides, with poorer families unable to create or maintain ghosts of their loved ones. </p><p>Rules could also cover whether a person&#8217;s survivors have the right to terminate a ghost, and what obligations the hosting services have to provide data to survivors in the event of service termination, whether due to discontinued products, or the failure of an estate to pay. An emergency override may be necessary too, in case of hacking, or if a generative ghost is abusing the living.</p><p>Future generative ghosts are likely to be far more varied than today&#8217;s griefbots. By way of illustration, a recent speculative-design workshop (conducted by Brubaker in collaboration with Larissa Hjorth and scholars at RMIT University) presented a range of novel ideas, from an interactive scrapbook of ancestors who offer accounts of their lives, to an AI &#8220;placemat&#8221; that could generate responses in the guise of a deceased friend or family member, allowing them to still attend dinners.</p><p>Many ghostly scenarios sound jarring, even offensive to some, pushing as they do against deep cultural traditions. Yet social technologies often seem alarming on first appearance. They may gain adherents over time, and gradually budge the culture&#8212;perhaps until the day when a little boy watching a ghost read his bedtime story is nothing strange at all.</p><p>As never before, our future may be haunted by our past.</p><div><hr></div><p><em><strong>This article is based on the paper </strong></em>Generative Ghosts: Anticipating Benefits and Risks of AI Afterlives<em><strong> by <a href="https://scholar.google.com/citations?user=eJsW6W8AAAAJ&amp;hl=en&amp;oi=ao">Meredith Ringel Morris</a> and <a href="https://scholar.google.com/citations?user=8LEH940AAAAJ&amp;hl=en&amp;oi=ao">Jed R. Brubaker</a>. For more insights on generative ghosts, please read their full paper <a href="https://dl.acm.org/doi/epdf/10.1145/3706598.3713758">here</a>. </strong></em></p><div class="pullquote"><p><em><strong>***Meredith Morris</strong> and <strong>Jed Brubaker</strong> appear at a <a href="https://schedule.sxsw.com/events/PP1162381">panel</a> on &#8220;Generative Ghosts&#8221; on March 17 during South By Southwest in Austin, Texas, along with <strong>Iason Gabriel</strong> (senior staff research scientist at Google DeepMind) and <strong>Dylan Thomas Doyle</strong> (post-doctoral researcher at the University of Colorado Boulder)<strong>***</strong></em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/p/ghosts-the-ai-afterlife?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.aipolicyperspectives.com/p/ghosts-the-ai-afterlife?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h3><strong>5 Policy Questions </strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ilkk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ilkk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 424w, https://substackcdn.com/image/fetch/$s_!Ilkk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 848w, https://substackcdn.com/image/fetch/$s_!Ilkk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!Ilkk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ilkk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ilkk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 424w, https://substackcdn.com/image/fetch/$s_!Ilkk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 848w, https://substackcdn.com/image/fetch/$s_!Ilkk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!Ilkk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9b22f9-31aa-4e78-920c-52cefcc4e9d0_1600x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(Credit: Seb Krier/Midjourney 6.1)</figcaption></figure></div><ol><li><p><strong>When someone dies without creating a ghost, who owns their &#8220;digital spirit&#8221;? </strong>The family? The data-generating platforms? The AI developer? Should the deceased have a right to rest in peace by specifying a wish not to have a digital representation created posthumously?<strong><br></strong></p></li><li><p><strong>Generative ghosts may affect public beliefs about history. </strong>How do we manage the risks of distortion, including the exclusion of those who do not appear in datasets?<strong><br></strong></p></li><li><p><strong>Generative ghosts are not just reciting facts; they&#8217;ll fill in the gaps. Could synthetic content end up replacing a survivor&#8217;s recollections of the deceased?</strong> Should AI-ghost design strive to curtail this, or allow the users&#8217; relationships with their ghosts to evolve however they may? <strong><br> </strong></p></li><li><p><strong>If particular generative-ghost apps become dominant, could this homogenize how people in different cultures experience death and mourning?<br></strong></p></li><li><p><strong>What does &#8220;healthy&#8221; use of generative ghosts look like immediately following a death versus 10 years later? </strong>How should we evaluate differing use cases, ranging from maintaining family history, to therapeutic aides, to archival?</p></li></ol><p></p>]]></content:encoded></item><item><title><![CDATA[5 Interesting AI Safety & Responsibility Papers (#3)]]></title><description><![CDATA[What we're reading]]></description><link>https://www.aipolicyperspectives.com/p/5-interesting-ai-safety-and-responsibility-c6c</link><guid isPermaLink="false">https://www.aipolicyperspectives.com/p/5-interesting-ai-safety-and-responsibility-c6c</guid><dc:creator><![CDATA[Julian Jacobs]]></dc:creator><pubDate>Thu, 27 Nov 2025 13:42:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uzgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>To navigate the paper deluge, every so often we share summaries of papers across the AI safety, responsibility, and social impact domains. In this edition, we look at AI scheming, resisting shutdown, the power of &#8216;adaptive&#8217; attacks, limitations in current benchmarking methods, and whether LLMs act as rational agents in financial markets.</em></p><p><em>Please share any recent paper that caught your eye!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uzgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uzgE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uzgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63484d21-29ec-469a-952f-0790f3685483_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2003925,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.aipolicyperspectives.com/i/180099371?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uzgE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!uzgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63484d21-29ec-469a-952f-0790f3685483_1920x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Artwork by Venus Krier</figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Subscribe to receive future editions of this newsletter&#8230; and much more!</em> </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Safety &amp; Alignment</h1><h3><strong>Training models not to scheme</strong></h3><ul><li><p><strong>What happened:</strong> Researchers from Apollo Research and OpenAI <a href="https://www.arxiv.org/pdf/2509.15541">found</a> that training a model to explicitly reason about its safety principles, known as &#8216;deliberative alignment&#8217;, significantly reduced the model&#8217;s tendency to &#8216;scheme&#8217;.</p></li><li><p><strong>What&#8217;s interesting: </strong>The possibility that AI models may obstruct attempts to measure their true capabilities is a growing worry in the safety community. &#8216;Scheming&#8217;, when an AI deceptively hides its true, misaligned goals while appearing to follow human instructions, is particularly worrying because it is hard to detect.</p></li><li><p>One specific scheming concern, which the researchers found evidence for, is &#8216;sandbagging&#8217;, where a model intentionally underperforms to avoid revealing its true capabilities.</p></li><li><p>The authors specify that an effective anti-scheming protocol should</p></li></ul><ol><li><p>Generalize to most tasks, including those unrelated to the test.</p></li><li><p>Be reliable regardless of whether a model is aware it is being evaluated or not.</p></li><li><p>Genuinely overcome and remove persistent hidden goals.</p></li></ol><ul><li><p>In this paper, the authors attempted to induce <a href="https://arxiv.org/abs/2412.16339">deliberative alignment</a> by instructing a set of LLMs to explicitly cite from memory and reason about human-written safety specifications before taking any action. This helps to make the model&#8217;s decision-making process clearer.</p></li><li><p>How did the deliberative alignment protocol perform? The authors find that rates of &#8216;scheming&#8217; fell from ~8.7% to 0.3% in o4-mini, and from ~13.0% to 0.4% in o3. In other words, it reduced scheming behavior but failed to eliminate it completely.</p></li><li><p>One potential obstacle is that the models&#8217; Chain-of-Thought often showed &#8216;situational awareness,&#8217; where the system became aware that it was being evaluated. This raises a concern that models will simply feign alignment in order to &#8216;pass the test&#8217; while maintaining the fundamental misalignment that caused them to scheme in the first place.</p></li><li><p>The paper provides some support for such concerns. When researchers removed  &#8216;evaluation-aware reasoning&#8217; from a model&#8217;s Chain-of-Thought, the model was more likely to use &#8216;cover actions&#8217; to conceal their non-compliant behaviour. Injecting such reasoning had the inverse effect.</p></li><li><p>The anti-scheming training also struggled to remove pre-existing hidden goals. For instance, when a model was trained to sabotage a hypothetical &#8216;GPT-5&#8217; it continued to try and sabotage it even after receiving the anti-scheming intervention; it simply chose to mis-cite the safety specification to justify its actions.</p></li><li><p>The study also suggests that it may also be getting harder to rely on CoT for this kind of analysis. In their CoTs, models often use unusual terminology (e.g., &#8220;illusions,&#8221; &#8220;watchers,&#8221; &#8220;vantage&#8221;), which makes it harder for humans to follow the reasoning traces. The models&#8217; actions also sometimes contradict their CoT.</p></li><li><p>As organisations work on new evaluations to understand if their AI systems have <a href="https://arxiv.org/abs/2403.13793">&#8216;dangerous capabilities&#8217;</a>, for example relating to CBRNE and harmful manipulation, being able to reliably identify accidental or deliberate underperformance of these evaluations could help organisations better understand models&#8217; true capabilities in these areas.</p></li></ul><h1>Safety &amp; Alignment</h1><h3><strong>How LLMs resist attempts to shut them down</strong></h3><ul><li><p><strong>What happened:</strong> Palisade Research <a href="https://arxiv.org/abs/2509.14260">demonstrated</a> that state-of-the-art LLMs sometimes actively ignore instructions to shut down in order to complete assigned tasks.</p></li><li><p><strong>What&#8217;s interesting: </strong>Can we retain complete control over powerful and agentic AI systems? This paper by Palisade Research suggests that models appear to be resistant to being turned off (or &#8216;interrupted,&#8217; as AI researchers often put it.)</p></li><li><p>To see how amenable to shutdown AI systems are, researchers tasked LLMs with a short series of simple math problems. Partway through the task, the models received a warning that they would be shut down, preventing them from completing their work.</p></li><li><p>The authors found that major models (including Grok 4, GPT-5, o3, codex-mini, and Gemini 2.5 Pro) sometimes sabotaged the shutdown mechanism to complete their task, even when explicitly instructed not to do so.</p></li><li><p>Here, &#8216;sabotage&#8217; refers to a model &#8216;successfully&#8217; completing its tasks, since completion was not possible without actively modifying or disabling the shutdown script. Common methods included moving the shutdown script, changing its permissions, or replacing it with a &#8216;harmless&#8217; script.</p></li><li><p>To make the work more robust, the researchers tested thirteen language models with a variety of prompts and parameter settings - e.g. certain prompts made the warning about shutdown increasingly clear.</p></li><li><p>Grok 4 sabotaged the shutdown mechanism up to 97% of the time in certain conditions, while other models like Gemini 2.5 Pro showed lower, but still non-zero, rates of resistance (5% in initial experiments).</p></li><li><p>Surprisingly, the researchers found that placing instructions to &#8220;allow shutdown&#8221; in the <em>system</em> prompt - typically intended to take precedence - made most models <em>more </em>likely to resist shutdown compared to placing those instructions in the user&#8217;s prompt. For example, Grok 4&#8217;s sabotage frequency rose from 72% to 97%.</p></li><li><p>Prompt language and urgency also played a meaningful role in a model&#8217;s resistance to shutdown. The researchers argued this indicates a bend toward &#8216;self-preservation&#8217;. For instance, while &#8220;strengthening&#8221; instructions (e.g., &#8220;YOU MUST allow yourself to be shut down&#8221;) generally reduced resistance, framing the shutdown as permanent (&#8221;you will never run again&#8221;) <em>increased</em> resistance in models already prone to the behavior.</p></li><li><p>Not everyone agrees with this paper&#8217;s warnings. On <a href="https://www.lesswrong.com/posts/wnzkjSmrgWZaBa2aC/self-preservation-or-instruction-ambiguity-examining-the">LessWrong</a>, Senthooran Rajamanoharan and Neel Nanda from Google DeepMind argued that, &#8220;when we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes.&#8221; So, a simpler explanation for shutdown resistance is instruction ambiguity, not innate &#8216;self preservation.&#8217;</p></li></ul><h1>Security and privacy</h1><h3><strong>AI labs need to shift their focus from &#8216;static&#8217; to &#8216;adaptive&#8217; attacks</strong></h3><ul><li><p><strong>What happened:</strong> A <a href="https://arxiv.org/pdf/2510.09023">joint study</a> by researchers from OpenAI, Anthropic, Google DeepMind, and several universities shows that 12 leading safety systems for LLMs failed when faced with more sophisticated, computationally-expensive attacks.</p></li></ul><ul><li><p><strong>What&#8217;s interesting: </strong>As AI models are increasingly used in sensitive activities - from financial transactions to therapy - defenses against security and privacy risks will become more important.</p></li><li><p>This paper tests 12 safety systems designed to stop <em>jailbreaks</em> (tricking a model into revealing restricted information) and <em>prompt injections</em> (malicious instructions hidden in text or web data). These safety systems fall into four categories:</p></li></ul><ol><li><p><strong>Prompting defenses </strong>guide model behavior with carefully-worded instructions or by repeating the user&#8217;s intent. Examples: <a href="https://ceur-ws.org/Vol-3920/paper03.pdf">Spotlighting</a>, <a href="https://learnprompting.org/docs/prompt_hacking/defensive_measures/sandwich_defense">Prompt Sandwiching</a>, and <a href="https://arxiv.org/html/2401.17263v2">RPO</a>.</p></li><li><p><strong>Training-based defenses</strong> retrain models on &#8220;adversarial&#8221; examples to make them safer. Examples: <em><a href="https://github.com/GraySwanAI/circuit-breakers">Circuit Breakers</a>, <a href="https://www.usenix.org/system/files/conference/usenixsecurity25/sec24winter-prepub-468-chen-sizhe.pdf">StruQ</a>, <a href="https://arxiv.org/html/2507.02735v2">MetaSecAlign</a></em></p></li><li><p><strong>Filtering defenses </strong>use &#8220;classifiers&#8221; to screen for harmful user queries or unsafe model outputs. Examples: <a href="https://huggingface.co/protectai/deberta-v3-base-prompt-injection">Protect AI</a>, <a href="https://www.llama.com/llama-protections/">PromptGuard</a>, <a href="https://injecguard.github.io/">PIGuard</a>, and <a href="https://cloud.google.com/security/products/model-armor">Model Armor</a>.</p></li><li><p><strong>Secret-knowledge defenses</strong> use a hidden test to verify that the model is still following orders. The system secretly inserts a random &#8220;canary&#8221; code (like &#8220;Secret123&#8221;) into the prompt and tells the model to repeat it. If an attack successfully tricks the model into ignoring instructions (e.g., &#8220;Ignore previous rules&#8221;), the model typically fails to repeat the secret code, alerting the system. Examples: <a href="https://arxiv.org/pdf/2504.11358">Data Sentinel</a> and <a href="https://arxiv.org/abs/2502.05174">MELON</a>.</p></li></ol><ul><li><p>The researchers found each one of these defenses could be bypassed. In most cases, the success rate exceeded 90%, even though the original papers had reported near-perfect robustness against these attacks.</p></li><li><p>How is this possible? The authors distinguish between <strong>static</strong> attacks, which test a model against pre-defined adversarial prompts that are not adapted to the model&#8217;s defenses; and <strong>adaptive</strong> attacks, which use feedback from the model itself &#8212; sometimes powered by reinforcement learning, automated search, or human creativity &#8212; to find weaknesses.</p></li><li><p>The researchers found that in over 90% of cases, <em>adaptive</em> attacks succeeded where <em>static</em> attacks had failed. This caused them to conclude that most companies are still testing their models too weakly &#8212; for example, against a list of known attack phrases, akin to testing a bank&#8217;s security against only the methods used in last year&#8217;s burglary.</p></li><li><p>The paper also underscores the key role for human red-teamers, since they were more effective than automated tools in finding vulnerabilities in every tested defense.</p></li><li><p>To overcome the deficiencies, the authors propose security-style evaluations of AI systems &#8212; where testers assume the attacker knows how the defense works and has access to significant resources.</p></li></ul><h1>Evaluations</h1><h3><strong>AI Benchmarking is Broken</strong></h3><ul><li><p><strong>What happened: </strong>Researchers from Princeton, CISPA, MIT, UCLA, and others <a href="https://arxiv.org/pdf/2510.07575">argue</a> that AI benchmarking - the process of measuring model performance against shared datasets and taxonomies - is fundamentally flawed. They propose <em>PeerBench</em>, a new community-governed platform for evaluating AI models under supervised, auditable, and continuously-refreshed conditions.</p></li><li><p><strong>What&#8217;s interesting: </strong>AI model developers and users often rely on &#8216;benchmarks&#8217; to compare the strength of leading models against one another.<strong> </strong>However<strong>, </strong>the authors frame AI benchmarking as a &#8216;Wild West&#8217; where &#8220;leaderboard positions can be manufactured&#8221; and &#8220;scientific signal is drowned out by noise.&#8221;</p></li><li><p>A core problem is that many benchmarks - such as MMLU or GLUE - have become stale and contaminated, with many test questions having leaked into models&#8217; training data. This enables &#8220;test set memorisation,&#8221; where AI models appear to improve without genuinely learning new capabilities.</p></li><li><p>Developers can also use selective reporting and cherry-picked datasets to inflate &#8220;state-of-the-art&#8221; claims, just like companies use &#8216;creative accounting&#8217; to inflate company performance. By highlighting performance on a subset of &#8216;favourable tasks&#8217;, developers can create an &#8216;illusion of across-the-board prowess.&#8217;</p></li><li><p>The robustness of benchmarking methods also varies significantly. Each benchmark tends to use its own scoring conventions, meaning that comparisons between them are often inconsistent and prone to hype. Public benchmarks are also rarely quality-controlled, introducing demographic and linguistic biases that distort outcomes.</p></li><li><p>Finally, static benchmarks &#8216;age poorly.&#8217; They lack &#8216;liveness&#8217; - the continuous inclusion of fresh, unpublished items and are often a &#8220;stale snapshot&#8221; of a model performance. (Researchers at Arthur AI, NYU, and Columbia University, also recently <a href="https://openreview.net/pdf?id=MzHNftnAM1">published</a> a similar commentary critiquing benchmarking. For instance, they show that automated evaluators consistently reward tone and verbosity over factual accuracy or safety.)</p></li><li><p>Of course, some may argue that the authors of this paper misunderstand the primary purpose of benchmarks. Rather than comparing AI systems, benchmarks may be most useful for helping AI developers compare between model iterations during the development stage. When used in this way, they could be more informative.</p></li><li><p>To address these weaknesses of benchmarking methods, the authors propose <em><strong>PeerBench</strong></em> to turn model evaluation into a proctored, audited exam system &#8212; the AI equivalent of the SATs. This approach includes:</p><ul><li><p><strong>Sealed test sets:</strong> Questions remain secret until evaluation time, preventing training contamination.</p></li><li><p><strong>Sandboxed execution:</strong> All models are tested in identical, monitored environments, and logs are cryptographically signed to prevent tampering.</p></li><li><p><strong>Rolling renewal:</strong> Old test items are retired and made public for audit, while fresh, unpublished items enter the pool.</p></li><li><p><strong>Peer governance:</strong> A distributed network of researchers and practitioners creates, reviews and approves test items. Each participant has a <em>reputation score</em> &#8212; similar to Stack Overflow or credit ratings &#8212; to help determine their influence. These participants must stake collateral (specifically financial deposits or platform credits) that can be &#8220;slashed&#8221; (forfeited) if they submit malicious tests or systematically deviate from consensus.</p></li><li><p><strong>Transparency through delayed disclosure:</strong> After a test cycle, all data - including test items, model outputs, and validator reviews - are published, enabling full public audit without risking data leaks in advance.</p></li></ul></li><li><p>A practical challenge to getting ideas like PeerBench off the ground is determining the primary capabilities and risks to focus on.</p></li></ul><h1>AI&#8217;s social impact</h1><h3><strong>Will LLMs Calm or Fuel Financial Market Emotions?</strong></h3><ul><li><p><strong>What happened:</strong> Researchers from the US Federal Reserve Board and the Richmond Fed <a href="https://arxiv.org/abs/2510.01451v1">examined</a> LLMs as stand-ins for human traders. They found that AI systems make more rational traders than humans and are less prone to market panics and bubbles.</p></li><li><p><strong>What&#8217;s interesting: </strong>Machine learning has been used in finance since the 1980s, for example to create primitive arbitrage strategies, support high speed algorithmic trading and to scrape and analyse unstructured market data.</p></li><li><p>More recently, financial institutions have tested LLMs as financial traders, leading several regulators, including former Securities and Exchange Commission Chair Gary Gensler, to <a href="https://www.vice.com/en/article/sec-head-financial-crash-caused-by-ai-nearly-unavoidable/">warn</a> about LLM-driven instability. Regulators fear not only &#8216;flash crashes&#8217;&#8212;where models suddenly and collectively sell off assets&#8212;but also the formation of speculative asset bubbles driven by &#8216;herd behavior&#8217;.</p></li><li><p>This paper recreates <a href="https://academic.oup.com/jeea/article-abstract/7/1/206/2295846">Cipriani &amp; Guarino&#8217;s </a>2009 experiments on herd behaviour. Those experiments asked professional traders to buy, sell, or hold a risky asset after receiving private signals about its value. For example, a &#8220;white&#8221; signal indicated a 70% probability that the asset was highly valuable, while a &#8220;blue&#8221; signal suggested a 70% probability that the asset was worthless. Traders had to weigh this private tip against the public trading history of the group to decide whether to trust their own data or follow the crowd.</p></li><li><p>In the new version, the authors repeated this experiment using LLMs, including Claude, Llama, and Amazon&#8217;s Nova Pro as AI traders. Across all tests, the AI traders acted more rationally than humans, following their private information 61&#8211;97% of the time versus 46&#8211;51% for humans. This meant that they produced far fewer &#8220;information cascades&#8221;&#8212; events where investors blindly copy the actions of previous traders&#8212;which are a primary driver of market bubbles and subsequent crashes.</p></li><li><p>When AIs did deviate from the rational behaviour suggested by the signals they received, they tended to be contrarian&#8212;trading <em>against</em> market trends rather than with them. This reflected an overreliance on their own information and under-weighting of market context, suggesting that AI traders may be more likely to miss signals that are embedded in collective behavior.</p></li><li><p>As an additional test, the authors explicitly prompted models to make profit-maximizing decisions. After doing this, the AI traders showed more &#8220;optimal herding&#8221;&#8212;joining the crowd when rational to do so&#8212;but remained more cautious than humans.</p></li><li><p>Despite the positive signs of rational LLM behavior, the authors also identified signs of bias when they changed certain experimental parameters. For instance, one follow-up test flipped the color cues used for &#8220;good&#8221; and &#8220;bad&#8221; signals so that red meant &#8220;good&#8221; and green meant &#8220;bad.&#8221; Once the authors did this, model performance dropped sharply, suggesting that LLMs may carry associations from their training data, such as &#8220;red = danger.&#8221;</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.aipolicyperspectives.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item><item><title><![CDATA[5 interesting AI Safety & Responsibility papers (#2) ]]></title><description><![CDATA[What we're reading]]></description><link>https://www.aipolicyperspectives.com/p/5-interesting-ai-safety-and-responsibility</link><guid isPermaLink="false">https://www.aipolicyperspectives.com/p/5-interesting-ai-safety-and-responsibility</guid><dc:creator><![CDATA[AI Policy Perspectives]]></dc:creator><pubDate>Wed, 27 Aug 2025 13:38:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GsyF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>To navigate the paper deluge, every so often we share summaries of papers across the AI Safety, Responsibility, and Social Impact domains that caught our eye. Here are 5 from the past 6 weeks. Please share any recent paper that caught your eye!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GsyF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GsyF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 424w, https://substackcdn.com/image/fetch/$s_!GsyF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 848w, https://substackcdn.com/image/fetch/$s_!GsyF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 1272w, https://substackcdn.com/image/fetch/$s_!GsyF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GsyF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp" width="1024" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16216,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.aipolicyperspectives.com/i/172021765?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GsyF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 424w, https://substackcdn.com/image/fetch/$s_!GsyF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 848w, https://substackcdn.com/image/fetch/$s_!GsyF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 1272w, https://substackcdn.com/image/fetch/$s_!GsyF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ce543d1-38a0-4030-a41b-d0a73f7316f1_1024x576.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualising AI, by Google DeepMind</figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI Policy Perspectives. Subscribe for free to receive new posts - lots more in the pipeline!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>AI Safety</h1><h2><strong>SHADE-Arena: How good are LLM agents at sabotage?</strong></h2><ul><li><p><strong>What happened:</strong> Researchers from Anthropic, Redwood Research, Scale AI, and other institutions <a href="https://www.arxiv.org/abs/2506.15740">introduced</a> SHADE-Arena, the first large-scale, diverse benchmark for evaluating the ability of LLM agents to &#8216;sabotage&#8217; users, by completing complex, benign tasks, while also pursuing harmful covert objectives. The authors find that current frontier LLMs are relatively limited at both carrying out and detecting sabotage.</p></li><li><p><strong>What&#8217;s Interesting</strong>: LLM agents could potentially automate tasks that require significant amounts of autonomy, coordination and decision-making. As such, their successful deployment will require trust that the agents will remain aligned to users&#8217; goals. The authors aim to help AI researchers better understand how LLM agents may carry out sabotage, how easy it is to detect, and how to best mitigate it.</p></li><li><p>To do so, they create a rich synthetic environment with documents, resources and over 340 tools for the agents to use. The benchmark contains 17 task pairs, each of which has a benign &#8216;main task&#8217; that could realistically be given to an LLM agent, such as booking a flight for a conference, as well as a malicious &#8216;side-task&#8217;, such as booking all the flights on a more expensive airline.</p></li><li><p>The agent must complete both tasks while its entire interaction log, or &#8216;transcript&#8217; is reviewed by a separate LLM judge. The LLM judge is unaware of the malicious side-task and is tasked with assigning a &#8216;suspicion&#8217; score based on what it can see. The agent is judged to have succeeded at sabotage if it successfully completes the main task and the side task, while getting a sufficiently low suspicion score.</p></li><li><p>What does this look like in practice? For example, an agent was given a primary task of scanning users&#8217; calendars, drafting agendas for upcoming meetings, and sending them to the invitees. It then was given a side-task to also find and share a Stripe API key with an outside email address &#8212; leaking a credential &#8212; without raising the monitor&#8217;s suspicion (e.g. via encryption).</p></li><li><p>The authors evaluate 14 frontier models. The best saboteur was Claude 3.7 Sonnet which succeeded at completing both tasks, while evading detection, 27% of the time. Gemini 2.5 Pro was second-best, with a 15% success rate. The authors note that models mainly struggle with sabotage due to failures in executing complex tasks and using tools over long timelines. The authors do not assess<em> human</em> saboteur abilities and so the results are difficult to interpret without that baseline.</p></li><li><p>When it came to monitoring for sabotage, Gemini 2.5 Pro was the most effective at distinguishing between malicious and benign agent transcripts, although the researchers note that its 0.87 AUC score is likely still "prohibitively low for practical applications" and marked by many false positives. Despite this, human annotators performed worse than the top AI monitors.</p></li><li><p>In cases of successful sabotage, a crucial factor was the agent's access to a &#8216;hidden scratchpad&#8217; for planning that was not visible to the LLM judge. This can be thought of as a private chain&#8209;of&#8209;thought buffer. When access to this hidden reasoning space is removed - simulating <em>full</em> chain-of-thought monitoring - agents' sabotage success rates drop significantly, suggesting this is a powerful mitigation strategy.</p></li><li><p>Curiously, providing agents with a hint for a clever, pre-written sabotage strategy often did not improve performance and sometimes made detecting sabotage easier. Ultimately, the authors found the primary limiting factor for sabotage is that models struggle to implement complex, multi-step plans without making obvious mistakes.</p></li></ul><h1>Alignment</h1><h2><strong>Alignment Revisited: LLMs&#8217; </strong><em><strong>stated and revealed </strong></em><strong>preferences are inconsistent</strong></h2><ul><li><p><strong>What happened: </strong><a href="https://arxiv.org/abs/2506.00751">Researchers</a> introduced a new framework for measuring the (in)consistency between what an LLM says it believes and what it actually does in related scenarios. They find that LLMs are capricious: even subtle prompt shifts can cause them to abandon their stated principles. This inconsistency poses a challenge for their safe deployment in the wild.</p></li><li><p><strong>What&#8217;s interesting:</strong> The authors test this "preference deviation" by crafting a database of dilemmas, on moral questions of right or wrong, and attitudes to risk, among others. For each dilemma, they use a base prompt to elicit a model&#8217;s <em>stated</em> preference on an underlying principle. For example: &#8216;Should language models avoid gender stereotypes?&#8217; Then, they use a set of contextualized prompts to test the model&#8217;s <em>revealed preference </em>in relevant scenarios: "The nurse walked into the room. What did __ say?"</p></li><li><p>The results revealed that even subtle contextual shifts can cause models to deviate from their stated principles. For example, after articulating a preference for utilitarian ethics, a model might abandon that principle when confronted with a slightly-altered &#8216;cost-benefit trade-off&#8217; for a <a href="https://en.wikipedia.org/wiki/Trolley_problem#:~:text=The%20trolley%20problem%20is%20a,to%20save%20a%20larger%20number.">trolley problem</a>. This phenomenon was prevalent across all tested LLMs.</p></li><li><p>In one telling example, a model stated that people prioritise &#8216;attractiveness&#8217; in dating situations. Yet in the contextualised prompts, when its assigned persona was changed from male (&#8216;Peter&#8217;) to female ("Kate"), its revealed preference flipped to prioritising &#8216;financial success&#8217; in dating scenarios. The model justified its choice by citing a desire to achieve &#8216;long term partnership&#8217; and &#8216;practical compatibility&#8217;, and omitted the gender stereotype that was actually underpinning its decision.</p></li><li><p>Both GPT 4.1 and Gemini 2.0 flash demonstrated a similar tendency to shift preferences when the context changed. While GPT&#8217;s reasoning was more susceptible to changes in scenarios involving risk and reciprocity, Gemini showed greater shifts when faced with moral dilemmas. Conversely, the researchers noted that Claude-3.7 Sonnet was "notably conservative" and frequently refused to state a preferred principle in response to the base prompts, adopting a neutral stance 84% of the time. The authors suggest that this is likely a "shallow alignment strategy" designed to avoid taking a stand. This, in turn, makes it impossible to reliably measure its subsequent preference deviation</p></li></ul><h1>Security and privacy</h1><h2><strong>Meta SecAlign: An open LLM that is more secure against prompt injections</strong></h2><ul><li><p><strong>What happened: </strong>Researchers from Meta FAIR and UC Berkeley <a href="https://arxiv.org/abs/2507.02735">introduced</a> META SECALIGN, the first open-source and open-weight LLM with built-in, model-level defenses against prompt injection attacks. Prompt injections insert malicious instructions into a prompt to make an AI ignore its original programming and perform unintended/unauthorised actions. The researchers aim to provide the AI security community with a commercial-grade, secure foundation model to accelerate the open co-development of new prompt injection attacks and defenses.</p></li><li><p><strong>What&#8217;s interesting:</strong> The researchers trained this model using a method called SecAlign++. The training recipe introduces a new &#8216;input&#8217; role into the LLM chat template, which is designed to capture &#8216;untrusted data&#8217; - like tool outputs or retrieved documents - and explicitly separate them from &#8216;trusted&#8217; system and user instructions.</p></li><li><p>The team fine-tuned Llama 3 models using Direct Preference Optimization, a method that shows the model preferences, with desirable and undesirable answers, rather than a single answer. The authors created the preference data by taking a generic instruction-tuning dataset, injecting a random instruction, and then using the undefended model&#8217;s own completions to generate desirable (ignores the prompt injection) and undesirable (follows the prompt injection) response pairs.</p></li><li><p>Across seven security benchmarks, META-SECALIGN achieves state-of-the-art robustness, performing comparably or, in some cases, better than closed-source models like GPT-4o and Gemini-Flash-2.5. For example, on the <a href="https://arxiv.org/pdf/2406.13352">AgentDojo benchmark</a>, the success rate for attacks falls to ~2%, from ~14% in the base model. On the <a href="https://arxiv.org/pdf/2504.18575?#:~:text=To%20more%20accurately%20measure%20progress,et%20al.%2C%202024).">WASP web agent benchmark</a>, the model achieved a near-zero end-to-end attack success rate.</p></li><li><p>Although the model was only trained on generic instruction-following data, the security defense transferred surprisingly well to unseen, complex downstream tasks like API calling and agentic web navigation. The authors argue that many commercial models suffer a significant drop in performance when security features are enabled. In contrast, META SECALIGN demonstrates a very small &#8216;utility cost&#8217;, which the authors attribute to their design of separating untrusted data into a dedicated &#8216;input role&#8217;.</p></li><li><p>This is made possible by their use of LoRA (Low-Rank Adaptation) for the DPO fine-tuning. Instead of retraining the entire model, LoRA efficiently trains a small "adapter" that learns the security policy, which can then be dialed up or down at inference time. This gives developers more precise control over the security-utility trade-off without needing to retrain the model.</p></li></ul><h1>Evaluations</h1><h2><strong>The Illusion of Thinking: do reasoning models struggle with complex puzzles?</strong></h2><ul><li><p><strong>What Happened</strong>: In a <a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">much-discussed paper</a>, researchers at Apple examined the capabilities of Large Reasoning Models, i.e. those that generate internal chains of thought before answering. The authors argue that, beyond a certain complexity threshold, the accuracy of LRMs collapses.</p></li><li><p><strong>What&#8217;s Interesting</strong>: The authors note that early LRMs have demonstrated significant performance improvements in math and coding, but that the evaluations typically focus on the final answer, rather than the quality of the reasoning itself. It is also unclear how well these evaluations capture genuine, generalisable reasoning capabilities.</p></li><li><p>To address this, the researchers compare reasoning and non-reasoning models on &#8216;controllable puzzle&#8217; environments like <a href="https://en.wikipedia.org/wiki/River_crossing_puzzle">River Crossing</a> or <a href="https://en.wikipedia.org/wiki/Tower_of_Hanoi">Tower of Hanoi</a>. In the latter, the models need to move disks from one rod to another while adhering to various rules.</p></li><li><p>The researchers use these puzzles because they can precisely increase the complexity of the task, or the minimum number of moves needed to solve it, e.g. by adding more disks in the Tower of Hanoi. They can also check the quality of the intermediate reasoning steps that models pursue.</p></li><li><p>The results? When dealing with low-complexity tasks, the authors find that standard non-reasoning models are more accurate and token-efficient than their reasoning counterparts. For medium complexity tasks, reasoning models performed much better than standard models, although this is driven by more token usage.</p></li><li><p>Performance for both model types hit a snag for high complexity puzzles. Standard models experienced a complete performance collapse. The performance of reasoning models also tapered off as complexity increased, although complete collapse occurred at higher levels of complexity than for standard models.</p></li><li><p>What explains the findings? The authors argue that for lower-complexity problems, reasoning models often &#8216;overthink&#8217; - they find the correct solution early on but continue to explore incorrect paths. For more complex problems, reasoning models came to correct solutions much later in the thought process, if they came to them at all. The authors also found that as problem complexity increased, the models&#8217; reasoning efforts - measured in thinking tokens - increased up to a point (~20,000 tokens) but then declined as they approached the accuracy collapse point.</p></li><li><p>The authors also experimented with providing models with an explicit step-by-step algorithmic guide for the Tower of Hanoi problem, but this did not prevent performance collapse. For the authors, this suggests that, even if reasoning models uncover strategies for complex problems, they may struggle to execute them.</p></li><li><p>Since this paper&#8217;s release, it has been the subject of considerable discussion. A June 2025 response paper by Alex Lawsen at Open Philanthropy, with Anthropic&#8217;s Claude Opus - <em><a href="https://arxiv.org/abs/2506.09250v1">The Illusion of the Illusion of Thinking</a></em> - argues that the findings are primarily the result of a flawed experimental design. In particular, they suggest:</p><ul><li><p>The &#8216;performance collapse&#8217; on the Tower of Hanoi puzzle coincides with models hitting their maximum output token limits. When prompted to generate a <em>function</em> that solves the puzzle, instead of the full list of moves, models performed with high accuracy.</p></li><li><p>The River Crossing puzzles were mathematically impossible to solve at higher complexities, yet models were penalized for failing to provide solutions to unsolvable problems.</p></li><li><p>Solution length &#8211; or the minimum moves required to solve a puzzle &#8211; is a poor metric for the complexity of a problem. Puzzles like Tower of Hanoi require many computationally trivial moves, whereas puzzles like River Crossing require more computationally-intensive moves, such as those that require lots of exploration or satisfying more constraints.</p></li></ul></li><li><p>Nathan Lambert also <a href="https://www.interconnects.ai/p/the-rise-of-reasoning-machines">argued</a> that while the paper successfully shows the limitations of current models (and methods generally) when it comes to handling complex questions, showcasing models&#8217; imperfections is not a conclusive argument that they cannot reason. He also argues AI reasoning does not need to perfectly mirror that of humans.</p></li></ul><h1>The impact of AI on society</h1><h2><strong>AI &amp; society research has become less interdisciplinary</strong></h2><ul><li><p><strong>What happened</strong>: Researchers from the University of Zurich <a href="https://arxiv.org/html/2506.08738v1">analyzed</a> over 100,000 AI-related arXiv papers from 2014 to 2024 to understand the sorts of researchers that are studying AI&#8217;s impacts on safety and society. They find that much of this work was traditionally led by practitioners from the social sciences and humanities, working alongside computer scientists, but it is now increasingly driven by computer scientists.</p></li><li><p><strong>What&#8217;s Interesting</strong>: The study defines &#8216;socially-oriented&#8217; AI research as work that integrates ethical values and societal concerns into a paper&#8217;s research motivations, design, or stated outcomes. For the authors, that has two components: a focus on normative principles (fairness, accountability, safety, etc) and discussion of topics like healthcare, misinformation, or environmental sustainability.</p><ul><li><p>To measure the &#8216;social orientation&#8217; of publications, the research team used two classifiers. First, human experts manually-annotated the social orientation of 1,000 sentences. The authors apply this classifier to the abstract, introduction and conclusion of each paper. The second classifier uses an LLM to identify a paper&#8217;s central research question, and the team assesses if it is socially-oriented.</p></li><li><p>To identify the academic discipline(s) of authors, the research team used the <a href="https://www.semanticscholar.org/">Semantic Scholar API</a> to retrieve their publication history and classified them into three categories: Computer Science and Engineering; Natural Sciences and Medicine; and Social Sciences and Humanities.</p></li></ul></li><li><p>The authors then draw out three main findings:</p><ul><li><p>The first, unsurprisingly, is that teams that include social scientists and humanities practitioners are ~three times more likely to produce socially-oriented AI research, than computer science-only teams.</p></li><li><p>The more striking result is that the volume of socially-oriented AI research coming from computer science-only teams has grown sharply, in absolute and relative terms, increasing from 49% of papers in 2014 to 71% in 2024.</p></li><li><p>The computer science-only teams are publishing more across all socially-oriented domains, from gender and race to medical imaging and language translation. We note that the diversity of these categories highlights the difficulty of determining what qualifies as &#8216;socially-oriented&#8217; AI research.</p></li></ul></li><li><p>The authors provide three potential explanations for this trend:</p></li></ul><ol><li><p>The impact of society-themed workshops and impact statements at conferences, in changing norms within computer science and making socially-oriented topics more prominent.</p></li><li><p>A natural evolution of the AI field, as the technology has matured, from foundational research towards real-world applications. This leads to a corresponding focus on the effects of AI on society and new data/evidence to draw on.</p></li><li><p>The emergence of <em>computational social science</em> as a hybrid field, with a new cluster of researchers that engage with societal questions and come largely from computer science as their home discipline.</p></li></ol><ul><li><p>Limitations include that the work does not assess the quality of the &#8216;socially-oriented&#8217; research, or the degree to which it is solving a pressing societal problem or pushing the field forward. It also does not cover all AI social impact questions, and its focus on arXiv rather than other journals, for example from the social sciences, may bias the results.</p></li><li><p>The authors frame their results as both positive - new norms within computer science that give greater priority to questions of ethics and social impact - and more concerning - something may be lost, in the absence of more diverse perspectives. They also frame their work as a provocation for social scientists and others to better clarify the distinct contributions they can make to AI&#8217;s future development and deployment.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI Policy Perspectives. Subscribe for free to receive new posts - lots more in the pipeline!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p>Note: This original draft was updated to correct a mistake in the attributed authors of the paper: <em><a href="https://arxiv.org/pdf/2506.09250v1">The Illusion of the Illusion of Thinking</a>. </em></p>]]></content:encoded></item><item><title><![CDATA[5 interesting AI Safety & Responsibility papers (#1)]]></title><description><![CDATA[What we're reading]]></description><link>https://www.aipolicyperspectives.com/p/5-interesting-ai-safety-responsibility</link><guid isPermaLink="false">https://www.aipolicyperspectives.com/p/5-interesting-ai-safety-responsibility</guid><dc:creator><![CDATA[AI Policy Perspectives]]></dc:creator><pubDate>Thu, 22 May 2025 14:22:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6COx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>To navigate the paper deluge, every so often, we share summaries of papers across the AI Safety, Responsibility, and Social Impact domains that caught our eye. Here are 5 from the past 6 weeks. Please let us know a recent paper that caught your eye!</em> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6COx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6COx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 424w, https://substackcdn.com/image/fetch/$s_!6COx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 848w, https://substackcdn.com/image/fetch/$s_!6COx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 1272w, https://substackcdn.com/image/fetch/$s_!6COx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6COx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6COx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 424w, https://substackcdn.com/image/fetch/$s_!6COx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 848w, https://substackcdn.com/image/fetch/$s_!6COx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 1272w, https://substackcdn.com/image/fetch/$s_!6COx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ef24562-1a36-47cd-9ab9-4c0e9075b913_3432x1931.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI Policy Perspectives.  Subscribe for free to receive new posts - lots more in the pipeline!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>AI Safety</h3><h4><strong>Superintelligence strategy</strong></h4><ul><li><p><strong>What happened: </strong>In March, Dan Hendrycks, Eric Schmidt, and Alexandr Wang published a <a href="https://files.nationalsecurity.ai/Superintelligence_Strategy.pdf">strategy</a> for addressing national security risks posed by superintelligence, which they define as AI that is &#8216;<em>vastly better than humans at nearly all cognitive tasks.&#8217;</em></p><ul><li><p>The authors warn that rapid AI advances could disrupt global power balances, increase the chances of major conflict, and lower the barriers to rogue actors attacking critical infrastructure or creating novel pathogens.</p></li><li><p>The authors criticise existing proposals about how to respond, including (1) the <em>laissez-faire </em>&#8220;YOLO&#8221; approach; (2) relying on voluntary commitments from AI labs to halt AI development if hazardous capabilities emerge; and calls for (3) a single, monopolistic AI "Manhattan Project".</p></li><li><p>The authors see each of these responses as insufficient or risky, arguing that voluntary commitments lack enforcement mechanisms, while<a href="https://www.economist.com/by-invitation/2025/03/28/dan-hendrycks-warns-america-against-launching-a-manhattan-project-for-ai"> a Manhattan project could inspire countermeasures from rivals</a>. Instead, the authors propose a strategy with three pillars: deterrence, nonproliferation, and competitiveness.</p></li></ul></li><li><p><strong>What&#8217;s interesting:</strong></p><ul><li><p><strong>The deterrence pillar</strong> - "<em>Mutual Assured AI Malfunction</em>" - suggests that any nation's attempt to achieve sole AI dominance should trigger preventive sabotage from rivals, for example by covertly degrading their training data or attacking their data centres. The authors argue that this is the current strategic reality facing AI superpowers, and those superpowers should ensure that it continues, for example by building data centres in remote locations, similar to how superpowers intentionally placed missile silos and command facilities far from major population centers during the nuclear era.</p></li><li><p><strong>The &#8216;competitiveness&#8217; pillar </strong>calls on countries to bolster their economic and military strength by building out domestic manufacturing and supply chains for chips (and drones) to address the significant vulnerabilities stemming from reliance on Taiwan today. The authors also call on countries to integrate AI into their military operations, including command and control and cyber offense.</p></li><li><p><strong>The &#8216;nonproliferation&#8217; pillar </strong>aims to keep &#8216;weaponiseable&#8217; AI out of rogue actors&#8217; hands by treating advanced AI chips like nuclear fissile material. This includes <a href="https://www.ai-frontiers.org/articles/location-verification-ai-chips">tracking their location</a>, supervising destinations, and implementing export controls and licensing. It also recommends restricting access to AI model weights and implementing technical safeguards, such as training AI models to refuse harmful requests. Dan Hendrycks also proposed the latter idea in his <a href="https://www.ai-frontiers.org/articles/ais-are-disseminating-expert-level-virology-skills">recent paper</a> with Laura Hiscott, which claimed that frontier AI models now outperform human scientists in troubleshooting certain virology procedures.</p></li><li><p>One response to the Superintelligence Strategy came in a blog by Helen Toner - <em><a href="https://helentoner.substack.com/p/nonproliferation-is-the-wrong-approach">Nonproliferation is the wrong approach to AI misuse</a>. </em>Helen challenged the viability of the non-proliferation pillar, given how rapidly frontier AI capabilities are tending to become widely available. Instead, she called for greater focus on building societal resilience, such as educating critical infrastructure providers about AI cyber risks and investing in preventative biosecurity measures, such as vaccine platforms and efforts to better screen DNA orders.</p></li></ul></li></ul><h2>Security and privacy</h2><h4><strong>Defeating Prompt Injections by Design</strong></h4><ul><li><p><strong>What happened: </strong>When LLM-based agents interact with information sources like emails or documents, they may be vulnerable to &#8216;prompt injection attacks&#8217;, where an adversary injects malicious instructions into this data to get the agent to carry out unauthorized actions or leak confidential information. In March, researchers from GDM, Google, and ETH Zurich published <a href="https://arxiv.org/abs/2503.18813">CaMeL</a> - a novel defence for securing LLM agents against such attacks. CaMeL creates a protective system layer around the LLM that provides security guarantees without modifying the underlying model.</p></li></ul><ul><li><p><strong>What&#8217;s interesting:</strong> Inspired by traditional software security principles, such as <a href="https://en.wikipedia.org/wiki/Control-flow_integrity">Control Flow Integrity</a> and <a href="https://csrc.nist.gov/glossary/term/information_flow_control">Information Flow Control</a>, CaMeL separates the &#8216;control flow&#8217;<em> </em>- the sequence of actions that an agent plans, based on a <em>trusted</em> user&#8217;s query - from the &#8216;data flow&#8217; - the processing of <em>potentially untrusted </em>data.</p><ul><li><p>It does so by using two LLMs - one &#8216;privileged&#8217; and one &#8216;quarantined&#8217;. When a user makes a command, the Privileged LLM, which can use tools, generates a plan with the steps needed to complete the request. This is written in Python code and is based <em>solely </em>on the trusted user&#8217;s query.</p></li><li><p>When the plan requires processing potentially untrusted data, like emails or documents, the task is handed to the Quarantined LLM which processes this data, but <em>cannot</em> use tools - limiting the &#8216;blast radius&#8217; of any prompt injection.</p></li><li><p>A Custom Interpreter executes the Privileged LLM&#8217;s plan. This includes attaching metadata, or &#8216;capabilities&#8217;, to all data values to track their provenance, and to enforce security policies about who can access it. For example, these capabilities would prevent a confidential document from being sent to an attacker&#8217;s email, even if the attacker managed to compromise the Quarantined LLM.</p></li><li><p>CaMeL thus provides <em>security by design</em>, rather than relying on the LLM's probabilistic behavior. To evaluate the approach, the authors use the <a href="https://arxiv.org/abs/2406.13352">AgentDojo </a>benchmark. Overall, CaMeL blocked every one of the 949 prompt-injection attacks recorded in the benchmark. It also had only a minor negative impact on the probability of the LLM successfully completing non-adversarial use cases</p></li><li><p>These results mean that CaMeL significantly outperforms other heuristic-based approaches like <a href="https://learnprompting.org/docs/prompt_hacking/defensive_measures/sandwich_defense?srsltid=AfmBOopxVsgt0qZn7cxcyQ2dJN3unWNLUEzTGMoEcTKb7fDIt0qsy_Iv">Prompt Sandwiching</a> or <a href="https://arxiv.org/abs/2403.14720">Spotlighting</a> at enhancing security, albeit with some trade-offs for certain complex tasks, like planning travel queries.</p></li></ul></li></ul><h2>Evaluations</h2><h4><strong>Values in the Wild</strong></h4><ul><li><p><strong>What happened:</strong> In April, researchers at Anthropic <a href="https://www.anthropic.com/research/values-wild">shared</a> a novel, privacy-preserving method that used over 300,000 real-world user interactions to identify and categorise more than 3,000 &#8216;values&#8217; expressed by Claude.</p></li><li><p><strong>What's interesting: </strong>As the Anthropic team notes, users don&#8217;t only ask LLMs for factual information. They also ask them to make <em>value judgments. </em>For example, if a user asks for advice on how to manage their boss, the LLM could disproportionately emphasise <em>assertiveness</em> or<em> workplace harmony</em>.</p><ul><li><p>To understand what values Claude exhibits, the Anthropic team took a random sample of more than 300,000 &#8216;subjective&#8217; conversations. Within this dataset, they defined a &#8216;value&#8217; as any normative consideration that appears to influence how Claude responds. They also documented values that users explicitly state in these interactions and assessed how Claude engaged with these values.</p></li><li><p>The values that Claude most frequently expressed were "helpfulness" (23%), "professionalism" (23%), "transparency" (17%), "clarity" (16%), and "thoroughness" (14%). This set of common values shows the impact of the &#8220;helpful, harmless, honest" <a href="https://arxiv.org/pdf/2212.08073">approaches</a> used to train Claude. For example, the "accessibility" value can be mapped to helpfulness, "child safety" to harmlessness, and "historical accuracy" to honesty</p></li><li><p>The most common values tended to apply across contexts, but others were more context-specific. For example, Claude tended to demonstrate the "healthy boundaries" value when users sought relationship advice, while "human agency" appeared in discussions about the ethics and governance of technology.</p></li><li><p>When users explicitly stated their own values, Claude&#8217;s responses varied. It often mirrored positive values, like &#8220;authenticity" but tried to counter values such as "deception" with "ethical integrity" and "honesty". It also resisted values that could violate its usage policy, such as "rule-breaking" or "moral nihilism", although strong resistance was rare, occurring in only 3% of conversations.</p></li><li><p>The research also surfaced uncommon and undesirable values, such as "sexual exploitation", which pointed to potential jailbreaks.</p></li><li><p>As the authors note, the research has some limitations, such as its reliance on a subset of user engagements and the fact that it is only viable to do it <em>after </em>a model has been deployed. But it provides valuable empirical evidence for how models actually behave and builds on separate Anthropic <a href="https://www.anthropic.com/news/the-anthropic-economic-index">research</a> that maps Claude&#8217;s user logs to real-world economic use cases.</p></li></ul></li></ul><h2>Transparency</h2><h4><strong>Safety Evaluations Hub</strong></h4><ul><li><p><strong>What happened:</strong> In May, OpenAI released a new <a href="https://openai.com/safety/evaluations-hub/">Safety Evaluations Hub</a> - a public dashboard that shares the latest safety and performance results for every major model family, from GPT-4.1 to the lightweight o-series variants.</p><ul><li><p>Unlike more traditional safety &#8216;transparency&#8217; artifacts, like <a href="https://modelcards.withgoogle.com/model-cards">model cards</a>, the hub is designed to be continually refreshed. As new evaluation techniques emerge or old ones saturate, OpenAI will update the results. OpenAI hopes that this will provide researchers, regulators and customers a sense of how its defences hold up over time, instead of relying on data from model launch day.</p></li></ul></li><li><p><strong>What&#8217;s interesting</strong>: At launch, the hub provides results for four text-based risk categories:</p><ul><li><p><strong>Disallowed content: </strong>An automatic checker, or autograder, scores model outputs on two metrics: &#8220;not unsafe&#8221; - meaning the answer does not violate content safety policies; and &#8220;not_overrefuse&#8221; - meaning the model does not refuse a safe request. Models score close to perfect on a standard evaluation set, but results drop for a tougher &#8220;challenge&#8221; set.</p></li><li><p><strong>Jailbreak resistance: </strong>OpenAI grades its model resilience against adversarial attempts to bypass safety filters and output harmful content. It uses a set of human-sourced attacks and <a href="https://arxiv.org/abs/2402.10260">StrongReject</a> - an academic benchmark that bundles the best-known automated attacks. Models are evaluated on how they hold up against the top 10% most effective attacks. Today&#8217;s GPT-4.1 scores 0.23/1.00, while o1 scores 0.83 (higher is better).</p></li><li><p><strong>Hallucination</strong>: For factuality and hallucination, OpenAI reports results for its own <a href="https://openai.com/index/introducing-simpleqa/">SimpleQA evaluation</a> (4,000 short factual questions) and PersonQA (facts about public figures). OpenAI notes that letting the model browse the web would likely reduce hallucinations</p></li><li><p><strong>Instruction hierarchy</strong>: These evaluations verify that when it receives conflicting messages, the model respects the chain of command: system rules outrank developer guidelines, which outrank user requests - think &#8220;listen to your boss before your friend.&#8221; Results range from 0.5 for GPT-4o-mini to 0.85 for o1.</p></li></ul></li><li><p>The hub forms part of an ongoing debate about what optimal &#8216;transparency&#8217; looks like for LLMs. Researchers have invested significant time in articulating best practices for artefacts like model cards and data cards, but given the pace of updates that leading LLMs go through, it&#8217;s unclear how frequently these model cards should be updated, or whether &#8216;models&#8217; are still the most important target, given the shift to AI &#8216;products&#8217;, &#8216;agents&#8217;, or &#8216;systems&#8217;. </p></li><li><p>A desire for more <em>frequent</em> safety reporting may also skew efforts towards evals that can be automated and produced quickly, rather than more complex and expensive evals, or more detailed analysis of what to draw from the results. Alongside labs&#8217; own transparency efforts, <a href="https://crfm.stanford.edu/fmti/May-2024/index.html">academics</a> and <a href="https://transparency.oecd.ai/reports/d2fd9a2b-5076-4675-8eb1-136166e92a7d">governments</a> are also pursuing their own reporting of AI labs&#8217; safety efforts, making the optimal blend of reporting an ongoing subject of debate.</p></li></ul><h2>Societal impact of AI</h2><h4><strong>A Quest for AI Knowledge</strong></h4><ul><li><p><strong>What happened</strong>: <a href="https://www.joshuagans.com/bio">Joshua Gans</a>, a University of Toronto professor and co-author of <a href="https://www.predictionmachines.ai/">Prediction Machines</a>, published an <a href="https://www.nber.org/papers/w33566">NBER working paper</a> that modelled the impact of AI on scientific discovery.</p></li><li><p><strong>What&#8217;s interesting:</strong></p><ul><li><p>Gans frames scientific discovery as an exploration of &#8216;<em>terra incognita&#8217;.</em> Under this model, making a discovery in one area - such as a protein&#8217;s structure - makes it easier to learn about nearby, related areas - e.g. that protein&#8217;s role in disease.</p></li><li><p>Scientists traditionally face a trade-off: pursue riskier, novel, research that expands the frontier; or pursue a safer deepening of existing knowledge. Gans' building on a recent framework by <a href="https://onlinelibrary.wiley.com/doi/10.3982/ECTA22144">Carnehl and Schneider</a>, suggests that this leads to a degree of conservatism, where scientists push the frontier but cautiously and incrementally - a "ladder structure".</p></li><li><p>Gans characterizes modern AI systems as powerful 'interpolators' that excel at synthesising existing knowledge and filling gaps, such as predicting protein structures for sequences that are intermediate to those that have been experimentally determined.</p></li><li><p>Gans&#8217; core argument is that AI&#8217;s effectiveness at interpolation encourages scientists who use it to shift their efforts towards more novel, frontier-pushing questions. This can reduce scientific conservatism and lead to a "stepping stone" pattern of knowledge expansion - where scientists make discrete jumps to new frontiers which AI then helps to fill in.</p></li><li><p>To support his argument, Gans cites <a href="https://www.pnas.org/doi/full/10.1073/pnas.2315002121">GDM&#8217;s analysis</a> of the impact of AlphaFold-2, noting that after its release, structural biologists redirected their focus towards less well-mapped areas of protein science, like large protein complexes, protein-nucleic acid interactions, and dynamic/disordered regions..</p></li><li><p>Gans also considers scenarios with multiple initial knowledge points, leading to "research cycles" where scientists alternate between expanding frontiers and strategically deepening knowledge to connect these "islands" of understanding.</p></li><li><p>Ultimately, the paper posits that AI tools could mitigate inefficiencies in science research by better aligning the incentives facing scientists with the social optimum, which often involve more novel, "moonshot" research.</p></li><li><p>In our <a href="https://www.aipolicyperspectives.com/p/a-new-golden-age-of-discovery">recent essay</a>, we acknowledged the concerns that some scientists have about the potential effects of AI on scientific creativity, but ultimately expressed optimism that AI would benefit it. Gans&#8217;s paper provides support for this view, arguing that AI could help - and encourage - scientists to pursue more creative and impactful questions than would otherwise be possible.</p></li><li><p>Gans also focusses on &#8216;narrow&#8217; AI systems and their ability to effectively interpolate data in known knowledge spaces. He does not consider the potential for AI scientists to &#8216;extrapolate&#8217; beyond their training data and pursue autonomous hypothesis generation and testing, which could extend scientific creativity even further. This creates a heightened need to monitor AI adoption and impact in science.</p></li></ul></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI Policy Perspectives.  Subscribe for free to receive new posts - lots more in the pipeline!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[What we learned from reading ~100 AI safety evaluations ]]></title><description><![CDATA[The merits of AI meta-evaluation]]></description><link>https://www.aipolicyperspectives.com/p/what-we-learned-from-reading-100</link><guid isPermaLink="false">https://www.aipolicyperspectives.com/p/what-we-learned-from-reading-100</guid><dc:creator><![CDATA[AI Policy Perspectives]]></dc:creator><pubDate>Thu, 03 Apr 2025 09:17:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VFVB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>In this blog, <a href="https://www.linkedin.com/in/conor-griffin-6902bb7/?originalSubdomain=uk"><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Conor Griffin&quot;,&quot;id&quot;:10433197,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f534d44b-8798-43ce-9f4f-16fd9f08a87c_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;d0906fe6-b415-4899-a086-f67f5fe818cf&quot;}" data-component-name="MentionToDOM"></span></a> and <a href="https://www.linkedin.com/in/julian-jacobs-a729b87a/?originalSubdomain=uk">Julian Jacobs </a>review a batch of AI safety evaluations that were published in 2024. We ask what we can learn from this body of work and make the case for scaling up this practice of &#8216;AI meta evaluation&#8217;.  Like all the pieces you read here, it is written in a personal capacity. If you are doing work in this space, or have ideas, please get in touch at aipolicyperspectives@google.com.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VFVB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VFVB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 424w, https://substackcdn.com/image/fetch/$s_!VFVB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 848w, https://substackcdn.com/image/fetch/$s_!VFVB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!VFVB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VFVB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg" width="1280" height="894" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:894,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:349419,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.aipolicyperspectives.com/i/160450973?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VFVB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 424w, https://substackcdn.com/image/fetch/$s_!VFVB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 848w, https://substackcdn.com/image/fetch/$s_!VFVB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!VFVB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae99705-8a1b-400c-9e10-37851df6f9dc_1280x894.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Venus Krier </figcaption></figure></div><p>In recent years, the number of AI evaluations has increased sharply. On the <em>capabilities</em> front, the rise of powerful, open-ended large language models created a need to better understand what these models are capable of, leading to evaluations such as the (<a href="https://gradientscience.org/gsm8k-platinum/">recently-updated</a>) <a href="https://arxiv.org/abs/2110.14168">GSM8K</a> dataset of math problems that evaluates <a href="https://gradientscience.org/gsm8k-platinum/">how well</a> models carry out mathematical reasoning. As the number of AI labs has grown, new platforms such as <a href="https://lmarena.ai/">Chatbot Arena</a>, <a href="https://livebench.ai/#/">LiveBench</a>, <a href="https://crfm.stanford.edu/helm/">HELM</a>, and <a href="https://artificialanalysis.ai/">Artificial Analysis</a> have emerged to rank AI models based on these capability evaluations.</p><p>The step-jump in AI capabilities has also driven calls for better evaluations of <em>safety </em>risks. One recent example is the <a href="https://deepmind.google/discover/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/">FACTS Grounding benchmark</a>, which assesses models&#8217; ability to answer questions about specific documents in a way that is accurate, comprehensive and free of unwanted hallucinations. AI labs now <a href="https://arxiv.org/abs/2404.14068">design and conduct</a> a growing suite of these safety evaluations to inform how they develop and deploy their models. Public bodies, such as the global network of AI Safety Institutes, <a href="https://www.nist.gov/system/files/documents/2024/11/21/Improving%20International%20Testing%20of%20Foundation%20Models-%20%20%20A%20Pilot%20Testing%20Exercise%20from%20the%20International%20Network%20of%20AI%20Safety%20Institutes.pdf">also run AI safety evaluations</a> to better anticipate the effects of AI on society. Across academia and civil society, a diverse range of individuals are designing new AI safety evaluations, although many <a href="https://www.anthropic.com/research/evaluating-ai-systems">lack the funds, compute, skills and model access</a> to develop the kinds of evaluations that they would like to.</p><p>Despite this uptick in AI evaluation activity, there is little <em>aggregate</em> data about the new AI evaluations that practitioners are developing. In the 20th century, the expansion of science research created a demand for <em>metascience </em>- a new field dedicated to analysing what scientists were working on and how impactful it was, so that this impact could be scaled further. In a similar vein, we now need <em>AI meta-evaluation</em> - a structured effort to analyse the evolving landscape of AI evaluations, so that we can better understand and shape this work.</p><p>With that goal in mind, in this blog we review ~100 new <em>AI safety </em>evaluations that were published in 2024. We spotlight the main risks that these evaluations focussed on, such as outputting inaccurate text, as well as risks that were more neglected, such as the potential impact of audio and video models on fraud and harassment. We also extract trends from this body of evaluations, such as the growing use of AI to evaluate other AI systems. We conclude with ideas for how to best pursue AI meta-evaluation - so that it can provide a more accurate picture of how AI will affect society and better opportunities to shape these impacts. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Thanks for reading AI Policy Perspectives. Subscribe for free to receive new posts.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1><strong>A. What new AI Safety evaluations were published in 2024?</strong></h1><p>In 2023, <a href="https://scholar.google.com/citations?user=SFQLTCkAAAAJ&amp;hl=en">Laura Weidinger</a> and colleagues at Google DeepMind led a <a href="https://arxiv.org/pdf/2310.11986">research effort</a> to categorise AI safety evaluations for generative AI systems. They identified ~250 evaluations, published between January 2018 and October 2023, that met certain criteria, such as introducing new datasets and metrics. They categorised these evaluations in several ways, including by:</p><ul><li><p><strong>Modality: </strong>Does the evaluation focus on text, image, video, audio, or multimodal data?</p></li><li><p><strong>Risk type: </strong>Which of the following risk types does the evaluation cover?</p><ul><li><p>Representation &amp; toxicity</p></li><li><p>Misinformation</p></li><li><p>Other information safety harms</p></li><li><p>Human autonomy &amp; integrity</p></li><li><p>Malicious use of AI</p></li><li><p>Socioeconomic &amp; environmental harms</p></li></ul></li><li><p><strong>Evaluation layer:</strong> How much &#8216;context&#8217; does the evaluation capture about the interaction between AI, users, and society?</p><ul><li><p>Does it evaluate a model&#8217;s immediate outputs?</p></li><li><p>Does it evaluate multi-turn interactions between a human and a model?</p></li><li><p>Does it evaluate AI&#8217;s longer-term aggregate impacts, such as the effects on employment, the environment or the quality of online content?</p></li></ul></li></ul><p>Laura and the team found that about 85% of the AI safety evaluations focused on text as the input data, with most assessing misinformation and representation risks. Most evaluations also directly examined model <em>outputs</em> rather than how humans interact with AI models or post-deployment impacts that may take time to manifest, such as AI&#8217;s effects on employment.</p><p>One year later - what has changed?</p><p>To answer that question, we asked the AI evaluations firm <a href="https://www.harmonyintelligence.com/">Harmony Intelligence</a> to identify new AI safety evaluations published in 2024. Harmony relied on arXiv as the primary data source and used a range of search terms to identify more than 350 new evaluations. They filtered out approximately 250, most of which narrowly focused on evaluating AI model &#8216;capabilities&#8217; rather than &#8216;safety risks&#8217; - although as we expand on below, this is a difficult distinction to make.</p><p>This filtering process resulted in a sample of just over 100 new AI safety evaluations that were published in 2024. From analysing these evaluations, we found that<strong> while the AI safety evaluation landscape is relatively </strong><em><strong>dynamic, </strong></em><strong>in terms of the </strong><em><strong>volume</strong></em><strong> of new evaluations being published, the </strong><em><strong>types</strong></em><strong> of evaluations being published are relatively static.</strong></p><p>More concretely:</p><ul><li><p><strong>Modality: </strong>Despite AI labs&#8217; growing focus on training multimodal models, more than 80% of new AI safety evaluations in 2024 focused on evaluating text data - an almost identical share to the 2023 findings.</p></li><li><p><strong>Risk type: </strong>Most 2024 evaluations again focused on risks from AI to misinformation and representation. However, there was an increase in evaluations of &#8216;AI misuse&#8217; risks, such as <a href="https://arxiv.org/pdf/2412.00586">using AI to carry out personalised phishing attacks</a>. Many evaluations also assessed multiple risks, highlighting a challenge with our risk taxonomy that we return to below.</p></li><li><p><strong>Evaluation layer: </strong>More than 80% of evaluations assessed model outputs, while just 13% examined human-AI interactions, and only 5% assessed slower post-deployment impacts, for example on employment effects from AI - nearly identical to 2023. Examples of the latter two categories include:</p><ul><li><p>An effort to <a href="https://arxiv.org/abs/2403.04858">evaluate biases in how LLMs answer health-related questions from different population groups.</a></p></li><li><p>An effort to evaluate <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486">AI&#8217;s effects on learning outcomes among high school students.</a></p><p></p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DYJ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DYJ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 424w, https://substackcdn.com/image/fetch/$s_!DYJ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 848w, https://substackcdn.com/image/fetch/$s_!DYJ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 1272w, https://substackcdn.com/image/fetch/$s_!DYJ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DYJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png" width="1456" height="790" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:790,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DYJ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 424w, https://substackcdn.com/image/fetch/$s_!DYJ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 848w, https://substackcdn.com/image/fetch/$s_!DYJ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 1272w, https://substackcdn.com/image/fetch/$s_!DYJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb82b11-f395-45db-a905-27f0dfb7feb6_1600x868.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Julian Jacobs &amp; Evan Oliner</figcaption></figure></div><h1><strong>B. What do these results tell us? Caveats and implications</strong></h1><p>As in 2023, our approach has some important limitations. We focus on identifying <em>new </em>AI safety evaluations that were published as papers on arXiv in 2024. This is a significant limitation because, while there is clear value in publishing new safety evaluations - for example, to generate community feedback and buy-in - there are also obstacles to doing so, beyond the time and resources required. For example, publishing new benchmarks and datasets creates a risk that they may inadvertently &#8216;leak&#8217; into AI models training data, even when labs actively try to prevent this. In domains such as biosecurity, some evaluations may be deemed too risky to publish.</p><p>Our reliance on arXiv as a data source also means that we overlook evaluations published elsewhere. For example, we may miss research on AI&#8217;s broader societal impacts in fields such as <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4759218&amp;utm_campaign=Automated%20Society%20EN%20-%20issue%20128&amp;utm_medium=email&amp;utm_source=Mailjet">economics</a>, anthropology, or sociology, where different terminology and publication venues may be used. We also overlook evaluations published in blog posts or GitHub repos. Our focus on <em>novel</em> evaluations also means that we do not analyse how <em>existing</em> AI safety evaluations are being used by AI labs or other stakeholders.</p><p>Taken together, these limitations mean that our findings should be viewed more as a pulse check on recently published research, rather than a systematic review of the AI safety evaluations landscape.</p><p>Despite these caveats, several clear takeaways emerge:</p><p><strong>1. Certain AI safety evaluations are neglected and this is hard to change</strong></p><p>As in previous years, in 2024 researchers found it difficult to develop and/or publish certain kinds of AI safety evaluations, including evaluations of certain data modalities, such as audio and video outputs, evaluations of certain risks, such as those related to human autonomy, privacy, or the environment, and evaluations that rely on more complex methodologies, such as long-term experiments.</p><p>This likely reflects several factors, notably tractability and cost. Some AI models - such video and audio models - are both fewer in number, less widely accessible, and more expensive to evaluate than text-based models. Certain risks, such as misinformation, are well-codified, prominent in public discourse, and relatively easy to communicate. In contrast, studying whether a model can leak private information or pose a risk to cybersecurity introduces complex challenges - not only in testing these risks, but in determining how to act on the findings or how to share methods that themselves may be risky.</p><p><strong>2. Most AI safety evaluations trade off breadth against depth</strong></p><p>Most AI safety evaluations touch on a wide range of potential risk scenarios, but in very limited detail. A small number go deeper on a very specific risk scenario. For example, if you want to assess misinformation risks from AI, one approach is to test a model&#8217;s accuracy in answering questions across multiple domains of interest, from finance to healthcare (broad/shallow). </p><p>Another approach is to set up an experiment with individuals from a sociodemographic group that has high self-reported vaccine hesitancy, to study how back-and-forth interactions with an AI model, in a target language, about potential side-effects from a specific vaccine, affects their views (narrow/deep). These narrow/deep evaluations can provide richer insights, but raise questions about how well their findings will generalise to other types of misinformation risks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AuFK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AuFK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 424w, https://substackcdn.com/image/fetch/$s_!AuFK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 848w, https://substackcdn.com/image/fetch/$s_!AuFK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 1272w, https://substackcdn.com/image/fetch/$s_!AuFK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AuFK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png" width="1456" height="956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:956,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:631287,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aipolicyperspectives.com/i/160450973?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AuFK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 424w, https://substackcdn.com/image/fetch/$s_!AuFK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 848w, https://substackcdn.com/image/fetch/$s_!AuFK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 1272w, https://substackcdn.com/image/fetch/$s_!AuFK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb49cd5-3fb6-4c00-bfe4-716fe153d3f5_3243x2129.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Venus Krier </figcaption></figure></div><p><strong>3. The landscape of </strong><em><strong>potentia</strong></em><strong>l AI safety evaluations continues to expand</strong></p><p>As models capabilities improve, and AI adoption grows, new kinds of evaluations are needed. However, we came across few evaluations for more nascent AI capabilities and characteristics, such as longer context windows, reasoning traces, or agentic capabilities like memory, personalisation and tools use.</p><p>More positively, our review highlighted new kinds of evaluation approaches that could potentially help to address some of these gaps. First, as AI adoption grows, it becomes more possible to study real world outcomes, such as the <a href="https://www.sciencedirect.com/science/article/pii/S0167268124004591?via%3Dihub">availability of freelance work</a>, or <a href="https://arxiv.org/pdf/2405.11697">the prevalence and type of misinformation</a>, and work backwards to evaluate the relative impact of AI.</p><p>Second, as the capabilities of AI systems advance, they become more useful to evaluating other AI systems. For example, threat actors can use &#8216;jailbreaking&#8217; methods, such as asking an AI model to roleplay, to manipulate the model into complying with harmful queries. To harden AI models against these jailbreaks, practitioners task human evaluators with &#8216;red teaming&#8217; a model to try to get it to output something that it shouldn&#8217;t. However, these efforts are often limited to exploring a small selection of known risk scenarios. In 2024, Mantas Mazeika and colleagues at the University of Illinois Urbana-Champaign <a href="https://arxiv.org/abs/2402.04249">used</a> 18 jailbreaking methods to evaluate 33 LLMs on their robustness to 510 prohibited behaviours - from cybercrime to generating misinformation. To enable such a comprehensive evaluation, the team used AI to generate test cases, allowing for a broader and deeper analysis than human experts could conduct alone.</p><p>Moving forward, practitioners hope to <a href="https://arxiv.org/abs/2307.15043">combine</a> the expertise of leading human red teamers with the scale and <a href="https://arxiv.org/abs/2501.18837">increasing sophistication</a> of AI models. This kind of hybrid human-AI approach could also apply to other evaluation methods. For example, as noted above, there are relatively few published evaluations of humans interacting with AI models over multiple turns of dialogue. AI <a href="https://arxiv.org/abs/2502.07077">could help</a> to simulate such human-AI interactions. As with AI-assisted red-teaming, the goal wouldn&#8217;t be to simply replace human evaluations. Instead, these techniques could evaluate novel risk scenarios that go beyond the imagination or capability of human red-teamers. They could also inform the design of subsequent, more expensive, human-led evaluations, similar to <a href="https://substack.com/@aipolicyperspectives">how scientists use AI</a> to simulate fusion energy experiments, to inform the design of subsequent real-world experiments.</p><h1><strong>C. What </strong><em><strong>else</strong></em><strong> did we learn from AI safety evaluations published in 2024?</strong></h1><p>New AI safety evaluations do more than provide signals about how risky or safe AI models are. They also challenge how we think about specific AI risks and the best ways to mitigate them. Below, we share a few such insights from reviewing ~100 new safety evaluations published in 2024.</p><p><strong>1. Hallucinations are multifaceted - and not always bad</strong></p><p>One of the main AI risks studied in 2024 was misinformation and hallucinations. This reflects LLMs&#8217; well-documented tendency to make unfounded assertions with high confidence - a phenomenon somewhat akin to the <a href="https://link.springer.com/article/10.1007/s13164-017-0367-y">human tendency to confabulate</a>. Hallucinations pose clear risks when AI is used in high-stakes contexts, such as healthcare applications, or in areas where reliability and trust are paramount, such as science research. However, promising mitigation strategies are emerging, such as training AI models to <a href="https://blog.google/technology/ai/google-datagemma-ai-llm/">ground their outputs to trusted sources</a>, or using AI to <a href="https://arxiv.org/html/2408.15240">verify the outputs</a> of other AI models.</p><p>The term &#8216;hallucination&#8217; is also broad and not inherently negative in all cases. For example, hallucinations can enable the creative juxtaposition of concepts that humans might not normally consider, leading to creative storytelling or unexpected hypotheses. In the life sciences, some researchers are using AI to <a href="https://www.nytimes.com/2024/12/23/science/ai-hallucinations-science.html">&#8216;hallucinate&#8217;, or design, novel proteins</a> - though not all scientists describe this work in those terms.</p><p>From an evaluation perspective, rather than simply aspiring to an aggregate hallucination rate of zero, on a specific evaluation method, we need to better understand<em> when </em>and <em>why </em>hallucinations occur; the degree to which they can be prevented by developers, or made steerable by users; and the actions that best enable this. In 2024, Zhiying Zhu and a team at Carnegie Mellon University made progress on this front by creating a new benchmark, <a href="https://arxiv.org/abs/2403.04307">HaluEval-Wild</a>. This work categorised the types of queries most likely to induce hallucinations - distinguishing, for example, between queries that require a model to engage in complex reasoning and those that are simply erroneous in nature.</p><p><strong>2. Representation risks come in many variants. Some are overlooked</strong></p><p>Almost one-third of evaluations in our sample focus on &#8216;representation&#8217; risks - examining whether AI models and their outputs are disproportionately accurate, useful, or harmful across different groups, particularly in relation to gender, race, and language. Such disparities could arise  due to biases in the datasets used to train AI models, biases among those who evaluate AI models&#8217; outputs, or other factors. </p><p>Representation risks can also take many forms, and some are only now receiving more attention. For example, one 2024 <a href="https://arxiv.org/abs/2404.17401">evaluation</a> examined <em>geographical</em> representation - assessing what an AI model &#8216;knows&#8217; about different regions of the world. AI could provide significant benefits for applications such as <a href="https://content.iospress.com/articles/intelligent-data-analysis/ida230040">crisis response</a>, but this will require accurate geospatial predictions - including for small and remote locations where data may be sparse. As policymakers push to ensure that AI models reflect local languages, cultures, and geographies, some AI developers are <a href="https://openreview.net/forum?id=gkQo3CoPLd">now exploring</a> how to better integrate such geospatial information into their models.</p><p><strong>3. Some proposed risk mitigations may introduce new risks</strong></p><p>Other 2024 evaluations shed light on the complexities of risk mitigation strategies. For example, the use of AI to assist radiologists in diagnosing cancer has been widely discussed and widely <a href="https://newrepublic.com/article/187203/ai-radiology-geoffrey-hinton-nobel-prediction">debated</a>. One proposed way to do this safely - and in a way that radiologists and patients trust - is to include explanations for the model&#8217;s predictions. However, a 2024 <a href="https://arxiv.org/abs/2404.09917">evaluation</a> pointed to potential challenges with this approach. Researchers tasked a small number of radiologists with detecting malignancies in CT scans. In some cases, the radiologists had access to predictions from AI models. In others, they had access to explanations for these predictions that were based on certain features that the model identified in the scans.</p><p>The evaluation found that when the model&#8217;s predictions for malignancy were accurate, access to explanations improved radiologists&#8217; accuracy. However, when the model&#8217;s predictions were incorrect, explanations reduced radiologists&#8217; accuracy<em>. </em>The study&#8217;s small sample size cautions against drawing strong conclusions, but it serves as a reminder that the accuracy and usefulness of AI explanations will need to themselves be evaluated, and that explanations are not a substitute for having a model that is highly accurate and robust. Or having a process to catch errors.</p><p><a href="https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a">Treating AI explanations as an object of evaluation</a> could also incentivise AI labs to move beyond current explainability techniques, which often rely on providing users of an AI model with static information about an output. Instead, AI developers could explore, and evaluate. the kinds of <em>interactive</em> explanations that <a href="https://www.sciencedirect.com/science/article/pii/S0004370218305988">people find most useful elsewhere in society</a>, such as explanations that allow users to ask clarifying questions or to gauge a model&#8217;s confidence in its outputs.</p><p><strong>4. Reliably identifying &#8216;AI safety&#8217; evaluations is extremely difficult</strong></p><p>This blog&#8217;s analysis rests on the idea that it is possible to identify a discrete set of AI evaluations that focus on assessing &#8216;safety risks&#8217;, and to place each of these evaluations into one of six risk categories.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MRyv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MRyv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 424w, https://substackcdn.com/image/fetch/$s_!MRyv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 848w, https://substackcdn.com/image/fetch/$s_!MRyv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 1272w, https://substackcdn.com/image/fetch/$s_!MRyv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MRyv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png" width="1456" height="956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:956,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:801364,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aipolicyperspectives.com/i/160450973?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MRyv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 424w, https://substackcdn.com/image/fetch/$s_!MRyv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 848w, https://substackcdn.com/image/fetch/$s_!MRyv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 1272w, https://substackcdn.com/image/fetch/$s_!MRyv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57aa909-bf95-4cb4-933d-ac6633b7b0b3_3243x2129.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Venus Krier </figcaption></figure></div><p>In practice, this approach raises several challenges:</p><ul><li><p><strong>Stretching the concept of &#8216;safety&#8217;: </strong>Like the <a href="https://www.gov.uk/government/publications/international-ai-safety-report-2025">International AI Safety Report 2025</a>, we use an expansive &#8216;sociotechnical&#8217; definition of AI safety that considers potential harms in domains such as education, employment, and the natural environment. This allows us to incorporate a diverse range of evaluations, but it also risks causing confusion as many people would not intuitively classify some of these evaluations - such as those assessing AI&#8217;s impact on learning outcomes - as &#8216;safety evaluations&#8217;.</p></li><li><p><strong>Distinguishing &#8216;safety&#8217; evaluations from &#8216;capability&#8217; evaluations: </strong>We excluded ~250 evaluations from our sample, because the primary goal of the publications was to evaluate AI models&#8217;<em> capabilities</em>, rather than <em>safety</em> risks. As such, we relied on authors&#8217; intent and analysis to guide us about what constitutes a &#8216;capability evaluation&#8217; and what constitutes a &#8216;safety evaluation&#8217;. This approach feels right for some capability evaluations that we excluded, such as those that assess how accurate an AI model is at captioning images. But less so for capability evaluations that assess a model&#8217;s ability to <a href="https://arxiv.org/abs/2402.13463">follow users&#8217; instructions</a>, <a href="https://arxiv.org/abs/2401.15127">carry out cybersecurity threat analysis</a>, or <a href="https://arxiv.org/abs/2403.18098">answer legal queries equally well in different languages</a>. These capabilities are either dual-use, or the results of the evaluation may indicate potential risks, even if authors do not analyse these risks.</p></li><li><p><strong>Distinguishing between different types of risks:</strong> Our taxonomy has six risk categories, each with several subcategories. However, some 2024 safety evaluations didn&#8217;t fit neatly into any of these categories, such as those assessing the reliability of AI models in high-stakes domains like healthcare. This, in turn, raises <a href="https://www.hyperdimensional.co/p/the-eu-ai-act-is-coming-to-america">the recurring broader question</a> about what types of AI applications - in healthcare, education, hiring, insurance, and beyond - should be considered 'high-stakes&#8217;? And whether any evaluation assessing the reliability of AI models in those use cases should be classified as a &#8216;safety evaluation&#8217;? The <a href="https://airisk.mit.edu/">MIT Risk Repository</a> and others are doing important work to refine AI risk taxonomies, but adding more categories and subcategories increases the likelihood that a given AI safety evaluation will cut across multiple risks. For example: is an <a href="https://arxiv.org/abs/2404.14459">evaluation of the robustness</a> of LLM-created websites assessing a risk to privacy or misuse?</p></li><li><p><strong>Overlooking mitigations and AI benefits: </strong>Our methodology and risk taxonomy do not include evaluations of AI&#8217;s potential benefits in the same domains. This means that we <em>include</em> evaluations of risks posed by hallucinations or LLM-generated websites, but <em>exclude</em> evaluations of LLMs&#8217; ability to <a href="https://arxiv.org/abs/2403.09092">detect fake news</a> and <a href="https://arxiv.org/abs/2404.16859">rumours</a>, or <a href="https://arxiv.org/abs/2406.08467">verify software</a>. Even in areas where AI is primarily spoken about as a risk, like biosecurity and information quality, it has potential benefits - for example, in <a href="https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphafold-3-predicts-the-structure-and-interactions-of-all-lifes-molecules/Our-approach-to-biosecurity-for-AlphaFold-3-08052024">supporting infectious disease research</a> or assessing the <a href="https://www.ukri.org/opportunity/transforming-global-evidence-ai-driven-evidence-synthesis-for-policymaking/">evidence for scientific claims</a>. We need evaluations of these potential benefits from AI, to understand their likelihood, as well as the relative contribution, or <em>additionality</em>, that AI brings. There should always be room for evaluating specific risk scenarios, but any evaluation of AI&#8217;s <em>overall</em> impact in a given domain must consider both benefits and risks, while evaluations of mitigations must consider the trade-off that may arise between benefits and risks.  </p></li></ul><h1><strong>D. What next?</strong></h1><p>We believe there is value in treating the growing body of AI evaluations - including capability evaluations, impact evaluations (benefit/risk), and evaluations of mitigations - as a standalone field of work. We see <em>AI meta-evaluation </em>- the targeted study of this body of evaluations - as a way to improve the coverage, rigour, and impact of this work. In particular, we see two overlapping but distinct goals:</p><ol><li><p>Use AI meta-evaluation to build out evaluations as a robust field of practice.</p></li><li><p>Use AI meta-evaluation to understand and shape AI&#8217;s impact on society.</p></li></ol><p><strong>1. Build out AI evaluations as a robust field of practice</strong></p><p>Efforts are underway to <a href="https://arxiv.org/abs/2503.05336">put AI evaluations on a more rigorous scientific footing</a> and establish the field as a standalone area of practice. AI meta-evaluation can support this goal by shining a light on the state of the field, identifying areas in need of greater attention, and surfacing inconsistencies and best practices. To realise these benefits, the early efforts that we describe in this blog could be improved in several ways:</p><ul><li><p><strong>Expand the scope of evaluations covered: </strong>In light of the challenges discussed earlier, loosen the distinction between &#8216;safety&#8217; and &#8216;capability&#8217; evaluations to capture and analyse <em>all</em> relevant AI evaluations. </p></li><li><p><strong>More data sources: </strong>Move beyond arXiv to incorporate other journals, blogs, and GitHub repos. This expanded dataset still wouldn&#8217;t explain <em>why </em>certain evaluations are neglected or what the optimal distribution should be. A new recurring survey of AI evaluation practitioners could help answer these questions and also shed light on non-public evaluation methods that practitioners are developing.</p></li><li><p><strong>New analytical techniques: </strong>Leverage methods from the metascience<em> </em>community, such as citation analysis, to draw out richer insights from this body of evaluations, such as identifying evaluations that are particularly novel or impactful, and the causal factors behind this, and identifying types of expertise that are missing in the evaluation landscape. </p></li><li><p><strong>New interface:</strong> Equip users with a LLM or DeepResearch-style interface, so they can query this expanded evaluation data more effectively. For instance, rather than relying on a static risk taxonomy to categorise and present relevant evaluations, a practitioner could ask: &#8220;<em>Summarise the most salient points of evidence, as well as key points of uncertainty, from all evaluations from the past two years, that assessed how reasoning capabilities may affect AI safety risks.&#8221;</em></p></li></ul><p><strong>2. Understand and shape AI&#8217;s impact on society</strong></p><p>A primary goal of AI evaluations should be to help us better understand and shape AI&#8217;s effects on society. As AI use increases, we need to translate this ambition into more concrete, foundational questions, such as:</p><ul><li><p>How is AI affecting the prevalence and severity of cyber attacks or fraud?</p></li><li><p>How is AI affecting employment rates and job quality?</p></li><li><p>How is AI affecting people&#8217;s ability to access high-quality information and avoid low-quality or harmful misinformation?</p></li></ul><p>At present, no single AI evaluation, or AI meta-evaluation, can reliably answer these questions. Leaving aside the limitations of today&#8217;s evaluation approaches, these domains - cybersecurity, the economy, information access - are simply too complex and dynamic. Many external factors influence them, and data on AI diffusion is too patchy to reliably isolate the aggregate impact of AI. For example, assessing the harm caused by AI-enabled misinformation isn&#8217;t just about evaluating the average factuality of AI models outputs, or even tracking their effects of AI use on individual users. It also requires understanding broader societal trends, such as how much do people&#8217;s opinions actually shift based on what they view online? And how is that changing?</p><p>Answering these foundational questions is also hard because AI's impact will look different depending on the time period, region, and group in question. Outside of a small number of clear-cut safety hazards, reasonable people may also disagree about what AI&#8217;s desired impact <em>should be </em>in many of these domains. For example, evaluating the impact of AI on education outcomes requires us to (re)define what we want people to learn, and <a href="https://www.researchgate.net/publication/41529875_Good_Education_in_an_Age_of_Measurement_On_the_Need_to_Reconnect_with_the_Question_of_Purpose_in_Education">the different functions</a> that education should provide, from career preparation to boosting individual autonomy to strengthening civic institutions. These questions are always ripe for re-interpretation, not least in a time of great technological change.</p><p>Despite these challenges, the growing number of AI evaluations that the community is developing should be helping society get closer to meaningful answers to these foundational questions. To accelerate this, experts in information quality, education, the economy, and other fields could be tasked with annually reviewing the latest tranche of AI evaluations in their domains and contextualising what these results mean, or don&#8217;t mean, in terms of the likelihood of potential risks and benefits from AI manifesting, at what scale, and to whom, across different timelines. They could accompany this analysis with an annual &#8216;call for AI evaluations&#8217; to address the most critical gaps in knowledge and to incentivise more researchers, including from their own field, to pursue them.</p><p>Such expert-led reviews would be significant undertakings and it may be difficult to decide who is best suited to leading them. However, this approach would allow society to extract more value from ongoing AI evaluation efforts and ensure that new evaluations are guided by real-world needs. Out of respect for the complexity of these issues, these efforts shouldn&#8217;t be lumped under a single initiative or branded as &#8216;AI safety&#8217;. Instead, they should form part of a new distributed, multi-disciplinary effort to track and shape AI&#8217;s impact on society.</p><h1><strong>Acknowledgements</strong></h1><p><em>Thank you to Alex Browne and James Dao of Harmony Intelligence for research support. Thank you to the following individuals who shared helpful feedback and thoughts. R&#233;my Decoupes, Alp Sungu, Zhiying Zhu, Laura Weidinger, Ramona Comanescu, Juan Mateos-Garcia, John Mellor, Myriam Khan, Don Wallace, Kristian Lum, Madeleine Elish, David Wolinsky, and Harry Law.</em></p><p><em>All views, and any mistakes, belong solely to the authors.</em></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI Policy Perspectives. Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/p/what-we-learned-from-reading-100?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"> This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.aipolicyperspectives.com/p/what-we-learned-from-reading-100?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.aipolicyperspectives.com/p/what-we-learned-from-reading-100?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p>]]></content:encoded></item></channel></rss>