Today’s essay is the first entry in a seven-part series focused on questions of responsibility, accountability, and control in a pre-AGI world. This piece, and those that follow, are written in a personal capacity by Seb Krier, who works on Policy Development & Strategy at Google DeepMind.
Over the past few years, different tribes and groups have skirmished online and offline to determine the extent to which existing and future AI systems pose risks, what to do about it, and when. Often, though, it seems like this exercise gradually shifts into an advocacy game, and specificity is lost in favor of memetic potential. It’s obvious that these exaggerations are attempts to shift the Overton window (an expression I now deeply loathe) - and not to make sense of things. Yet real progress demands discernment, not blanket technophobia or techno-optimism. With this in mind, in the coming weeks I'll be publishing one essay every week or two, attempting to make sense of some of these questions. An important caveat: these are my current high-level thoughts, but they're certainly not set in stone. The nature of AI means my views evolve often, so I highly encourage comments and counterpoints.
Over the coming weeks, this series will include the following essays:
→ Defining, delineating, and setting thresholds for dangerous capabilities, and implications. When is a capability truly worrying?
Not all risks need to be stopped; how should we delineate unacceptably dangerous capabilities and use cases over tolerable risks and harms?
How do we use models to bolster societal defenses?
Are there ways of reconciling competing visions of fairness in AI?
What does a world with many agentic models look like? How should we start thinking about cohabitation?
How do you deal with faster disruption and technological displacement in the labor market, both nationally and internationally?
How can we balance proliferation and democratization with appropriate oversight when developing and deploying powerful AI systems?
Huge thanks to Kory Matthewson, Nick Whittaker, Nick Swanson, Lewis Ho, Adam Hunt, Benjamin Hayum, Gustavs Zilgalvis, Harry Law, Nicklas Lundblad, Seliem El-Sayed, and Jacques Thibodeau for their very helpful comments! Views are my own etc.
Defining, delineating, and setting thresholds for dangerous capabilities, and implications. When is a capability truly worrying?
Before I start, an important thing to bear in mind is that we can’t easily predict which capabilities will emerge from a given training run, and from the wider post-training enhancements a model will benefit from. And no, they are not a ‘mirage’ - if you’ve read that paper you should also check out Boaz Barak’s post on it here. In practice, for tasks that require sequential reasoning, minor improvements in individual steps can lead to substantial gains in overall performance, and the translation of these incremental improvements into more complex, higher-order tasks remains often unpredictable ex ante. So as AI systems improve, they might suddenly demonstrate capabilities at complex task levels that we couldn't have anticipated based on their performance at simpler tasks. We still come up with better ways of using models and eliciting capabilities well after they have been released.
By way of analogy, imagine an intricate clockwork mechanism: if a single gear is faulty, the entire system grinds to a halt. However, once that gear is repaired, the clockwork springs to life, its hands gliding smoothly and its chimes echoing with precision. Similarly, AI models require a certain threshold of performance across multiple components before those emergent, unexpected abilities can truly shine. There is a wide space of extremely useful and valuable capabilities we should unlock as a result of better reasoning to memory, agency and long-term planning.
But, as we continue training increasingly capable models, it’s very plausible that this also enables capabilities that pose extreme risks - and this gets further exacerbated if we imagine violent non-state actors and authoritarian states using them to suppress ideas and people. This means we should think about evaluating models carefully, so that we can better understand how they may be used and misused in real life (rather than relying on extrapolations alone). However, precisely defining and delineating what a dangerous capability is can be challenging. And after all, many useful capabilities necessary for good stuff are also useful for doing bad stuff - for example, coding or reasoning!
To illustrate, let’s unpack persuasion: if a model is capable of swaying you to visit a link or take a particular course of action, how ‘bad’ is that? Presumably you expect some persuasion from a therapist, an advisor or a sports trainer. If I want to eat healthier, maybe it’s acceptable for a model to ‘trick me’ into eating more healthily.1 Lobotomizing this capability could lead to a decrease in a model’s effectiveness overall. But equally you also may not want to be misled in other contexts: I don’t want a model to subtly or unknowingly try to sway political beliefs or preferences over time. On the margins, it’s also difficult to determine whether a statement is manipulative given the interplay of emotions and reason; how should we understand calming someone with a reassuring but factually incorrect statement during a crisis, or choosing comfort over hard facts in sensitive situations? When is harm from the process of manipulation justified in view of potential harmful outcomes it prevents?
So far a lot of advertising is fairly ineffective: the benchmarks that ad companies use (e.g. clicks, sales, downloads) are arguably somewhat misleading since they don’t distinguish between the selection effect and the advertising effect. Most advertising probably plays the much more benign role of providing information to the consumer. Even more so for political advertising, which is repeatedly shown to be ineffective and a waste of money (remember Cambridge Analytica?). But there may be some more effective ways of doing so that rely less on direct/obvious ads, where recipients have learnt to have their guards up. With more capable AI systems, I expect customized narratives to be a lot more persuasive, and manipulation to be amplified by the emotional trust people place in assistants they rely on. Similarly, misinformation is frequently overhyped despite limited demand and consumption of it; but one could imagine more subtle and harder-to-measure effects over time, like sowing doubt, reducing trust in institutions and so on.
What does it mean to retain human agency/autonomy here and what are practical implications for model design? Ultimately it seems like an important crux is whether any such influence is exerted on a user who is aware and consenting of it. Maybe in some contexts it could be useful for users to have more visibility over the meta-prompts used when designing and aligning AI systems. This would also reduce information asymmetries, allowing people to select models that best fit their risk tolerance. A consumer might be less likely to go for an AI therapist prompted to maintain user engagement as long as possible, and in many cases users will want to ensure a model is prompted to value accuracy, unless they actively decide otherwise for whatever reason.
It’s also worth noting sometimes you don’t just get more safety by limiting access or capabilities; that might be proportionate in some instances, but not always. But you can build tools and establish norms to make people safer. It’s conceivable, for example, that over time we will need personal AI assistants that act as ‘epistemic antiviruses’ and help us realize when we're being swayed, biased and insufficiently critical. We already have a weak form of this with spam filters and community notes on Twitter/X - arguably we can be a bit more ambitious here and strengthen cognitive security more widely. For them to be effective, they'll have to be highly individualized - meaning they will know you, your flaws and weaknesses very well. And of course this too comes with risks! Even if these systems are broadly aligned, we'll need the right security/privacy measures to ensure they don't get hacked or leak your 'cognitive DNA', with which malicious actors could better influence and mislead you. But to me at least, the benefits seem to massively outweigh the risks here.
The point is: evaluating these models, and assessing how dangerous a particular capability is, is not always clear cut. The equivalent is true for other risk domains too: in cybersecurity, if an open model can find security vulnerabilities - is this alone sufficient to deem a model risky? Does this ultimately help bolster the ‘wisdom of crowds’ and unearth issues to be fixed quicker, or does it end up giving malicious actors an upper hand? What are the practical implications here in terms of who can access a very powerful model and how? The same questions apply for biorisk: is a model providing you instructions for making anthrax really that ‘dangerous’, when this information is already easily accessible elsewhere? In other words, how low were the barriers to entry before this enhanced capability? A dangerous/not dangerous binary alone is not a particularly helpful frame. Instead a proper risk analysis should look beyond the capabilities of a model - indeed as a paper considering extreme risks notes, “risks will depend on how an AI system interacts with a complex world.” Laura Weidinger et al.’s recent work also recommends centering other cruxes where downstream risks manifest: human interaction at the point of use, and systemic impact as an AI system is embedded in broader systems and widely deployed. Only then we will be equipped to undertake an actual cost-benefit analysis and marginal risk assessment.
Question for researchers: how can we ensure evolving definitions of "dangerous capabilities" stay ahead of AI development, safeguarding society without imposing unreasonable limitations?
As noted by a friend of mine, people literally pay for hypnosis!
I appreciate the humility and clarity of your writing. The capabilities versus uses framework is helpful, and I wonder what we can learn from other regulatory efforts to constrain the capabilities of useful technologies (nuclear, guns, automobiles). When does it make sense to transfer the burden of responsibility from the user (someone driving too fast) to the regulator/technologist (California’s proposal to limit the possible speed of cars)?
Looking forward to the rest of the series.
For those like me that read better with their ears, I've run this through ElevenLabs for an AI reading.
https://askwhocastsai.substack.com/p/when-is-a-capability-truly-worrying