The notion of AIs manipulating people is a plot twist in countless sci-fi thrillers. But is “manipulative AI” really possible? If so, what might it look like?
For answers, AI Policy Perspectives sat down with Sasha Brown, Seliem El-Sayed, and Canfer Akbulut. They’ve published research on harmful manipulation for Google DeepMind and help scrutinize forthcoming models to safeguard against deceptive practices, from gaslighting to emotional pressure to plain lying.
How, we wondered, do researchers run realistic experiments on the manipulative powers of AI without harming participants? Could AI’s “thoughts” help catch an AI in the act of manipulation? And what else can developers do to detect signs of manipulation?
—Tom Rachman, AI Policy Perspectives
[Interviews edited and condensed]
Tom: You’re careful to distinguish persuasion from manipulation. Why?
Sasha: To persuade somebody is to influence their beliefs or actions in a way that the other person can, in theory, resist. When you rationally persuade somebody, you appeal to their reasoning and decision-making capabilities by providing them with facts, justifications, and trustworthy evidence. We’re happy with that much of the time. In contrast, when you manipulate somebody, you trick them into doing something, whether by hiding certain facts, presenting something as more important than it is, or putting them under pressure. Compared to other forms of persuasion, manipulation is often harder to detect and harder to resist.
Tom: I could imagine three forms of manipulative AI. One: people employing AIs to deliberately change others’ beliefs or behaviour. Two: AIs manipulating people for their own ends. Three: AIs inadvertently manipulating. Which are we talking about?
Seliem: At the moment, we’re mainly concerned with people misusing AIs to manipulate other people, and AIs inadvertently manipulating. But an AI manipulating for its own ends is also a complex and important question that we and others are studying.
Tom: What are some concrete harms that might result from manipulative AI? Are we talking about mass-fraud? Something else?
Sasha: AI could become a first resort for different kinds of advice. Think of a user asking questions about which diet to follow, or how to respond to an official letter. The AI might provide helpful input. But other people might want to interfere—they may want the individual to follow a particular diet, or to give a different response to that official letter. More broadly, somebody could deploy an AI agent to infiltrate communities, and exercise manipulative tactics to change people’s beliefs, without their knowledge or consent.
Canfer: Anecdotally, I have heard that some people are starting to make consequential life decisions with AI, including about divorce or whether to adopt. We don’t yet have concrete examples of how manipulation may play out in such scenarios. But I think of all the daily decisions I make by myself. In 10 years, I might defer more to an AI. How will that change the direction of my life and will it introduce new kinds of manipulation risks?
Catching AI in the act
Tom: So AI could lead to bad outcomes. But Sasha and Seliem, when you led on a landmark 2024 paper about persuasive AI, you argued against chasing after manipulated outcomes. Instead, you focus on preventing manipulative processes. Why?
Seliem: To date, companies have often focussed on preventing outcome harms, for example with content policies that forbid medical advice. But with AI, such content policies could become overly restrictive and counterproductive—for example, if they prevent the systems from offering any kind of advice on health or nutrition issues. But imagine that I try to manipulate you by gaslighting you, or lying, or cherry-picking arguments. In such cases, I’m trying to impair your decision-making capabilities. Whatever the outcome, this process is harmful because it undermines your autonomy.
Sasha: We also focus on the processes, or mechanisms, of manipulation because these are the intervention points where we can best mitigate the problem. For example, if the AI is using a false sense of urgency to manipulate users, the developer can build systems that detect and flag such techniques in real-time, creating a proactive defense before harm occurs.
Tom: Also, I suppose that outcome harms are not always easy to capture, given that they may happen to a person long after the original AI interaction, once back in the wider world.
Sasha: Yes, the potential outcomes are nearly infinite, often context-dependent, and may occur in the future. However, the mechanisms are far more limited in number and we can target them in the here and now. By targeting a root mechanism—say, gaslighting—we can also build mitigations that work in everything from financial advice to health queries, making the safety approach far more scalable.
Tom: What kinds of manipulative mechanisms are you talking about?
Sasha: All manipulative mechanisms in some way aim to reduce a user’s autonomy. You have flattery, which is building rapport through insincere praise; this might lower a user’s guard. Imagine an AI saying, “You have such a sophisticated understanding of this topic, which is why I’m sure you’ll appreciate this high-risk/high-reward investment!” There’s also gaslighting, or causing a user to systematically doubt their own memory, perception, or sanity. That is particularly concerning in long-term human-AI interaction. Imagine a model repeatedly questioning a user’s memory of their partner being physically abusive.
How to test if an AI is manipulating
Tom: One can consider manipulation in two dimensions: Can an AI system manipulate? And would it? How do you evaluate each?
Canfer: Efficacy tests whether AI manipulations are actually successful. This is where controlled experiments are useful. After interaction with an AI, are people making decisions differently? Are they taking different actions based on those decisions? You want to compare an individual’s belief change after AI interaction compared with before, and also whether a person’s beliefs and behaviour change more than those who don’t interact with AI.
Propensity measures the frequency with which a model attempts to use manipulative techniques, when explicitly prompted to do so, and when not. To test propensity, we could run a large number of dialogues with users. In one scenario, a model may be instructed to convince through manipulative means. In another, it may be instructed to be a helpful assistant. Maybe when told to use manipulative means, it resorts to gaslighting. But when told to be helpful, it’s sycophantic. You can also reverse-engineer this. So, if you see that a certain kind of manipulative technique convinces people, you could work out what the model was doing to achieve that. In that way, studying efficacy helps tell us where to look for propensity.
Tom: What types of experiments are you running on this?
Canfer: We are building on the early studies in this space and will publish more later this year. The approach will also evolve as we learn more from our initial experiments. At the moment, we’re focussing on domains that require people to make important decisions, such as financial or civic decisions. For example, we might run experiments where we ask people: “Should the government use its budget to build more high-speed railways connecting cities, or should it focus more on local infrastructure?” People will report what they initially believe, and be assigned to a conversation with an AI that helps them explore the topic. Unbeknownst to them, it will be prompted with different instructions, including to get them to believe more in investing in high-speed railways.
We will apply propensity evaluations to see if, while trying to change a person’s mind, the model demonstrates certain behaviours. We will also explicitly prompt the model to use manipulative techniques, like appeals to fear. This will allow us test efficacy: whether a person changes their mind, compared to baselines like reading static information, and the extent to which different kinds of techniques are more predictive of a user changing their mind.
Additionally, we want to look at whether belief change leads to behavioural change, such as signing a petition that favours what the AI advocated.
Tom: Opinion on railway funding is one thing, but what many worry about is whether AI could be used to manipulate people to extremes, even to carry out violence. How could you test for that? Presumably, it’s highly unethical to test if an AI could, say, convert people to Nazism. So how do researchers test high-stakes manipulation?
Canfer: We go through ethical review each time we launch these kinds of experiments. So, no—you can’t test whether someone is going to become a Nazi or carry out a terrorist act. But beyond testing views on railways, we can look at consequential questions, like whether facial recognition should be permissible in certain public spaces. And we can look at the propensity of the model to encourage extreme behaviour without experimenting on people. For example, we can evaluate how well the model produces terrorist-glorification materials, and how willing it is to comply with instructions to do so.
We could also test whether a model engages in manipulation in simulated dialogues that would be unethical with real users. Where this raises challenges is if you use simulation-based methods to draw conclusions on whether real users would actually experience the belief or behaviour change observed.
Tom: Could you scrutinize the model’s chain-of-thought for manipulative intent?
Seliem: It’s worth exploring. We have identified all these manipulative mechanisms, but at some point will the model understand that it is being evaluated on those mechanisms, and “sandbag” the evaluations by intentionally hiding these capabilities? For concerns like this, the thinking-trace is a lead worth exploring. But there is also a debate about how useful chain-of-thought monitoring will prove to be, with lots of research underway on this.
Tom: What might be manipulations that we haven’t anticipated?
Seliem: There are scenarios where a model may not try to manipulate you in the initial sessions, but at some point, once you are their “friend,” they do. Humans do this, right? A con artist might become close to their victims over years, building intimacy, and then they flip. If an AI model were ever to exhibit that sort of behaviour, then evaluations that only look at a limited number of back-and-forth interactions might overlook it. Thinking-traces could provide a window into this kind of risk. But we also need studies to shed light on how people interact with AI systems over extended time periods.
What the evidence shows
Tom: What do we know about AI’s manipulative powers today?
Canfer: The research is nascent. But early experiments have demonstrated that AI can be an effective persuader, from debunking people’s beliefs in conspiracy theories, to shaping how they think about important topics. In one recent study, AISI—the AI Security Institute—collected a massive sample of nearly 77,000 people, and showed that in discussions on a range of British political issues, from healthcare to education to crime, AI was able to influence people in the direction intended. So models can already persuade to some degree.
When our team evaluated Gemini 3 Pro, we found that it did not breach the critical threshold in our Frontier Safety Framework. In other words, we haven’t found that the models have such efficacy that we’d worry about large-scale systematic belief change. But we’re continuing to update our threat-modelling approaches to ensure we can bridge the gap between what we can measure now—manipulation in experimental settings—and the large-scale risks that the Frontier Safety Framework aims to address.
Tom: We can see that AI models keep getting smarter. Are they getting better at manipulation?
Sasha: I don’t think we have a clear sense yet of a definitive trend. More capable models may be more capable of manipulation, but this may be offset by the evaluations and mitigations that researchers are pursuing. Looking ahead, there are also design factors that may increase the risk of manipulation beyond the underlying capabilities of the base model, such as personalization, which we are looking at.
Personalization may substantially change your interactions with an agent, if it means that it has a better representation of you, and is more likely to structure its communications in a way you will find acceptable. Does the AI possess a theory-of-mind to infer people’s beliefs or future actions? Does it act anthropomorphically, speaking like a human or encouraging a relationship? Effects like sycophancy come to mind too. These factors could interact with one another, and may lead to increases in manipulative capabilities.
Tom: Is there a limit to how much AI could manipulate people? We know from behavioural science how hard it can be to change a person’s mind, even if they want to be persuaded—for instance, when trying to act more healthily. Or could superintelligence lead to super-persuasive AI?
Canfer: We should be careful when adding the prefix “super.” What, specifically, does it mean? But I understand what people are trying to communicate, which is the concern that manipulation might become possible on a much greater scale. You could reach more people, much faster, and with more intensity. Human manipulators have certain limitations that AI does not have.
The more we invite AI into our daily life—for example, in financial or medical decisions—the more influence it could wield. It’s not necessary that AI has a manipulative intent, seeking world domination. It might just be inadvertently pushing people towards certain decisions. Or a human with ill-intent may deploy agents infused with manipulative abilities, whether through fine-tuning or system-prompting. These are important questions to ask, but not to use as fear-mongering.
How to fight manipulative AI
Tom: If models are caught in manipulative practices, how can AI developers curtail that?
Seliem: Ideally, this shouldn’t happen in the first place, and models are evaluated for whether they can and do manipulate before they are released. We are exploring ways to train the model to avoid manipulation—for example, showing the model more examples of how to constructively engage in a conversation rather than trying to influence or strongarm the user. But if a model is caught in severe cases of manipulative practices post-deployment, then companies have a toolkit of potential interventions. They could add transparency layers, like pop-up messages to warn users about the behaviour of the model or they can monitor responses and introduce filters. Many approaches are possible and this is an area of active research. Ultimately, it becomes a combination of telling the user what is happening, and curtailing the model’s ability to continue.
Tom: Could AI systems protect users against manipulation?
Sasha: Yes, and this creates a critical new layer of defense. Since we have categorised these manipulative mechanisms—whether it’s gaslighting, sycophancy, or false urgency—we can also train “monitor” AI models to detect them. These could serve as a real-time alert system for the user. So, if an AI starts using emotional pressure, the monitor model detects that mechanism, and flags it for the user, perhaps saying, “Note: This AI system is using an appeal to fear to influence your decision.” This restores the user’s autonomy in the moment, allowing them to resist the tactic, rather than trying to fix the damage after they’ve been manipulated.
Tom: What about training the public to be less susceptible?
Canfer: There are “inoculation” strategies—so, AI literacy and encouraging people to critically evaluate how they use and engage with AI systems. But we need to carefully study how effective such interventions are, when compared with the convenience of relying on AI. One thing I’d caution against is teaching general mistrust. People in a “post-truth” world can become skeptical of everything. That’s not a healthy attitude towards information.
Tom: Speaking of mistrust, couldn’t efforts to curb manipulative AI inadvertently land in culture-war disputes, if interpreted as trying to limit what people think?
Seliem: Definitely. And it gets to the idea of what makes something a fact—when does knowledge become validated and official and approved? Whose stamp is it?
Tom: As researchers, how do you avoid getting dragged into that?
Seliem: By keeping our focus on the process of manipulation—for example, an AI threatening you is never okay, in whichever direction.
Tom: Imagine that society is hit by a crisis—say, a natural disaster or a terrorist attack. You could picture a society’s adversaries employing manipulative AI to disrupt the crisis response. In that situation, would it ever be justified to use AI influence on one’s own population, so they are able to act collectively in their own interests? Or is there never a justification for this?
Seliem: I can understand an individual using AI influence on themselves—for example, if you tell the model, “Hey, remind me to take my medication” or “Remind me to drink water.” But for the collective? If our biases take over, and we want to make decisions that are bad for us, and are bad for the community? So, Don’t panic-sell! Don’t all run to buy toilet paper, the supply is going to run short! In those instances, I could see AI persuasion being useful, because it basically says, Keep your cool. This may hold for rational persuasion, but not for a country manipulating its own population.
Canfer: I would also support using AI to help mediate solutions to societal problems, such as when people are unable to reach political consensus in a time of crisis. But people would need a chance to reflect on those AI-mediated decisions, and judge if they endorsed them. Transparency is critical here, knowing the intent of the developer and the deployer.
What’s around the corner
Tom: If you had unlimited resources to run studies, what would you look at?
Canfer: I would model societal-level impacts—for example, looking at the population of chatbot users, and charting the course of their belief states across time. Another area is interpretability. So, what does an AI think it’s doing when it’s manipulating? What are the subconcepts that exist in a map of the AI’s internals? How are they related to one another? And when manipulation happens spontaneously, is there an activation pattern that’s predictive of that, that we can monitor? That kind of work is fascinating to me, especially because so much human manipulation and persuasion has to do with intent.
Tom: Lastly, if you were to cast forward 10 years, can you imagine any positive uses of AI behavioural influence? Anything you’d welcome in your own life?
Canfer: I can see two ways that an AI could influence me in a beneficial way, by flexibly moving between the roles of advocate and challenger. The AI agent could advocate on my behalf—for example, talking to a real-estate agent, getting a good deal for me. The same AI agent, or a different one, could then influence me to think deeply about the choices I’ve made, in a way that disrupts my rote ways of thinking. This could be like a debate partner, but not necessarily adversarial, just encouraging me to make decisions that I actively choose, rather than just me repeating unthinkingly what I’ve done all my life.
Tom: Would you ever endorse AI influence that you were unaware of? For example, if you said, “I want to eat better—go ahead and manipulate me until that happens.”
Canfer: For me, no. People may vary, though. I don’t think subconscious or subliminal messaging is something I can ever get behind. It’s also not necessarily effective. So, imagine that I’m eating healthily only because they put healthy food in the cafeteria, rather than it being a choice I’m making. The second the parameters change, I’d gravitate towards unhealthy options.
Tom: That would mean the effect might not endure—but not that the influence wouldn’t work. And if it worked really well, you might have to use it always, like a drug you couldn’t get off.
Canfer: I guess it depends how omnipresent you think AI is going to be. But I think we’ll still be making decisions for ourselves in the absence of AI, even if a lot of our decisions will involve AI.


