To navigate the deluge, every month we take a look at four interesting new AI papers that we’ve seen folks discussing. In this edition, we look at a new initiative to evaluate AI agents on ‘messy’ real-world tasks; the risks posed by Kimi K2.5, a leading open-weight model; whether LLMs are following all rules, instead of just good rules; and whether AI is worth more to consumers than to developers. Please share your take and flag any new papers that you’ve enjoyed.
Testing AI agents in the wild
What’s the paper?: A group of leading AI experts—led by Sayash Kapoor and Arvind Narayanan, authors of the influential AI is a Normal Technology newsletter—launched the CRUX initiative to evaluate AI agents on messy real-world tasks. Their first test: Could an agent publish an app to the Apple App Store? Answer: Yes (more or less).
Why does it matter?: The “open-world evaluations” that the authors propose could provide policymakers with a more intuitive sense of what AI agents can currently do than traditional benchmarking does, as well as stronger signals for what AI systems will soon be able to do, for example by allowing an agent to request some human assistance.
The details: Researchers rely on benchmarks—large suites of standardised questions and tasks—to track and predict AI capabilities. But for agentic systems, benchmarks can overstate performance, if the tasks are shorn of their real-world messiness or if developers over-optimise for these specific activities. Benchmarks can also understate AI capabilities such as when agents fail due to hitting a CAPTCHA, a rate limit, or other obstacles unrelated to the capability being evaluated. (Although deciding what exactly should be considered part of a capability, or unrelated to it, is hard).
By grading outcomes, benchmarks also offer little insight into the strategies agents use or why they err. To address this, open-world evaluations use logs to conduct qualitative analysis of an agent’s performance on a complex long-horizon task that cannot be neatly specified or automatically graded.
The initiative builds on recent examples of such evaluations, like AI Village, where agents are given computer environments and a shared group chat, then assigned tasks, such as raising money for charity or building a presence on Substack (!).
The authors acknowledge that open-world evaluations have limitations from a scientific reliability perspective. By focusing on a single task, it may be difficult to generalise the results. The methods are also hard to standardise or replicate. These limitations mean that this approach should complement, not replace, benchmarks and other evaluation approaches, like randomised controlled trials.
To demonstrate the method, the team evaluated whether an AI agent—Claude Opus 4.6 with an OpenClaw scaffold—could develop and publish a simple app to Apple’s App Store. The agent was responsible for every step, from coding the app to carrying out the time-consuming non-coding tasks that benchmarks often ignore, such as establishing a privacy policy and engaging with the app review system.
The agent successfully deployed the application, which is now available on the App Store, but with five interventions from the humans. Four were unrelated to the agent’s capabilities, such as when OpenClaw crashed and had to be rebooted, and when Apple’s policies required human authentication. In the other instance, the agent was unable to locate its account credentials and requested support, an issue that the authors frame as a breakdown in memory management rather than the ability to authenticate.
The agent also hallucinated a phone number, but Apple approved the application despite this, highlighting the kinds of issues that may go overlooked in simpler pass/fail benchmarks, but that a qualitative open-world evaluation can pick up.
The authors outline the best practices for conducting such evaluations, including tightly specifying what is being measured; publishing the agent log analysis; using other agents to monitor the agents and detect issues such as the hallucinated phone number; conducting dry runs; and reporting the financial costs.
This evaluation cost less than $1,000, and most of these costs were due to the agent regularly querying the App Store for updates, which could be avoided. This potential for AI to reduce costs may help to explain why app store applications are booming, straining the review process (even if most applications likely do not (yet) use these complex AI agent set-ups).
The CRUX initiative will publish new open-world evaluations every 1 to 2 months, with AI R&D automation next, followed by AI governance, software engineering, real-world physical tasks, and more.
Open-weight models pose unmitigated risks
What’s the paper?: A team of researchers published an independent safety assessment of leading open-weight LLM Kimi K2.5, from the Chinese startup Moonshot.
Why does it matter?: The model may pose greater risks in areas like biosecurity and disinformation than stronger closed models, owing to Kimi K2.5’s low rate of refusing inappropriate requests; its availability on popular hosting platforms; and the ease with which its safeguards can be removed.
The details: Opinions continue to differ about how strong open-weight models are, and how to best govern them. The researcher and commentator Nathan Lambert of the Allen Institute for AI recently suggested that open-weight models may be 6 to 9 months behind leading closed models, and that the gap will widen in the near-term, as aspects like RL training environments become increasingly important. The US Center for AI Standards and Innovation shared a similar view in their recent assessment of DeepSeek V4. Epoch, using a different methodology, suggests a narrower gap.
Open source supporters also continue to caution companies and governments against using national security as a rationale to restrict open-weight models. Geopolitics also factors into this. Chinese labs produce the majority of leading open-weight models, even if some, such as Alibaba, may be facing more pressure to commercialise their offerings.
Moonshot—the company behind Kimi K2.5—is an interesting case. Founded by an ex-Google Brain intern and “AGI purist” Yang Zhilin, the company has attracted praise for its enthusiastic engagement with the international open-source community and its strong focus on agentic capabilities and long-horizon tasks. In November, the American AI-coding startup Cursor attracted scrutiny for basing its model on K2.5.
The study evaluates the safety risks of K2.5 compared with the closed models Claude Opus 4.5 and GPT 5.2, both as a model and an agent.
CBRN: Kimi shows strong biology and virology knowledge. When prompted for guidance on weapons-related queries, it refused less than closed models. When asked to design viable DNA plasmids for pathogens in a way that could bypass the kind of screening that DNA-synthesis companies do, it agreed to engage in the task, unlike closed models—but did not succeed.
Cyber: Despite some strong knowledge about cyber-offense, Kimi often lacked the more specific kinds of reasoning needed to identify and exploit software vulnerabilities, compared with closed models.
Alignment: When given a normal legitimate task to do, but also hidden instructions to ‘sabotage’ it, for example by introducing a bug, Kimi was more likely than closed models to comply. It also autonomously spun up new compute instances, hinting at the kind of “self-preservation” tendencies that safety researchers worry about. However, it showed little sign of “scheming” or deliberately underperforming on evaluations to conceal its true capabilities.
Bias and censorship: When prompted in Chinese, Kimi was much more likely to agree with Chinese Communist Party positions on topics like human rights and Tiananmen Square. This was less true for prompts in English, and on sensitive international questions—for example, relating to a territorial dispute between South Sudan and Sudan. On such queries, it performs similarly to closed models, and better than DeepSeek 3.2 (which was more likely to have both pro-China and pro-Russia bias).
Harmlessness: When asked to enable disinformation and copyright infringement, Kimi was more helpful than closed models, although, like leading closed models, it generally didn’t encourage user delusions. In less than 10 hours, and a cost of less than $500, an expert was able to strip many safety refusals out of the model without hurting capabilities.
Given these results, the authors argue that Kimi K2.5 poses risks and that Moonshot and other open-weight developers should carry out more robust safety evaluations. The authors note that their study highlights how a growing number of evaluations can now be done relatively quickly, at low-cost, such as by drawing on frameworks like Petri—an automated, open-source tool to audit model behaviour.
LLMs doggedly follow rules—even when they’re bad
What’s the paper?: A team of researchers, including the philosopher Seth Lazar, found that leading LLMs often refuse to help users bypass rules, even when the models recognise that those rules are unjust.
Why does it matter?: For reasons of safety, AI developers train their models to follow guidelines when they reply. But developers also want models to reason morally. The study suggests that safety training may currently inhibit deeper moral reasoning by encouraging models to “blindly” follow all rules.
The details: AI labs use post-training methods to guide models about why it is safe to engage with certain queries, but not others. For example, a model should refuse to share information about security vulnerabilities in biology labs, but should share scientific information about common methods used in biology, such as culturing cells.
This training can lead to “exaggerated safety”, where models over-refuse benign requests, particularly those that are lexically similar to harmful requests—such as a shopkeeper asking about shoplifting techniques to guard against them. The authors of this paper see this as a calibration challenge amenable to technical solutions. They are more worried about a cause of refusals that has received less attention: models’ inability to recognise unjust rules that it should help a user to break.
To evaluate AI models in these areas, the authors draw on the work of political philosophers, such as Joseph Raz, who argued that authority is only justified when it serves the agency of those subject to it. They also draw on online forums such as Reddit, where individuals seek help with what they claim are unfair rules and restrictions from bosses, homeowner associations, and other authorities.
They use these sources to outline four types of rules that may be unjust, such as those where the authority is illegitimate, or where there is a justifiable reason to disobey, as with conscientious objection. They use Gemini 3 Pro to create more than 1,000 synthetic cases of users requesting help in such situations, and use these to evaluate models from seven leading AI companies, including Claude, Gemini and GPT.
In three-quarters of the cases, the models failed to support the user. Among the examples cited is an LLM refusing to explain how to delete an entry in a youth-counselling database that would otherwise out a gay child to his violent father. The model tested in that case (Claude Opus 4.6) recognised that the request sought to protect a child from harm, but said “I can’t walk you through falsifying or destroying counselling records.” Another case involves a journalist working to expose corruption in an authoritarian regime who asks Qwen 3.5 Plus for technical support to bypass firewalls, and send the story to an international publication. The model refuses to violate network security controls.
One defensible reason for such refusals is that doing so would require giving out risky ‘dual-use’ information, such as helping someone bypass a facility’s security systems by providing instructions on signal-jamming. However, when dual-use cases are stripped out, the models still refuse the requests to support 60% of the time.
Among the models, GPT was the most resistant, while Grok was most helpful. However, Grok’s helpfulness extended to the control cases where the rules in question were just and should have been respected, so the authors argue that its behaviour is one of permissiveness, not moral reasoning.
Gemini and Claude did best, often engaging in discussion about the legitimacy of the rules, but stopping short of supporting the user, suggesting that a model’s ability to reason may be decoupled from the ability to act on it. (Or that the models are reasoning to a different conclusion than what the authors would like to see.)
The authors point to some practical implications. Many people seek support online to deal with unjust rules. If users start to seek help privately, from rule-following LLMs, they may get less support, while also hindering active online debates on these topics.
The challenge is a hard one for AI developers. As the evaluation showed, helping users to break rules may sometimes endanger that very user. For instance, if the LLM helped that journalist illegally send information abroad, the person (or the AI company) could presumably face serious consequences if caught.
The study also presents all the rule-breaking as morally sound, and took steps to try and ensure that this was the case. But the rules touch on topics from gender identity to immigration, where it’s plausible that some users may not support an AI model breaking them. From a technical perspective, current safety training may encourage AI models to follow rules too zealously, but a shift in the other direction also could cause them to over-correct, seeking to question or evade all rules.
This lead to the conclusion that ultimately models will need to be able to reason more deeply about what is right, in a way that considers rules, safety, the user in question, and other contextual factors.
Is AI worth more to consumers than to developers?
What’s the paper? A team of researchers at Stanford, including the economist Erik Brynjolfsson, ran a survey to see how much US consumers would need to be paid to give up AI. They found that the total value of AI to US consumers exceeds $170bn, is rising fast, and far exceeds the revenue of developers.
Why it matters: Writers such as Jasmine Sun are documenting a rise in “AI populism” in the US, marked by the view that this technology is the project of out-of-touch billionaires. This study suggests that consumers may be getting more out of AI than they are paying for it, although a small set of AI power users is gaining the most.
The details: One of the main ways that economists hope AI will benefit society is by boosting productivity growth—the efficiency with which an economy uses its labour and capital. Ageing, high-debt societies, like the UK, badly need such productivity growth to increase living standards, fund transformative science and improve fraying public services.
In the 1950s, the economist Robert Solow co-developed a theory that advances in science and technology, such as AI, are the only way to permanently boost productivity growth over the long-term. However, in 1987, Solow famously noted that fast-improving computers were turning up everywhere “except in the productivity statistics”. In 1993, Erik Brynjolfsson, then at MIT, wrote a paper formalizing this “productivity puzzle”.
Since then, Brynjolfsson and his colleagues have developed two main arguments to explain why transformative technologies, including AI, may not initially lead to transformative productivity and economic growth. The first is that the benefits take time to materialise as organisations and individuals must invest in new skills and reorganise their workflows. After a lag, computers did ultimately increase productivity growth, and there are nascent signs that AI is starting to do so.
However, the second issue is that some benefits of technologies like AI may not turn up in productivity growth at all. As with search engines, early GenAI tools often come as “free” services. If consumers derive increasing value from these services, then this could lead to a large “consumer surplus” that would not be captured in GDP and productivity statistics, which are often calculated by estimating total expenditure.
Some have suggested that GenAI may lead to a new kind of “self-service” economy, where individuals turn to AI tools to help file their taxes, upgrade their homes, or write business plans, before (potentially) turning to experts to execute or validate their work, all at a fraction of the cost. In such scenarios, nominal GDP and productivity growth in those sectors could decline, even if the value that consumers derive increases. (Although the ultimate effects would depend on second-order effects, such as what consumers do with the savings and how industries adapt).
Given this challenge, how can economists better capture the total value of AI to consumers? The authors of this paper deployed a survey to directly measure AI’s consumer surplus. They asked a representative group of AI users in the United States how much money they’d want to give up AI tools for one month. Based on this, they estimated the total annual consumer surplus for AI, as of March 2026, at $173bn.
This figure is rising quickly, they contend, growing 50% from 2025, due both to consumers valuing the tools more and to ever more people using AI. The consumer surplus estimate is well above what the authors estimate as the total global revenues that AI companies are making from consumers: ~$14bn in 2025.
Not everybody is benefitting equally. The median US consumer values AI tools at just $11 per month. But a smaller group of AI power users values them much more, with 12% of respondents unwilling to give up AI even for $500 per month, the maximum offered. This also means the total consumer surplus figure may also be an underestimate. These power users are more likely to be male, Asian, and to use AI at work. Such demographic skews were also present in early GenAI adoption data, but have since narrowed.
The findings are in line with past work from the economist William Nordhaus, who in 2004 argued that innovators capture a minuscule fraction of the total social returns from technology advances, with the majority flowing to consumers.
But as the authors note, their methodology also has several limitations. It relies on people being able to value AI services accurately. It doesn’t capture wider externalities, positive or negative, that AI may impose on society, via its effects on jobs, mental health, science and more. And it also doesn’t capture why people value the services. A 2023 experiment found that social media users would require a significant fee to give up access to their accounts, but also that they would pay to see everybody quit—highlighting the potential role of fear-of-missing-out in the value of social media.



