1. LLMs are making it easier for scientists to write papers, for better or worse
What happened: A team at Cornell and Berkeley investigated how scientists are using LLMs to help write papers, and what this means for the future volume, quality and fairness of research.
What’s interesting: The authors built a dataset of ~2.1 million preprints from arXiv, bioRxiv and SSRN, between 2018-2024. To detect whether scientists had used AI to help write a paper, the team compared the distribution of words in the abstract against human- and LLM-written baselines. When an author’s paper hit a threshold on this “AI detection” metric, they were labelled as an “AI adopter”. According to the study, LLM adopters subsequently enjoyed a major productivity boost, compared with non-adopters with similar profiles, publishing 36-60% more frequently. The gains were particularly large for researchers with Asian names at Asian institutions.
The team also assessed the complexity of the writing, using measures like Flesch Reading Ease, which evaluates sentence length and the number of syllables per word. They found that human-written papers with more complex language were more likely to be subsequently accepted by peer-reviewed journals or conferences—suggesting that, for humans, writing complexity is an (imperfect) signal of research effort and quality. For LLM-assisted papers, the relationship was inverted, with the authors concluding that the polished text of LLMs is helping to disguise lower-quality work. (They validated the findings against a separate dataset).
The authors also used the launch of Bing Chat, an LLM-based search engine, in 2023 to conduct a natural experiment. They compared views and downloads on arXiv that Bing Chat had referred, to those that Google Search referred. Bing Chat was more likely to refer scientists to newer and less-cited literature, as well as to books, possibly because LLMs are better able to parse long documents or a larger number of documents. (They also validated this finding with a separate dataset, although we don’t know how good the new sources cited by Bing were).
As the authors note, their study has a number of limitations. Their AI detection method is imperfect, only looks at abstracts, and doesn’t capture authors who may have edited LLM-generated text. There are also various potential confounders: maybe less experienced researchers are more likely to use LLMs? That said, the findings highlight (at least) three major questions posed by the growing integration of AI into science:
First, AI is leading to a big increase in the supply of papers (and grant applications). This poses a challenge for preprint repositories, which don’t want to host slop. ArXiv, whose founder Paul Ginsparg is a co-author of this study, recently banned computer science review and position papers, citing a surge in low-quality AI papers. LLM-assisted papers also pose a challenge for peer reviewers, who are already under strain, and are typically prohibited from using AI, although many do so anyway. This seems unsustainable. As the authors of this study suggest, it is likely time to consider how to integrate AI into at least some aspects of the peer-review process.
Second, the findings illustrate how LLMs may both mitigate and exacerbate fairness issues in science. For some scientists, the complexity of their writing may be a reliable indicator of their thinking and effort. For others, particularly non-native English speakers, writing may be more of an obstacle that has previously penalised them. A hopeful outcome is that LLMs may ease that burden. But a more worrying outcome is that, if reviewers and readers can no longer rely on writing complexity as an (albeit unfair) signal of good work, they may fall back on (even more unfair) signals, such as the institution that a person works at. This challenge is not limited to science, and may also occur in other areas where writing serves this purpose, like with cover letters.
Finally, the finding that LLM-based search engines may increase the diversity of sources that researchers review is the opposite of what some suggested would happen: that AI models would continually cite the same high-profile studies, exacerbating the “Matthew effect”.
Collectively, the study serves as a reminder that for every concerning scenario about the integration of AI into science, there are plausible counter-scenarios. Will AI lessen scientific reliability because of hallucinations? Or will AI “review agents” and AI-supported evidence reviews reduce (the many) inaccuracies that are already in the evidence base? Will AI remove the intuitive and serendipitous ideas that humans come up with? Or will AI enable scientists to pursue more novel hypotheses? Ultimately, AI could well upend the standard processes and traditions of science but do so in a way that delivers fresh benefits. To know if and how that is occurring, we need more empirical evidence about how AI is changing science.
2. Lessons from two years of AI safety evaluations
What Happened: In December, the UK AI Security Institute shared a set of trends observed since they started to evaluate frontier AI systems in November 2023.
What’s Interesting:
The report features more than 60 authors, a testament to the deep expertise that AISI has built up. Their trends are based on their evaluations of more than 30 frontier AI systems, with methodologies ranging from asking those AI systems questions to adversarially red-teaming them.
Their headline finding is striking, if unsurprising: AI capabilities have rapidly improved across all the domains that AISI tests. In the cyber domain, AI models and agents can now successfully complete more than 40% of the 1-hour software tasks they are tested on, up from <5% in 2023. Last year, a model completed an “expert-level” cyber task for the first time. In biology and chemistry, AI has gone from significantly underperforming PhD-level human experts at troubleshooting experiments, to significantly outperforming them, including for requests about images.
On the risk that AI models may “self-replicate” in a way that subverts human control, AISI’s evaluations suggest that AI agents have gotten better at simplified versions of some tasks that could be instrumental to self-replication, such as passing know-your-customer checks to access financial services, but less so at others, like retaining access to compute and deploying successor agents. AISI’s evaluations also suggest that models are capable of deliberately obstructing attempts to measure their true capabilities (“sandbagging”), but only when explicitly prompted to do so.
The report also sheds light on AI systems’ limitations. In the cyber domain, AISI notes that AI systems still struggle in open-ended environments where they must complete long sequences of actions autonomously. Similarly, regarding chembio threats, biologists and chemists, and potential threat actors, need “tacit” knowledge and expertise, such as how to pipette. AISI’s evaluations to date have focussed more on explicit knowledge although they plan to share more on wet lab tasks.
When it comes to mitigations, the report provides both reassurance and concern. On one hand, the safeguards that leading labs have introduced have made their models safer, in one instance increasing the amount of expert effort needed to jailbreak a model by 40x. On the other hand, AISI says that it was still able to find a vulnerability in every AI system it tested. Worryingly, AISI also found no notable correlation between how capable a model is, and the strength of safeguards it has in place.
AISI also sheds light on two other sources of AI risk: open source and scaffolding. They argue that the performance gap between open source and proprietary AI models has narrowed. This introduces risks as safeguards for open models (where they exist) can be removed, and jailbreaks are hard to patch. AISI also found that scaffolding can make AI agents more capable than the underlying base AI models, even if those gaps later narrow when the base models are updated. Some complex scaffolds are in proprietary products, such as coding agents, but others are in open-source efforts.
The report also touches on AISI’s evaluations of the broader societal impacts of AI, such as the degree to which people are using AI to access political information, or the risks of harmful manipulation. One striking statistic, picked up in media coverage of the report, was that one-third of UK respondents to a recent AISI survey had used AI for emotional support or social interaction in the preceding year, although just 4% do so daily. In a separate effort, AISI found that some dedicated AI companion users reported signs of “withdrawal” during outages.
Overall, AISI argues that AI labs are taking an uneven approach to safety, focussing more on safeguards for biosecurity risks, for example, than for other threats. This is arguably true of AISI as well, given their strong focus on biology and chemical risks rather than radiological or nuclear risks. This raises a question: Given finite resources, what evaluations of frontier AI systems are most lacking in the current landscape?
3. One in four UK doctors are using AI in their clinical practice
What happened: The Nuffield Trust and the Royal College of General Practitioners surveyed more than 2,000 UK GPs to understand how they view and use AI, in what the authors called the largest and most up-to-date survey on the topic.
What’s interesting:
28% of UK GPs now use AI. This is up from ~10% in 2018, but below the rates seen in some other UK professions. According to the survey, the GPs most likely to use AI are younger, male, and work in more affluent areas. This is similar to disparities in the wider public’s use of LLMs, although there, the early gender gap may have narrowed.
Just over half of AI-using GPs procure AI tools themselves rather than relying on those that their practices select. This kind of “shadow AI use” is not unique to GPs, but a Nuffield focus group sheds light on why UK GPs feel compelled to do it: some GP practices or Integrated Care Boards ban AI tools, while others are slow to respond to GPs’ requests and instead prefer to stick with legacy digital tools.
UK GPs mainly use AI for clinical documentation and note-taking. Some say that AI note-taking allows them to look at, and speak more, with their patients, a non-trivial benefit given that the UK public worries about AI making healthcare staff more distant.
GPs also use LLMs to produce documents, from translations of patient communications to referral letters; and to stay abreast of new research, with some younger practitioners turning to LLM “study modes” to help with their mandatory professional development.
GPs cite “saving time” as the primary benefit of AI, and mainly use this to reduce overtime, rest, and engage in professional development, rather than to see more patients. This is notable as the UK government wants AI to reduce the wait time to get a GP appointment, which is a top concern for the public. These findings suggest that more nuanced evaluations of AI’s impact on GP services will be needed.
GPs worry about errors and liability issues with AI. As a result, the authors call on tech suppliers to do better evaluations of hallucinations. Ideally, such evaluations would compare the accuracy of AI, human and hybrid outputs in real-world settings, and all the nuances that might entail. For example, when explaining the benefits of AI note-taking, some GPs pointed out that certain colleagues can’t touch type and so, without AI, struggle to capture all the details in a patient consultation ( this is, presumably, a form of inaccuracy).
Use of AI for more complex “clinical support” tasks remains relatively low, owing to GPs’ concerns about errors, their desire to retain control over clinical judgement, and a lack of regulatory approval. However, some GPs did report using AI, or wanting to use future systems, to help check diagnoses, formulate care plans, and analyse lab results.
This suggests that more GPs may start to use AI to enhance their own clinical judgement, spurred by a growing body of evidence that LLM-based systems may be useful in this area, and by the public’s own growing use of LLMs for answering medical questions.
In their recommendations, the Nuffield authors call for clearer guidelines and regulatory frameworks for GPs, including as part of the UK’s new National Commission into the Regulation of AI in Healthcare. However, the report also acknowledges that much guidance already exists, such as the British Medical Association’s AI principles and the NHS guidance on AI note-taking (which some GPs appear to be breaking by procuring their own tools). This raises a question: what exactly should any new guidance stipulate? How to get the burden on GPs right? And how to ensure that they are actually following it?



Policy development for AI presents unique challenges because the technology evolves faster than regulatory frameworks can adapt. The tension between enabling innovation and managing risk requires policy approaches that are both flexible and robust. Healthcare AI policy specifically highlights this - we need safeguards for patient safety while avoiding regulatory paralysis that prevents beneficial applications. The key policy question isn't whether to regulate AI, but how to design governance structures that can evolve alongside the technology while maintaining public trust and protecting fundamental values.