Researchers use AI to speed reviews of existing evidence

Study focuses on developing prompts for large language models such as ChatGPT that can quickly sift through thousands of scientific articles

Man in a lab working on a computer that shows an AI chatbot

(photo by Laurence Dutton/Getty Images)

Published: March 14, 2025

By Betty Zou

Researchers at the University of Toronto and University of Calgary have developed an innovative approach that uses artificial intelligence to streamline the screening process for systematic reviews, a research gold standard that involves analyzing large volumes of existing evidence.

The study, published recently in the journal Annals of Internal Medicine, involved developing ready-to-use prompt templates that enable researchers working in any field to use large language models (LLMs) such as ChatGPT to sift through thousands of published scientific articles to identify the ones that meet their criteria.

“Whenever clinicians are trying to decide which drug to administer or which treatment might be best, we rely on systematic reviews to inform our decision,” says Christian Cao, the study’s first author and a third-year medical student in U of T’s Temerty Faculty of Medicine.

To produce a high-quality review article, authors first compile all the previously published literature on a given topic. Cao notes that depending on the topic, reviewers filter through as many as hundreds of thousands of papers to determine which studies should be included – a process that is time-consuming and expensive.

“There are no truly effective automation efforts for systematic reviews. That’s where we thought we could make an impact, using these LLMs that have become exceptionally good at text classification,” says Cao, who worked with his mentors Rahul Arora and Niklas Bobrovitz – both of the University of Calgary.

To test the performance of their prompt templates, the researchers created a database of 10 published systematic reviews along with the complete set of citations and list of inclusion and exclusion criteria for each one. After multiple rounds of testing, the researchers developed two key prompting innovations that significantly improved their prompts’ accuracy in identifying the correct studies.

Their first innovation was based on a prompting technique that instructs LLMs to think step-by-step to break down a complex problem. Cao likens it to asking someone to think out loud or walking another person through their thought process. The researchers took it one step further by developing their own approach to provide more structured guidance that asks the LLMs to systematically analyze each inclusion criterion before making an overall assessment on whether a specific paper should be included.

The second innovation addressed the so-called “lost in the middle” phenomenon where LLMs can overlook key information that may be buried in the middle of lengthy documents that are provided as inputs. The researchers showed that they could overcome this challenge by placing their instructions at the beginning and end. Much like how human memory is biased towards recent events, Cao explains that repeating the prompts at the end helps the LLMs better remember what it is being asked to do.

“We used natural language statements because we really wanted the LLMs to mimic how humans would attack this problem,” he says.

With these strategies, the prompt templates scored close to 98 per cent sensitivity and 85 per cent specificity in selecting the right studies based on the abstracts alone. When asked to screen full-length articles, the prompt templates performed similarly well with 96.5 per cent sensitivity and 91 per cent specificity.

The researchers also compared different LLMs, including several versions of OpenAI’s GPT, Anthropic’s Claude, and Google’s Gemini Pro. They found that GPT-4 variations and Claude-3.5 had strong and similar performance.

In addition, the study highlights how LLMs can produce significant cost and time savings for authors. The researchers estimated that traditional screening methods using human reviewers can cost upwards of thousands of dollars in wages whereas LLM-driven screening costs roughly a 10^th of that amount. LLMs can also shorten the time required to screen articles from months to less than a day.

Cao hopes that these benefits, coupled with how easy their prompt templates are to customize and use, will encourage other researchers to integrate them into their workflows. To that end, the team has made all their work freely accessible online.

As a next step, Cao and his collaborators are working on a new LLM-driven application to facilitate data extraction, another time-consuming and laborious step in the systematic review process.

“We want to create an end-to-end solution for systematic reviews where clinical grade research answers to any medical question are just a search away.”

Topics

Breaking Research

Researchers use AI to speed reviews of existing evidence

Topics

Tags

Subscribe to The Bulletin Brief

More U of T News

Researchers use AI to speed reviews of existing evidence

Share this page

Topics

Tags

Subscribe to The Bulletin Brief

More U of T News