Lu Yu from AofeisiQbitAI | Official Account
While carbon-based life forms are still opening a hundred browser tabs to write literature reviews, the AI next door has already started a fierce competition. (doge)
Completing 12 years of human work in two days——
In the field of medical research, Systematic Reviews (SRs) serve as the gold standard for clinical decision-making, yet they typically take over 16 months and cost more than $100,000, and can prolong the use of ineffective or harmful treatments.
To address this, institutions including the University of Toronto and Harvard Medical School jointly developed an AI end-to-end workflow—otto-SR.
By combining GPT-4.1 and o3-mini for screening and data extraction, it completed a Cochrane systematic review update, which would conventionally take 12 years, in just two days.
It also surpassed human performance across multiple metrics. In benchmark tests, otto-SR achieved a sensitivity of 96.7% (compared to human 81.7%), a specificity of 93.9%, and a data extraction accuracy of 93.1% (compared to human 79.7%). It even identified 54 key studies that humans had overlooked.
So, all those nights we spent on PubMed, all the hair we lost, what was it all for...?
Wiping away tears, let's dive into the specifics of its implementation.
An Intelligent Workflow for Systematic Review Automation
The team introduced otto-SR, an LLM-based end-to-end workflow that supports fully automated and human-AI collaborative systematic review processes, from initial retrieval to data analysis.
otto-SR first collects citations in RIS format identified from the initial retrieval. GPT-4.1 then acts as an independent reviewer for screening.
The screened article set is fed into the o3-mini-high model for data extraction. PDF formats are processed by Gemini 2.0 flash, converted into structured Markdown files, and used for downstream tasks.
Specifically, it can be broken down into two functionalities: screening and extraction:
SR Literature Screening
The research team developed a screening Agent, utilizing the GPT-4.1 model, which excels at following instructions, combined with optimized prompting strategies, to screen literature at both the abstract and full-text stages.
Additionally, this Agent incorporates the initial objectives and eligibility criteria of each review into supplementary instructions.
The study evaluated otto-SR's screening performance across the complete original retrieval of five reviews (totaling 32,357 citations).
The reviews covered four types of questions from the Oxford Centre for Evidence-Based Medicine (CEBM) (prevalence, diagnostic test accuracy, prognosis, intervention benefits) and were benchmarked against the evaluation results of two human reviewers (the current standard workflow) and Elicit (a commercial LLM-based systematic review automation software).
In the abstract screening phase, otto-SR achieved the highest sensitivity of 96.6%, and its specificity of 93.9% was comparable to human review's 95.7%.
In the full-text screening phase, otto-SR also maintained the highest sensitivity at 96.2%, while human reviewers' sensitivity significantly dropped to 63.3%; both maintained high specificity.
Therefore, the study found that otto-SR can capture more relevant studies than traditional dual-human screening while maintaining sufficient specificity.
SR Data Extraction
The research team selected the OpenAI o3mini-high model as the extraction Agent due to its strong scientific reasoning capabilities, robust long-context retrieval capabilities, and cost-effectiveness. Prompts all adopted variable descriptions defined by the original authors.
The study compared otto-SR and Elicit's data extraction performance across 495 studies from seven reviews, with two human reviewers then evaluating a randomly sampled subset of literature for each review.
The results showed that otto-SR's average weighted accuracy reached 93.1%, significantly higher than the two human reviewers' 79.7% and Elicit's 74.8%.
Furthermore, to address instances where otto-SR's extracted values differed from the original review authors, the team introduced a blinded reviewer panel to make decisions, with 69.3% of cases supporting otto-SR.
In contrast, the blinded reviewer panel supported dual-human extractors in only 28.1% of cases and Elicit in 22.4% of cases.
This further demonstrates otto-SR's superior performance in data extraction, significantly outperforming other methods.
Rapid Reproduction and Updating of Reviews
To evaluate otto-SR's practical applicability, the team performed a complete reproduction of SRs from the Cochrane database's April 2024 issue, which are typically used to inform clinical guidelines.
Updating the retrieval to May 8, 2025, for the 12 available reviews, a total of 146,276 citations were identified. After deduplication, these were submitted to otto-SR for screening according to the original criteria.
Filtering the results to align with the original retrieval cut-off date, otto-SR identified 54 overlooked eligible studies (median 2, IQR: 1 to 6.25 per review). After manual review, it was found that otto-SR incorrectly included 10 false positive articles, nine of which potentially contained relevant data.
Extending the date back to May 8, 2025, yielded 14 additional eligible studies (total n=64, median 2.5, IQR 1 to 7.25 per review), including another 2 false positive articles, one of which contained relevant data.
The above work doubled the number of eligible articles and reduced the time required for researchers to complete 12 work-years of effort to within 48 hours.
A meta-analysis was conducted comparing the extracted data with the original reviews, involving three comparison groups:
1. Matched group: The same set of articles included in the original Cochrane analysis by otto-SR.
2. Expanded group: Includes all eligible studies identified by otto-SR, filtered to the original retrieval cut-off date.
3. Updated group: Evaluates all articles, with the retrieval cut-off date updated to May 8, 2025.
Additionally, considering potential data extraction tasks, a dual-human review was introduced to derive corrected values for each group, i.e., removing false positive articles and adding false negative articles.
In the matched group, the meta-analysis effect estimates generated by otto-SR overlapped with the 95% CIs of the original Cochrane data and corrected datasets.
In the expanded analysis, two reviews were found to have generated new statistical significance, and one review lost significance.
For instance, in a nutritional review, otto-SR identified 5 additional studies and uncovered an interesting fact: preoperative immune enhancement before gastric surgery might reduce the average hospital stay by one day.
The advent of otto-SR will greatly alleviate the slow and laborious process of systematic reviews. In the future, work that once took months or even years may be reduced to a few hours or minutes, allowing for quicker responses to new therapies or pandemics.
Furthermore, regions that lack the funding to conduct systematic reviews will also be able to benefit from cutting-edge medicine, as the authors wrote at the end of the article:
In short, the gold standard is no longer human.