In a recent study published in the journal The Lancet Digital Health, scientists in the United States evaluated the efficacy and challenges of artificial intelligence (AI) in clinical practice by analyzing randomized controlled trials, emphasizing the need for more diverse and comprehensive research approaches.
Review: Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. Image Credit: Kundra / Shutterstock
Background
AI’s role in healthcare has significantly expanded in the last five years, showing potential to match or exceed clinician performance in various specialties. However, most AI models have undergone retrospective rather than real-world testing. Out of nearly 300 AI-enabled medical devices approved by the United States (US) Food and Drug Administration (FDA), only a few have been evaluated through prospective randomized controlled trials (RCTs). This gap in real-world testing highlights concerns about AI’s reliability and effectiveness, with issues like alert fatigue from faulty AI predictions, as demonstrated by a sepsis model. Further research is needed to validate AI’s real-world efficacy, address biases, and ensure its safe, equitable, and effective integration into clinical practice.
About the study
From January 1, 2018, to November 14, 2023, a systematic search was conducted across databases such as SCOPUS, PubMed, CENTRAL, and the International Clinical Trials Registry Platform, targeting the rise of modern AI in clinical trials. Search terms included “artificial intelligence,” “clinician,” and “clinical trial,” with further studies identified through a manual review of relevant publication references.
The inclusion criteria were specific to RCTs utilizing significant AI components, defined as non-linear computational models like decision trees or neural networks, which must integrate into clinical practice and influence patient management. Exclusions included studies using linear models, secondary studies, abstracts, and non-integrated interventions. This methodology follows Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for scoping reviews and is registered with the International Prospective Register of Systematic Reviews (PROSPERO).
The publications were initially screened using the Covidence Review software, focusing on titles and abstracts. Two independent investigators performed the screening, with subsequent full-text reviews. Data extraction was completed in Google Sheets by one investigator and verified by another, with any disagreements resolved by a third. Information was collected on study location, participant characteristics, clinical tasks, primary endpoints, time efficiency, comparators, results, AI type, and origin. Studies were categorized by primary endpoint group, clinical area or specialty, and AI data modality.
No contact was made with study authors for additional information, and due to the varied nature of tasks and endpoints across studies, no meta-analyses were performed. Instead, descriptive statistics were used to provide an overview of the characteristics of the trials included in this review.
Study results
After deduplication, the electronic search for the scoping review produced 10,484 unique records spanning from January 1, 2018, to November 14, 2023. This process included retrieving 6,219 study records and 4,299 trial registrations. The initial screening of titles and abstracts narrowed the selection to 133 articles subjected to a full-text review. Subsequent exclusions left 73 studies, supplemented by an additional 13 articles identified through secondary reference screening, totaling 86 unique RCTs for inclusion.
Of these 86 RCTs, a substantial proportion (43%) focused on gastroenterology, followed by radiology (13%), surgery (6%), and cardiology (6%). Gastroenterology trials predominantly utilized video-based deep learning algorithms to assist clinicians, mainly evaluating diagnostic yield or performance. Most gastroenterology trials were concentrated among four research groups, highlighting a lack of diversity in trial conduct. Geographically, 92% of the trials were conducted within single countries, with the USA and China leading in the number of trials but focusing on different specialties.
The trials typically involved single centers and a median of 359 participants. Participant demographics like age and sex were consistently reported, but race or ethnicity was less frequently included.
Diagnostic effectiveness was the most common primary endpoint, followed by metrics related to care management, patient behavior and symptoms, and clinical decision-making. Notably, AI interventions in insulin dosing and hypotension monitoring demonstrated improvements in clinical management by optimizing time within target ranges. Other AI applications influenced patient behavior positively, as seen in trials that increased adherence to referral recommendations through immediate AI-generated predictions.
The majority of the trials evaluated deep learning systems for medical imaging, specifically video-based systems used in endoscopy. The use of AI varied across different data types, including structured data from electronic health records and waveform data. In terms of development, most AI models originated from the industry, with academia also playing a significant role.
Outcome analyses revealed that a substantial number of the trials achieved significant improvements in their primary endpoints when AI was used to assist clinicians or compared to routine care. However, a small group of trials used non-inferiority designs to demonstrate that AI systems could match the performance of unassisted clinicians or routine care.
Operational time measurements varied across trials, with some reporting significant reductions while others saw increases or no change. Gastroenterology was notably the most studied specialty in terms of operational time effects, with mixed results regarding the impact of AI on operational efficiency.