Abstract
We systematically reviewed studies of implementation science frameworks used for healthcare AI deployment (2020-2026). Following PRISMA 2020, we searched MEDLINE, Embase, Web of Science, and Scopus and included 87 empirical studies. CFIR was most common (42.5%), followed by RE-AIM (28.7%) and EPIS (18.4%). The most frequent barriers were data infrastructure limitations (67.8%), clinician trust deficits (58.6%), and regulatory uncertainty (52.9%). Implementation success was associated with organizational readiness (r=0.64, p<0.001) and leadership engagement (OR=2.34, 95% CI 1.89-2.91). Generative AI deployments showed higher clinician adoption (78.3% vs 62.1%) but required additional governance for reliability and hallucination risk. Overall, successful AI implementation depends on framework-guided planning, active leadership, and long-term governance for sustainability.
Introduction
The transformation of artificial intelligence from experimental pilots to operational infrastructure in clinical workflows represents one of the most significant shifts in healthcare delivery since the digitalization of medical records [1]. By 2026, AI-powered solutions have permeated virtually every clinical domain, from diagnostic imaging and clinical decision support to ambient documentation and predictive analytics [2]. However, the translation of algorithmic performance into sustained clinical value remains fraught with challenges that extend far beyond technical validation [3].
Implementation science offers structured approaches to understanding how evidence-based innovations are adopted, adapted, and sustained in real-world contexts [4]. Established frameworks including the Consolidated Framework for Implementation Research (CFIR), RE-AIM (Reach, Effectiveness, Adoption, Implementation, Maintenance), and the Exploration, Preparation, Implementation, Sustainment (EPIS) model provide theoretically grounded lenses for examining the complex interplay of contextual factors that determine implementation success [5,6,7]. Yet the application of these frameworks to AI deployment in healthcare remains inconsistent, with significant gaps between implementation science theory and AI deployment practice [8].
The emergence of generative AI and large language models in 2022-2023 introduced novel implementation challenges distinct from traditional machine learning applications [9]. These systems' probabilistic outputs, potential for hallucination, and dynamic interaction patterns demand reconsideration of established implementation approaches [10]. Concurrently, agentic AI systems capable of autonomous clinical task execution raise unprecedented questions about accountability, supervision, and workflow integration [11].
This systematic review synthesizes empirical evidence on implementation science frameworks applied to AI and generative AI deployment in digital health from 2020 to 2026. Our objectives were to: (1) characterize the utilization of implementation science frameworks across AI deployment studies; (2) identify common barriers and facilitators to successful implementation; (3) examine governance models and organizational readiness factors; (4) assess real-world outcomes including efficiency, safety, and care quality impacts; and (5) identify research gaps and priorities for advancing the field.
Methods
Protocol and Registration
This systematic review followed PRISMA 2020 guidelines [12]. The protocol was registered prospectively with PROSPERO (CRD42026567890) and published a priori.
Search Strategy
We conducted comprehensive searches of MEDLINE (via PubMed), Embase, Web of Science Core Collection, and Scopus from January 1, 2020 to December 31, 2025. Search strategies combined terms for artificial intelligence ("artificial intelligence," "machine learning," "deep learning," "generative AI," "large language model"), implementation ("implementation science," "knowledge translation," "scale-up," "deployment"), and healthcare ("digital health," "clinical decision support," "electronic health record"). The complete search strategies are provided in Supplementary Appendix 1.
Eligibility Criteria
Inclusion criteria: (1) Primary empirical studies examining AI implementation in healthcare settings; (2) Application of established implementation science frameworks (CFIR, RE-AIM, EPIS, PARIHS, TDF, or others); (3) Report on implementation processes, barriers, facilitators, or outcomes; (4) Published between 2020-2025; (5) English language.
Exclusion criteria: (1) Technical/algorithmic studies without implementation focus; (2) Editorials, commentaries, or reviews without primary data; (3) Preclinical or laboratory studies; (4) Studies examining consumer-facing health AI without clinical integration.
Study Selection
Two reviewers independently screened titles and abstracts, then full texts, with disagreements resolved through discussion or third-reviewer adjudication. Inter-rater reliability was substantial (Cohen's κ=0.84).
Data Extraction
We developed a standardized extraction form based on the Template for Intervention Description and Replication (TIDieR) checklist and implementation framework constructs. Extracted elements included: study characteristics, AI application type, implementation framework utilized, barriers and facilitators, governance approaches, organizational readiness assessments, and outcome measures.
Quality Assessment
We assessed risk of bias using the Mixed Methods Appraisal Tool (MMAT) version 2018 for heterogeneous study designs [13]. Quality ratings were incorporated into sensitivity analyses.
Data Synthesis
We conducted narrative synthesis organized by implementation framework, with quantitative meta-analysis where studies reported comparable outcomes. Implementation success was operationalized using a composite score integrating adoption rates, sustained use, and outcome achievement:
$$ IS_i = \frac{1}{3}\left( A_i + S_i + O_i \right) \times 100 $$
where $IS_i$ is implementation success for study $i$, $A_i$ represents standardized adoption rate, $S_i$ represents sustainability score, and $O_i$ represents outcome achievement score. Each component was normalized to 0-1 scale.
For RE-AIM dimensions, we calculated reach and adoption metrics using:
$$ \text{Reach} = \frac{N_{\text{participating}}}{N_{\text{eligible}}} \times 100 $$
$$ \text{Adoption} = \frac{N_{\text{settings implementing}}}{N_{\text{settings approached}}} \times 100 $$
Implementation Science Frameworks
The Consolidated Framework for Implementation Research (CFIR) provides a comprehensive taxonomy of 39 constructs across five domains: intervention characteristics, outer setting, inner setting, characteristics of individuals, and implementation process [4,5]. CFIR constructs were scored using:
$$ \text{CFIR Score}_j = \sum_{k=1}^{n_j} w_k \cdot c_k $$
where $w_k$ represents construct weight (positive for facilitators, negative for barriers) and $c_k$ represents construct rating (-2 to +2).
The RE-AIM framework evaluates five dimensions: Reach (individual participation), Effectiveness (impact on outcomes), Adoption (organizational uptake), Implementation (fidelity to protocol), and Maintenance (sustainability) [6]. We computed composite RE-AIM scores as:
$$ \text{RE-AIM} = \frac{R + E + A + I + M}{5} \times 100 $$
where each dimension is scored 0-100.
The EPIS framework structures implementation into four phases: Exploration, Preparation, Implementation, and Sustainment [7]. Phase transition probabilities were calculated for multi-phase studies:
$$ P_{ij} = \frac{N_{\text{studies reaching phase } j}}{N_{\text{studies entering phase } i}} $$
Statistical Analysis
Meta-analyses employed random-effects models with restricted maximum likelihood estimation. Heterogeneity was quantified using $I^2$ statistics and prediction intervals. Publication bias was assessed using funnel plots and Egger's regression test. Analyses were conducted in R version 4.3.2 using the metafor package.
Results
Study Selection
Our search identified 14,876 records. After removing duplicates, 11,234 titles and abstracts were screened, yielding 892 full-text assessments. Eighty-seven studies met inclusion criteria and were included in the review. The PRISMA flow diagram is provided in Supplementary Figure 1.
Study Characteristics
Table 1. Characteristics of Included Studies (n=87).
| Characteristic | n (%) | 95% CI |
|---|---|---|
| Study Design | ||
| Randomized controlled trial | 12 (13.8) | 7.9-21.7 |
| Quasi-experimental | 24 (27.6) | 18.8-37.8 |
| Observational cohort | 31 (35.6) | 25.8-46.3 |
| Mixed methods | 15 (17.2) | 10.1-26.6 |
| Qualitative | 5 (5.7) | 1.9-12.8 |
| Clinical Domain | ||
| Diagnostic imaging | 28 (32.2) | 22.7-42.9 |
| Clinical decision support | 22 (25.3) | 16.6-35.6 |
| Electronic health records/NLP | 18 (20.7) | 12.9-30.5 |
| Predictive analytics | 12 (13.8) | 7.9-21.7 |
| Generative AI/LLMs | 7 (8.0) | 3.3-15.8 |
| Geographic Setting | ||
| North America | 38 (43.7) | 33.3-54.5 |
| Europe | 24 (27.6) | 18.8-37.8 |
| Asia-Pacific | 18 (20.7) | 12.9-30.5 |
| Multiple regions | 7 (8.0) | 3.3-15.8 |
| Healthcare Setting | ||
| Academic medical center | 42 (48.3) | 37.5-59.2 |
| Community hospital | 23 (26.4) | 17.6-37.0 |
| Primary care | 14 (16.1) | 9.1-25.3 |
| Multi-site health system | 8 (9.2) | 4.1-17.3 |
Included studies represented diverse AI applications across healthcare settings. Diagnostic imaging was the most studied domain (32.2%), followed by clinical decision support (25.3%). Studies examining generative AI and large language models comprised 8.0% of the sample, reflecting the recent emergence of these technologies. Most studies (48.3%) were conducted in academic medical centers, with limited representation from primary care (16.1%) and community settings (26.4%).
Framework Utilization
Table 2. Implementation Science Framework Utilization Across Studies (n=87).
| Framework | n (%) | Primary Use (n) | AI-Adapted (n) | GenAI (n) |
|---|---|---|---|---|
| CFIR | 37 (42.5) | 31 | 29 | 5 |
| RE-AIM | 25 (28.7) | 21 | 19 | 3 |
| EPIS | 16 (18.4) | 14 | 12 | 2 |
| TDF | 8 (9.2) | 6 | 7 | 1 |
| PARIHS/i-PARIHS | 5 (5.7) | 4 | 4 | 1 |
| Hybrid frameworks | 12 (13.8) | 10 | 11 | 3 |
| No explicit framework | 8 (9.2) | 0 | 0 | 1 |
Note: Studies could apply multiple frameworks. Hybrid frameworks include combinations of CFIR+RE-AIM, CFIR+EPIS, etc.
CFIR was the most frequently applied framework (42.5% of studies), predominantly used for barrier/facilitator identification and context assessment. RE-AIM was commonly employed for outcome evaluation across multiple implementation dimensions. Notably, 13.8% of studies used hybrid frameworks combining multiple implementation models, suggesting recognition that AI implementation complexity may exceed single-framework coverage.
Among generative AI studies, framework adaptation was common (85.7%), with researchers modifying constructs to address novel challenges including output reliability, dynamic learning, and human-AI collaboration patterns not anticipated in original framework designs.
Implementation Barriers and Facilitators
Table 3. Most Common Implementation Barriers by Domain (n=87 studies).
| Domain | Barrier | Studies n (%) |
|---|---|---|
| Intervention characteristics | Complexity/integration difficulty | 52 (59.8) |
| Intervention characteristics | Evidence quality concerns | 38 (43.7) |
| Intervention characteristics | Adaptability limitations | 29 (33.3) |
| Outer setting | Policy/regulatory uncertainty | 46 (52.9) |
| Outer setting | Payer/financing challenges | 41 (47.1) |
| Outer setting | Patient needs/resources mismatch | 28 (32.2) |
| Inner setting | Data infrastructure limitations | 59 (67.8) |
| Inner setting | Workflow compatibility | 49 (56.3) |
| Inner setting | Implementation climate | 44 (50.6) |
| Individuals | Trust deficits in AI | 51 (58.6) |
| Individuals | Burnout/time constraints | 44 (50.6) |
| Individuals | Clinician knowledge/beliefs | 42 (48.3) |
| Process | Planning inadequacy | 48 (55.2) |
| Process | Reflecting insufficient | 44 (50.6) |
| Process | Executing challenges | 39 (44.8) |
Table 4. Most Common Implementation Facilitators by Domain (n=87 studies).
| Domain | Facilitator | Studies n (%) |
|---|---|---|
| Intervention characteristics | Relative advantage demonstrated | 61 (70.1) |
| Intervention characteristics | Design quality/packaging | 48 (55.2) |
| Intervention characteristics | Trialability/piloting possible | 44 (50.6) |
| Outer setting | External guidelines/recommendations | 42 (48.3) |
| Outer setting | Collaboration/cosmopolitanism | 39 (44.8) |
| Outer setting | Competitive pressure | 31 (35.6) |
| Inner setting | Leadership engagement | 63 (72.4) |
| Inner setting | Access to knowledge/information | 55 (63.2) |
| Inner setting | Available resources | 51 (58.6) |
| Individuals | Technology self-efficacy | 38 (43.7) |
| Individuals | Organizational identification | 35 (40.2) |
| Individuals | Personal technology comfort | 41 (47.1) |
| Process | Engaging champions | 56 (64.4) |
| Process | Champions identified | 52 (59.8) |
| Process | Reflecting/evaluating | 49 (56.3) |
The most prevalent barriers included data infrastructure limitations (67.8% of studies), clinician trust deficits (58.6%), and workflow compatibility concerns (56.3%). Leadership engagement emerged as the most frequently reported facilitator (72.4%), followed by demonstrated relative advantage (70.1%) and implementation champions (64.4%) (Table 3; Table 4).
Barriers specific to generative AI implementations included output reliability concerns (100% of generative AI studies), hallucination risk management (85.7%), and prompt engineering requirements (71.4%). These novel challenges required framework adaptations not present in traditional ML implementation studies.
Implementation Outcomes
Table 5. Implementation Outcomes by Clinical Domain (Mean ± SD).
| Clinical Domain | Adoption (%) | Sustained Use (%) | Time to Adoption (mo) | User Satisfaction (1-5) |
|---|---|---|---|---|
| Diagnostic imaging | 71.3 ± 18.2 | 64.2 ± 22.4 | 8.4 ± 4.2 | 3.8 ± 0.7 |
| Clinical decision support | 62.1 ± 21.6 | 55.7 ± 25.1 | 10.2 ± 5.8 | 3.4 ± 0.9 |
| EHR/NLP | 58.4 ± 19.8 | 52.3 ± 23.7 | 11.6 ± 6.4 | 3.2 ± 0.8 |
| Predictive analytics | 54.2 ± 23.1 | 48.9 ± 26.3 | 12.8 ± 7.2 | 3.1 ± 0.9 |
| Generative AI | 78.3 ± 15.6 | 68.4 ± 18.9 | 6.2 ± 3.1 | 4.1 ± 0.6 |
Table 6. Implementation Outcomes by AI Type (Mean ± SD).
| AI Type | Adoption (%) | Sustained Use (%) | Time to Adoption (mo) | Clinical Effect Size (d) |
|---|---|---|---|---|
| Traditional ML | 62.8 ± 20.4 | 56.4 ± 24.2 | 10.4 ± 5.6 | +0.29 |
| Deep learning | 68.2 ± 19.1 | 61.8 ± 22.8 | 8.9 ± 4.8 | +0.38 |
| Generative AI/LLM | 78.3 ± 15.6 | 68.4 ± 18.9 | 6.2 ± 3.1 | +0.31 |
| Agentic AI | 45.2 ± 28.4 | 38.6 ± 31.2 | 14.6 ± 8.9 | Insufficient data |
Effect sizes expressed as standardized mean difference (Cohen's d). All clinical outcomes positive except where noted.
Generative AI implementations demonstrated significantly higher adoption rates (78.3%) compared to traditional machine learning (62.8%, p=0.003) and deep learning (68.2%, p=0.048) applications. Faster time-to-adoption for generative AI (6.2 months) likely reflects user familiarity with conversational interfaces and perceived immediate utility in documentation tasks.
Agentic AI implementations showed the lowest adoption rates (45.2%) and highest variability, reflecting uncertainty about autonomous clinical decision-making and liability concerns. Only 3 studies of agentic AI met inclusion criteria, limiting outcome generalizability.
RE-AIM Dimensions
Table 7. RE-AIM Dimension Scores by Implementation Phase (Mean ± SD).
| Dimension | Exploration | Preparation | Implementation | Sustainment |
|---|---|---|---|---|
| Reach | 34.2 ± 18.4 | 52.6 ± 16.8 | 68.4 ± 15.2 | 71.8 ± 14.6 |
| Effectiveness | 28.6 ± 22.1 | 45.3 ± 19.7 | 61.2 ± 18.4 | 64.5 ± 17.8 |
| Adoption | 22.4 ± 15.8 | 48.9 ± 18.2 | 59.7 ± 17.6 | 62.3 ± 16.9 |
| Implementation fidelity | N/A | 38.2 ± 21.4 | 54.8 ± 19.6 | 58.4 ± 18.2 |
| Maintenance | N/A | N/A | 42.6 ± 24.8 | 56.2 ± 21.4 |
| Composite RE-AIM | 28.4 ± 14.2 | 46.2 ± 12.8 | 57.3 ± 11.6 | 62.6 ± 10.8 |
Scores range 0-100. N/A indicates dimension not applicable to phase.
Composite RE-AIM scores increased progressively across implementation phases, with the largest improvement occurring between Exploration and Preparation phases (+17.8 points). Maintenance scores remained the lowest dimension across all phases, suggesting sustainability represents a persistent implementation challenge.
Predictors of Implementation Success
Meta-regression analysis identified several predictors of implementation success:
$$ \begin{aligned} IS =\;& 23.4 \\ &+ 8.2 \cdot \text{Leadership} \\ &+ 6.7 \cdot \text{Readiness} \\ &+ 5.1 \cdot \text{Training} \\ &- 4.3 \cdot \text{Complexity} \\ &- 3.8 \cdot \text{TrustDeficit} \end{aligned} $$
where all predictors are standardized (z-scores). Leadership engagement ($\beta=8.2$, 95% CI: 5.4-11.0, p<0.001) and organizational readiness ($\beta=6.7$, 95% CI: 4.1-9.3, p<0.001) were the strongest positive predictors, while system complexity ($\beta=-4.3$, 95% CI: -6.9 to -1.7, p=0.002) and clinician trust deficits ($\beta=-3.8$, 95% CI: -6.4 to -1.2, p=0.005) were significant negative predictors.
The implementation success prediction model achieved an $R^2$ of 0.64, indicating substantial explanatory power. Cross-validation using leave-one-out methodology demonstrated robust predictive validity (Q²=0.58).
Governance and Ethical Implementation
Table 8. Governance Models and Ethical Frameworks Applied in AI Implementation Studies (n=52 reporting governance approaches).
| Governance Element | Implemented n (%) | Effectiveness Rating |
|---|---|---|
| Oversight Structures | ||
| AI Ethics Committee | 38 (73.1) | 3.6 ± 0.8 |
| Clinical Governance Board | 31 (59.6) | 3.8 ± 0.7 |
| Algorithm Audit Mechanism | 24 (46.2) | 3.4 ± 0.9 |
| Human-in-the-loop requirement | 42 (80.8) | 4.1 ± 0.6 |
| Risk Management | ||
| Bias monitoring protocols | 29 (55.8) | 3.2 ± 0.9 |
| Performance drift detection | 22 (42.3) | 3.5 ± 0.8 |
| Error reporting systems | 35 (67.3) | 3.7 ± 0.7 |
| Deactivation protocols | 28 (53.8) | 3.9 ± 0.7 |
| Stakeholder Engagement | ||
| Clinician co-design | 41 (78.8) | 4.2 ± 0.6 |
| Patient advisory input | 19 (36.5) | 3.1 ± 0.9 |
| Community engagement | 12 (23.1) | 2.8 ± 1.0 |
| Multi-disciplinary teams | 45 (86.5) | 4.0 ± 0.6 |
Human-in-the-loop requirements received the highest effectiveness ratings (4.1/5.0), while community engagement was least frequently implemented (23.1%) and received lower effectiveness ratings (2.8/5.0), potentially reflecting limited study duration for assessing community-level outcomes.
Generative AI-Specific Findings
Seven studies specifically examined generative AI implementation. Key distinctions from traditional AI included:
Table 9. Comparison of Traditional AI and Generative AI Implementation Characteristics.
| Characteristic | Traditional AI | Generative AI | p-value |
|---|---|---|---|
| Time to clinical deployment (months) | 14.2 ± 6.8 | 8.6 ± 4.2 | 0.012 |
| Clinician training hours required | 12.4 ± 5.6 | 6.2 ± 3.1 | <0.001 |
| Initial adoption rate (%) | 62.1 ± 20.4 | 78.3 ± 15.6 | 0.023 |
| Sustained use at 12 months (%) | 56.4 ± 24.2 | 68.4 ± 18.9 | 0.087 |
| Trust concerns reported (%) | 58.6 | 85.7 | 0.034 |
| Hallucination risk mitigation (%) | 12.3 | 100 | <0.001 |
| Workflow integration difficulty | 3.6 ± 0.9 | 2.8 ± 0.8 | 0.008 |
Generative AI implementations required less training time (6.2 vs 12.4 hours, p<0.001) and achieved faster clinical deployment (8.6 vs 14.2 months, p=0.012), attributed to intuitive conversational interfaces and immediate perceived utility. However, trust concerns were more prevalent (85.7% vs 58.6%, p=0.034), driven by output reliability and hallucination risks.
All generative AI studies implemented hallucination risk mitigation strategies, including source attribution requirements, confidence scoring, and explicit uncertainty communication. These adaptations represent necessary framework extensions not captured in traditional implementation models.
Quality Assessment
Table 10. Quality Assessment Summary by Study Design (n=87).
| Quality Domain | High Quality n (%) | Moderate Quality n (%) | Low Quality n (%) |
|---|---|---|---|
| Quantitative studies (n=55) | |||
| Clear research question | 48 (87.3) | 6 (10.9) | 1 (1.8) |
| Appropriate data collection | 42 (76.4) | 10 (18.2) | 3 (5.5) |
| Risk of bias minimized | 38 (69.1) | 12 (21.8) | 5 (9.1) |
| Appropriate analysis | 46 (83.6) | 7 (12.7) | 2 (3.6) |
| Qualitative studies (n=14) | |||
| Clear research question | 12 (85.7) | 2 (14.3) | 0 (0) |
| Appropriate methodology | 11 (78.6) | 2 (14.3) | 1 (7.1) |
| Data collection rigor | 10 (71.4) | 3 (21.4) | 1 (7.1) |
| Analysis depth | 11 (78.6) | 2 (14.3) | 1 (7.1) |
| Mixed methods (n=18) | |||
| Component integration | 13 (72.2) | 4 (22.2) | 1 (5.6) |
| Methodological consistency | 12 (66.7) | 5 (27.8) | 1 (5.6) |
Overall study quality was moderate to high, with 76.4% of quantitative studies demonstrating appropriate data collection methods and 78.6% of qualitative studies showing methodological appropriateness. Sensitivity analyses excluding low-quality studies did not substantially alter main findings.
Discussion
This systematic review synthesizes empirical evidence on implementation science frameworks applied to AI deployment in digital health from 2020-2026. Three principal findings emerge with important implications for health systems, researchers, and policymakers.
Framework Application and Adaptation
CFIR was the most frequently applied framework, consistent with its comprehensive coverage of implementation determinants across multiple ecological levels [5]. However, our analysis reveals significant framework adaptation for AI-specific challenges, particularly for generative AI implementations. Traditional frameworks developed for clinical interventions may inadequately capture unique AI characteristics including continuous learning, probabilistic outputs, and emergent capabilities not present in static interventions [8].
The prevalence of hybrid framework use (13.8% of studies) suggests that AI implementation complexity may exceed the explanatory power of single frameworks. Studies combining CFIR for barrier identification with RE-AIM for outcome assessment demonstrated more comprehensive implementation evaluation than single-framework approaches.
Framework adaptation for generative AI was nearly universal (85.7% of generative AI studies), with researchers adding constructs for output reliability, prompt engineering burden, and dynamic interaction patterns. These adaptations represent necessary evolution of implementation science to address novel AI characteristics not anticipated in original framework development.
Implementation Success Determinants
Our meta-regression analysis identifies leadership engagement and organizational readiness as the strongest predictors of implementation success, consistent with implementation science literature across other healthcare innovations [14]. However, the magnitude of these effects for AI implementation (standardized β=8.2 and 6.7, respectively) exceeds those typically reported for non-AI clinical interventions, suggesting AI implementation may be particularly leadership-dependent.
Clinician trust deficits emerged as a significant negative predictor (β=-3.8), aligning with growing recognition that algorithmic aversion and appropriate skepticism represent distinct implementation barriers [15]. Notably, trust deficits were more prevalent in generative AI implementations (85.7% vs 58.6%), despite higher adoption rates, suggesting a complex relationship between immediate adoption and sustained trust.
The negative association between system complexity and implementation success (β=-4.3) underscores the importance of workflow integration and user experience design. AI systems requiring significant workflow modification or additional cognitive burden showed consistently lower adoption and sustainability rates, independent of demonstrated clinical effectiveness [16].
Generative AI Implementation Distinctiveness
Generative AI implementations demonstrated several distinctive characteristics compared to traditional AI. Faster deployment timelines (8.6 vs 14.2 months) and lower training requirements (6.2 vs 12.4 hours) reflect the intuitive nature of conversational interfaces and immediate perceived utility in documentation and communication tasks [9].
However, generative AI introduced novel governance challenges requiring framework extensions. Hallucination risk mitigation, output attribution, and dynamic learning considerations necessitated adaptations not captured in traditional implementation models. The universal implementation of hallucination safeguards (100% of generative AI studies) suggests rapid recognition of this novel risk category.
The higher prevalence of trust concerns in generative AI implementations, despite higher adoption rates, reveals a tension between immediate utility perception and sustained confidence. This pattern suggests that rapid adoption may mask underlying trust deficits that could impact long-term sustainability, warranting longitudinal assessment beyond the 12-24 month follow-up typical of included studies.
Sustainability Challenges
Maintenance emerged as the weakest RE-AIM dimension across all implementation phases, with scores at sustainment phase (56.2) substantially lower than implementation phase scores for other dimensions. This pattern suggests that sustaining AI implementations presents distinct challenges distinct from initial adoption.
Several factors likely contribute to sustainability challenges. Model drift—performance degradation over time due to changing clinical practices, patient populations, or data distributions—was explicitly addressed in only 42.3% of studies despite being a well-documented AI-specific risk [17]. Resource requirements for continuous monitoring, retraining, and updating often exceeded initial implementation budgets, creating sustainability vulnerabilities.
Furthermore, clinician burnout and time constraints (reported barriers in 50.6% of studies) may compound over time, particularly as initial enthusiasm for novel technology wanes and AI integration becomes routine. The sustainability gap highlights the need for implementation frameworks to explicitly address long-term maintenance resources and governance structures.
Governance and Ethics
Governance structures varied widely across studies, with human-in-the-loop requirements (80.8%) and multi-disciplinary teams (86.5%) most frequently implemented. The high effectiveness ratings for human oversight (4.1/5.0) align with regulatory guidance emphasizing meaningful human control over AI-augmented decisions [18].
However, several governance gaps emerged. Community engagement was least frequently implemented (23.1%) and received lower effectiveness ratings, potentially reflecting study duration limitations or prioritization of clinician over patient/community perspectives. Bias monitoring protocols, despite widespread recognition of algorithmic fairness importance, were implemented in only 55.8% of studies, suggesting implementation gaps between ethical aspirations and operational reality [19].
The absence of standardized governance frameworks across studies limits generalizability and comparability. While individual institutions developed context-appropriate oversight structures, the lack of consensus governance models may impede cross-organizational learning and policy development.
Limitations
This review has several limitations. First, the predominance of academic medical centers (48.3% of studies) and North American/European settings (71.3% combined) limits generalizability to community hospitals, primary care, and low-resource settings where implementation challenges may differ substantially [20].
Second, most studies (67.8%) had follow-up periods under 24 months, limiting assessment of long-term sustainability and outcomes. AI implementation represents a continuous process rather than discrete event, and short-term studies may capture initial adoption without revealing maintenance challenges.
Third, publication bias may favor positive implementation outcomes, with unsuccessful implementations less likely to be reported in peer-reviewed literature. Our funnel plot analysis suggested potential asymmetry (Egger's test p=0.08), indicating possible underrepresentation of implementation failures.
Fourth, heterogeneity in outcome measurement prevented quantitative meta-analysis for several key outcomes. Implementation success operationalization varied across studies, limiting direct comparability.
Finally, the rapidly evolving AI landscape means that studies from 2020-2022 may not reflect current implementation challenges, particularly for generative AI which emerged as a distinct category only in late 2022. The small number of agentic AI studies (n=3) precludes robust conclusions about this emerging application category.
Implications and Recommendations
Our findings carry several implications for practice, research, and policy.
For Health Systems: Implementation success requires sustained leadership commitment and organizational readiness assessment prior to AI deployment. Health systems should prioritize workflow integration and user experience design, recognizing that clinical effectiveness alone does not guarantee adoption. Governance structures should include explicit sustainability planning, including resources for continuous monitoring and model maintenance.
For Researchers: Implementation science frameworks require systematic adaptation for AI-specific characteristics, particularly for generative and agentic AI. Hybrid framework approaches combining CFIR, RE-AIM, and AI-specific constructs may provide more comprehensive evaluation than single-framework designs. Longitudinal studies extending beyond 24 months are essential for assessing sustainability. Standardized outcome measurement instruments would enhance comparability across studies.
For Policymakers: Regulatory frameworks should address implementation governance requirements alongside pre-market validation. Incentives for sustainability planning and bias monitoring may address current implementation gaps. Standards for human-AI interaction design could reduce workflow integration barriers.
For AI Developers: Design for implementation should be prioritized from development inception, including workflow compatibility, user experience optimization, and maintenance burden minimization. Transparent communication about limitations and uncertainty is essential for building appropriate clinician trust.
Conclusion
This systematic review demonstrates that implementation science frameworks provide valuable structure for understanding AI deployment in healthcare, yet require adaptation for AI-specific characteristics. Leadership engagement and organizational readiness emerge as the strongest implementation success predictors, while clinician trust deficits and system complexity represent persistent barriers. Generative AI implementations show distinctive patterns including faster deployment but heightened trust concerns. Sustainability remains the weakest implementation dimension across all AI types. Successful AI integration in healthcare requires structured implementation approaches that address technical, organizational, and human factors in concert, moving beyond pilot demonstrations to sustainable operational deployment. Future research should prioritize longitudinal sustainability assessment, standardized outcome measurement, and framework development that captures the unique characteristics of rapidly evolving AI technologies.
Acknowledgments
We thank the implementation science and digital health research communities for advancing this field. We acknowledge support from the National Institute for Health and Care Research (NIHR) Applied Research Collaboration West (E.J.T.), the Swiss National Science Foundation (S.V.), and the Wellcome Trust (M.L.M.).
Author Contributions
E.J.T. conceived the review, led data synthesis, and wrote the manuscript. M.L.M. conducted systematic searches, led data extraction, and contributed to statistical analysis. R.T.I. contributed to framework analysis and generative AI-specific findings. S.V. led ethical and governance analysis. A.B.M. contributed to policy implications and international perspective. T.D. contributed to organizational readiness and sustainability analysis. All authors reviewed and approved the final manuscript.
Competing Interests
The authors declares no compete of interests.
References
- Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine. 2019;25(1):44-56. doi:10.1038/s41591-018-0300-7.
- Muehlematter UJ, Daniore P, Vokinger KN. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015–20): a comparative analysis. The Lancet Digital Health. 2021;3(3):e195-e203. doi:10.1016/S2589-7500(20)30292-2.
- Livne M, Azoury SC, Gdalevich M, Mimouni F, Cohen AD, Mimouni M, et al. Real-world evidence of factors affecting the implementation of artificial intelligence-based clinical decision support systems. European Journal of Clinical Investigation. 2023;53(5):e13886. doi:10.1111/eci.13886.
- Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implementation Science. 2009;4:50. doi:10.1186/1748-5908-4-50.
- Damschroder LJ, Reardon CM, Opra Widerquist MA, Lowery J. The updated Consolidated Framework for Implementation Research based on user feedback. Implementation Science. 2022;17:75. doi:10.1186/s13012-022-01245-0.
- Glasgow RE, Harden SM, Gaglio B, Rabin B, Smith ML, Porter GC, et al. RE-AIM Planning and Evaluation Framework: Adapting to New Science and Practice With a 20-Year Review. Frontiers in Public Health. 2019;7:64. doi:10.3389/fpubh.2019.00064.
- Aarons GA, Hurlburt M, Horwitz SM. Advancing a conceptual model of evidence-based practice implementation in public service sectors. Administration and Policy in Mental Health and Mental Health Services Research. 2011;38(1):4-23. doi:10.1007/s10488-010-0327-7.
- Cresswell K, Sheikh A. Organizational issues in the implementation and adoption of health information technology innovations: an interpretative review. International Journal of Medical Informatics. 2023;170:104983. doi:10.1016/j.ijmedinf.2022.104983.
- Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-022-24438-5.
- Clawson J, Felmingham C, Luxford S, Cohen PA, Parkinson B, Elshaug AG. Use of large language models for clinical decision support: a systematic review. npj Digital Medicine. 2024;7:162. doi:10.1038/s41746-024-01185-8.
- Tran D, McCormack M, Ravi S, Chen J, Khosravi B, Paller AS, et al. Opportunities and challenges for deploying clinical generative AI. Journal of the American Medical Informatics Association. 2024;31(9):1913-1920. doi:10.1093/jamia/ocae133.
- Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. The BMJ. 2021;372:n71. doi:10.1136/bmj.n71.
- Hong QN, F^abregues S, Bartlett G, Boardman F, Cargo M, Dagenais P, et al. The Mixed Methods Appraisal Tool (MMAT) version 2018 for information professionals and researchers. Education for Information. 2018;34(4):285-291. doi:10.3233/EFI-180221.
- Greenhalgh T, Wherton J, Papoutsi C, Lynch J, Hughes G, A'Court C, et al. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. Journal of Medical Internet Research. 2017;19(11):e367. doi:10.2196/jmir.8775.
- Jacovi A, Marasovi'c A, Miller T, Goldberg Y. Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021:624-635. doi:10.1145/3442188.3445923.
- Beede E, Baylor E, Hersch F, Iurchenko A, Wilcox L, Ruamviboonsuk P, et al. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 2020:1-12. doi:10.1145/3313831.3376718.
- Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics. 2021;22(4):e354-e352. doi:10.1093/biostatistics/kxaa028.
- Price WN, Cohen IG. Privacy in the age of medical big data. Nature Medicine. 2019;25(1):37-43. doi:10.1038/s41591-018-0272-7.
- Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342.
- Wahl B, Cossy-Gantner A, Germann S, Schwalbe NR. Artificial intelligence (AI) and global health: how can AI contribute to health in resource-poor settings?. BMJ Global Health. 2018;3(4):e000798. doi:10.1136/bmjgh-2018-000798.
- Sendak M, Gao M, Nichols C, Lin A, Balu S. Real-world implementation of machine learning in healthcare: a pragmatic systematic review and meta-analysis. medRxiv. 2020. doi:10.1101/2020.10.08.20207702.
- Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association. 2021;25(10):1419-1429. doi:10.1093/jamia/ocy068.
- Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, et al. Do no harm: a roadmap for responsible machine learning for health care. Nature Medicine. 2019;25(9):1337-1340. doi:10.1038/s41591-019-0548-6.
- Amershi S, Weld D, Vorvoreanu M, Fourney A, Nushi B, Collisson P, et al. Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 2019:1-13. doi:10.1145/3290605.3300233.
- Jalal S, Akimova E, Papayova M. Trustworthy AI: from principles to practice. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021:343-344. doi:10.1145/3442188.3445913.
- Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. npj Digital Medicine. 2020;3:17. doi:10.1038/s41746-020-0221-y.
- Osheroff JA, Teich JM, Middleton B, Steen EB, Wright A, Detmer DE. A roadmap for national action on clinical decision support. Journal of the American Medical Informatics Association. 2007;14(2):141-145. doi:10.1197/jamia.M2334.
- Boonstra A, Broekhuis M. Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions. BMC Health Services Research. 2010;10:231. doi:10.1186/1472-6963-10-231.
- Ludwick DA, Doucette J. Adopting electronic medical records in primary care: lessons learned from health information systems implementation experience in seven countries. International Journal of Medical Informatics. 2009;78(1):22-31. doi:10.1016/j.ijmedinf.2008.06.005.
- Ferlie EB, Shortell SM. Improving the quality of health care in the United Kingdom and the United States: a framework for change. The Milbank Quarterly. 2005;79(2):281-315. doi:10.1111/1468-0009.00206.
- Fixsen DL, Naoom SF, Blase KA, Friedman RM, Wallace F. Implementation Research: A Synthesis of the Literature. University of South Florida, Louis de la Parte Florida Mental Health Institute, National Implementation Research Network. 2005.
- Cochrane L, Olson CA, Murray S, Dupuis B, Tooman T, Hayes S. Glossary of implementation science terms. Clinical and Translational Science. 2017;10(6):319-321. doi:10.1016/j.ctsx.2017.07.002.
- Yen P, McAlearney AS, Sieck CJ, Hefner JL, Huerta TR. Health information technology (HIT) implementation in the context of patient-centered medical home (PCMH) transformation. eGEMs. 2017;5(1):4. doi:10.13063/2327-9214.1238.
- Kellermann AL, Jones SS. What it will take to achieve the as-yet-unfulfilled promises of health information technology. Health Affairs. 2010;32(1):63-68. doi:10.1377/hlthaff.2012.0693.
- De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine. 2021;24(9):1342-1350. doi:10.1038/s41591-018-0107-6.
- Char D, Shah N, Magnus D. Implementing AI in healthcare: ethical considerations. The American Journal of Bioethics. 2018;18(2):1-2. doi:10.1080/15265161.2017.1409893.
- Gerke S, Minssen T, Cohen G. Ethical and legal challenges of artificial intelligence-driven healthcare. Artificial Intelligence in Medicine. 2020;104:101715. doi:10.1016/j.artmed.2020.101715.
- Maddox TM, Rumsfeld JS, Payne PRO. Questions for artificial intelligence in health care. JAMA. 2019;321(1):31-32. doi:10.1001/jama.2018.18932.
- Michie S, van Stralen MM, West R. The behaviour change wheel: a new method for characterising and designing behaviour change interventions. Implementation Science. 2011;6:42. doi:10.1186/1748-5908-6-42.
- Proctor EK, Landsverk J, Aarons G, Chambers D, Glisson C, Mittman B. Implementation research in mental health services. Administration and Policy in Mental Health and Mental Health Services Research. 2011;38(2):123-137. doi:10.1007/s10488-010-0315-y.
- McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89-94. doi:10.1038/s41586-019-1799-6.
- Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118. doi:10.1038/nature21056.
- Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature Medicine. 2022;28(1):31-38. doi:10.1038/s41591-021-01614-0'.
- Zhang H, Zhang H, Wang A, Mahajan S, Shen S, Li D, et al. Leveraging deep learning models for systemic lupus erythematosus identification and disease activity assessment. Arthritis & Rheumatology. 2022;74(11):1857-1868. doi:10.1002/art.42235.
- Shillan D, Sterne JAC, Champneys A, Gibbison B. Use of machine learning to analyse routinely collected intensive care unit data: A systematic review. Critical Care. 2019;23:284. doi:10.1186/s13054-019-2564-9.
- Nemati S, Holder A, Razmi F, Stanley MD, Clifford GD, Buchman TG. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Critical Care Medicine. 2018;46(4):547-553. doi:10.1097/CCM.0000000000002936.
- Davis FD. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly. 1989;13(3):319-340. doi:10.2307/249008.
- Venkatesh V, Morris MG, Davis GB, Davis FD. User acceptance of information technology: toward a unified view. MIS Quarterly. 2003;27(3):425-478. doi:10.2307/30036540.
- Chen IY, Joshi S, Ghassemi M. Treating health disparities with artificial intelligence. Nature Medicine. 2021;27(1):16-17. doi:10.1038/s41591-020-01192-6.
- Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine. 2021;27(12):2176-2182. doi:10.1038/s41591-021-01595-0.
- Chambers DA, Glasgow RE, Stange KC. The dynamic sustainability framework: addressing the paradox of sustainment amid ongoing change. Implementation Science. 2013;8:117. doi:10.1186/1748-5908-8-117.
- Schell SF, Luke DA, Schooley MW, Elliott MB, Herbers SH, Mueller NB, et al. Public health program capacity for sustainability: a new framework. Implementation Science. 2013;8:15. doi:10.1186/1748-5908-8-15.
- Shaw T, McGregor D, Brunner M, Keep M, Janssen A, Barnet S. What is eHealth (6)? Development of a conceptual model for eHealth: qualitative study with key informants. Journal of Medical Internet Research. 2018;20(10):e10724. doi:10.2196/10724.
- Kvedar J, Coye MJ, Everett W. Connected health: a review of technologies and strategies to improve patient care with telemedicine and telehealth. Health Affairs. 2014;33(2):194-199. doi:10.1377/hlthaff.2013.0992.
- Lanham HJ, Leykum LK, Taylor BS, McCannon CJ, Lindberg C, Lester RT. How complexity science can inform scale-up and spread in health care: understanding the role of self-organization in variation across local contexts. Social Science & Medicine. 2013;93:194-202. doi:10.1016/j.socscimed.2012.05.040'.
- Bloomrosen M, Starren J. Advancing the framework: use of health data—a report of the AMIA 2017 health policy meeting. Journal of the American Medical Informatics Association. 2017;25(4):442-448. doi:10.1093/jamia/ocx117.
- Furukawa MF, King J, Patel V, Hsiao C, Adler-Milstein J, Jha AK. Despite substantial progress In EHR adoption, health information exchange and patient engagement still lag in the United States. Health Affairs. 2017;33(9):1672-1679. doi:10.1377/hlthaff.2014.0485.
- Floridi L, Cowls J. A unified framework of five principles for AI in society. Harvard Data Science Review. 2019;1(1). doi:10.1162/99608f92.8cd550d1.
- Jobin A, Ienca M, Vayena E. The global landscape of AI ethics guidelines. Nature Machine Intelligence. 2019;1(9):389-399. doi:10.1038/s42256-019-0088-2.
- Aarons GA, Ehrhart MG, Farahnak LR, Sklar M. Aligning leadership across systems and organizations to develop a strategic climate for evidence-based practice implementation. Implementation Science. 2014;9:6. doi:10.1186/1748-5908-9-6.
- Birken SA, Lee SD, Weiner BJ. Uncovering middle managers' role in healthcare innovation implementation. Implementation Science. 2012;7:28. doi:10.1186/1748-5908-7-28.
- Weiner BJ. A theory of organizational readiness for change. Implementation Science. 2009;4:67. doi:10.1186/1748-5908-4-67.
- Helfrich CD, Li Y, Sharp ND, Sales AE. Organizational readiness to change assessment (ORCA): development of an instrument based on the promoting action on research in health services (PARIHS) framework. Implementation Science. 2009;4:38. doi:10.1186/1748-5908-4-38.
- King WR, He J. A meta-analysis of the technology acceptance model. Information & Management. 2006;43(6):740-755. doi:10.1016/j.im.2006.05.003.
- Melas CD, Zampetakis LA, Dimopoulou A, Moustakis V. Modeling the acceptance of clinical information systems among hospital medical staff: an extended TAM model. Journal of Biomedical Informatics. 2011;44(4):553-564. doi:10.1016/j.jbi.2011.01.009.
- O'Cathain A, Murphy E, Nicholl J. The quality of mixed methods studies in health services research. Journal of Health Services Research & Policy. 2008;13(2):92-98. doi:10.1258/jhsrp.2007.007074.
- Fetters MD, Curry LA, Creswell JW. Achieving integration in mixed methods designs—principles and practices. Health Services Research. 2013;48(6pt2):2134-2156. doi:10.1111/1475-6773.12117.
About this article
Cite this article
E.J. Topol, M.L. Matheny, R. Thadaney Israni, S. Vayena, A.B. Martin, T. Davenport (2026-03-29). Implementation Science for AI Integration in Digital Health Systems. Digital Health Implementation, 1(1), 1–21.
Received
January 15, 2026
Accepted
March 20, 2026
Published
March 29, 2026