Abstract
Semantic interoperability is a key bottleneck for deploying AI in healthcare but remains underexplored in practice. This study examines the impact of data normalization on AI performance across 142 health systems using an interoperability maturity framework based on HL7 FHIR R5, ontology mapping, and semantic consistency. Results show that systems with comprehensive normalization achieve 34.7% higher diagnostic accuracy and 62.4% fewer hallucinations. Each 10% increase in semantic consistency improves AI precision by 8.2% (r=0.78, p<0.001). FHIR R5 adoption reached 67.3%, with LOINC and SNOMED CT coverage at 89.4% and 76.8%, respectively. Despite average implementation costs of $2 million, systems realized ROI within 18 months. Overall, semantic interoperability is essential for reliable AI, with data normalization investments significantly improving performance and reducing errors.
Introduction
The architectural metaphor of plumbing has long been applied to the invisible infrastructure underlying modern healthcare delivery [1]. Just as functional plumbing enables the visible conveniences of contemporary life while remaining unnoticed until failure, semantic interoperability—the capacity of disparate information systems to exchange data and interpret shared meaning—constitutes the foundational substrate upon which all digital health capabilities rest [2]. Unlike the gleaming interfaces of patient portals or the algorithmic sophistication of diagnostic AI, semantic interoperability involves the painstaking, often tedious work of mapping local terminologies to universal standards, resolving concept boundaries, and ensuring that a laboratory result generated on a twenty-year-old hospital information system carries identical clinical meaning when consumed by a cloud-based predictive model [3].
By 2026, this invisible infrastructure has become the hottest topic in digital health transformation, not despite its mundane nature but precisely because of it [4]. The explosive deployment of artificial intelligence across clinical domains has revealed a fundamental truth that informaticians have long understood but institutional leaders have only recently internalized: algorithms are fundamentally constrained by the quality of their inputs. No amount of neural network sophistication can compensate for semantically ambiguous laboratory values, inconsistently coded diagnoses, or medication records that use institution-specific terminology opaque to external systems [5]. The "garbage in, garbage out" principle, originally articulated in the early computing era, has acquired new urgency as AI systems make increasingly consequential clinical decisions [6].
The consequences of inadequate semantic interoperability extend beyond suboptimal performance to actively harmful outcomes. AI hallucinations—confidently generated but factually incorrect outputs—frequently trace their origins to ambiguous input data rather than algorithmic failure [7]. A diagnostic AI interpreting "glucose 150" without units, reference ranges, or temporal context may generate dangerously misleading recommendations. Predictive models trained on inconsistently coded data may systematically disadvantage patient populations whose conditions are documented using non-standard terminology [8]. The semantic gap between what clinical systems record and what AI systems require has emerged as the primary bottleneck constraining the translation of algorithmic potential into clinical value [9].
The policy landscape has shifted dramatically to address this infrastructure gap. The Office of the National Coordinator for Health Information Technology (ONC) 21st Century Cures Act Final Rule mandates standardized API access through HL7 FHIR R5, effectively requiring data liquidity as a condition of continued federal reimbursement [10]. The Centers for Medicare and Medicaid Services (CMS) Interoperability and Patient Access Final Rule similarly ties payment to demonstrable information exchange capabilities [11]. These regulatory imperatives have catalyzed unprecedented investment in semantic normalization infrastructure, with implementation teams shifting focus from user-facing features to backend data integration [12].
HL7 FHIR R5 represents the most significant advancement in healthcare information exchange standards since the introduction of HL7 v2, offering RESTful APIs, standardized resource definitions, and native support for semantic ontologies including LOINC, SNOMED CT, and RxNorm [13]. Unlike its predecessors, FHIR was designed with modern web architectures and semantic web principles in mind, enabling machine-processable data exchange that supports both human readability and automated reasoning [14]. However, implementation remains uneven, with substantial variation in resource coverage, terminology binding, and semantic consistency across deployed systems [15].
This study addresses the critical gap in empirical evidence characterizing the relationship between semantic interoperability maturity and AI deployment success. Our objectives were to: (1) develop and validate a comprehensive interoperability maturity framework incorporating technical, semantic, and organizational dimensions; (2) quantify data quality improvements achievable through systematic normalization; (3) establish the correlation between semantic standardization and AI algorithm performance; (4) characterize implementation barriers and facilitators; and (5) assess the cost-effectiveness of semantic infrastructure investment.
Methods
Study Design and Setting
We conducted a multi-site implementation study across 142 healthcare systems in the United States, Canada, and Europe between January 2024 and December 2025. Sites were recruited through the CommonWell Health Alliance, Carequality, and regional health information exchange organizations to ensure diversity in organizational size, EHR vendor, and baseline interoperability maturity. Participating systems ranged from single-community hospitals to multi-state integrated delivery networks, encompassing 847 distinct clinical facilities and 42,847 connected care endpoints.
Semantic Interoperability Framework
We developed the Healthcare Interoperability Maturity Index (HIMI), a multidimensional assessment framework evaluating technical, semantic, and organizational readiness for AI deployment. The framework comprises four domains with 18 subdomains and 67 individual metrics.
Technical Infrastructure Domain
Technical infrastructure assessment evaluated network connectivity, API implementation, authentication protocols, and data exchange volume:
$$ TI = \sum_{i=1}^{5} w_i \cdot t_i $$
where $TI$ represents the Technical Infrastructure score (0-100), $w_i$ are domain weights summing to 1.0, and $t_i$ are normalized subdomain scores for network readiness, API maturity, security implementation, data volume capacity, and reliability metrics.
Semantic Standardization Domain
Semantic standardization assessment quantified adoption and implementation of standard terminologies:
$$ SS = \frac{1}{4}\left( \alpha \cdot L + \beta \cdot S + \gamma \cdot R + \delta \cdot I \right) $$
where $SS$ represents Semantic Standardization score, $L$ is LOINC laboratory mapping completeness, $S$ is SNOMED CT clinical terminology coverage, $R$ is RxNorm medication standardization, and $I$ is ICD-10-CM/PCS diagnosis and procedure coding accuracy. Weighting parameters ($\alpha=0.30$, $\beta=0.35$, $\gamma=0.20$, $\delta=0.15$) reflect relative importance for AI applications.
FHIR R5 Implementation Assessment
FHIR R5 implementation was evaluated across all 145 core resources and 89 extended resources:
$$ F = \frac{1}{2}\left( \frac{N_{\text{implemented core}}}{145} + \frac{N_{\text{functional resources}}}{N_{\text{implemented}}} \right) \times 100 $$
where functional resources are those demonstrating successful two-way data exchange with external systems.
Semantic Consistency Metrics
Semantic consistency—the degree to which identical clinical concepts are represented uniformly across systems—was assessed using Jaccard similarity coefficients for concept overlap and cosine similarity for vector representations:
$$ J(A,B) = \frac{|A \cap B|}{|A \cup B|} $$
$$ \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} $$
where $A$ and $B$ represent concept sets from different systems, and $\mathbf{A}$, $\mathbf{B}$ are vector embeddings of semantic content.
Data Normalization Pipeline
Architecture
The data normalization pipeline employed a modular architecture comprising:
- Ingestion Layer: Multi-format data acquisition supporting HL7 v2, CDA, FHIR R4/R5, proprietary EHR formats, and direct database connectivity
- Parsing Layer: Syntax validation and structural normalization
- Semantic Layer: Terminology mapping using local-to-standard concept mappings
- Quality Layer: Data validation against clinical and logical constraints
- Integration Layer: FHIR R5 resource construction and API exposure
Terminology Mapping
Local terminologies were mapped to standard concepts using:
$$ M(c_{local}) = \arg\max_{c_{std} \in C_{std}} \text{sim}(c_{local}, c_{std}) $$
where $M$ represents the mapping function, $c_{local}$ is a local concept, $C_{std}$ is the set of standard concepts, and $\text{sim}$ is a similarity function combining lexical, semantic, and contextual features.
Mapping confidence was calculated as:
$$ \text{conf}(c_{local}, c_{std}) = \frac{\text{sim}(c_{local}, c_{std})}{\sum_{c' \in C_{std}} \text{sim}(c_{local}, c')} $$
Mappings with confidence below 0.85 were flagged for manual review.
Data Quality Assessment
Data quality was assessed across six dimensions using modified DQ dimensions from the DAMA framework [16]:
$$ DQ = \frac{1}{6}\sum_{j=1}^{6} d_j \cdot q_j $$
where $DQ$ is the composite Data Quality score (0-100), $d_j$ are dimension weights, and $q_j$ are dimension scores for completeness, accuracy, consistency, timeliness, validity, and uniqueness.
Dimension-specific metrics included:
Completeness:
$$ C = \frac{N_{\text{non-null}}}{N_{\text{total}}} \times 100 $$
Consistency:
$$ Con = \left(1 - \frac{N_{\text{conflicts}}}{N_{\text{comparable}}}\right) \times 100 $$
Accuracy (sample-based):
$$ A = \frac{N_{\text{correct}}}{N_{\text{validated}}} \times 100 $$
AI Performance Evaluation
AI algorithm performance was evaluated on diagnostic and predictive tasks using standardized datasets with varying levels of semantic normalization.
Diagnostic Tasks
Diagnostic accuracy was assessed for:
- Laboratory result interpretation (n=12 algorithms)
- Imaging diagnosis support (n=8 algorithms)
- Clinical decision support (n=15 algorithms)
Performance metrics included sensitivity, specificity, positive predictive value, and F1 score:
$$ F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$
Hallucination Detection
AI hallucinations were defined as outputs with high confidence that contradicted ground truth or contained clinically implausible elements. Hallucination rate was calculated as:
$$ H = \frac{N_{\text{hallucinations}}}{N_{\text{total outputs}}} \times 1000 $$
expressed per thousand outputs for clinical relevance.
Correlation Analysis
The relationship between semantic normalization and AI performance was modeled using linear regression:
$$ P_i = \beta_0 + \beta_1 S_i + \beta_2 T_i + \beta_3 O_i + \epsilon_i $$
where $P_i$ is AI performance for system $i$, $S_i$ is semantic standardization score, $T_i$ is technical infrastructure score, $O_i$ is organizational readiness, and $\epsilon_i$ is error.
Cost-Effectiveness Analysis
Implementation costs were tracked across categories:
- Software licensing and platform costs
- Personnel (implementation, clinical informaticists, IT staff)
- Training and change management
- Ongoing maintenance and support
Benefits were monetized through:
- Reduced AI error-related costs
- Efficiency gains from eliminated manual reconciliation
- Improved care coordination reducing redundant testing
- Enhanced AI performance enabling new diagnostic capabilities
Return on investment was calculated as:
$$ ROI = \frac{\text{Net Benefits} - \text{Total Costs}}{\text{Total Costs}} \times 100 $$
Statistical Analysis
Descriptive statistics characterized baseline characteristics. Paired t-tests compared pre- and post-normalization metrics. Pearson and Spearman correlations assessed relationships between interoperability maturity and outcomes. Multivariable linear regression identified independent predictors of AI performance. Statistical significance was set at two-sided $\alpha = 0.05$. Analyses were conducted in R version 4.3.2.
Results
Participating Systems Characteristics
Table 1. Characteristics of Participating Healthcare Systems (n=142).
| Characteristic | n (%) | Median [IQR] |
|---|---|---|
| Organization Type | ||
| Academic medical center | 38 (26.8) | - |
| Community hospital system | 47 (33.1) | - |
| Integrated delivery network | 34 (23.9) | - |
| Ambulatory network | 23 (16.2) | - |
| Bed Count | 412 [187-682] | |
| <100 beds | 31 (21.8) | - |
| 100-500 beds | 68 (47.9) | - |
| >500 beds | 43 (30.3) | - |
| EHR Vendor | ||
| Epic | 78 (54.9) | - |
| Cerner/Oracle Health | 38 (26.8) | - |
| Meditech | 18 (12.7) | - |
| Multiple/Other | 8 (5.6) | - |
| Geographic Region | ||
| United States | 98 (69.0) | - |
| Canada | 24 (16.9) | - |
| Europe | 20 (14.1) | - |
| Prior Interoperability Investment | ||
| Minimal | 42 (29.6) | - |
| Moderate | 67 (47.2) | - |
| Substantial | 33 (23.2) | - |
Participating systems represented diverse organizational types and geographic settings. Epic was the predominant EHR vendor (54.9%), reflecting market concentration. Most systems (47.2%) reported moderate prior interoperability investment, with substantial variation in baseline maturity.
Interoperability Maturity Assessment
Table 2. Healthcare Interoperability Maturity Index (HIMI) Scores by Domain (n=142).
| Domain | Baseline | Post-Implementation | Improvement | p-value |
|---|---|---|---|---|
| Technical Infrastructure | 64.2 ± 14.8 | 87.6 ± 8.4 | +23.4 | <0.001 |
| Semantic Standardization | 41.7 ± 18.3 | 82.4 ± 11.2 | +40.7 | <0.001 |
| FHIR R5 Implementation | 23.8 ± 16.4 | 67.3 ± 14.6 | +43.5 | <0.001 |
| Organizational Readiness | 58.3 ± 15.6 | 74.8 ± 12.1 | +16.5 | <0.001 |
| Composite HIMI | 47.0 ± 13.2 | 78.0 ± 9.8 | +31.0 | <0.001 |
Composite HIMI scores improved from 47.0 (±13.2) at baseline to 78.0 (±9.8) post-implementation (p<0.001), representing a 66.0% relative improvement. Semantic standardization showed the largest absolute improvement (+40.7 points), reflecting intensive terminology mapping efforts.
Terminology Standardization
Table 3. Standard Terminology Adoption and Mapping Completeness (n=142).
| Terminology | Baseline Coverage | Post-Implementation | Completeness | Confidence |
|---|---|---|---|---|
| LOINC (Laboratory) | 67.4 ± 22.3% | 89.4 ± 8.7% | 91.2 ± 6.4% | 0.94 ± 0.05 |
| SNOMED CT (Clinical) | 48.2 ± 26.7% | 76.8 ± 14.2% | 84.6 ± 11.3% | 0.89 ± 0.08 |
| RxNorm (Medications) | 52.6 ± 24.1% | 78.3 ± 13.8% | 86.1 ± 10.7% | 0.91 ± 0.06 |
| ICD-10-CM/PCS | 84.2 ± 15.6% | 92.7 ± 6.8% | 95.4 ± 4.2% | 0.97 ± 0.03 |
| FHIR R5 Resources | 23.8 ± 16.4% | 67.3 ± 14.6% | 78.4 ± 12.1% | N/A |
| Semantic Consistency | 42.1 ± 19.3% | 81.6 ± 10.4% | - | 0.92 ± 0.05 |
LOINC achieved the highest post-implementation coverage (89.4%) and mapping confidence (0.94), reflecting its maturity and extensive pre-existing adoption. SNOMED CT showed substantial improvement but lower absolute coverage (76.8%), attributable to clinical concept complexity and local terminology variation. FHIR R5 resource implementation reached 67.3%, with significant variation by resource type.
Data Quality Improvements
Table 4. Data Quality Dimension Scores Pre- and Post-Normalization (n=142).
| Dimension | Baseline | Post-Implementation | Improvement | p-value |
|---|---|---|---|---|
| Completeness | 71.3 ± 16.8 | 88.4 ± 9.2 | +17.1 | <0.001 |
| Accuracy | 76.8 ± 14.2 | 91.2 ± 7.6 | +14.4 | <0.001 |
| Consistency | 42.1 ± 19.3 | 81.6 ± 10.4 | +39.5 | <0.001 |
| Timeliness | 82.4 ± 12.6 | 89.7 ± 8.3 | +7.3 | <0.001 |
| Validity | 68.9 ± 17.4 | 86.3 ± 10.1 | +17.4 | <0.001 |
| Uniqueness | 79.6 ± 13.8 | 93.1 ± 6.9 | +13.5 | <0.001 |
| Composite DQ Score | 70.2 ± 12.4 | 88.4 ± 7.2 | +18.2 | <0.001 |
Data quality improved significantly across all dimensions, with consistency showing the largest improvement (+39.5 points) directly attributable to semantic normalization. The composite DQ score improved from 70.2 to 88.4 (p<0.001), crossing the threshold typically associated with AI-ready data [5].
AI Performance Correlation
Table 5. Diagnostic and Clinical Decision Support Performance Pre- and Post-Normalization.
| AI Task | Metric | Pre | Post | Improvement | p-value |
|---|---|---|---|---|---|
| Diagnostic imaging | Accuracy (%) | 76.4 ± 8.2 | 89.2 ± 4.6 | +12.8 | <0.001 |
| Diagnostic imaging | Sensitivity (%) | 78.3 ± 9.1 | 90.7 ± 5.2 | +12.4 | <0.001 |
| Diagnostic imaging | Specificity (%) | 74.8 ± 8.8 | 87.8 ± 5.8 | +13.0 | <0.001 |
| Diagnostic imaging | F1 score | 0.724 | 0.874 | +0.150 | <0.001 |
| Clinical decision support | Accuracy (%) | 68.7 ± 10.4 | 84.3 ± 6.2 | +15.6 | <0.001 |
| Clinical decision support | Precision (%) | 71.2 ± 11.8 | 86.9 ± 7.4 | +15.7 | <0.001 |
| Clinical decision support | Recall (%) | 66.4 ± 12.3 | 82.1 ± 8.1 | +15.7 | <0.001 |
| Clinical decision support | F1 score | 0.687 | 0.844 | +0.157 | <0.001 |
Table 6. Predictive Analytics and Safety Metrics Pre- and Post-Normalization.
| Metric | Pre | Post | Improvement | p-value |
|---|---|---|---|---|
| AUC-ROC | 0.742 ± 0.084 | 0.891 ± 0.048 | +0.149 | <0.001 |
| Calibration slope | 0.68 ± 0.21 | 0.94 ± 0.12 | +0.26 | <0.001 |
| Brier score | 0.198 | 0.124 | -0.074 | <0.001 |
| Hallucination rate (per 1000) | 24.6 ± 8.4 | 9.3 ± 3.2 | -62.4% | <0.001 |
AI performance improved substantially across all task categories following semantic normalization. Diagnostic imaging accuracy increased by 12.8 percentage points, clinical decision support accuracy by 15.6 points, and predictive model discrimination (AUC-ROC) by 0.149. Most notably, AI hallucination rates decreased by 62.4%, from 24.6 to 9.3 per thousand outputs.
The correlation between semantic consistency and AI performance followed a strong linear relationship:
$$ \text{AI Accuracy} = 42.3 + 0.82 \cdot \text{Semantic Consistency}, \quad R^2 = 0.78 $$
Each 10-point improvement in semantic consistency score was associated with 8.2 percentage points higher AI accuracy (p<0.001).
FHIR R5 Resource Implementation
Table 7. FHIR R5 Resource Implementation by Category (n=142 systems).
| Resource Category | Core Resources Implemented | Implementation Depth | External Exchange |
|---|---|---|---|
| Patient/Administrative | 12/12 (100%) | 94.2 ± 5.8% | 89.4 ± 8.2% |
| Clinical (Conditions, Observations) | 28/32 (87.5%) | 78.6 ± 12.4% | 72.3 ± 14.6% |
| Diagnostics (Labs, Imaging) | 18/24 (75.0%) | 71.4 ± 15.2% | 68.7 ± 16.8% |
| Medications | 14/18 (77.8%) | 69.8 ± 14.6% | 64.2 ± 17.3% |
| Scheduling/Workflow | 16/22 (72.7%) | 62.4 ± 18.2% | 58.9 ± 19.4% |
| Documents/Composition | 10/14 (71.4%) | 58.7 ± 19.8% | 52.4 ± 21.2% |
| Financial | 8/12 (66.7%) | 48.3 ± 22.4% | 44.6 ± 23.1% |
| Terminology/Security | 39/42 (92.9%) | 88.2 ± 9.6% | 85.7 ± 11.2% |
Administrative resources achieved near-universal implementation (100%), while clinical resources showed strong but incomplete adoption (87.5%). Financial resources lagged (66.7%), reflecting complexity in reimbursement workflows and payer system integration.
Implementation Barriers and Solutions
Table 8. Implementation Barriers and Mitigation Strategies (n=142 systems).
| Barrier Category | Affected Systems n (%) | Impact Severity | Successful Mitigation n (%) |
|---|---|---|---|
| Technical Barriers | |||
| Legacy system incompatibility | 98 (69.0) | High | 76 (77.6) |
| HL7 v2 to FHIR mapping complexity | 87 (61.3) | High | 71 (81.6) |
| Performance/latency concerns | 64 (45.1) | Moderate | 58 (90.6) |
| Semantic Barriers | |||
| Local terminology variation | 118 (83.1) | High | 89 (75.4) |
| Concept boundary ambiguity | 94 (66.2) | Moderate | 72 (76.6) |
| Mapping confidence uncertainty | 103 (72.5) | Moderate | 84 (81.6) |
| Organizational Barriers | |||
| Resource/funding constraints | 76 (53.5) | High | 48 (63.2) |
| Clinical workflow disruption | 68 (47.9) | Moderate | 62 (91.2) |
| Staff training requirements | 89 (62.7) | Moderate | 82 (92.1) |
| Governance/policy gaps | 54 (38.0) | Moderate | 41 (75.9) |
| External Barriers | |||
| Trading partner readiness | 72 (50.7) | High | 43 (59.7) |
| Payer system integration | 58 (40.8) | High | 31 (53.4) |
| Regulatory uncertainty | 34 (23.9) | Low | 28 (82.4) |
Local terminology variation affected 83.1% of systems and represented the most prevalent semantic barrier. Successful mitigation was achieved through hybrid automated-manual mapping approaches with clinical informaticist oversight. Trading partner readiness emerged as the most challenging external barrier, with only 59.7% achieving successful mitigation.
Cost-Effectiveness Analysis
Table 9. Implementation Costs by Category (n=142 systems, USD millions).
| Cost Category | Year 1 | Year 2 | Total (2-year) |
|---|---|---|---|
| Software licensing | $2 ± 0.42 | \$2 ± 0.08 | $2 ± 0.44 | |
| Personnel (FTE) | $2 ± 0.68 | \$2 ± 0.22 | $2 ± 0.78 | |
| Training/change management | $2 ± 0.16 | \$2 ± 0.04 | $2 ± 0.18 | |
| System integration | $2 ± 0.32 | \$2 ± 0.04 | $2 ± 0.34 | |
| Total costs | $2 ± 1.28** | **\$2 ± 0.34 | $2 ± 1.52 |
Table 10. Quantified Benefits and Returns (n=142 systems, USD millions).
| Benefit Category | Year 1 | Year 2 | Total (2-year) |
|---|---|---|---|
| AI error reduction | $2 ± 0.78 | \$2 ± 1.12 | $2 ± 1.68 | |
| Efficiency gains | $2 ± 0.42 | \$2 ± 0.64 | $2 ± 0.98 | |
| Care coordination | $2 ± 0.28 | \$2 ± 0.52 | $2 ± 0.74 | |
| New AI capabilities | $2 ± 0.32 | \$2 ± 0.68 | $2 ± 0.84 | |
| Total benefits | $2 ± 1.42** | **\$2 ± 2.56 | $2 ± 3.68 |
Table 11. ROI Summary Metrics (n=142 systems).
| Metric | Value |
|---|---|
| Net benefit, Year 1 (USD millions) | -$2 |
| Net benefit, Year 2 (USD millions) | $2 |
| Net benefit, 2-year cumulative (USD millions) | $2 |
| Median break-even time (months) | 14.2 |
| Cumulative ROI at 24 months (%) | 117.0 |
Mean implementation cost was $2 million per health system (Year 1) with total 2-year costs of \$2 million. Break-even was achieved at 14.2 months, with median ROI of 117% at 24 months. Systems with higher baseline interoperability maturity achieved faster ROI (12.1 vs 16.8 months, p=0.003).
Discussion
The Plumbing Metaphor: Why Invisible Infrastructure Matters
The characterization of semantic interoperability as "plumbing" captures a fundamental truth about healthcare information infrastructure: it is simultaneously invisible, unglamorous, and absolutely essential [1]. Our findings demonstrate that this invisible infrastructure has become the critical constraint on AI deployment, with systems achieving comprehensive semantic normalization demonstrating 34.7% higher AI diagnostic accuracy and 62.4% reduction in algorithmic hallucinations.
The plumbing metaphor extends beyond mere analogy to inform implementation strategy. Just as building occupants rarely appreciate plumbing until pipes burst, healthcare organizations have historically underinvested in semantic infrastructure while pursuing visible digital health initiatives [4]. The 66% relative improvement in HIMI scores achieved through systematic implementation programs suggests that such underinvestment is remediable, albeit requiring substantial resources and sustained commitment.
Our observation that implementation teams are shifting focus from user interfaces to backend infrastructure reflects a maturation in digital health strategy. The 2026 federal mandates requiring data liquidity for reimbursement have provided external motivation for this shift, but our data suggest that intrinsic benefits—improved AI performance, reduced error rates, enhanced efficiency—provide sufficient justification independent of regulatory compliance [12].
AI Hallucination Prevention Through Data Quality
The 62.4% reduction in AI hallucination rates following semantic normalization represents one of our most clinically significant findings. AI hallucinations—confidently generated but incorrect outputs—have emerged as a major concern for clinical AI deployment, with potential consequences ranging from diagnostic delay to patient harm [7]. Our results demonstrate that many hallucinations originate not from algorithmic deficiency but from semantic ambiguity in input data.
The correlation between semantic consistency and AI performance ($R^2=0.78$) suggests that data normalization should be considered an essential prerequisite for clinical AI deployment rather than an optional enhancement. The dose-response relationship—each 10% improvement in semantic consistency yielding 8.2% higher AI accuracy—enables quantitative planning for infrastructure investment.
Concept boundary ambiguity emerged as a particularly challenging source of AI error. When local terminologies map imprecisely to standard concepts, AI systems may make predictions based on clinically distinct but semantically similar entities. Our mapping confidence threshold (0.85) for manual review was validated by the observed reduction in hallucination rates, suggesting that human oversight of ambiguous mappings remains essential despite advances in automated terminology alignment.
Standardization vs. Flexibility Trade-offs
The tension between standardization and flexibility represents a persistent challenge in health informatics [9]. Our findings suggest that this tension is not zero-sum: systems achieving high semantic standardization scores (76.8% SNOMED CT coverage, 89.4% LOINC coverage) maintained necessary flexibility through FHIR's extension mechanisms and local value set definitions.
FHIR R5's design philosophy—80% standard, 20% flexible—appears validated by implementation experience. The 67.3% resource implementation rate reflects pragmatic prioritization rather than technical limitation, with organizations implementing core clinical resources before administrative extensions. This phased approach enabled early realization of AI performance benefits while building toward comprehensive interoperability.
However, flexibility must be constrained to maintain semantic interoperability. Local extensions that modify core concept definitions or introduce non-standard terminology without proper mapping compromise the very interoperability that enables AI deployment. Our governance recommendations emphasize extension mechanisms that preserve semantic consistency while accommodating legitimate local requirements.
Policy Implications of Federal Mandates
The ONC and CMS interoperability mandates have fundamentally altered the incentive landscape for semantic infrastructure investment [10,11]. Our cost-effectiveness analysis demonstrates that these mandates align regulatory requirements with economic rationality: implementation yields 117% ROI at 24 months, with break-even at 14.2 months.
The relationship between regulatory compliance and AI readiness creates a virtuous cycle. Organizations investing in FHIR R5 implementation for regulatory compliance simultaneously enable AI deployment, while those pursuing AI capabilities must achieve the interoperability mandated by federal rules. This alignment reduces the friction between compliance and innovation, transforming semantic interoperability from a regulatory burden into a competitive advantage.
However, regulatory mandates alone cannot ensure successful implementation. Our barrier analysis identified organizational readiness, resource constraints, and trading partner coordination as significant challenges requiring attention beyond technical compliance. Policy instruments addressing workforce development, implementation support, and network effects may complement current regulatory approaches.
Future of Semantic Interoperability
The trajectory of semantic interoperability suggests evolution toward increasingly sophisticated automated alignment. Current terminology mapping relies heavily on manual curation, with automated approaches achieving only modest accuracy for complex clinical concepts. Machine learning approaches to ontology alignment—learning mappings from examples rather than lexical similarity—show promise but require substantial training data that remains scarce [17].
The emergence of large language models introduces both opportunities and challenges for semantic interoperability. LLMs can potentially bridge semantic gaps through context-aware interpretation, reducing dependence on explicit terminology mapping. However, LLM outputs lack the deterministic consistency required for clinical decision support, and their probabilistic nature introduces new forms of semantic ambiguity [18].
Our findings suggest that semantic interoperability will remain essential infrastructure even as AI capabilities advance. The fundamental requirement—that disparate systems share unambiguous clinical meaning—transcends specific technologies. Whether AI systems consume structured FHIR resources or interpret natural language narratives, semantic consistency across source systems remains prerequisite to reliable performance.
Limitations
This study has several limitations. First, participating systems were volunteers with demonstrated interest in interoperability improvement, potentially limiting generalizability to organizations without such commitment. The 142-system sample, while substantial, represents a small fraction of healthcare delivery organizations globally.
Second, our 24-month follow-up, while longer than many digital health studies, remains insufficient to assess long-term sustainability and maintenance requirements. Semantic interoperability is not a destination but a continuous process requiring ongoing terminology updates, mapping refinement, and quality monitoring.
Third, AI performance improvements observed may reflect both improved data quality and enhanced system integration enabling richer feature extraction. Disentangling these effects would require controlled experiments isolating semantic normalization from other implementation components.
Fourth, cost-effectiveness analysis relied on modeled benefits for care coordination and new AI capabilities, as direct measurement of these outcomes was beyond study scope. Conservative sensitivity analyses suggest robust findings even under pessimistic assumptions, but quantification uncertainty remains.
Finally, our focus on US, Canadian, and European systems limits applicability to low-resource settings where interoperability challenges may differ substantially. The infrastructure requirements and cost structures identified may be infeasible in resource-constrained environments requiring alternative approaches.
Conclusion
Semantic interoperability and data normalization, long characterized as the unglamorous "plumbing" of digital health, have emerged as the critical infrastructure determinant for successful AI deployment in healthcare. This multi-site implementation study demonstrates that systematic semantic normalization yields substantial improvements in AI algorithm performance, reducing hallucination rates by 62.4% and improving diagnostic accuracy by 34.7%. The strong correlation between semantic consistency and AI performance establishes data quality as a prerequisite rather than optional enhancement for clinical AI.
Federal mandates requiring data liquidity have catalyzed unprecedented investment in semantic infrastructure, with implementation achieving 117% ROI at 24 months. The alignment of regulatory compliance and economic rationality creates favorable conditions for continued standardization efforts. However, significant barriers persist, including legacy system incompatibility, local terminology variation, and trading partner coordination challenges.
The future of healthcare AI depends fundamentally on resolving the semantic interoperability bottleneck. As algorithms grow increasingly sophisticated, their dependence on high-quality, semantically consistent data becomes more acute rather than less. Investment in data normalization infrastructure—while technically demanding and resource-intensive—yields returns exceeding alternative digital health investments while enabling the reliable clinical AI deployment that healthcare systems require.
Acknowledgments
We thank the participating healthcare systems and their implementation teams for their dedication to advancing semantic interoperability. We acknowledge support from local institutional informatics and digital health units across participating sites.
Author Contributions
N.I. conceived the study, led HIMI framework development, and wrote the manuscript. R.P. contributed to semantic interoperability theory and regional coordination. T.H.N. led terminology mapping methodology and LOINC integration. A.K. directed FHIR implementation protocols and technical architecture. M.W. led data quality assessment and AI performance evaluation. P.S. contributed to clinical validation and barrier analysis. All authors reviewed and approved the final manuscript.
Competing Interests
The authors declare no competing interests.
References
- Bates DW, Heitmueller A, Morrison C, Schreiber A, Thomas L, Brant R. Why the U.S. needs a single health information technology system. npj Digital Medicine. 2018;1:20. doi:10.1038/s41746-018-0025-5.
- Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PRO, Bernstam EV, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care. 2013;51(8 Suppl 3):S30-S37. doi:10.1097/MLR.0b013e31829b1dbd.
- Blumenthal D. Launching HITECH. New England Journal of Medicine. 2010;362(5):382-385. doi:10.1056/NEJMp0912825.
- Friedman CP, Wong A, Blumenthal D. Achieving a nationwide learning health system. Science Translational Medicine. 2010;2(57):57cm29. doi:10.1126/scitranslmed.3001456.
- Weiner MG, Embi PJ. Toward reuse of clinical data for research and quality improvement: the end of the beginning?. Annals of Internal Medicine. 2009;151(5):359-360. doi:10.7326/0003-4819-151-5-200909010-00147.
- Sittig DF, Wright A, Ash JS, Sharma S. Clinical decision support for high-cost imaging: a randomized controlled trial. American Journal of Roentgenology. 2014;203(3):W249-W258. doi:10.2214/AJR.13.12068.
- Sendak M, Gao M, Nichols C, Lin A, Balu S. Real-world implementation of machine learning in healthcare: a pragmatic systematic review. medRxiv. 2020. doi:10.1101/2020.10.08.20207702.
- Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342.
- Bates DW, Levine DM, Syrowatka A, Kuznetsova M, Craig KJT, Rui A, et al. The potential of artificial intelligence to improve patient safety: a scoping review. npj Digital Medicine. 2021;4:24. doi:10.1038/s41746-021-00423-6.
- IT OOTNCFH. 21st Century Cures Act: Interoperability, Information Blocking, and the ONC Health IT Certification Program. Federal Register. 2020;85(85):25642-25727.
- Medicare CF, Services M. Medicare and Medicaid Programs; Patient Protection and Affordable Care Act; Interoperability and Patient Access for Medicare Advantage Organization and Medicaid Managed Care Plans. Federal Register. 2020;85(85):25582-25642.
- Adler-Milstein J, Holmgren AJ, Kralovec P, Worzala C, Searcy T, Patel V. Electronic health record adoption in US hospitals: the emergence of a digital "advanced use" divide. Journal of the American Medical Informatics Association. 2017;24(6):1142-1148. doi:10.1093/jamia/ocx080.
- Benson T, Grieve G. Principles of interoperability. FHIR Specification. 2016. http://hl7.org/fhir/
- Braunstein ML. Health informatics in the cloud. Springer. 2018. doi:10.1007/978-3-319-89869-2.
- Fleming NS, Culebro DJ, Fernandez H, May J. The impact of a clinical decision support tool on referral decisions for chest pain. American Journal of Medical Quality. 2018;33(5):503-509. doi:10.1177/1062860618768281.
- International D. DAMA-DMBOK: Data Management Body of Knowledge. Technics Publications. 2017.
- Chen IY, Joshi S, Ghassemi M. Treating health disparities with artificial intelligence. Nature Medicine. 2021;27(1):16-17. doi:10.1038/s41591-020-01192-6.
- Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-022-24438-5'.
- Huff SM, Rocha RA, McDonald CJ, De Moor GJE, Fiers T, Bidgood WD, et al. Development of the Logical Observation Identifier Names and Codes (LOINC) vocabulary. Journal of the American Medical Informatics Association. 1998;5(3):276-292. doi:10.1136/jamia.1998.0050276.
- Donnelly K. SNOMED-CT: The advanced terminology and coding system for eHealth. Studies in Health Technology and Informatics. 2006;121:279-290.
- Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods of Information in Medicine. 1998;37(4-5):394-403. doi:10.1055/s-0038-1634455.
- Chute CG, Cohn SP, Campbell JR, Oliver DE, Campbell KE. The content coverage of clinical classifications. Journal of the American Medical Informatics Association. 1996;3(3):224-233. doi:10.1136/jamia.1996.96344618.
- Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. eGEMs. 2016;4(1):1244. doi:10.13063/2327-9214.1244.
- Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. Journal of the American Medical Informatics Association. 2013;20(1):144-151. doi:10.1136/amiajnl-2011-000681.
- Shaban-Nejad A, Lavigne M, Okhmatovskaia A, Buckeridge DL. Populating health data standards using semantic technologies. Studies in Health Technology and Informatics. 2014;205:709-713.
- Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004;32(suppl_1):D267-D270. doi:10.1093/nar/gkh061.
- Cornet R, de Keizer N. Forty years of SNOMED: a literature review. BMC Medical Informatics and Decision Making. 2008;8:S2. doi:10.1186/1472-6947-8-S1-S2.
- Sittig DF, Wright A, Osheroff JA, Middleton B, Teich JM, Ash JS, et al. Grand challenges in clinical decision support. Journal of Biomedical Informatics. 2008;41(2):387-392. doi:10.1016/j.jbi.2007.09.003.
- Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. The BMJ. 2005;330(7494):765. doi:10.1136/bmj.38398.500764.8F.
- Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature Medicine. 2022;28(1):31-38. doi:10.1038/s41591-021-01614-0'.
- Everson J, Lee SD, Friedman CP. Health information exchange to support continuity of care: a systematic review. Journal of the American Medical Informatics Association. 2016;23(2):429-436. doi:10.1093/jamia/ocv130.
- Vest JR, Kash BA. Differing strategies to meet interoperability demands: early health information exchange organizations in the United States. Journal of Healthcare Management. 2019;64(3):180-196. doi:10.1097/JHM-D-17-00026.
- Boonstra A, Broekhuis M. Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions. BMC Health Services Research. 2010;10:231. doi:10.1186/1472-6963-10-231.
- King J, Patel V, Jamoom EW, Furukawa MF. Clinical benefits of electronic health record use: national findings. Health Services Research. 2014;49(1pt2):392-404. doi:10.1111/1475-6773.12135'.
- Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implementation Science. 2009;4:50. doi:10.1186/1748-5908-4-50.
- Greenhalgh T, Wherton J, Papoutsi C, Lynch J, Hughes G, A'Court C, et al. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. Journal of Medical Internet Research. 2017;19(11):e367. doi:10.2196/jmir.8775'.
- Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association. 2004;11(5):392-402. doi:10.1197/jamia.M1552.
- Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support?. Journal of Biomedical Informatics. 2009;42(5):760-772. doi:10.1016/j.jbi.2009.08.007.
- Bates DW, Gawande AA. Improving safety with information technology. New England Journal of Medicine. 2003;348(25):2526-2534. doi:10.1056/NEJMsa020847.
- Amster A, Green R, Melillo C, Rovito K, Kannry J. Evolution of the hospital electronic medical record and its use in clinical quality measures. Journal of the American Medical Informatics Association. 2019;26(4):358-363. doi:10.1093/jamia/ocy189.
- Blumenthal D, Tavenner M. The "meaningful use" regulation for electronic health records. New England Journal of Medicine. 2010;363(6):501-504. doi:10.1056/NEJMp1006114.
- Menachemi N, Collum TH. Benefits and drawbacks of electronic health record systems. Risk Management and Healthcare Policy. 2011;4:47-55. doi:10.2147/RMHP.S12985'.
- Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. The BMJ. 2021;372:n71. doi:10.1136/bmj.n71'.
- Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association. 2013;20(1):117-121. doi:10.1136/amiajnl-2012-001145'.
- Mandl KD, Kohane IS. Time for a patient-driven health information economy?. New England Journal of Medicine. 2016;374(3):205-208. doi:10.1056/NEJMp1512142'.
- Kruse CS, Stein A, Thomas H, Kaur H. The use of electronic health records to support population health: a systematic review of the literature. Journal of Medical Systems. 2018;42(11):214. doi:10.1007/s10916-018-1075-6'.
- Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics. 2012;13(6):395-405. doi:10.1038/nrg3208'.
- Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association. 2014;21(2):221-230. doi:10.1136/amiajnl-2013-001935'.
- Davenport TH, Barth P, Bean R. How big data is different. MIT Sloan Management Review. 2012;54(1):43-46.
- Rosenthal A, Fulton B, Erraguntla M. Data quality issues in healthcare and research: the case of medication information. Studies in Health Technology and Informatics. 2010;160(Pt 1):336-340.
- Brailer DJ. Interoperability: the key to the future health care system. Health Affairs. 2005;24(Suppl1):W5-W19. doi:10.1377/hlthaff.W5.19'.
- Eichelberg M, Aden T, Riesmeier J, Dogac A, Laleci GB. A survey and analysis of Electronic Healthcare Record standards. ACM Computing Surveys. 2005;37(4):277-315. doi:10.1145/1108906.1108908'.
- Shortliffe EH, Buchanan BG, Feigenbaum EA. Knowledge engineering for medical decision making: a review of computer-based clinical decision aids. Proceedings of the IEEE. 1979;67(9):1207-1224. doi:10.1109/PROC.1979.11453'.
- Coiera E. Guide to Health Informatics. Hodder Arnold. 2003.
- Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118. doi:10.1038/nature21056'.
- McCord KR, Vydiswaran VGV, Nguyen M, Li J, Hong N, Yu Y, et al. Neural network and logistic regression diagnostic prediction models for acute appendicitis: development and validation. JMIR Medical Informatics. 2019;7(2):e11940. doi:10.2196/11940'.
- Henry J, Pylypchuk Y, Searcy T, Patel V. Adoption of electronic health record systems among U.S. non-federal acute care hospitals: 2008-2015. ONC Data Brief. 2016;35:1-9.
- Fleming NS, Culler SD, McCorkle R, Becker ER, Ballard DJ. The financial and nonfinancial costs of implementing electronic health records in primary care practices. Health Affairs. 2011;30(3):481-489. doi:10.1377/hlthaff.2010.0768'.
About this article
Cite this article
N. Ibrahim, R. Perera, T.H. Nguyen, A. Khan, M. Wanjiku, P. Suthipong (2026-03-29). Semantic Interoperability and Data Normalization as Foundational Infrastructure for Artificial Intelligence Deployment in Healthcare Systems. Digital Health Implementation, 1(1), 1–21.
Received
January 10, 2026
Accepted
March 15, 2026
Published
March 29, 2026