Data Quality in the Agentic AI Era
The Semantic Gap
Humans navigate ambiguity through context, experience, and asking a colleague.
Machines don't ask — they execute on whatever interpretation they compute first.
This gap is manageable when AI advises. It becomes catastrophic when AI acts.
The Stakes
The error mode isn't a bad dashboard — it's a bad decision at machine speed.
Predictive / Generative
AI advises, humans decide
Agentic AI
AI decides and executes
What Most Enterprises Have Today
Three artifacts designed for human interpretation — not for machines that act autonomously
Business Glossary
Terms in a wiki. Finance reads 'net revenue' as post-returns; Sales reads it as bookings.
Human-interpretable onlyTaxonomy
Product → Category → SKU. Captures 'is a type of' but can't express a bundled license spanning three categories.
Classification without relationshipsData Model
Tables, columns, foreign keys. Tells the database how to store — but not that EMEA reports gross while NA reports net.
Storage schema ≠ business meaningThree Layers of Data Quality
Each layer compounds the risk. Structural quality is necessary but insufficient. Semantic quality is where most enterprises fail. Contextual quality is where agents break.
Structural Quality
the pipesAn inventory optimization agent pulls stock counts from SAP every 15 minutes. Data is fresh, complete, schema-valid. But the Munich warehouse reports in UTC+1 while Shanghai reports in UTC+8. The agent sees a phantom stockout and triggers an emergency reorder of 2,400 units already on the shelf.
Structurally clean. Temporally misaligned. Timezone and unit-of-measure misalignment is the #1 structural failure in cross-region agent deployments.
Semantic Quality
the meaningA portfolio risk agent calculates exposure using 'notional value' from two trading systems. System A reports gross notional; System B nets out hedges. The agent computes a $340M exposure gap that doesn't exist and triggers a $47M hedge position — against a definitional mismatch, not a market risk.
Both systems are correct. The semantic contract between them is missing. Cross-system definitional conflicts exist in 100% of Tier-1 banks assessed.
Contextual Quality
the expertiseA claims processing agent flags a cluster of cardiac procedure claims as potential fraud — unusual volume spike vs. national baseline. But any experienced analyst knows snowbird season drives 40%+ cardiac volume increases in Miami-Dade every January. 200+ legitimate claims escalated, delaying reimbursements.
The pattern is real. The interpretation requires context no schema can encode. Contextual false positives account for 35–60% of agent-generated escalations.
Not a glossary.
Not a taxonomy.
Not a data model.
A machine-readable rulebook agents can query at runtime.
Layer 1: Vocabulary
Terms & DefinitionsFormal, versioned definitions of business terms. Not a PDF — executable code.
'Net Revenue is revenue after adjustments'net_revenue = gross_sales - returns
- allowances - discounts
WHERE region_rules[region].apply()Layer 2: Relationships
How Entities ConnectEntity relationships, constraints, and inheritance. A customer CAN BE retail_depositor AND wealth_client simultaneously — with different revenue attribution rules per relationship.
Customer > Retail > Depositorcustomer.relationships[] = [
{type: retail_depositor,
revenue_rule: net},
{type: wealth_client,
revenue_rule: aum_based}
]Layer 3: Context
Domain KnowledgeMachine-readable rules encoding expert knowledge. Seasonal patterns, regulatory interpretations, regional exceptions. The hardest layer — requires expert co-design.
'FL cardiac volumes spike in winter'IF region=south_FL
AND month IN [12,1,2,3]
THEN cardiac_baseline *= 1.42
AND fraud_threshold *= 1.5The Difference
Same agent, same data, fundamentally different outcome
Without Semantic Layer
Agent queries raw table: SELECT notional_value FROM trades
System A returns gross: $840M. System B returns net: $500M
Agent computes gap: $340M. No way to know definitions differ
Agent executes $47M hedge against a phantom exposure
With Semantic Layer
Agent queries semantic API: GET /entities/notional_value?context=risk
API resolves: Sys A = gross, Sys B = net. Returns normalized values + confidence
Confidence: 0.42 (below threshold). API flags: definition_conflict detected
Agent escalates to human trader. No action taken. $47M preserved.
Humans work around ambiguity. Machines compound it.
Data quality is not a hygiene problem — it's a control problem.
The semantic layer is the governance layer. The organizations that build for this now will be the ones whose agents are trusted to act.