VISHNU BAPATHI
ExecEng
Back to site

Data Quality in the Agentic AI Era

The Semantic Gap

Humans navigate ambiguity through context, experience, and asking a colleague.

Machines don't ask — they execute on whatever interpretation they compute first.

This gap is manageable when AI advises. It becomes catastrophic when AI acts.

The Stakes

The error mode isn't a bad dashboard — it's a bad decision at machine speed.

Before

Predictive / Generative

AI advises, humans decide

RoleGenerates reports and risk scores
LoopHuman reviews before action
Failure modeBad data = bad insight (catchable)
Blast radiusOne analyst's time
Now

Agentic AI

AI decides and executes

RoleExecutes trades, routes shipments, adjusts pricing
LoopNo human in the loop — by design
Failure modeBad data = bad action (irreversible)
Blast radiusFinancial loss, regulatory exposure

What Most Enterprises Have Today

Three artifacts designed for human interpretation — not for machines that act autonomously

01

Business Glossary

Terms in a wiki. Finance reads 'net revenue' as post-returns; Sales reads it as bookings.

Human-interpretable only
02

Taxonomy

Product → Category → SKU. Captures 'is a type of' but can't express a bundled license spanning three categories.

Classification without relationships
03

Data Model

Tables, columns, foreign keys. Tells the database how to store — but not that EMEA reports gross while NA reports net.

Storage schema ≠ business meaning

Three Layers of Data Quality

Each layer compounds the risk. Structural quality is necessary but insufficient. Semantic quality is where most enterprises fail. Contextual quality is where agents break.

L1

Structural Quality

the pipes
FreshnessCompletenessSchema ValidationPipeline SLAsDeduplication
🏭Supply Chain

An inventory optimization agent pulls stock counts from SAP every 15 minutes. Data is fresh, complete, schema-valid. But the Munich warehouse reports in UTC+1 while Shanghai reports in UTC+8. The agent sees a phantom stockout and triggers an emergency reorder of 2,400 units already on the shelf.

~$380K unnecessary freight

Structurally clean. Temporally misaligned. Timezone and unit-of-measure misalignment is the #1 structural failure in cross-region agent deployments.

L2

Semantic Quality

the meaning
Shared DefinitionsBusiness RulesOntologyData ContractsLineage
🏦Financial Services

A portfolio risk agent calculates exposure using 'notional value' from two trading systems. System A reports gross notional; System B nets out hedges. The agent computes a $340M exposure gap that doesn't exist and triggers a $47M hedge position — against a definitional mismatch, not a market risk.

$47M loss on a definitional mismatch

Both systems are correct. The semantic contract between them is missing. Cross-system definitional conflicts exist in 100% of Tier-1 banks assessed.

L3

Contextual Quality

the expertise
Domain ExpertiseBusiness ContextIndustry PatternsOperational Judgment
🏥Healthcare

A claims processing agent flags a cluster of cardiac procedure claims as potential fraud — unusual volume spike vs. national baseline. But any experienced analyst knows snowbird season drives 40%+ cardiac volume increases in Miami-Dade every January. 200+ legitimate claims escalated, delaying reimbursements.

200+ false fraud flags

The pattern is real. The interpretation requires context no schema can encode. Contextual false positives account for 35–60% of agent-generated escalations.

Not a glossary.

Not a taxonomy.

Not a data model.

A machine-readable rulebook agents can query at runtime.

1

Layer 1: Vocabulary

Terms & Definitions

Formal, versioned definitions of business terms. Not a PDF — executable code.

Glossary says
'Net Revenue is revenue after adjustments'
Ontology says
net_revenue = gross_sales - returns
  - allowances - discounts
  WHERE region_rules[region].apply()
2

Layer 2: Relationships

How Entities Connect

Entity relationships, constraints, and inheritance. A customer CAN BE retail_depositor AND wealth_client simultaneously — with different revenue attribution rules per relationship.

Glossary says
Customer > Retail > Depositor
Ontology says
customer.relationships[] = [
  {type: retail_depositor,
   revenue_rule: net},
  {type: wealth_client,
   revenue_rule: aum_based}
]
3

Layer 3: Context

Domain Knowledge

Machine-readable rules encoding expert knowledge. Seasonal patterns, regulatory interpretations, regional exceptions. The hardest layer — requires expert co-design.

Glossary says
'FL cardiac volumes spike in winter'
Ontology says
IF region=south_FL
  AND month IN [12,1,2,3]
THEN cardiac_baseline *= 1.42
  AND fraud_threshold *= 1.5

The Difference

Same agent, same data, fundamentally different outcome

Without Semantic Layer

1

Agent queries raw table: SELECT notional_value FROM trades

2

System A returns gross: $840M. System B returns net: $500M

3

Agent computes gap: $340M. No way to know definitions differ

4

Agent executes $47M hedge against a phantom exposure

$47M loss on a definitional mismatch

With Semantic Layer

1

Agent queries semantic API: GET /entities/notional_value?context=risk

2

API resolves: Sys A = gross, Sys B = net. Returns normalized values + confidence

3

Confidence: 0.42 (below threshold). API flags: definition_conflict detected

4

Agent escalates to human trader. No action taken. $47M preserved.

Conflict detected. Human decides. $0 loss.

Humans work around ambiguity. Machines compound it.

Data quality is not a hygiene problem — it's a control problem.

The semantic layer is the governance layer. The organizations that build for this now will be the ones whose agents are trusted to act.