The AI-Ready Data Framework is an open standard that defines what "AI-ready" actually means. The six factors of AI-ready data provide criteria and requirements to help you evaluate your data, pipelines, and platforms against the demands of AI workloads.
This repo contains two things:
- The framework — six factors, 62 measurable requirements, and five workload profiles that define AI-readiness as a platform-agnostic standard.
- The
ai-ready-dataskill — an installable agent skill that can scan your data estate, assess specific assets against a profile, score every requirement, and guide you through remediation. Point a coding agent at this repo and say "assess my data for RAG readiness" — it handles the rest.
Use the framework to understand what matters. Use the skill to measure where you stand and fix what doesn't pass.
The contributors to this framework include practicing data engineers, ML engineers, and platform architects who have built and operated AI systems across industries.
This repo synthesizes our collective experience building data infrastructure that can reliably power AI. Our goal is to help data practitioners design infrastructure that produces trustworthy AI decisions.
- Data engineers building pipelines that power AI systems.
- Platform teams designing infrastructure for ML and AI workloads.
- Architects evaluating whether their stack can support RAG, agents, or real-time inference.
- Data leaders who need to assess organizational AI readiness and communicate gaps to their teams.
- Coding agents building the data infrastructure they will eventually consume.
- Clean: Clean data is consistently accurate, complete, valid, and free of errors that would compromise downstream consumption.
- Contextual: Meaning is explicit and colocated with the data. No external lookup, tribal knowledge, or human context is required to take action on the data.
- Consumable: Data is served in the right format and at the right latencies for AI workloads.
- Current: Data reflects the present state, with freshness enforced by infrastructure rather than assumed by convention.
- Correlated: Data is traceable from source to every decision it informs.
- Compliant: Data is governed with explicit ownership, enforced access boundaries, and AI-specific safeguards.
These factors apply to any data system powering AI applications, regardless of tech stack.
Each factor is backed by a set of measurable requirements — specific, platform-agnostic criteria that define what must be true of your data. Requirements describe the what, not the how. The full canonical list lives in the skill manifest.
The factor markdown files above describe the why and what of each factor in prose. The manifest provides the machine-readable counterpart: every requirement has a unique key, a description, a factor, and a scope (schema, table, or column). All tests return a normalized score between 0 and 1, making it straightforward to build automated assessments or dashboards on top of the framework.
An installable skill that any coding agent can dynamically load and execute. Scan your data estate for prioritization, assess specific assets against a profile, and get a scored report across the six factors of AI-ready data with guided remediation.
npx skills add Snowflake-Labs/ai-ready-data -a cortexClone or add this repo as workspace context. The agent reads skills/ai-ready-data/SKILL.md automatically.
After installing, ask your coding agent:
Assess my [data assets] for RAG readiness.
The agent asks your platform and scope, loads the RAG profile, runs checks, and presents a scored report. From there you can drill into failures and remediate stage-by-stage.
For estate-level prioritization:
Scan my data estate for AI readiness.
The agent sweeps across all schemas in a database with lightweight readiness proxies and presents a comparative ranking.
Three phases, from light to deep: Scan, Assess, Remediate.
- Choose a platform: Snowflake, Postgres, etc
- Discovery: tell the agent your database, schema, and tables, or scan your entire estate
- Choose a profile: RAG, feature serving, training, agents, full assessment, or pick specific requirements
- Adjust: skip, set, or add requirements before running
- Coverage: see what's runnable on your platform before executing
- Assess: platform-specific checks score each requirement 0–1
- Remediate: for failures, the agent presents platform-specific fixes for your approval
Every assessment is organized into six stages, one per factor of AI-ready data:
| Factor | Example Requirements |
|---|---|
| Clean | data_completeness, uniqueness, referential_integrity |
| Contextual | semantic_documentation, relationship_declaration, entity_identifier_declaration |
| Consumable | embedding_coverage, vector_index_coverage, serving_latency_compliance |
| Current | change_detection, data_freshness, incremental_update_coverage |
| Correlated | data_provenance, lineage_completeness, agent_attribution |
| Compliant | classification, column_masking, access_audit_coverage |
All scores are 0–1 where 1.0 is perfect. Requirements pass when score >= threshold.
| Profile | Requirements | Best For |
|---|---|---|
| scan | 8 | Estate-level sweep: lightweight readiness proxies for portfolio analysis and prioritization |
| rag | 27 | Retrieval-augmented generation: chunking, embeddings, vector search, document governance |
| feature-serving | 39 | Online feature stores: low-latency lookups, materialized features, freshness SLAs |
| training | 50 | Fine-tuning and ML training: temporal integrity, reproducibility, bias testing, licensing |
| agents | 37 | Text-to-SQL and agentic tool use: highest bar on schema documentation, strong audit trail |
Each profile selects a different subset of the total requirements, with thresholds tuned for the use case. Every assessment uses the same six stages.
Before running, you can adjust any profile on the fly:
skip <requirement>: exclude a check entirelyset <requirement> <threshold>: override a thresholdadd <requirement> <threshold>: include a check not in the base profile
For repeatability, save overrides as a custom profile using extends:
name: my-rag-profile
extends: rag
overrides:
skip:
- embedding_coverage
set:
chunk_readiness: { min: 0.70 }
add:
row_access_policy: { min: 0.50 }- Factor: one of six categories of AI-ready data (Clean, Contextual, Consumable, Current, Correlated, Compliant). Factors define the dimensions along which data is evaluated.
- Requirement: a platform-agnostic criterion that must be true of the data. Requirements define what to measure, not how. All requirements live in a single manifest (
requirements/requirements.yaml). - Check: a platform-specific markdown file (
check.md) containing prose context and SQL that measures a requirement, returning a normalized 0–1 score. Context, constraints, and variant guidance are co-located directly above the SQL they apply to. - Diagnostic: a platform-specific markdown file (
diagnostic.md) containing prose context and SQL that provides detail drill-downs on check results. - Fix: a platform-specific markdown file (
fix.md) containing remediation options — executable SQL and/or organizational process guidance. A single file can contain multiple remediation paths with prose explaining when to use each. Fixes are executed only with explicit user approval. - Profile: a curated collection of requirements with thresholds, organized into the six factor stages. Profiles can target a workload (RAG, training, feature-serving, agents) or an estate-level scan. Five built-in, unlimited custom.
- Assessment: the guided flow where the agent discovers scope and profile, runs tests, and produces a scored report.
- Scan: estate-level sweep using the lightweight scan profile across many schemas for comparative prioritization. Scans turn into assessments when the user drills into a specific schema.
- Platform Reference: everything the agent needs to operate on a specific platform, including capabilities, nuances, permissions, and dialect notes.
- Override: skip/set/add adjustments applied to a profile before running an assessment.
- Add an entry to
requirements/requirements.yamlwith: description, factor, scope, placeholders, implementations. - Create
requirements/{name}/{platform}/with three markdown files:check.md(required) — context + SQL returning avaluescore 0–1diagnostic.md(required) — context + SQL for detail drill-downfix.md(required) — remediation SQL and/or organizational guidance
- Add the requirement to relevant profile YAML(s) under the matching factor stage.
Create profiles/{name}.yaml with six stages, or use extends to derive from an existing one.
- Create
platforms/{PLATFORM}.mdcovering capabilities, dialect, permissions, and nuances. - Add requirement files under
requirements/{key}/{platform}/.
See demos/README.md for available demo walkthroughs. Start with Scan + Agents for the full estate scan → deep assessment → remediation flow, or RAG Readiness for a focused single-schema assessment.
factors/ # The six factors of AI-ready data (prose + requirements)
skills/
ai-ready-data/
SKILL.md # Orchestration protocol (Scan, Assess, Remediate)
platforms/ # Platform references
{PLATFORM}.md # Capabilities, nuances, permissions, dialect
requirements/ # Requirement manifest + implementation directories
requirements.yaml # Single manifest (all requirement metadata)
{requirement_key}/
{platform}/
check.md # Context + check SQL (read-only, returns 0–1 score)
diagnostic.md # Context + diagnostic SQL (read-only detail)
fix.md # Context + remediation SQL/guidance (mutating)
profiles/ # Assessment profiles
scan.yaml # Estate-level scan (lightweight)
rag.yaml
feature-serving.yaml
training.yaml
agents.yaml
Jacob Prall |
All content and images are licensed under a CC BY 4.0 License
Code is licensed under the Apache 2.0 License