Data quality is the most important aspect when building any reports or machine learning models based on data available in an enterprise data lake. When managing an Athena data lake professionally I would like to see an agent which can answer common questions like:
- how do I get this piece of information (understanding data catalog)
- When was this dataset last updated (data quality)
- How is this particular column derived (lineage)
- When I use this dataset can I make this assumption (provenance)
Modern enterprise data lakes—such as those built on Amazon Athena—could contain hundreds of tables, tens of thousands of columns, and a constantly evolving set of ingestion pipelines, ETL jobs, and business rules.
While organizations typically maintain metadata systems such as Glue Data Catalog, lineage tools, dashboards, and wikis, the knowledge about data quality, provenance, and usage patterns is fragmented across many systems and often locked in tribal knowledge.
An LLM-agent–based solution is uniquely suited to this problem for several reasons:
A large language model can interpret ambiguous or conversational questions like:
- “Where do I find customer churn metrics?”
- “What table contains the canonical definition of a transaction?”
Instead of requiring the user to navigate Glue Catalog, query logs, lineage tools, application source code and documentation separately, an LLM agent can integrate these sources and provide intuitive, plain-English answers. This removes friction for analysts, engineers, and product managers who may not know the structure of the data lake.
Data quality signals exist across multiple layers:
- Technical metadata (Glue/Athena table definitions)
- Operational metadata (update timestamps, job logs, job statuses, pipeline failures)
- Business metadata (wiki pages, Confluence docs, slack threads)
- Lineage graphs (ETL tools, event logs)
- Data quality rules and monitoring outputs (e.g., Great Expectations, Soda)
An LLM agent can reason across all of these by synthesizing information rather than requiring rigid schemas or brittle rule-based systems. This makes it well-suited for answering questions like:
- “When was this dataset last refreshed?”
- “What upstream sources does this column depend on?”
Such questions typically require joining information across multiple systems—something LLMs excel at when combined with retrieval tools.
Lineage tools often represent relationships as graphs or JSON blobs. They are technically correct but not human-friendly.
An LLM agent can translate technical lineage (“column X is produced by Spark job Y reading from S3 prefix Z…”) into crisp, concise explanations:
“This column comes from the Finance ETL pipeline and is derived from daily trades data aggregated by order ID.”
This supports onboarding, data literacy, and reduces reliance on subject-matter experts.
One of the hardest problems in data engineering is helping users understand whether a dataset is fit for purpose.
Questions like:
- “Can I assume this field is unique?”
- “Is it safe to use this column for customer segmentation?” require a combination of rule interpretation, historical patterns, data quality metrics, domain context and provenance. LLM agents can give context-aware, probabilistic advice based on available metadata, plus point users toward authoritative documentation or warn them if information is incomplete.
Without such a system, data lake teams spend a huge amount of time answering questions on Slack:
- “What does account_id mean here?”
- “Is this table still maintained?”
- "How often is this table used?"
- “Why does the date only update once a week?”
An LLM agent acts as an always-available first responder, reducing load on data engineering teams while making data - knowledge accessible to the entire organization (data democratisation).
Unlike rule-based chatbots or static documentation, an LLM agent:
- Improves as more metadata becomes available
- Adapts to schema changes automatically
- Scales with the size and complexity of the data environment
- Can learn organizational domain model language and conventions
This makes it a long-term strategic asset rather than a static tool.
- Multi-agent system, including any combination of: ✅
- Agent powered by an LLM ✅
- Parallel agents
- Sequential agents
- Loop agents
- Sub Agents?
- Tools, including:
- MCP ✅
- custom tools ✅
- built-in tools, such as Google Search or Code Execution
- OpenAPI tools
- Long-running operations (pause/resume agents)
- Sessions & Memory
- Sessions & state management (e.g. InMemorySessionService)
- Long term memory (e.g. Memory Bank)
- Context engineering (e.g. context compaction)
- Observability: Logging, Tracing, Metrics ✅
- Agent evaluation
- A2A Protocol
✅ - features which are included in this submission
The below diagram shows the high level architecture of the system.
virtualenv -p python3.12 datalake-adk-assistant
source datalake-adk-assistant/bin/activate
pip install -r requirements.txtRun your App agent with the main.py code using the following command:
python3 main.pyIn order to do some integration testing we needed some Glue/Athena tables. We used publicly available Sakila database as the starting point for creating the parquet data underlying Glue/Athena tables.
Standup a dockerized sakila postgres instance https://github.com/sakiladb/postgres
database: sakila
username / password: sakila / p_ssW0rd
docker run -p 5432:5432 -d sakiladb/postgres:latestInstall pg2parquet. It is a dedicated command-line tool written in Rust that queries a PostgreSQL database and writes the result to a Parquet file.
cargo install pg2parquetExport data to parquet
export PGHOST=localhost
export PGDATABASE=sakila
export PGUSER=sakila
export PGTABLE=actor
pg2parquet export --host $PGHOST --dbname $PGDATABASE --user $PGUSER --table $PGTABLE --output-file $PGTABLE.parquetThis project uses following MCP tools to implement some of the agents.
The AWS DataProcessing MCP server provides AI code assistants with comprehensive data processing tools and real-time pipeline visibility across AWS Glue and Amazon EMR-EC2. This integration equips large language models (LLMs) with essential data engineering capabilities and contextual awareness.
Initial implementation of Retriever Agent is based on Amazon Data Processing MCP Server.