Skip to content

francoisfernando/datalake-assistant

Repository files navigation

Data quality assistant for Athena data lake

Data quality is the most important aspect when building any reports or machine learning models based on data available in an enterprise data lake. When managing an Athena data lake professionally I would like to see an agent which can answer common questions like:

  • how do I get this piece of information (understanding data catalog)
  • When was this dataset last updated (data quality)
  • How is this particular column derived (lineage)
  • When I use this dataset can I make this assumption (provenance)

Why are agents the right solution to this problem

Modern enterprise data lakes—such as those built on Amazon Athena—could contain hundreds of tables, tens of thousands of columns, and a constantly evolving set of ingestion pipelines, ETL jobs, and business rules.

While organizations typically maintain metadata systems such as Glue Data Catalog, lineage tools, dashboards, and wikis, the knowledge about data quality, provenance, and usage patterns is fragmented across many systems and often locked in tribal knowledge.

An LLM-agent–based solution is uniquely suited to this problem for several reasons:

1. Natural-language querying of complex metadata

A large language model can interpret ambiguous or conversational questions like:

  • “Where do I find customer churn metrics?”
  • “What table contains the canonical definition of a transaction?”

Instead of requiring the user to navigate Glue Catalog, query logs, lineage tools, application source code and documentation separately, an LLM agent can integrate these sources and provide intuitive, plain-English answers. This removes friction for analysts, engineers, and product managers who may not know the structure of the data lake.

2. Reasoning across structured and unstructured metadata

Data quality signals exist across multiple layers:

  • Technical metadata (Glue/Athena table definitions)
  • Operational metadata (update timestamps, job logs, job statuses, pipeline failures)
  • Business metadata (wiki pages, Confluence docs, slack threads)
  • Lineage graphs (ETL tools, event logs)
  • Data quality rules and monitoring outputs (e.g., Great Expectations, Soda)

An LLM agent can reason across all of these by synthesizing information rather than requiring rigid schemas or brittle rule-based systems. This makes it well-suited for answering questions like:

  • “When was this dataset last refreshed?”
  • “What upstream sources does this column depend on?”

Such questions typically require joining information across multiple systems—something LLMs excel at when combined with retrieval tools.

3. Explaining derivation and lineage in human terms

Lineage tools often represent relationships as graphs or JSON blobs. They are technically correct but not human-friendly.

An LLM agent can translate technical lineage (“column X is produced by Spark job Y reading from S3 prefix Z…”) into crisp, concise explanations:

“This column comes from the Finance ETL pipeline and is derived from daily trades data aggregated by order ID.”

This supports onboarding, data literacy, and reduces reliance on subject-matter experts.

4. Providing judgment and guidance on assumptions (provenance)

One of the hardest problems in data engineering is helping users understand whether a dataset is fit for purpose.

Questions like:

  • “Can I assume this field is unique?”
  • “Is it safe to use this column for customer segmentation?” require a combination of rule interpretation, historical patterns, data quality metrics, domain context and provenance. LLM agents can give context-aware, probabilistic advice based on available metadata, plus point users toward authoritative documentation or warn them if information is incomplete.

5. Reducing operational overhead and democratizing data access

Without such a system, data lake teams spend a huge amount of time answering questions on Slack:

  • “What does account_id mean here?”
  • “Is this table still maintained?”
  • "How often is this table used?"
  • “Why does the date only update once a week?”

An LLM agent acts as an always-available first responder, reducing load on data engineering teams while making data - knowledge accessible to the entire organization (data democratisation).

6. Future-proof, adaptive, and incrementally improvable

Unlike rule-based chatbots or static documentation, an LLM agent:

  • Improves as more metadata becomes available
  • Adapts to schema changes automatically
  • Scales with the size and complexity of the data environment
  • Can learn organizational domain model language and conventions

This makes it a long-term strategic asset rather than a static tool.

Features to include/included for Kaggle Submission

  • Multi-agent system, including any combination of: ✅
    • Agent powered by an LLM ✅
    • Parallel agents
    • Sequential agents
    • Loop agents
    • Sub Agents?
  • Tools, including:
    • MCP ✅
    • custom tools ✅
    • built-in tools, such as Google Search or Code Execution
    • OpenAPI tools
    • Long-running operations (pause/resume agents)
  • Sessions & Memory
    • Sessions & state management (e.g. InMemorySessionService)
  • Long term memory (e.g. Memory Bank)
  • Context engineering (e.g. context compaction)
  • Observability: Logging, Tracing, Metrics ✅
  • Agent evaluation
  • A2A Protocol

✅ - features which are included in this submission

Architecture

The below diagram shows the high level architecture of the system.

Architecture diagram

Setup and testing

Create virual environment and activate it

virtualenv -p python3.12 datalake-adk-assistant
source datalake-adk-assistant/bin/activate
pip install -r requirements.txt

Run locally

Run your App agent with the main.py code using the following command:

python3 main.py

How the test dataset was created

In order to do some integration testing we needed some Glue/Athena tables. We used publicly available Sakila database as the starting point for creating the parquet data underlying Glue/Athena tables.

Standup a dockerized sakila postgres instance https://github.com/sakiladb/postgres

database: sakila username / password: sakila / p_ssW0rd

docker run -p 5432:5432 -d sakiladb/postgres:latest

Install pg2parquet. It is a dedicated command-line tool written in Rust that queries a PostgreSQL database and writes the result to a Parquet file.

cargo install pg2parquet

Export data to parquet

export PGHOST=localhost
export PGDATABASE=sakila
export PGUSER=sakila
export PGTABLE=actor
pg2parquet export --host $PGHOST --dbname $PGDATABASE --user $PGUSER --table $PGTABLE --output-file $PGTABLE.parquet

MCP Tools

This project uses following MCP tools to implement some of the agents.

The AWS DataProcessing MCP server provides AI code assistants with comprehensive data processing tools and real-time pipeline visibility across AWS Glue and Amazon EMR-EC2. This integration equips large language models (LLMs) with essential data engineering capabilities and contextual awareness.

Initial implementation of Retriever Agent is based on Amazon Data Processing MCP Server.

About

Agentic Datalake Assistant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors