A local-first analytics engine that turns arbitrary tabular files into audited, Power BI-ready analytical packages.
Quick Start . Pipeline . Outputs . Evaluation . Architecture
AAE is a multi-agent analytics system for CSV, TSV, Excel, and Parquet files. It profiles the dataset, cleans it, performs statistically guarded EDA, builds Power BI-style semantic artifacts, generates dashboard stories, and produces an assurance report that checks output quality.
It is designed for realistic data work: malformed CSVs, mixed types, schema drift, sparse columns, high-cardinality IDs, randomized no-signal datasets, and large CSV/TSV files that should be processed in chunks rather than loaded fully into memory.
| Capability | What AAE Produces |
|---|---|
| Data audit | schema inference, date detection, cardinality, nulls, outliers, type recommendations |
| Cleaning | cleaned Parquet output, cleaning decision trace, dtype enforcement, risk notes |
| EDA | distributions, correlations, trends, movers, KPI candidates, significance-aware insights |
| Modeling | fact table, dimension tables, data dictionary, TMDL semantic model files |
| DAX | validated SUM, AVG, COUNT, time intelligence, contribution, and rolling measures where applicable |
| Storytelling | dashboard page recommendations, chart suggestions, executive headlines, caveats |
| Assurance | consistency checks, DAX validity, story evidence support, cleaning risk, model health |
flowchart LR
A["Upload Dataset"] --> B["AUDIT<br/>Data Health Inspector"]
B --> C["CLEAN<br/>Data Surgeon"]
C --> D["ANALYZE<br/>Insight Miner"]
D --> E["ARCHITECT<br/>Schema Engineer"]
E --> F["STORY<br/>Dashboard Storyteller"]
F --> G["ASSURANCE<br/>Output QA Lead"]
G --> H["Power BI-ready Package"]
| Phase | Agent | Primary Output |
|---|---|---|
| AUDIT | Data Health Inspector | data_health_report.json |
| CLEAN | Data Surgeon | cleaned_data.parquet, cleaning_report.json |
| ANALYZE | Insight Miner | statistical_analysis_report.json |
| ARCHITECT | Schema Engineer | star schema tables, DAX, dictionary, TMDL |
| STORY | Dashboard Storyteller | dashboard_stories.json |
| ASSURANCE | Output QA Lead | assurance_report.json |
Large CSV/TSV files are routed through agents/warehouse.py, which scans, cleans, aggregates, and writes chunked outputs without requiring a full in-memory load.
git clone https://github.com/prithvinairr/AAE---Autonomous-Analytics-Engineer.git
cd AAE---Autonomous-Analytics-EngineerWindows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtmacOS/Linux:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython main.pyOpen:
http://localhost:8000
Upload a dataset, review the inferred KPI/date/role selections, optionally answer the analyst interview, and run the pipeline.
| Format | Extensions |
|---|---|
| CSV | .csv |
| TSV | .tsv |
| Excel | .xlsx, .xls |
| Parquet | .parquet, .pq |
Uploads are capped at 5GB. Large CSV/TSV files automatically switch to chunked full-file mode.
Each run can create a local package like this:
output/
data_health_report.json
cleaning_report.json
cleaned_data.parquet
statistical_analysis_report.json
dashboard_stories.json
dax_measures.json
data_dictionary.json
data_dictionary.csv
assurance_report.json
star_schema/
fact_data.parquet
dim_*.parquet
tmdl/
Generated data and reports stay local and are ignored by Git.
AAE includes reproducible evaluation harnesses for data resilience, statistical restraint, semantic-model validity, recovery behavior, and large-file processing.
python benchmark_aae.py --include-chunked
python elite_eval.pyFor the 1GB chunked-processing proof:
python elite_eval.py --large-mb 1024Last benchmark run: 2026-05-03T13:59:31
| Metric | Result |
|---|---|
| Dataset benchmark cases | 11/11 passed |
| Readiness edge cases | 5/5 passed |
| Benchmark pass rate | 100.0% |
| Average Assurance | 98.5/100 |
| Minimum Assurance | 96/100 |
| DAX validation | All passed |
| Required artifacts | All present |
| Senior analyst reviews | All present |
Last adversarial eval run: 2026-05-03T13:53:17
| Metric | Result |
|---|---|
| Adversarial perturbations | 24/24 passed |
| Perturbation categories | 9 |
| Self-healing probes | 4/4 passed |
| Peer-review checks | 24/24 passed |
| TMDL exports | 24/24 passed |
| Latency gate | Passed |
| Average Assurance | 99/100 |
| Minimum Assurance | 99/100 |
| Metric | Result |
|---|---|
| Generated CSV size | 1,025.94 MB |
| Rows processed | 18,400,000 |
| Columns processed | 7 |
| Processing mode | chunked_full_file |
| Time to insight | 184.31 seconds |
| Assurance | 99/100 |
| Result | PASS |
The randomized-control eval passed with:
- 0 predictive false-positive insight signals
- 0 strong significant random correlations promoted
- no significant random trend promoted
- cautious/no-dominant-pattern language present
Read the full evaluation notes in docs/EVALUATION.md.
agents/ pipeline agents and QA modules
evals/ perturbation and latency evaluation utilities
static/ browser UI
docs/ architecture and evaluation notes
main.py FastAPI app and WebSocket pipeline runner
benchmark_aae.py dataset benchmark harness
elite_eval.py adversarial evaluation harness
generate_data.py synthetic local data generator
requirements.txt Python dependencies
Run syntax checks:
python -m py_compile main.py benchmark_aae.py elite_eval.py agents/*.py evals/*.py
node --check static/app.jsRun the app:
python main.pyImportant generated folders are ignored by Git:
data/
output/
output_benchmark/
output_elite_audit/
output_test*/
| Layer | Tools |
|---|---|
| Backend | FastAPI, WebSocket |
| Data | Pandas, NumPy, PyArrow; SciPy optional |
| Frontend | HTML, CSS, JavaScript |
| BI handoff | Parquet star schema, DAX, data dictionary, TMDL |
MIT License. See LICENSE.