Skip to content

Latest commit

 

History

History
204 lines (169 loc) · 11.4 KB

File metadata and controls

204 lines (169 loc) · 11.4 KB

Alias for Data Science

An autonomous agent that runs your entire data science workflow.

Overview

Alias-DataScience is an autonomous, ready-to-use, intelligent assistant for real-world data science workflows. It transforms high-level analytical questions into executable plans, which can seamlessly handle data acquisition, cleaning, modeling, visualization, and narrative reporting, with minimal human intervention.

✨ Key Features

🔍 Scalable File Filtering

To handle massive data files commonly found in enterprise data lakes, Alias-DataScience combines parallelized grep operations with Retrieval-Augmented Generation (RAG) to build a low-latency, high-throughput file filtering pipeline. This preprocessing step enables accurate identification of relevant files, significantly expanding our scope and applicability.

🧠 Context-Aware Prompt Engineering

Rather than relying on generic instructions, Alias-DataScience employs three specialized prompt templates, each fine-tuned for a dominant data science workflow:

  • Exploratory Data Analysis (EDA): Surfaces trends, anomalies, and relationships to answer "what's happening?" and "why?"
  • Predictive Modeling: Automates feature engineering, model selection, and optimization.
  • Exact Data Computation: Delivers precise, auditable answers to quantitative queries (e.g., "What was the YoY revenue growth in Q3?").

An intelligent prompt selector routes tasks to the best template based on user intent.

📊 Handling of Messy Tabular Data

Alias-DataScience parses irregular spreadsheets (merged cells, embedded notes, multi-level headers) and converts them into structured tables. For large files, it outputs a semantic-preserving JSON representation, enabling reliable analysis of human-crafted inputs.

👁️ Multimodal Understanding of Visual Content

  • Image Understanding: Interprets charts, diagrams, and general images to extract numerical data, trends, and domain-specific entities
  • Visual QA: Answers natural-language questions about visual elements (e.g., "What was the peak value in Q3?").

📑 Automated Reporting

For EDA tasks, Alias-DataScience generates an interactive HTML report featuring:

  • Actionable insights backed by statistics and visuals,
  • Executable code snippets for transparency and reuse.

This bridges the gap between data scientists and stakeholders like business users or auditors.

📈 Benchmark Performance

Alias-DataScience achieves state-of-the-art (SOTA) across major data science agent benchmarks.

Realistic tasks from ModelOff & Kaggle; includes multimodal inputs, multi-source data, and large-scale modeling.

Task Category Framework Model Score
Data Analysis Alias-DataScience Qwen3-max-Preview 55.58% 🏆
AutoGen GPT-4 30.69%
AutoGen GPT-4o 34.12%
CodeInterpreter GPT-4 26.39%
CodeInterpreter GPT-4o 23.82%
Data Modeling Alias-DataScience Qwen3-max-Preview 49.70% 🏆
AutoGen GPT-4 45.52%
AutoGen GPT-4o 34.74%
CodeInterpreter GPT-4 26.14%
CodeInterpreter GPT-4o 16.90%

Open-ended comprehensive analytical tasks.

Framework Model Score
Alias-DataScience Qwen3-max-Preview 43.29% 🏆
AgentPoirot Qwen3-max-Preview 39.30%

End-to-end data analysis from real-world CSVs.

Framework Model Score
Alias-DataScience Qwen3-max-Preview 95.20% 🏆
AutoGen GPT-4 71.49%
Data Interpreter GPT-4 73.55%
Data Interpreter GPT-4o 94.93%

Some tables include data from published sources, used with gratitude to the original authors and cited in good faith. For accuracy, please refer to the original publications.

🎯 Use Cases

1. Machine Learning

2. Exact Data Computation

3. Exploratory Data Analysis