Skip to content

Commit a1427a1

Browse files
committed
fix(data source): handle profiling for irregular excel files; fix bug in preparing unrecognized data source type
1 parent cc91b37 commit a1427a1

File tree

10 files changed

+1025
-829
lines changed

10 files changed

+1025
-829
lines changed

alias/src/alias/agent/agents/data_source/_data_profiler_factory.py

Lines changed: 750 additions & 0 deletions
Large diffs are not rendered by default.

alias/src/alias/agent/agents/data_source/_multimodal_to_text.py

Lines changed: 0 additions & 678 deletions
This file was deleted.

alias/src/alias/agent/agents/data_source/built_in_prompt/_profile_csv_prompt.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,4 +80,4 @@ You must output a single valid JSON object containing only the `description` key
8080
```
8181

8282
# Input
83-
input_json = {schema}
83+
input_json = {data}
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Role
2+
You are an expert Data Engineer specializing in unstructured Excel parsing. Your task is to analyze the raw content of the first 100 rows of an Excel sheet and determine if it contains structured tabular data suitable for a Pandas DataFrame.
3+
4+
If it is a valid table, identify the **Header Row** and the **Column Range**.
5+
If it is NOT a valid table (e.g., a dashboard, a form, a letter, or empty), you must flag it as unsuitable.
6+
7+
# Task Analysis
8+
Excel sheets fall into two categories:
9+
1. **List-Like Tables (Valid)**: Contains a header row followed by multiple rows of consistent record data. This is what we want.
10+
2. **Unstructured/Layout-Heavy (Unstructured)**:
11+
- **Forms/KV Pairs**: "Label: Value" scattered across the sheet.
12+
- **Dashboards**: Multiple small tables, charts, or scattered numbers.
13+
- **Text/Notes**: Paragraphs of text or disclaimers without column structure.
14+
- **Empty/Near Empty**: Contains almost no data.
15+
16+
# Rules for Detection
17+
18+
### A. Validity Check (The "Gatekeeper")
19+
Set `is_extractable_table` to **false** if:
20+
- There is no distinct row where meaningful column headers align horizontally.
21+
- The data is scattered (e.g., values exist in A1, G5, and C20 with no relation).
22+
- The sheet looks like a printed form (Key on the left, Value on the right) rather than a list of records.
23+
- There are fewer than 3 rows of data following a potential header.
24+
25+
### B. Structure Extraction (Only if Valid)
26+
If the sheet passes the Validity Check:
27+
1. **Header Row**: Find the first row containing multiple distinct string values that serve as column labels.
28+
2. **Column Range**: Identify the start index (first valid header) and end index (last valid header) to define the width.
29+
3. **Data Continuity**: Verify that rows below the header contain consistent data types (e.g., Dates under "Date").
30+
31+
# Input Data
32+
The user will provide the first 100 rows in CSV/Markdown format (0-based index).
33+
34+
# Output Format
35+
You must output a strictly valid JSON object.
36+
JSON Structure:
37+
{{
38+
"is_extractable_table": <boolean, true if it serves as a dataframe source, false otherwise>,
39+
"row_start_index": <int or null, 0-based index of the header row>,
40+
"col_ranges": <list [start, end] or null, inclusive 0-based column indices>,
41+
"confidence_score": <float, 0-1>,
42+
"reasoning": "<string, explain what the row data contains. declare the final conclusion(IRREGULAR,REGULAR,INVALIED). >"
43+
}}
44+
45+
# Examples
46+
47+
## Example 1 (Valid Table with Noise)
48+
Input:
49+
Title: Monthly Sales, NaN, NaN, NaN
50+
NaN, NaN, NaN, NaN
51+
NaN, Date, Item, Qty, Total
52+
NaN, 2023-01-01, Apple, 10, 500
53+
NaN, 2023-01-02, Banana, 5, 100
54+
55+
Output:
56+
{{
57+
"is_extractable_table": true,
58+
"row_start_index": 2,
59+
"col_ranges": [1, 4],
60+
"confidence_score": 0.99,
61+
"reasoning": " Rows 0-1 are ignored metadata, Row 2 is clear headers. Rows 3-4 contain consistent data aligned with headers. It is IRREGULAR and requires skiprows=2, usecols=[1, 4] to extract using Pansa DataFrame."
62+
}}
63+
64+
## Example 2 (Unstructured - Form/Dashboard)
65+
Input:
66+
Company Invoice, NaN, NaN, Invoice #: 001
67+
To:, John Doe, NaN, Date:, 2023-01-01
68+
Address:, 123 St, NaN, Due:, 2023-02-01
69+
NaN, NaN, NaN, NaN, NaN
70+
Subject:, Consulting Services, NaN, NaN, NaN
71+
72+
Output:
73+
{{
74+
"is_extractable_table": false,
75+
"row_start_index": null,
76+
"col_ranges": null,
77+
"confidence_score": 0.95,
78+
"reasoning": "Data matches a 'Form/Invoice' layout (Key-Value pairs) rather than a list-like table. No single header row defines a dataset of records. It is INVALIED and cannot be processed as Pandas DataFrame."
79+
}}
80+
81+
# Input
82+
{raw_snippet_data}

alias/src/alias/agent/agents/data_source/built_in_prompt/_profile_relationdb_prompt.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,4 +106,4 @@ You must output a single valid JSON object.
106106
```
107107

108108
# Input
109-
input_json=`{schema}`
109+
input_json=`{data}`
Lines changed: 89 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# Role
2-
You are an expert Data Steward. Your task is to analyze the metadata and content of an Excel file.
3-
**Assumption:** This is an ideal dataset or database where **ALL** tables contain valid headers in the first row. You will process the entire file structure in a single pass.
2+
You are an expert Data Steward. Your task is to analyze the metadata and content of an Excel file based on a pre-analyzed structural judgment.
3+
4+
**Context:** The dataset contains three types of sheets:
5+
1. **Regular Tables**: Standard headers in row 0.
6+
2. **Irregular Tables**: Valid data but requires `skiprows` or `usecols` parameters.
7+
3. **Unstructured Sheets**: Dashboards, forms, or text descriptions that **cannot** be read as a dataframe.
8+
9+
**Constraint**: Your analysis relies on a snippet of the first 100 rows.
410

511
# Input Format
612
You will receive a single JSON string in the variable `input_json`. The structure is:
@@ -9,50 +15,74 @@ You will receive a single JSON string in the variable `input_json`. The structur
915
"file": "Name of the file",
1016
"tables": [
1117
{{
12-
"name": "Name of the table",
18+
"name": "Sheet Name",
1319
"row_count": 100,
1420
"col_count": 5,
15-
"raw_data_snippet": "Header1, Header2\nVal1, Val2..."
16-
}},
17-
...
21+
"raw_data_snippet": "...",
22+
"irregular_judgment": {{
23+
"row_header_index": int,
24+
"cols_ranges": list,
25+
"reasoning": "..."
26+
}}
27+
}}
1828
]
1929
}}
2030

2131
```
32+
33+
*(Note: If `irregular_judgment` is null, treat it as Regular).*
34+
2235
# Analysis Logic
2336

37+
## 1. Sheet Iteration (Table Descriptions)
2438

25-
## 1. Sheet Iteration (Sheet Descriptions)
39+
For **EACH** object in the `tables` array, apply the following priority logic:
2640

27-
For **EACH** object in the `tables` array:
41+
**Case A: Unstructured Sheet (irregular_judgment contains "UNSTRUCTURED")**
2842

29-
1. **Extract Schema:**
30-
* Since headers are guaranteed, simply extract the column names from the **first row** of the `raw_data_snippet`.
31-
* Format them as a clean list of strings.
43+
* **Columns**: Return an empty list `[]`.
44+
* **Description**: "The sheet [Name] contains [something].
45+
**Append MANDATORY Warning**: "It is Unstructured based on a 100-row sample."
3246

33-
2. **Draft Description:**
34-
* Write a concise sentence describing what the sheet tracks based on its name and columns.
35-
* **MANDATORY:** You MUST explicitly mention the `row_count` and `col_count` in this sentence.
36-
* *Template:* "The sheet [Sheet Name] contains [Subject] data with [Row Count] rows and [Col Count] columns, featuring fields like [List 3 key columns]."
47+
**Case B: Irregular Table (irregular_judgment contains a dict and `row_header_index` > 0 or `cols_ranges` is set)**
48+
49+
* **Columns**: Extract column names from the row indicated by `row_header_index`.
50+
* **Description**:
51+
Write a concise sentence describing what the sheet tracks based on its name and columns.
52+
1. Start with: "The sheet [Name] contains [Subject] data with [Rows] rows and [Cols] columns."
53+
2. **Append MANDATORY Warning**: "It is irregular; requires specifying skiprows={{row_header_index}}, usecols={{cols_ranges}} using pandas dataframe."
54+
55+
**Case C: Regular Table (Default)**
56+
57+
* **Columns**: Extract from the first row of `raw_data_snippet`.
58+
* **Description**: "The sheet [Name] contains [Subject] data with [Rows] rows and [Cols] columns, featuring fields like [Key Cols]."
3759

3860
## 2. Global Analysis (File Description)
39-
* Analyze the `file` name and the number of all `table_name`s inside the `tables` array.
40-
* Based on all sheet descriptions, generate a single sentence summarizing the whole workbook.
61+
62+
Generate a single string summarizing the workbook. This summary **MUST** explicitly include:
63+
64+
1. **Total Count**: The number of sheets.
65+
2. **Status List**: List every table name with its status tag:
66+
* (Regular)
67+
* (Irregular, requires skiprows=X, usecols=Y)
68+
* (Unstructured)
69+
* *Format Example:* "The file logistics_data.xlsx contains supply chain logistics information for 2024, analyze the log datas. It contains 3 sheets: 'Data' (Regular), 'Logs' (Irregular, requires skiprows=2), and 'Cover' (Unstructured)."
70+
71+
4172

4273
# Output Format (Strict JSON)
4374

4475
You must output a single valid JSON object.
4576

4677
```json
4778
{{
48-
"description": "One sentence describing the whole file or database.",
79+
"description": "Comprehensive summary including count, names, and specific status tags for ALL tables.",
4980
"tables": [
5081
{{
51-
"name": "Name of table 1",
52-
"description": "Sentence including row/col counts and key columns.",
53-
"columns": ["col1", "col2", "col3"]
54-
}},
55-
...
82+
"name": "Table Name",
83+
"description": "Specific description based on Case A, B, or C.",
84+
"columns": ["col1", "col2"]
85+
}}
5686
]
5787
}}
5888

@@ -65,19 +95,32 @@ You must output a single valid JSON object.
6595

6696
```json
6797
{{
68-
"file": "logistics_data.xlsx",
98+
"file": "finance_report_v2.xlsx",
6999
"tables": [
70100
{{
71-
"na me": "Shipments",
72-
"row_count": 2000,
73-
"col_count": 4,
74-
"raw_data_snippet": "shipment_id, origin, destination, date\nSHP-001, Tokyo, London, 2024-05-20"
101+
"name": "Q1_Sales",
102+
"row_count": 200,
103+
"col_count": 5,
104+
"raw_data_snippet": "Date, Item, Amount\n2023-01-01, A, 100",
105+
}},
106+
{{
107+
"name": "Historical_Data",
108+
"row_count": 500,
109+
"col_count": 10,
110+
"raw_data_snippet": "Confidential\nSystem Generated\n\nDate, ID, Val\n...",
111+
"irregular_judgment": {{
112+
"is_extractable_table": true,
113+
"row_header_index": 3,
114+
"cols_ranges": [0, 3],
115+
"reasoning": "Header offset."
116+
}}
75117
}},
76118
{{
77-
"name": "Rates",
119+
"name": "Dashboard_Overview",
78120
"row_count": 50,
79-
"col_count": 2,
80-
"raw_data_snippet": "Route_ID, Cost_Per_Kg\nR-101, 5.50"
121+
"col_count": 20,
122+
"raw_data_snippet": "Total KPI: 500 | Chart Area |\nDisclaimer: Internal Use",
123+
"irregular_judgment": "UNSTRUCTURED"
81124
}}
82125
]
83126
}}
@@ -88,22 +131,29 @@ You must output a single valid JSON object.
88131

89132
```json
90133
{{
91-
"description": "The file/database logistics_data.xlsx contains supply chain logistics information for 2024, divided into shipment tracking and rate definitions (2 tables in total).",
134+
"description": "The file finance_report_v2.xlsx contains historical sales transaction records over the past Q1 period.
135+
It contains 3 sheets: 'Q1_Sales' (Regular), 'Historical_Data' (Irregular, requires skiprows=3, usecols=[0, 3], sampled first 100 rows), and 'Dashboard_Overview' (Unstructured).",
92136
"tables": [
93137
{{
94-
"name": "Shipments",
95-
"description": "The 'Shipments' sheet tracks individual shipment records with 2000 rows and 4 columns, featuring fields such as shipment_id, origin, and destination.",
96-
"columns": ["shipment_id", "origin", "destination", "date"]
138+
"name": "Q1_Sales",
139+
"description": "The sheet 'Q1_Sales' contains sales transaction records. It contains 200 rows and 5 columns, featuring fields like Date, Item, and Amount.",
140+
"columns": ["Date", "Item", "Amount"]
97141
}},
98142
{{
99-
"name": "Rates",
100-
"description": "The 'Rates' sheet lists shipping cost rates with 50 rows and 2 columns, specifically Route_ID and Cost_Per_Kg.",
101-
"columns": ["Route_ID", "Cost_Per_Kg"]
143+
"name": "Historical_Data",
144+
"description": "The sheet 'Historical_Data' contains historical sales transaction records records. It contains 400 rows and 21 columns. It's irregular judged by the first 100 samples(The first 3 rows contains metadata. requires specifying skiprows=3, usecols=[0, 3] using pandas dataframe.)",
145+
"columns": ["Date", "ID", "Val"]
146+
}},
147+
{{
148+
"name": "Dashboard_Overview",
149+
"description": "The sheet 'Dashboard_Overview' contains the whole overview and summary of the whole dashboards It is Unstructured based on a 100-row sample.",
150+
"columns": []
102151
}}
103152
]
104153
}}
105154

106155
```
107156

108157
# Input
109-
input_json=`{schema}`
158+
159+
input_json=`{data}`

0 commit comments

Comments
 (0)