Skip to content

Commit 679431b

Browse files
SSSuperDanStCarmen
andauthored
feat(alias core): add data source management (#110)
Co-authored-by: Tianjing Zeng <39507457+StCarmen@users.noreply.github.com> Co-authored-by: stcarmen <1106135234@qq.com>
1 parent df0776c commit 679431b

36 files changed

+3330
-318
lines changed

alias/README.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -207,10 +207,23 @@ alias_agent run --mode finance --task "Analyze Tesla's Q4 2024 financial perform
207207
# Data Science mode
208208
alias_agent run --mode ds \
209209
--task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
210-
--files ./docs/data/incident_records.csv
210+
--datasource ./docs/data/incident_records.csv
211211
```
212212

213-
**Note**: Files uploaded with `--files` are automatically copied to `/workspace` in the sandbox. Generated files are available in `sessions_mount_dir` subdirectories.
213+
#### Input/Output Management
214+
215+
**Input:**
216+
- Use the `--datasource` parameter (with aliases `--files` for backward compatibility) to specify data sources, supporting multiple formats:
217+
- **Local files**: such as `./data.txt` or `/absolute/path/file.json`
218+
- **Database DSN**: supports relational databases like PostgreSQL and SQLite, with format like `postgresql://user:password@host:port/database`
219+
220+
Examples: `--datasource file.txt postgresql://user:password@localhost:5432/mydb`
221+
222+
- Specified data sources will be automatically profiled (analyzed) and provide guidance for efficient data source access to the model.
223+
- Uploaded files are automatically copied to the `/workspace` directory in the sandbox.
224+
225+
**Output:**
226+
- Generated files are stored in subdirectories of `sessions_mount_dir`, where all output results can be found.
214227

215228
#### Enable Long-Term Memory Service (General Mode Only)
216229
To enable the long-term memory service in General mode, you need to:

alias/README_ZH.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -208,10 +208,25 @@ alias_agent run --mode finance --task "Analyze Tesla's Q4 2024 financial perform
208208
# 数据科学(Data Science)模式
209209
alias_agent run --mode ds \
210210
--task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
211-
--files ./docs/data/incident_records.csv
211+
--datasource ./docs/data/incident_records.csv
212212
```
213213

214-
**注意**:使用 `--files` 上传的文件会自动复制到沙盒中的 `/workspace`。生成的文件可在 `sessions_mount_dir` 的子目录中找到。
214+
#### 输入/输出管理
215+
216+
**输入:**
217+
- 使用 `--datasource` 参数指定数据源,支持多种格式 (向后兼容,也支持使用 `--files`):
218+
- **本地文件**:如 `./data.txt``/absolute/path/file.json`
219+
- **数据库 DSN**:支持 PostgreSQL、SQLite 等关系型数据库,格式如 `postgresql://user:password@host:port/database`
220+
221+
示例: `--datasource file.txt postgresql://user:password@localhost:5432/mydb`
222+
- 指定的数据源会自动进行 profile(分析),并为模型提供高效访问数据源的指导。
223+
- 上传的文件会自动复制到沙盒中的 `/workspace` 目录。
224+
225+
226+
227+
**输出:**
228+
- 生成的文件存储在 `sessions_mount_dir` 的子目录中,可以在该位置找到所有输出结果。
229+
215230

216231
#### 启用长期记忆服务(仅限通用模式)
217232
要在通用模式下启用长期记忆服务,您需要:

alias/pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ dependencies = [
4545
"agentscope-runtime>=1.0.0",
4646
"aiosqlite>=0.21.0",
4747
"asyncpg>=0.30.0",
48-
"itsdangerous>=2.2.0"
48+
"itsdangerous>=2.2.0",
49+
"polars>=1.37.1"
4950
]
5051

5152
[tool.setuptools]
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
name: csv-excel-file
3+
description: Guidelines for handling CSV/Excel files
4+
type:
5+
- csv
6+
- excel
7+
---
8+
9+
# CSV/Excel Handling Specifications
10+
11+
## Goals
12+
13+
- Safely load tabular data without crashing.
14+
- Detect and handle messy spreadsheets (multiple blocks, missing headers, merged cells artifacts).
15+
- Produce reliable outputs (clean dataframe for clean table or structured JSON for messy spreadsheet) with validated types.
16+
17+
## Encoding, Delimiters, and Locale
18+
19+
- CSV encoding: Try UTF-8; if garbled, attempt common fallbacks (e.g., gbk, cp1252) based on context.
20+
- Delimiters: Detect common separators (,, \t, ;, |) during inspection.
21+
- Locale formats: Be cautious with comma decimal separators and thousands separators.
22+
23+
## Inspection (always first)
24+
25+
- Identify file type, encoding (CSV), and sheet names (Excel) before full reads.
26+
- Prefer small reads to preview structure:
27+
- CSV: pd.read_csv(..., nrows=20); if uncertain delimiter: sep=None, engine="python" (small nrows only).
28+
- Excel: pd.ExcelFile(path).sheet_names, then pd.read_excel(..., sheet_name=..., nrows=20).
29+
- Use df.head(n) and df.columns to check:
30+
- Missing/incorrect headers (e.g., columns are numeric 0..N-1)
31+
- "Unnamed: X" columns
32+
- Unexpected NaN/NaT, merged-cell artifacts
33+
- Multiple tables/blocks in one sheet (blank rows separating sections)
34+
35+
## Preprocessing
36+
37+
- Treat as messy if any of the following is present:
38+
- Columns contain "Unnamed:" or mostly empty column names
39+
- Header row appears inside the data (first rows look like data + later row looks like header)
40+
- Multiple data blocks (large blank-row gaps, repeated header patterns)
41+
- Predominantly NaN/NaT in top rows/left columns
42+
- Notes/metadata blocks above/beside the table (titles, footnotes, merged header areas)
43+
- If messy spreadsheets are detected:
44+
- First choice: use `clean_messy_spreadsheet` tool to extract key tables/fields and output JSON.
45+
- Only fall back to manual parsing if tool fails, returns empty/incorrect structure, or cannot locate the target table.
46+
47+
## Querying
48+
49+
- Never load entire datasets blindly.
50+
- Use minimal reads:
51+
- `nrows`, `usecols`, `dtype` (or partial dtype mapping), `parse_dates` only when necessary.
52+
- Sampling: `skiprows` with a step pattern for rough profiling when file is huge.
53+
- For very large CSV:
54+
- Prefer `chunksize` iteration; aggregate/compute per chunk.
55+
- For Excel:
56+
- Read only needed `sheet_name`, and consider narrowing `usecols`/`nrows` during exploration.
57+
58+
## Data Quality & Type Validation
59+
60+
- After load/clean:
61+
- Validate types:
62+
- Numeric columns: coerce with pd.to_numeric(errors="coerce")
63+
- Datetime columns: pd.to_datetime(errors="coerce")
64+
- Report coercion fallout (how many became NaN/NaT).
65+
- Standardize missing values: treat empty strings/“N/A”/“null” consistently.
66+
67+
# Best Practices
68+
69+
- Always inspect structure before processing.
70+
- Handle encoding issues appropriately
71+
- Keep reads minimal; expand only after confirming layout.
72+
- Log decisions: chosen sheet, detected header row, dropped columns/rows, dtype conversions.
73+
- Avoid silent data loss: when dropping/cleaning, summarize what changed.
74+
- Validate data types after loading
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
name: image-file
3+
description: Guidelines for handling image files
4+
type: image
5+
---
6+
7+
# Images Handling Specifications
8+
9+
## Goals
10+
11+
- Safely identify image properties and metadata without memory exhaustion.
12+
- Accurately extract text (OCR) and visual elements (Object Detection/Description).
13+
- Perform necessary pre-processing (resize, normalize, crop) for downstream tasks.
14+
- Handle multi-frame or high-resolution images efficiently.
15+
16+
## Inspection (Always First)
17+
18+
- Identify Properties: Use lightweight libraries (e.g., PIL/Pillow) to get `format`, `size` (width/height), and `mode` (RGB, RGBA, CMYK).
19+
- Check File Size: If the image is exceptionally large (e.g., >20MB or >100MP), consider downsampling or tiling before full processing.
20+
- Metadata/EXIF Extraction:
21+
- Read EXIF data for orientation, GPS tags, and timestamps.
22+
- Correction: Automatically apply EXIF orientation to ensure the image is "upright" before visual analysis.
23+
24+
## Content Extraction & Vision
25+
26+
- Vision Analysis:
27+
- Use multimodal vision models to describe scenes, identify objects, and detect activities.
28+
- For complex images (e.g., infographics, UI screenshots), guide the model to focus on specific regions.
29+
- OCR (Optical Character Recognition):
30+
- If text is detected, specify whether to extract "raw text" or "structured data" (like forms/tables).
31+
- Handle low-contrast or noisy backgrounds by applying pre-filters (grayscale, binarization).
32+
- Format Conversion: Convert non-standard formats (e.g., HEIC, TIFF) to standard formats (JPEG/PNG) if tools require it.
33+
34+
## Handling Large or Complex Images
35+
36+
- Tiling: For ultra-high-res images (e.g., satellite maps, medical scans), split into overlapping tiles to avoid missing small details.
37+
- Batching: Process multiple images using generators to keep memory usage stable.
38+
- Alpha Channel: Be mindful of transparency (PNG/WebP); decide whether to discard it or composite against a solid background (e.g., white).
39+
40+
## Best Practices
41+
42+
- Safety First: Validate that the file is a genuine image (not a renamed malicious script).
43+
- Graceful Failure: Handle corrupted files, truncated downloads, or unsupported formats with descriptive error logs.
44+
- Efficiency: Avoid unnecessary re-encoding (e.g., multiple JPEG saves) to prevent "generation loss" or artifacts.
45+
- Process images individually or in small batches to prevent system crashes
46+
- Consider memory usage when working with large or high-resolution images
47+
- Resource Management: Close file pointers or use context managers (`with Image.open(...) as img:`) to prevent memory leaks.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
name: json-file
3+
description: Guildlines for handling json files
4+
type: json
5+
---
6+
7+
# JSON Handling Specifications
8+
9+
## Goals
10+
- Safely parse JSON/JSONL without memory overflow.
11+
- Discover schema structure (keys, nesting depth, data types).
12+
- Flatten complex nested structures into tabular data when necessary.
13+
- Handle inconsistent schemas and "dirty" JSON (e.g., trailing commas, mixed types).
14+
15+
## Inspection (Always First)
16+
17+
- Structure Discovery:
18+
- Determine if the root is a `list` or a `dict`.
19+
- Identify if it's a standard JSON or JSONL (one valid JSON object per line).
20+
- Schema Sampling:
21+
- For large files, read the first few objects/lines to infer the schema.
22+
- Identify top-level keys and their types.
23+
- Detect nesting depth: If depth > 3, consider it a "deeply nested" structure.
24+
- Size Check:
25+
- If the file is large (>50MB), avoid `json.load()`. Use iterative parsing or streaming.
26+
27+
## Processing & Extraction
28+
29+
- Lazy Loading (Streaming):
30+
- For massive JSON: Use `ijson` (Python) or similar streaming parsers to yield specific paths/items.
31+
- For JSONL: Read line-by-line using a generator to minimize memory footprint.
32+
- Flattening & Normalization:
33+
- Use `pandas.json_normalize` to convert nested structures into flat tables if the goal is analysis.
34+
- Specify `max_level` during normalization to prevent "column explosion."
35+
- Data Filtering:
36+
- Extract only required sub-trees (keys) early in the process to reduce the memory object size.
37+
38+
## Data Quality & Schema Validation
39+
40+
- Missing Keys: Use `.get(key, default)` or `try-except` blocks. Never assume a key exists in all objects.
41+
- Type Coercion:
42+
- Validate numeric strings vs. actual numbers.
43+
- Standardize `null`, `""`, and `[]` consistently.
44+
- Encoding: Default to UTF-8; check for BOM (utf-8-sig) if parsing fails.
45+
- Malformed JSON Recovery:
46+
- For minor syntax errors (e.g., single quotes instead of double), attempt `ast.literal_eval` or regex-based cleanup only as a fallback.
47+
48+
## Best Practices
49+
50+
- Minimal Reads: Don't load a 50MB JSON just to read one config key; use a streaming approach.
51+
- Schema Logging: Document the detected structure (e.g., "Root is a list of 500 objects; key 'metadata' is nested").
52+
- Error Transparency: When a JSON object in a JSONL stream is corrupted, log the line number, skip it, and continue instead of crashing the entire process.
53+
- Avoid Over-Flattening: Be cautious with deeply nested arrays; flattening them can lead to massive row duplication.
54+
- Strict Typing: After extraction, explicitly convert types (e.g., `pd.to_datetime`) to ensure downstream reliability.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
name: database
3+
description: Guidelines for handling databases
4+
type: relational_db
5+
---
6+
7+
# Database Handling Specifications
8+
9+
## Goals
10+
11+
- Safely explore database schema without performance degradation.
12+
- Construct precise, efficient SQL queries that prevent system crashes (OOM & OOT).
13+
- Handle dialect-specific nuances (PostgreSQL, MySQL, SQLite, etc.).
14+
- Transform raw result sets into structured, validated data for analysis.
15+
16+
## Inspection
17+
18+
- Volume Estimation:
19+
- Before any `SELECT *`, always run `SELECT COUNT(*) FROM table_name` to understand the scale.
20+
- If a table has >1,000,000 rows, strictly use indexed columns for filtering.
21+
- Sample Data:
22+
- Use `SELECT * FROM table_name LIMIT 5` to see actual data formats.
23+
24+
## Querying
25+
26+
- Safety Constraints:
27+
- Always use `LIMIT`: Never execute a query without a `LIMIT` clause unless the row count is confirmed to be small.
28+
- Avoid `SELECT *`: In production-scale tables, explicitly name columns to reduce I/O and memory usage.
29+
- Dialect & Syntax:
30+
- Case Sensitivity: If a column/table name contains uppercase or special characters, MUST quote it (e.g., `"UserTable"` in Postgres, `` `UserTable` `` in MySQL).
31+
- Date/Time: Use standard ISO strings for date filtering; be mindful of timezone-aware vs. naive columns.
32+
- Complex Queries:
33+
- For `JOIN` operations, ensure joining columns are indexed to prevent full table scans.
34+
- When performing `GROUP BY`, ensure the result set size is manageable.
35+
36+
## Data Retrieval & Transformation
37+
38+
- Type Mapping:
39+
- Ensure SQL types (e.g., `DECIMAL`, `BIGINT`, `TIMESTAMP`) are correctly mapped to Python/JSON types without precision loss.
40+
- Convert `NULL` values to a consistent "missing" representation (e.g., `None` or `NaN`).
41+
- Chunked Fetching:
42+
- For medium-to-large exports, use `fetchmany(size)` or `OFFSET/LIMIT` pagination instead of fetching everything into memory at once.
43+
- Aggregations:
44+
- Prefer performing calculations (SUM, AVG, COUNT) at the database level rather than pulling raw data to the client for processing.
45+
46+
## Error Handling & Recovery
47+
48+
- Timeout Management: If a query takes too long, retry with more restrictive filters or optimized joins.
49+
- Syntax Errors: If a query fails, inspect the dialect-specific error message and re-verify the schema (it's often a misspelled column or missing quotes).
50+
51+
## Anti-Pattern Prevention (Avoiding "Bad" SQL)
52+
53+
- Index-Friendly Filters: Never wrap indexed columns in functions (e.g., `DATE()`, `UPPER()`) within the `WHERE` clause.
54+
- Join Safety: Always verify join keys. Before joining, check if the key has high cardinality to avoid massive intermediate result sets.
55+
- Memory Safety:
56+
- Avoid `DISTINCT` and `UNION` (which performs de-duplication) on multi-million row sets unless necessary; use `UNION ALL` if duplicates are acceptable.
57+
- Avoid `ORDER BY` on large non-indexed text fields.
58+
- Wildcard Warning: Strictly avoid leading wildcards in `LIKE` patterns (e.g., `%term`) on large text columns.
59+
- No Function on Columns: `WHERE col = FUNC(val)` is good; `WHERE FUNC(col) = val` is bad.
60+
- Explicit Columns: Only fetch what is necessary.
61+
- Early Filtering: Push `WHERE` conditions as close to the base tables as possible.
62+
- CTE for Clarity: Use `WITH` for complex multi-step logic to improve maintainability and optimizer hints.
63+
64+
# Best Practices
65+
66+
- Always verify database structure before querying
67+
- Use appropriate sampling techniques for large datasets
68+
- Optimize queries for efficiency based on schema inspection
69+
- Self-review the draft SQL against the "Anti-Pattern Prevention" list.
70+
- Perform a silent mental 'EXPLAIN' on your query. If it smells like a full table scan on a large table, refactor it before outputting
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
name: text-file
3+
description: Guidelines for handling text files
4+
type: text
5+
---
6+
7+
# Text Files Handling Specifications
8+
9+
## Goals
10+
- Safely read text files without memory exhaustion.
11+
- Accurately detect encoding to avoid garbled characters.
12+
- Identify underlying patterns (e.g., Log formats, Markdown structure, delimiters).
13+
- Efficiently extract or search for specific information within large volumes of text.
14+
15+
## Encoding & Detection
16+
17+
- Encoding Strategy:
18+
- Default to `utf-8`.
19+
- If it fails, try `utf-8-sig` (for files with BOM), `gbk/gb18030` (for Chinese context), or `latin-1`.
20+
- Use `chardet` or similar logic if encoding is unknown and first few bytes look non-standard.
21+
- Line Endings: Be aware of `\n` (Unix), `\r\n` (Windows), and `\r` (Legacy Mac) when counting lines or splitting.
22+
23+
## Inspection
24+
25+
- Preview: Read the first 10-20 lines to determine:
26+
- Content Type: Is it a log, code, prose, or a semi-structured list?
27+
- Uniformity: Does every line follow the same format?
28+
- Metadata: Check total file size before reading. If >50MB, treat as a "large file" and avoid full loading.
29+
30+
## Querying & Reading (Large Files)
31+
32+
- Streaming: For files exceeding memory or >50MB:
33+
- Use `with open(path) as f: for line in f:` to process line-by-line.
34+
- Never use `.read()` or `.readlines()` on large files.
35+
- Random Sampling: To understand a huge file's structure, read the first N lines, the middle N lines (using `f.seek()`), and the last N lines.
36+
- Pattern Matching: Use Regular Expressions (Regex) for targeted extraction instead of complex string slicing.
37+
- Grep-like Search: If searching for a keyword, iterate through lines and only store/return matching lines + context.
38+
39+
## Data Quality
40+
41+
- Truncation Warning: If only a portion of the file is read, clearly state: "Displaying first X lines of Y total lines."
42+
- Empty Lines/Comments: Decide early whether to ignore blank lines or lines starting with specific comment characters (e.g., `#`, `//`).
43+
44+
## Best Practices
45+
46+
- Resource Safety: Always use context managers (`with` statement) to ensure file handles are closed.
47+
- Memory Consciousness: For logs and large TXT, prioritize "find and extract" over "load and filter."
48+
- Regex Optimization: Compile regex patterns if they are used repeatedly in a loop over millions of lines.
49+
- Validation: After reading, verify the content isn't binary (e.g., PDF or EXE renamed to .txt) by checking for null bytes or a high density of non-ASCII characters.
50+
- Progress Logging: For long-running text processing, log progress every 100k lines or 10% of file size.

0 commit comments

Comments
 (0)