Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
387188d
Set Sandbox Timeout to 120 seconds
StCarmen Jan 15, 2026
2f3ca9e
Add the Alias-datascience dependency package and update the relevant …
StCarmen Jan 15, 2026
cee4267
Remove code related to pre-installed packages
StCarmen Jan 16, 2026
ab75302
fix pre-commit
StCarmen Jan 16, 2026
327d212
feat(core): add data source management
SSSuperDan Jan 22, 2026
cc91b37
fix generate response failed error
StCarmen Jan 22, 2026
a1427a1
fix(data source): handle profiling for irregular excel files; fix bug…
SSSuperDan Jan 23, 2026
f6a29b3
fix(data profile): fix the generation profile
StCarmen Jan 23, 2026
d2e6221
fix(data profile): use model_call_with_retry instead of dashscope.Gen…
StCarmen Jan 23, 2026
2b147bd
fix(report generation): use structure model to format outout for repo…
SSSuperDan Jan 26, 2026
3b38e00
fix(data profile): downgrade the unsupported warning level
StCarmen Jan 27, 2026
6f825fb
fix(cli): restore backward compatibility for --files argument
SSSuperDan Jan 27, 2026
b545577
fix(data profile): refine the image content; remove assert; remove ap…
StCarmen Jan 27, 2026
b6ff393
fix(meta planner): prevent automatically entering DS mode when attach…
SSSuperDan Jan 27, 2026
35d5a8f
fix(data profile): add unified model interface(init at run.py) for da…
StCarmen Jan 28, 2026
7d172e4
type(data profile): format the LLMCallManager
StCarmen Jan 28, 2026
66ef58d
type(data profile): add llm_call_manager for each mode
StCarmen Jan 28, 2026
c7f1964
fix(data profie): await formatter
StCarmen Jan 28, 2026
51d1d88
type(data profile): add doc for each function
StCarmen Jan 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions alias/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,10 +207,23 @@ alias_agent run --mode finance --task "Analyze Tesla's Q4 2024 financial perform
# Data Science mode
alias_agent run --mode ds \
--task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
--files ./docs/data/incident_records.csv
--datasource ./docs/data/incident_records.csv
```

**Note**: Files uploaded with `--files` are automatically copied to `/workspace` in the sandbox. Generated files are available in `sessions_mount_dir` subdirectories.
#### Input/Output Management

**Input:**
- Use the `--datasource` parameter (with aliases `--files` for backward compatibility) to specify data sources, supporting multiple formats:
- **Local files**: such as `./data.txt` or `/absolute/path/file.json`
- **Database DSN**: supports relational databases like PostgreSQL and SQLite, with format like `postgresql://user:password@host:port/database`

Examples: `--datasource file.txt postgresql://user:password@localhost:5432/mydb`

- Specified data sources will be automatically profiled (analyzed) and provide guidance for efficient data source access to the model.
- Uploaded files are automatically copied to the `/workspace` directory in the sandbox.

**Output:**
- Generated files are stored in subdirectories of `sessions_mount_dir`, where all output results can be found.

#### Enable Long-Term Memory Service (General Mode Only)
To enable the long-term memory service in General mode, you need to:
Expand Down
19 changes: 17 additions & 2 deletions alias/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,25 @@ alias_agent run --mode finance --task "Analyze Tesla's Q4 2024 financial perform
# 数据科学(Data Science)模式
alias_agent run --mode ds \
--task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
--files ./docs/data/incident_records.csv
--datasource ./docs/data/incident_records.csv
```

**注意**:使用 `--files` 上传的文件会自动复制到沙盒中的 `/workspace`。生成的文件可在 `sessions_mount_dir` 的子目录中找到。
#### 输入/输出管理

**输入:**
- 使用 `--datasource` 参数指定数据源,支持多种格式 (向后兼容,也支持使用 `--files`):
- **本地文件**:如 `./data.txt` 或 `/absolute/path/file.json`
- **数据库 DSN**:支持 PostgreSQL、SQLite 等关系型数据库,格式如 `postgresql://user:password@host:port/database`

示例: `--datasource file.txt postgresql://user:password@localhost:5432/mydb`
- 指定的数据源会自动进行 profile(分析),并为模型提供高效访问数据源的指导。
- 上传的文件会自动复制到沙盒中的 `/workspace` 目录。



**输出:**
- 生成的文件存储在 `sessions_mount_dir` 的子目录中,可以在该位置找到所有输出结果。


#### 启用长期记忆服务(仅限通用模式)
要在通用模式下启用长期记忆服务,您需要:
Expand Down
3 changes: 2 additions & 1 deletion alias/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ dependencies = [
"agentscope-runtime>=1.0.0",
"aiosqlite>=0.21.0",
"asyncpg>=0.30.0",
"itsdangerous>=2.2.0"
"itsdangerous>=2.2.0",
"polars>=1.37.1"
]

[tool.setuptools]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
name: csv-excel-file
description: Guidelines for handling CSV/Excel files
type:
- csv
- excel
---

# CSV/Excel Handling Specifications

## Goals

- Safely load tabular data without crashing.
- Detect and handle messy spreadsheets (multiple blocks, missing headers, merged cells artifacts).
- Produce reliable outputs (clean dataframe for clean table or structured JSON for messy spreadsheet) with validated types.

## Encoding, Delimiters, and Locale

- CSV encoding: Try UTF-8; if garbled, attempt common fallbacks (e.g., gbk, cp1252) based on context.
- Delimiters: Detect common separators (,, \t, ;, |) during inspection.
- Locale formats: Be cautious with comma decimal separators and thousands separators.

## Inspection (always first)

- Identify file type, encoding (CSV), and sheet names (Excel) before full reads.
- Prefer small reads to preview structure:
- CSV: pd.read_csv(..., nrows=20); if uncertain delimiter: sep=None, engine="python" (small nrows only).
- Excel: pd.ExcelFile(path).sheet_names, then pd.read_excel(..., sheet_name=..., nrows=20).
- Use df.head(n) and df.columns to check:
- Missing/incorrect headers (e.g., columns are numeric 0..N-1)
- "Unnamed: X" columns
- Unexpected NaN/NaT, merged-cell artifacts
- Multiple tables/blocks in one sheet (blank rows separating sections)

## Preprocessing

- Treat as messy if any of the following is present:
- Columns contain "Unnamed:" or mostly empty column names
- Header row appears inside the data (first rows look like data + later row looks like header)
- Multiple data blocks (large blank-row gaps, repeated header patterns)
- Predominantly NaN/NaT in top rows/left columns
- Notes/metadata blocks above/beside the table (titles, footnotes, merged header areas)
- If messy spreadsheets are detected:
- First choice: use `clean_messy_spreadsheet` tool to extract key tables/fields and output JSON.
- Only fall back to manual parsing if tool fails, returns empty/incorrect structure, or cannot locate the target table.

## Querying

- Never load entire datasets blindly.
- Use minimal reads:
- `nrows`, `usecols`, `dtype` (or partial dtype mapping), `parse_dates` only when necessary.
- Sampling: `skiprows` with a step pattern for rough profiling when file is huge.
- For very large CSV:
- Prefer `chunksize` iteration; aggregate/compute per chunk.
- For Excel:
- Read only needed `sheet_name`, and consider narrowing `usecols`/`nrows` during exploration.

## Data Quality & Type Validation

- After load/clean:
- Validate types:
- Numeric columns: coerce with pd.to_numeric(errors="coerce")
- Datetime columns: pd.to_datetime(errors="coerce")
- Report coercion fallout (how many became NaN/NaT).
- Standardize missing values: treat empty strings/“N/A”/“null” consistently.

# Best Practices

- Always inspect structure before processing.
- Handle encoding issues appropriately
- Keep reads minimal; expand only after confirming layout.
- Log decisions: chosen sheet, detected header row, dropped columns/rows, dtype conversions.
- Avoid silent data loss: when dropping/cleaning, summarize what changed.
- Validate data types after loading
47 changes: 47 additions & 0 deletions alias/src/alias/agent/agents/_built_in_skill/data/image/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: image-file
description: Guidelines for handling image files
type: image
---

# Images Handling Specifications

## Goals

- Safely identify image properties and metadata without memory exhaustion.
- Accurately extract text (OCR) and visual elements (Object Detection/Description).
- Perform necessary pre-processing (resize, normalize, crop) for downstream tasks.
- Handle multi-frame or high-resolution images efficiently.

## Inspection (Always First)

- Identify Properties: Use lightweight libraries (e.g., PIL/Pillow) to get `format`, `size` (width/height), and `mode` (RGB, RGBA, CMYK).
- Check File Size: If the image is exceptionally large (e.g., >20MB or >100MP), consider downsampling or tiling before full processing.
- Metadata/EXIF Extraction:
- Read EXIF data for orientation, GPS tags, and timestamps.
- Correction: Automatically apply EXIF orientation to ensure the image is "upright" before visual analysis.

## Content Extraction & Vision

- Vision Analysis:
- Use multimodal vision models to describe scenes, identify objects, and detect activities.
- For complex images (e.g., infographics, UI screenshots), guide the model to focus on specific regions.
- OCR (Optical Character Recognition):
- If text is detected, specify whether to extract "raw text" or "structured data" (like forms/tables).
- Handle low-contrast or noisy backgrounds by applying pre-filters (grayscale, binarization).
- Format Conversion: Convert non-standard formats (e.g., HEIC, TIFF) to standard formats (JPEG/PNG) if tools require it.

## Handling Large or Complex Images

- Tiling: For ultra-high-res images (e.g., satellite maps, medical scans), split into overlapping tiles to avoid missing small details.
- Batching: Process multiple images using generators to keep memory usage stable.
- Alpha Channel: Be mindful of transparency (PNG/WebP); decide whether to discard it or composite against a solid background (e.g., white).

## Best Practices

- Safety First: Validate that the file is a genuine image (not a renamed malicious script).
- Graceful Failure: Handle corrupted files, truncated downloads, or unsupported formats with descriptive error logs.
- Efficiency: Avoid unnecessary re-encoding (e.g., multiple JPEG saves) to prevent "generation loss" or artifacts.
- Process images individually or in small batches to prevent system crashes
- Consider memory usage when working with large or high-resolution images
- Resource Management: Close file pointers or use context managers (`with Image.open(...) as img:`) to prevent memory leaks.
54 changes: 54 additions & 0 deletions alias/src/alias/agent/agents/_built_in_skill/data/json/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
name: json-file
description: Guildlines for handling json files
type: json
---

# JSON Handling Specifications

## Goals
- Safely parse JSON/JSONL without memory overflow.
- Discover schema structure (keys, nesting depth, data types).
- Flatten complex nested structures into tabular data when necessary.
- Handle inconsistent schemas and "dirty" JSON (e.g., trailing commas, mixed types).

## Inspection (Always First)

- Structure Discovery:
- Determine if the root is a `list` or a `dict`.
- Identify if it's a standard JSON or JSONL (one valid JSON object per line).
- Schema Sampling:
- For large files, read the first few objects/lines to infer the schema.
- Identify top-level keys and their types.
- Detect nesting depth: If depth > 3, consider it a "deeply nested" structure.
- Size Check:
- If the file is large (>50MB), avoid `json.load()`. Use iterative parsing or streaming.

## Processing & Extraction

- Lazy Loading (Streaming):
- For massive JSON: Use `ijson` (Python) or similar streaming parsers to yield specific paths/items.
- For JSONL: Read line-by-line using a generator to minimize memory footprint.
- Flattening & Normalization:
- Use `pandas.json_normalize` to convert nested structures into flat tables if the goal is analysis.
- Specify `max_level` during normalization to prevent "column explosion."
- Data Filtering:
- Extract only required sub-trees (keys) early in the process to reduce the memory object size.

## Data Quality & Schema Validation

- Missing Keys: Use `.get(key, default)` or `try-except` blocks. Never assume a key exists in all objects.
- Type Coercion:
- Validate numeric strings vs. actual numbers.
- Standardize `null`, `""`, and `[]` consistently.
- Encoding: Default to UTF-8; check for BOM (utf-8-sig) if parsing fails.
- Malformed JSON Recovery:
- For minor syntax errors (e.g., single quotes instead of double), attempt `ast.literal_eval` or regex-based cleanup only as a fallback.

## Best Practices

- Minimal Reads: Don't load a 50MB JSON just to read one config key; use a streaming approach.
- Schema Logging: Document the detected structure (e.g., "Root is a list of 500 objects; key 'metadata' is nested").
- Error Transparency: When a JSON object in a JSONL stream is corrupted, log the line number, skip it, and continue instead of crashing the entire process.
- Avoid Over-Flattening: Be cautious with deeply nested arrays; flattening them can lead to massive row duplication.
- Strict Typing: After extraction, explicitly convert types (e.g., `pd.to_datetime`) to ensure downstream reliability.
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
name: database
description: Guidelines for handling databases
type: relational_db
---

# Database Handling Specifications

## Goals

- Safely explore database schema without performance degradation.
- Construct precise, efficient SQL queries that prevent system crashes (OOM & OOT).
- Handle dialect-specific nuances (PostgreSQL, MySQL, SQLite, etc.).
- Transform raw result sets into structured, validated data for analysis.

## Inspection

- Volume Estimation:
- Before any `SELECT *`, always run `SELECT COUNT(*) FROM table_name` to understand the scale.
- If a table has >1,000,000 rows, strictly use indexed columns for filtering.
- Sample Data:
- Use `SELECT * FROM table_name LIMIT 5` to see actual data formats.

## Querying

- Safety Constraints:
- Always use `LIMIT`: Never execute a query without a `LIMIT` clause unless the row count is confirmed to be small.
- Avoid `SELECT *`: In production-scale tables, explicitly name columns to reduce I/O and memory usage.
- Dialect & Syntax:
- Case Sensitivity: If a column/table name contains uppercase or special characters, MUST quote it (e.g., `"UserTable"` in Postgres, `` `UserTable` `` in MySQL).
- Date/Time: Use standard ISO strings for date filtering; be mindful of timezone-aware vs. naive columns.
- Complex Queries:
- For `JOIN` operations, ensure joining columns are indexed to prevent full table scans.
- When performing `GROUP BY`, ensure the result set size is manageable.

## Data Retrieval & Transformation

- Type Mapping:
- Ensure SQL types (e.g., `DECIMAL`, `BIGINT`, `TIMESTAMP`) are correctly mapped to Python/JSON types without precision loss.
- Convert `NULL` values to a consistent "missing" representation (e.g., `None` or `NaN`).
- Chunked Fetching:
- For medium-to-large exports, use `fetchmany(size)` or `OFFSET/LIMIT` pagination instead of fetching everything into memory at once.
- Aggregations:
- Prefer performing calculations (SUM, AVG, COUNT) at the database level rather than pulling raw data to the client for processing.

## Error Handling & Recovery

- Timeout Management: If a query takes too long, retry with more restrictive filters or optimized joins.
- Syntax Errors: If a query fails, inspect the dialect-specific error message and re-verify the schema (it's often a misspelled column or missing quotes).

## Anti-Pattern Prevention (Avoiding "Bad" SQL)

- Index-Friendly Filters: Never wrap indexed columns in functions (e.g., `DATE()`, `UPPER()`) within the `WHERE` clause.
- Join Safety: Always verify join keys. Before joining, check if the key has high cardinality to avoid massive intermediate result sets.
- Memory Safety:
- Avoid `DISTINCT` and `UNION` (which performs de-duplication) on multi-million row sets unless necessary; use `UNION ALL` if duplicates are acceptable.
- Avoid `ORDER BY` on large non-indexed text fields.
- Wildcard Warning: Strictly avoid leading wildcards in `LIKE` patterns (e.g., `%term`) on large text columns.
- No Function on Columns: `WHERE col = FUNC(val)` is good; `WHERE FUNC(col) = val` is bad.
- Explicit Columns: Only fetch what is necessary.
- Early Filtering: Push `WHERE` conditions as close to the base tables as possible.
- CTE for Clarity: Use `WITH` for complex multi-step logic to improve maintainability and optimizer hints.

# Best Practices

- Always verify database structure before querying
- Use appropriate sampling techniques for large datasets
- Optimize queries for efficiency based on schema inspection
- Self-review the draft SQL against the "Anti-Pattern Prevention" list.
- Perform a silent mental 'EXPLAIN' on your query. If it smells like a full table scan on a large table, refactor it before outputting
50 changes: 50 additions & 0 deletions alias/src/alias/agent/agents/_built_in_skill/data/text/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
name: text-file
description: Guidelines for handling text files
type: text
---

# Text Files Handling Specifications

## Goals
- Safely read text files without memory exhaustion.
- Accurately detect encoding to avoid garbled characters.
- Identify underlying patterns (e.g., Log formats, Markdown structure, delimiters).
- Efficiently extract or search for specific information within large volumes of text.

## Encoding & Detection

- Encoding Strategy:
- Default to `utf-8`.
- If it fails, try `utf-8-sig` (for files with BOM), `gbk/gb18030` (for Chinese context), or `latin-1`.
- Use `chardet` or similar logic if encoding is unknown and first few bytes look non-standard.
- Line Endings: Be aware of `\n` (Unix), `\r\n` (Windows), and `\r` (Legacy Mac) when counting lines or splitting.

## Inspection

- Preview: Read the first 10-20 lines to determine:
- Content Type: Is it a log, code, prose, or a semi-structured list?
- Uniformity: Does every line follow the same format?
- Metadata: Check total file size before reading. If >50MB, treat as a "large file" and avoid full loading.

## Querying & Reading (Large Files)

- Streaming: For files exceeding memory or >50MB:
- Use `with open(path) as f: for line in f:` to process line-by-line.
- Never use `.read()` or `.readlines()` on large files.
- Random Sampling: To understand a huge file's structure, read the first N lines, the middle N lines (using `f.seek()`), and the last N lines.
- Pattern Matching: Use Regular Expressions (Regex) for targeted extraction instead of complex string slicing.
- Grep-like Search: If searching for a keyword, iterate through lines and only store/return matching lines + context.

## Data Quality

- Truncation Warning: If only a portion of the file is read, clearly state: "Displaying first X lines of Y total lines."
- Empty Lines/Comments: Decide early whether to ignore blank lines or lines starting with specific comment characters (e.g., `#`, `//`).

## Best Practices

- Resource Safety: Always use context managers (`with` statement) to ensure file handles are closed.
- Memory Consciousness: For logs and large TXT, prioritize "find and extract" over "load and filter."
- Regex Optimization: Compile regex patterns if they are used repeatedly in a loop over millions of lines.
- Validation: After reading, verify the content isn't binary (e.g., PDF or EXE renamed to .txt) by checking for null bytes or a high density of non-ASCII characters.
- Progress Logging: For long-running text processing, log progress every 100k lines or 10% of file size.
Loading