agentscope-ai
diff --git a/‎alias/README.md‎
Lines changed: 15 additions & 2 deletions b/‎alias/README.md‎
Lines changed: 15 additions & 2 deletions
diff --git a/‎alias/README_ZH.md‎
Lines changed: 17 additions & 2 deletions b/‎alias/README_ZH.md‎
Lines changed: 17 additions & 2 deletions
diff --git a/‎alias/pyproject.toml‎
Lines changed: 2 additions & 1 deletion b/‎alias/pyproject.toml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎alias/src/alias/agent/agents/_built_in_skill/data/csv_excel/SKILL.md‎
Lines changed: 74 additions & 0 deletions b/‎alias/src/alias/agent/agents/_built_in_skill/data/csv_excel/SKILL.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎alias/src/alias/agent/agents/_built_in_skill/data/image/SKILL.md‎
Lines changed: 47 additions & 0 deletions b/‎alias/src/alias/agent/agents/_built_in_skill/data/image/SKILL.md‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎alias/src/alias/agent/agents/_built_in_skill/data/json/SKILL.md‎
Lines changed: 54 additions & 0 deletions b/‎alias/src/alias/agent/agents/_built_in_skill/data/json/SKILL.md‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎alias/src/alias/agent/agents/_built_in_skill/data/relational_database/SKILL.md‎
Lines changed: 70 additions & 0 deletions b/‎alias/src/alias/agent/agents/_built_in_skill/data/relational_database/SKILL.md‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎alias/src/alias/agent/agents/_built_in_skill/data/text/SKILL.md‎
Lines changed: 50 additions & 0 deletions b/‎alias/src/alias/agent/agents/_built_in_skill/data/text/SKILL.md‎
Lines changed: 50 additions & 0 deletions
@@ -207,10 +207,23 @@ alias_agent run --mode finance --task "Analyze Tesla's Q4 2024 financial perform
 # Data Science mode
 alias_agent run --mode ds \
   --task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
-  --files ./docs/data/incident_records.csv
+  --datasource ./docs/data/incident_records.csv
 ```
 
-**Note**: Files uploaded with `--files` are automatically copied to `/workspace` in the sandbox. Generated files are available in `sessions_mount_dir` subdirectories.
+#### Input/Output Management
+
+**Input:**
+- Use the `--datasource` parameter (with aliases `--files` for backward compatibility) to specify data sources, supporting multiple formats:
+  - **Local files**: such as `./data.txt` or `/absolute/path/file.json`
+  - **Database DSN**: supports relational databases like PostgreSQL and SQLite, with format like `postgresql://user:password@host:port/database`
+
+  Examples: `--datasource file.txt postgresql://user:password@localhost:5432/mydb`
+
+- Specified data sources will be automatically profiled (analyzed) and provide guidance for efficient data source access to the model.
+- Uploaded files are automatically copied to the `/workspace` directory in the sandbox.
+
+**Output:**
+- Generated files are stored in subdirectories of `sessions_mount_dir`, where all output results can be found.
 
 #### Enable Long-Term Memory Service (General Mode Only)
 To enable the long-term memory service in General mode, you need to:
 
@@ -208,10 +208,25 @@ alias_agent run --mode finance --task "Analyze Tesla's Q4 2024 financial perform
 # 数据科学（Data Science）模式
 alias_agent run --mode ds \
   --task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
-  --files ./docs/data/incident_records.csv
+  --datasource ./docs/data/incident_records.csv
 ```
 
-**注意**：使用 `--files` 上传的文件会自动复制到沙盒中的 `/workspace`。生成的文件可在 `sessions_mount_dir` 的子目录中找到。
+#### 输入/输出管理
+
+**输入：**
+- 使用 `--datasource` 参数指定数据源，支持多种格式 (向后兼容，也支持使用 `--files`):
+  - **本地文件**：如 `./data.txt` 或 `/absolute/path/file.json`
+  - **数据库 DSN**：支持 PostgreSQL、SQLite 等关系型数据库，格式如 `postgresql://user:password@host:port/database`
+
+  示例： `--datasource file.txt postgresql://user:password@localhost:5432/mydb`
+- 指定的数据源会自动进行 profile（分析），并为模型提供高效访问数据源的指导。
+- 上传的文件会自动复制到沙盒中的 `/workspace` 目录。
+
+
+
+**输出：**
+- 生成的文件存储在 `sessions_mount_dir` 的子目录中，可以在该位置找到所有输出结果。
+
 
 #### 启用长期记忆服务（仅限通用模式）
 要在通用模式下启用长期记忆服务，您需要：
 
@@ -45,7 +45,8 @@ dependencies = [
     "agentscope-runtime>=1.0.0",
     "aiosqlite>=0.21.0",
     "asyncpg>=0.30.0",
-    "itsdangerous>=2.2.0"
+    "itsdangerous>=2.2.0",
+    "polars>=1.37.1"
 ]
 
 [tool.setuptools]
 
@@ -0,0 +1,74 @@
+---
+name: csv-excel-file
+description: Guidelines for handling CSV/Excel files
+type:
+  - csv
+  - excel
+---
+
+# CSV/Excel Handling Specifications
+
+## Goals
+
+- Safely load tabular data without crashing.
+- Detect and handle messy spreadsheets (multiple blocks, missing headers, merged cells artifacts).
+- Produce reliable outputs (clean dataframe for clean table or structured JSON for messy spreadsheet) with validated types.
+
+## Encoding, Delimiters, and Locale
+
+- CSV encoding: Try UTF-8; if garbled, attempt common fallbacks (e.g., gbk, cp1252) based on context.
+- Delimiters: Detect common separators (,, \t, ;, |) during inspection.
+- Locale formats: Be cautious with comma decimal separators and thousands separators.
+
+## Inspection (always first)
+
+- Identify file type, encoding (CSV), and sheet names (Excel) before full reads.
+- Prefer small reads to preview structure:
+  - CSV: pd.read_csv(..., nrows=20); if uncertain delimiter: sep=None, engine="python" (small nrows only).
+  - Excel: pd.ExcelFile(path).sheet_names, then pd.read_excel(..., sheet_name=..., nrows=20).
+- Use df.head(n) and df.columns to check:
+  - Missing/incorrect headers (e.g., columns are numeric 0..N-1)
+  - "Unnamed: X" columns
+  - Unexpected NaN/NaT, merged-cell artifacts
+  - Multiple tables/blocks in one sheet (blank rows separating sections)
+
+## Preprocessing
+
+- Treat as messy if any of the following is present:
+  - Columns contain "Unnamed:" or mostly empty column names
+  - Header row appears inside the data (first rows look like data + later row looks like header)
+  - Multiple data blocks (large blank-row gaps, repeated header patterns)
+  - Predominantly NaN/NaT in top rows/left columns
+  - Notes/metadata blocks above/beside the table (titles, footnotes, merged header areas)
+- If messy spreadsheets are detected:
+  - First choice: use  `clean_messy_spreadsheet` tool to extract key tables/fields and output JSON.
+  - Only fall back to manual parsing if tool fails, returns empty/incorrect structure, or cannot locate the target table.
+
+## Querying
+
+- Never load entire datasets blindly.
+- Use minimal reads:
+  - `nrows`, `usecols`, `dtype` (or partial dtype mapping), `parse_dates` only when necessary.
+  - Sampling: `skiprows` with a step pattern for rough profiling when file is huge.
+- For very large CSV:
+  - Prefer `chunksize` iteration; aggregate/compute per chunk.
+- For Excel:
+  - Read only needed `sheet_name`, and consider narrowing `usecols`/`nrows` during exploration.
+
+## Data Quality & Type Validation
+
+- After load/clean:
+  - Validate types:
+    - Numeric columns: coerce with pd.to_numeric(errors="coerce")
+    - Datetime columns: pd.to_datetime(errors="coerce")
+  - Report coercion fallout (how many became NaN/NaT).
+  - Standardize missing values: treat empty strings/“N/A”/“null” consistently.
+
+# Best Practices
+
+- Always inspect structure before processing.
+- Handle encoding issues appropriately
+- Keep reads minimal; expand only after confirming layout.
+- Log decisions: chosen sheet, detected header row, dropped columns/rows, dtype conversions.
+- Avoid silent data loss: when dropping/cleaning, summarize what changed.
+- Validate data types after loading
@@ -0,0 +1,47 @@
+---
+name: image-file
+description: Guidelines for handling image files
+type: image
+---
+
+# Images Handling Specifications
+
+## Goals
+
+- Safely identify image properties and metadata without memory exhaustion.
+- Accurately extract text (OCR) and visual elements (Object Detection/Description).
+- Perform necessary pre-processing (resize, normalize, crop) for downstream tasks.
+- Handle multi-frame or high-resolution images efficiently.
+
+## Inspection (Always First)
+
+- Identify Properties: Use lightweight libraries (e.g., PIL/Pillow) to get `format`, `size` (width/height), and `mode` (RGB, RGBA, CMYK).
+- Check File Size: If the image is exceptionally large (e.g., >20MB or >100MP), consider downsampling or tiling before full processing.
+- Metadata/EXIF Extraction:
+  - Read EXIF data for orientation, GPS tags, and timestamps.
+  - Correction: Automatically apply EXIF orientation to ensure the image is "upright" before visual analysis.
+
+## Content Extraction & Vision
+
+- Vision Analysis:
+  - Use multimodal vision models to describe scenes, identify objects, and detect activities.
+  - For complex images (e.g., infographics, UI screenshots), guide the model to focus on specific regions.
+- OCR (Optical Character Recognition):
+  - If text is detected, specify whether to extract "raw text" or "structured data" (like forms/tables).
+  - Handle low-contrast or noisy backgrounds by applying pre-filters (grayscale, binarization).
+- Format Conversion: Convert non-standard formats (e.g., HEIC, TIFF) to standard formats (JPEG/PNG) if tools require it.
+
+## Handling Large or Complex Images
+
+- Tiling: For ultra-high-res images (e.g., satellite maps, medical scans), split into overlapping tiles to avoid missing small details.
+- Batching: Process multiple images using generators to keep memory usage stable.
+- Alpha Channel: Be mindful of transparency (PNG/WebP); decide whether to discard it or composite against a solid background (e.g., white).
+
+## Best Practices
+
+- Safety First: Validate that the file is a genuine image (not a renamed malicious script).
+- Graceful Failure: Handle corrupted files, truncated downloads, or unsupported formats with descriptive error logs.
+- Efficiency: Avoid unnecessary re-encoding (e.g., multiple JPEG saves) to prevent "generation loss" or artifacts.
+- Process images individually or in small batches to prevent system crashes
+- Consider memory usage when working with large or high-resolution images
+- Resource Management: Close file pointers or use context managers (`with Image.open(...) as img:`) to prevent memory leaks.
@@ -0,0 +1,54 @@
+---
+name: json-file
+description: Guildlines for handling json files
+type: json
+---
+
+# JSON Handling Specifications
+
+## Goals
+- Safely parse JSON/JSONL without memory overflow.
+- Discover schema structure (keys, nesting depth, data types).
+- Flatten complex nested structures into tabular data when necessary.
+- Handle inconsistent schemas and "dirty" JSON (e.g., trailing commas, mixed types).
+
+## Inspection (Always First)
+
+- Structure Discovery:
+  - Determine if the root is a `list` or a `dict`.
+  - Identify if it's a standard JSON or JSONL (one valid JSON object per line).
+- Schema Sampling:
+  - For large files, read the first few objects/lines to infer the schema.
+  - Identify top-level keys and their types.
+  - Detect nesting depth: If depth > 3, consider it a "deeply nested" structure.
+- Size Check:
+  - If the file is large (>50MB), avoid `json.load()`. Use iterative parsing or streaming.
+
+## Processing & Extraction
+
+- Lazy Loading (Streaming):
+  - For massive JSON: Use `ijson` (Python) or similar streaming parsers to yield specific paths/items.
+  - For JSONL: Read line-by-line using a generator to minimize memory footprint.
+- Flattening & Normalization:
+  - Use `pandas.json_normalize` to convert nested structures into flat tables if the goal is analysis.
+  - Specify `max_level` during normalization to prevent "column explosion."
+- Data Filtering:
+  - Extract only required sub-trees (keys) early in the process to reduce the memory object size.
+
+## Data Quality & Schema Validation
+
+- Missing Keys: Use `.get(key, default)` or `try-except` blocks. Never assume a key exists in all objects.
+- Type Coercion:
+  - Validate numeric strings vs. actual numbers.
+  - Standardize `null`, `""`, and `[]` consistently.
+- Encoding: Default to UTF-8; check for BOM (utf-8-sig) if parsing fails.
+- Malformed JSON Recovery:
+  - For minor syntax errors (e.g., single quotes instead of double), attempt `ast.literal_eval` or regex-based cleanup only as a fallback.
+
+## Best Practices
+
+- Minimal Reads: Don't load a 50MB JSON just to read one config key; use a streaming approach.
+- Schema Logging: Document the detected structure (e.g., "Root is a list of 500 objects; key 'metadata' is nested").
+- Error Transparency: When a JSON object in a JSONL stream is corrupted, log the line number, skip it, and continue instead of crashing the entire process.
+- Avoid Over-Flattening: Be cautious with deeply nested arrays; flattening them can lead to massive row duplication.
+- Strict Typing: After extraction, explicitly convert types (e.g., `pd.to_datetime`) to ensure downstream reliability.
@@ -0,0 +1,70 @@
+---
+name: database
+description: Guidelines for handling databases
+type: relational_db
+---
+
+# Database Handling Specifications
+
+## Goals
+
+- Safely explore database schema without performance degradation.
+- Construct precise, efficient SQL queries that prevent system crashes (OOM & OOT).
+- Handle dialect-specific nuances (PostgreSQL, MySQL, SQLite, etc.).
+- Transform raw result sets into structured, validated data for analysis.
+
+## Inspection
+
+- Volume Estimation:
+  - Before any `SELECT *`, always run `SELECT COUNT(*) FROM table_name` to understand the scale.
+  - If a table has >1,000,000 rows, strictly use indexed columns for filtering.
+- Sample Data:
+  - Use `SELECT * FROM table_name LIMIT 5` to see actual data formats.
+
+## Querying
+
+- Safety Constraints:
+  - Always use `LIMIT`: Never execute a query without a `LIMIT` clause unless the row count is confirmed to be small.
+  - Avoid `SELECT *`: In production-scale tables, explicitly name columns to reduce I/O and memory usage.
+- Dialect & Syntax:
+  - Case Sensitivity: If a column/table name contains uppercase or special characters, MUST quote it (e.g., `"UserTable"` in Postgres, `` `UserTable` `` in MySQL).
+  - Date/Time: Use standard ISO strings for date filtering; be mindful of timezone-aware vs. naive columns.
+- Complex Queries:
+  - For `JOIN` operations, ensure joining columns are indexed to prevent full table scans.
+  - When performing `GROUP BY`, ensure the result set size is manageable.
+
+## Data Retrieval & Transformation
+
+- Type Mapping:
+  - Ensure SQL types (e.g., `DECIMAL`, `BIGINT`, `TIMESTAMP`) are correctly mapped to Python/JSON types without precision loss.
+  - Convert `NULL` values to a consistent "missing" representation (e.g., `None` or `NaN`).
+- Chunked Fetching:
+  - For medium-to-large exports, use `fetchmany(size)` or `OFFSET/LIMIT` pagination instead of fetching everything into memory at once.
+- Aggregations:
+  - Prefer performing calculations (SUM, AVG, COUNT) at the database level rather than pulling raw data to the client for processing.
+
+## Error Handling & Recovery
+
+- Timeout Management: If a query takes too long, retry with more restrictive filters or optimized joins.
+- Syntax Errors: If a query fails, inspect the dialect-specific error message and re-verify the schema (it's often a misspelled column or missing quotes).
+
+## Anti-Pattern Prevention (Avoiding "Bad" SQL)
+
+- Index-Friendly Filters: Never wrap indexed columns in functions (e.g., `DATE()`, `UPPER()`) within the `WHERE` clause.
+- Join Safety: Always verify join keys. Before joining, check if the key has high cardinality to avoid massive intermediate result sets.
+- Memory Safety:
+  - Avoid `DISTINCT` and `UNION` (which performs de-duplication) on multi-million row sets unless necessary; use `UNION ALL` if duplicates are acceptable.
+  - Avoid `ORDER BY` on large non-indexed text fields.
+- Wildcard Warning: Strictly avoid leading wildcards in `LIKE` patterns (e.g., `%term`) on large text columns.
+- No Function on Columns: `WHERE col = FUNC(val)` is good; `WHERE FUNC(col) = val` is bad.
+- Explicit Columns: Only fetch what is necessary.
+- Early Filtering: Push `WHERE` conditions as close to the base tables as possible.
+- CTE for Clarity: Use `WITH` for complex multi-step logic to improve maintainability and optimizer hints.
+
+# Best Practices
+
+- Always verify database structure before querying
+- Use appropriate sampling techniques for large datasets
+- Optimize queries for efficiency based on schema inspection
+- Self-review the draft SQL against the "Anti-Pattern Prevention" list.
+- Perform a silent mental 'EXPLAIN' on your query. If it smells like a full table scan on a large table, refactor it before outputting
@@ -0,0 +1,50 @@
+---
+name: text-file
+description: Guidelines for handling text files
+type: text
+---
+
+# Text Files Handling Specifications
+
+## Goals
+- Safely read text files without memory exhaustion.
+- Accurately detect encoding to avoid garbled characters.
+- Identify underlying patterns (e.g., Log formats, Markdown structure, delimiters).
+- Efficiently extract or search for specific information within large volumes of text.
+
+## Encoding & Detection
+
+- Encoding Strategy:
+  - Default to `utf-8`.
+  - If it fails, try `utf-8-sig` (for files with BOM), `gbk/gb18030` (for Chinese context), or `latin-1`.
+  - Use `chardet` or similar logic if encoding is unknown and first few bytes look non-standard.
+- Line Endings: Be aware of `\n` (Unix), `\r\n` (Windows), and `\r` (Legacy Mac) when counting lines or splitting.
+
+## Inspection
+
+- Preview: Read the first 10-20 lines to determine:
+  - Content Type: Is it a log, code, prose, or a semi-structured list?
+  - Uniformity: Does every line follow the same format?
+- Metadata: Check total file size before reading. If >50MB, treat as a "large file" and avoid full loading.
+
+## Querying & Reading (Large Files)
+
+- Streaming: For files exceeding memory or >50MB:
+  - Use `with open(path) as f: for line in f:` to process line-by-line.
+  - Never use `.read()` or `.readlines()` on large files.
+- Random Sampling: To understand a huge file's structure, read the first N lines, the middle N lines (using `f.seek()`), and the last N lines.
+- Pattern Matching: Use Regular Expressions (Regex) for targeted extraction instead of complex string slicing.
+- Grep-like Search: If searching for a keyword, iterate through lines and only store/return matching lines + context.
+
+## Data Quality
+
+- Truncation Warning: If only a portion of the file is read, clearly state: "Displaying first X lines of Y total lines."
+- Empty Lines/Comments: Decide early whether to ignore blank lines or lines starting with specific comment characters (e.g., `#`, `//`).
+
+## Best Practices
+
+- Resource Safety: Always use context managers (`with` statement) to ensure file handles are closed.
+- Memory Consciousness: For logs and large TXT, prioritize "find and extract" over "load and filter."
+- Regex Optimization: Compile regex patterns if they are used repeatedly in a loop over millions of lines.
+- Validation: After reading, verify the content isn't binary (e.g., PDF or EXE renamed to .txt) by checking for null bytes or a high density of non-ASCII characters.
+- Progress Logging: For long-running text processing, log progress every 100k lines or 10% of file size.
Original file line number	Diff line number	Diff line change
`@@ -45,7 +45,8 @@ dependencies = [`
`45`	`45`	`"agentscope-runtime>=1.0.0",`
`46`	`46`	`"aiosqlite>=0.21.0",`
`47`	`47`	`"asyncpg>=0.30.0",`
`48`		`- "itsdangerous>=2.2.0"`
	`48`	`+ "itsdangerous>=2.2.0",`
	`49`	`+ "polars>=1.37.1"`
`49`	`50`	`]`
`50`	`51`
`51`	`52`	`[tool.setuptools]`