-
Notifications
You must be signed in to change notification settings - Fork 76
feat(alias core): add data source management #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
387188d
Set Sandbox Timeout to 120 seconds
StCarmen 2f3ca9e
Add the Alias-datascience dependency package and update the relevant …
StCarmen cee4267
Remove code related to pre-installed packages
StCarmen ab75302
fix pre-commit
StCarmen 327d212
feat(core): add data source management
SSSuperDan cc91b37
fix generate response failed error
StCarmen a1427a1
fix(data source): handle profiling for irregular excel files; fix bug…
SSSuperDan f6a29b3
fix(data profile): fix the generation profile
StCarmen d2e6221
fix(data profile): use model_call_with_retry instead of dashscope.Gen…
StCarmen 2b147bd
fix(report generation): use structure model to format outout for repo…
SSSuperDan 3b38e00
fix(data profile): downgrade the unsupported warning level
StCarmen 6f825fb
fix(cli): restore backward compatibility for --files argument
SSSuperDan b545577
fix(data profile): refine the image content; remove assert; remove ap…
StCarmen b6ff393
fix(meta planner): prevent automatically entering DS mode when attach…
SSSuperDan 35d5a8f
fix(data profile): add unified model interface(init at run.py) for da…
StCarmen 7d172e4
type(data profile): format the LLMCallManager
StCarmen 66ef58d
type(data profile): add llm_call_manager for each mode
StCarmen c7f1964
fix(data profie): await formatter
StCarmen 51d1d88
type(data profile): add doc for each function
StCarmen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
74 changes: 74 additions & 0 deletions
74
alias/src/alias/agent/agents/_built_in_skill/data/csv_excel/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| --- | ||
| name: csv-excel-file | ||
| description: Guidelines for handling CSV/Excel files | ||
| type: | ||
| - csv | ||
| - excel | ||
| --- | ||
|
|
||
| # CSV/Excel Handling Specifications | ||
|
|
||
| ## Goals | ||
|
|
||
| - Safely load tabular data without crashing. | ||
| - Detect and handle messy spreadsheets (multiple blocks, missing headers, merged cells artifacts). | ||
| - Produce reliable outputs (clean dataframe for clean table or structured JSON for messy spreadsheet) with validated types. | ||
|
|
||
| ## Encoding, Delimiters, and Locale | ||
|
|
||
| - CSV encoding: Try UTF-8; if garbled, attempt common fallbacks (e.g., gbk, cp1252) based on context. | ||
| - Delimiters: Detect common separators (,, \t, ;, |) during inspection. | ||
| - Locale formats: Be cautious with comma decimal separators and thousands separators. | ||
|
|
||
| ## Inspection (always first) | ||
|
|
||
| - Identify file type, encoding (CSV), and sheet names (Excel) before full reads. | ||
| - Prefer small reads to preview structure: | ||
| - CSV: pd.read_csv(..., nrows=20); if uncertain delimiter: sep=None, engine="python" (small nrows only). | ||
| - Excel: pd.ExcelFile(path).sheet_names, then pd.read_excel(..., sheet_name=..., nrows=20). | ||
| - Use df.head(n) and df.columns to check: | ||
| - Missing/incorrect headers (e.g., columns are numeric 0..N-1) | ||
| - "Unnamed: X" columns | ||
| - Unexpected NaN/NaT, merged-cell artifacts | ||
| - Multiple tables/blocks in one sheet (blank rows separating sections) | ||
|
|
||
| ## Preprocessing | ||
|
|
||
| - Treat as messy if any of the following is present: | ||
| - Columns contain "Unnamed:" or mostly empty column names | ||
| - Header row appears inside the data (first rows look like data + later row looks like header) | ||
| - Multiple data blocks (large blank-row gaps, repeated header patterns) | ||
| - Predominantly NaN/NaT in top rows/left columns | ||
| - Notes/metadata blocks above/beside the table (titles, footnotes, merged header areas) | ||
| - If messy spreadsheets are detected: | ||
| - First choice: use `clean_messy_spreadsheet` tool to extract key tables/fields and output JSON. | ||
| - Only fall back to manual parsing if tool fails, returns empty/incorrect structure, or cannot locate the target table. | ||
|
|
||
| ## Querying | ||
|
|
||
| - Never load entire datasets blindly. | ||
| - Use minimal reads: | ||
| - `nrows`, `usecols`, `dtype` (or partial dtype mapping), `parse_dates` only when necessary. | ||
| - Sampling: `skiprows` with a step pattern for rough profiling when file is huge. | ||
| - For very large CSV: | ||
| - Prefer `chunksize` iteration; aggregate/compute per chunk. | ||
| - For Excel: | ||
| - Read only needed `sheet_name`, and consider narrowing `usecols`/`nrows` during exploration. | ||
|
|
||
| ## Data Quality & Type Validation | ||
|
|
||
| - After load/clean: | ||
| - Validate types: | ||
| - Numeric columns: coerce with pd.to_numeric(errors="coerce") | ||
| - Datetime columns: pd.to_datetime(errors="coerce") | ||
| - Report coercion fallout (how many became NaN/NaT). | ||
| - Standardize missing values: treat empty strings/“N/A”/“null” consistently. | ||
|
|
||
| # Best Practices | ||
|
|
||
| - Always inspect structure before processing. | ||
| - Handle encoding issues appropriately | ||
| - Keep reads minimal; expand only after confirming layout. | ||
| - Log decisions: chosen sheet, detected header row, dropped columns/rows, dtype conversions. | ||
| - Avoid silent data loss: when dropping/cleaning, summarize what changed. | ||
| - Validate data types after loading |
47 changes: 47 additions & 0 deletions
47
alias/src/alias/agent/agents/_built_in_skill/data/image/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| --- | ||
| name: image-file | ||
| description: Guidelines for handling image files | ||
| type: image | ||
| --- | ||
|
|
||
| # Images Handling Specifications | ||
|
|
||
| ## Goals | ||
|
|
||
| - Safely identify image properties and metadata without memory exhaustion. | ||
| - Accurately extract text (OCR) and visual elements (Object Detection/Description). | ||
| - Perform necessary pre-processing (resize, normalize, crop) for downstream tasks. | ||
| - Handle multi-frame or high-resolution images efficiently. | ||
|
|
||
| ## Inspection (Always First) | ||
|
|
||
| - Identify Properties: Use lightweight libraries (e.g., PIL/Pillow) to get `format`, `size` (width/height), and `mode` (RGB, RGBA, CMYK). | ||
| - Check File Size: If the image is exceptionally large (e.g., >20MB or >100MP), consider downsampling or tiling before full processing. | ||
| - Metadata/EXIF Extraction: | ||
| - Read EXIF data for orientation, GPS tags, and timestamps. | ||
| - Correction: Automatically apply EXIF orientation to ensure the image is "upright" before visual analysis. | ||
|
|
||
| ## Content Extraction & Vision | ||
|
|
||
| - Vision Analysis: | ||
| - Use multimodal vision models to describe scenes, identify objects, and detect activities. | ||
| - For complex images (e.g., infographics, UI screenshots), guide the model to focus on specific regions. | ||
| - OCR (Optical Character Recognition): | ||
| - If text is detected, specify whether to extract "raw text" or "structured data" (like forms/tables). | ||
| - Handle low-contrast or noisy backgrounds by applying pre-filters (grayscale, binarization). | ||
| - Format Conversion: Convert non-standard formats (e.g., HEIC, TIFF) to standard formats (JPEG/PNG) if tools require it. | ||
|
|
||
| ## Handling Large or Complex Images | ||
|
|
||
| - Tiling: For ultra-high-res images (e.g., satellite maps, medical scans), split into overlapping tiles to avoid missing small details. | ||
| - Batching: Process multiple images using generators to keep memory usage stable. | ||
| - Alpha Channel: Be mindful of transparency (PNG/WebP); decide whether to discard it or composite against a solid background (e.g., white). | ||
|
|
||
| ## Best Practices | ||
|
|
||
| - Safety First: Validate that the file is a genuine image (not a renamed malicious script). | ||
| - Graceful Failure: Handle corrupted files, truncated downloads, or unsupported formats with descriptive error logs. | ||
| - Efficiency: Avoid unnecessary re-encoding (e.g., multiple JPEG saves) to prevent "generation loss" or artifacts. | ||
| - Process images individually or in small batches to prevent system crashes | ||
| - Consider memory usage when working with large or high-resolution images | ||
| - Resource Management: Close file pointers or use context managers (`with Image.open(...) as img:`) to prevent memory leaks. |
54 changes: 54 additions & 0 deletions
54
alias/src/alias/agent/agents/_built_in_skill/data/json/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| --- | ||
| name: json-file | ||
| description: Guildlines for handling json files | ||
| type: json | ||
| --- | ||
|
|
||
| # JSON Handling Specifications | ||
|
|
||
| ## Goals | ||
| - Safely parse JSON/JSONL without memory overflow. | ||
| - Discover schema structure (keys, nesting depth, data types). | ||
| - Flatten complex nested structures into tabular data when necessary. | ||
| - Handle inconsistent schemas and "dirty" JSON (e.g., trailing commas, mixed types). | ||
|
|
||
| ## Inspection (Always First) | ||
|
|
||
| - Structure Discovery: | ||
| - Determine if the root is a `list` or a `dict`. | ||
| - Identify if it's a standard JSON or JSONL (one valid JSON object per line). | ||
| - Schema Sampling: | ||
| - For large files, read the first few objects/lines to infer the schema. | ||
| - Identify top-level keys and their types. | ||
| - Detect nesting depth: If depth > 3, consider it a "deeply nested" structure. | ||
| - Size Check: | ||
| - If the file is large (>50MB), avoid `json.load()`. Use iterative parsing or streaming. | ||
|
|
||
| ## Processing & Extraction | ||
|
|
||
| - Lazy Loading (Streaming): | ||
| - For massive JSON: Use `ijson` (Python) or similar streaming parsers to yield specific paths/items. | ||
| - For JSONL: Read line-by-line using a generator to minimize memory footprint. | ||
| - Flattening & Normalization: | ||
| - Use `pandas.json_normalize` to convert nested structures into flat tables if the goal is analysis. | ||
| - Specify `max_level` during normalization to prevent "column explosion." | ||
| - Data Filtering: | ||
| - Extract only required sub-trees (keys) early in the process to reduce the memory object size. | ||
|
|
||
| ## Data Quality & Schema Validation | ||
|
|
||
| - Missing Keys: Use `.get(key, default)` or `try-except` blocks. Never assume a key exists in all objects. | ||
| - Type Coercion: | ||
| - Validate numeric strings vs. actual numbers. | ||
| - Standardize `null`, `""`, and `[]` consistently. | ||
| - Encoding: Default to UTF-8; check for BOM (utf-8-sig) if parsing fails. | ||
| - Malformed JSON Recovery: | ||
| - For minor syntax errors (e.g., single quotes instead of double), attempt `ast.literal_eval` or regex-based cleanup only as a fallback. | ||
|
|
||
| ## Best Practices | ||
|
|
||
| - Minimal Reads: Don't load a 50MB JSON just to read one config key; use a streaming approach. | ||
| - Schema Logging: Document the detected structure (e.g., "Root is a list of 500 objects; key 'metadata' is nested"). | ||
| - Error Transparency: When a JSON object in a JSONL stream is corrupted, log the line number, skip it, and continue instead of crashing the entire process. | ||
| - Avoid Over-Flattening: Be cautious with deeply nested arrays; flattening them can lead to massive row duplication. | ||
| - Strict Typing: After extraction, explicitly convert types (e.g., `pd.to_datetime`) to ensure downstream reliability. |
70 changes: 70 additions & 0 deletions
70
alias/src/alias/agent/agents/_built_in_skill/data/relational_database/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| --- | ||
| name: database | ||
| description: Guidelines for handling databases | ||
| type: relational_db | ||
| --- | ||
|
|
||
| # Database Handling Specifications | ||
|
|
||
| ## Goals | ||
|
|
||
| - Safely explore database schema without performance degradation. | ||
| - Construct precise, efficient SQL queries that prevent system crashes (OOM & OOT). | ||
| - Handle dialect-specific nuances (PostgreSQL, MySQL, SQLite, etc.). | ||
| - Transform raw result sets into structured, validated data for analysis. | ||
|
|
||
| ## Inspection | ||
|
|
||
| - Volume Estimation: | ||
| - Before any `SELECT *`, always run `SELECT COUNT(*) FROM table_name` to understand the scale. | ||
| - If a table has >1,000,000 rows, strictly use indexed columns for filtering. | ||
| - Sample Data: | ||
| - Use `SELECT * FROM table_name LIMIT 5` to see actual data formats. | ||
|
|
||
| ## Querying | ||
|
|
||
| - Safety Constraints: | ||
| - Always use `LIMIT`: Never execute a query without a `LIMIT` clause unless the row count is confirmed to be small. | ||
| - Avoid `SELECT *`: In production-scale tables, explicitly name columns to reduce I/O and memory usage. | ||
| - Dialect & Syntax: | ||
| - Case Sensitivity: If a column/table name contains uppercase or special characters, MUST quote it (e.g., `"UserTable"` in Postgres, `` `UserTable` `` in MySQL). | ||
| - Date/Time: Use standard ISO strings for date filtering; be mindful of timezone-aware vs. naive columns. | ||
| - Complex Queries: | ||
| - For `JOIN` operations, ensure joining columns are indexed to prevent full table scans. | ||
| - When performing `GROUP BY`, ensure the result set size is manageable. | ||
|
|
||
| ## Data Retrieval & Transformation | ||
|
|
||
| - Type Mapping: | ||
| - Ensure SQL types (e.g., `DECIMAL`, `BIGINT`, `TIMESTAMP`) are correctly mapped to Python/JSON types without precision loss. | ||
| - Convert `NULL` values to a consistent "missing" representation (e.g., `None` or `NaN`). | ||
| - Chunked Fetching: | ||
| - For medium-to-large exports, use `fetchmany(size)` or `OFFSET/LIMIT` pagination instead of fetching everything into memory at once. | ||
| - Aggregations: | ||
| - Prefer performing calculations (SUM, AVG, COUNT) at the database level rather than pulling raw data to the client for processing. | ||
|
|
||
| ## Error Handling & Recovery | ||
|
|
||
| - Timeout Management: If a query takes too long, retry with more restrictive filters or optimized joins. | ||
| - Syntax Errors: If a query fails, inspect the dialect-specific error message and re-verify the schema (it's often a misspelled column or missing quotes). | ||
|
|
||
| ## Anti-Pattern Prevention (Avoiding "Bad" SQL) | ||
|
|
||
| - Index-Friendly Filters: Never wrap indexed columns in functions (e.g., `DATE()`, `UPPER()`) within the `WHERE` clause. | ||
| - Join Safety: Always verify join keys. Before joining, check if the key has high cardinality to avoid massive intermediate result sets. | ||
| - Memory Safety: | ||
| - Avoid `DISTINCT` and `UNION` (which performs de-duplication) on multi-million row sets unless necessary; use `UNION ALL` if duplicates are acceptable. | ||
| - Avoid `ORDER BY` on large non-indexed text fields. | ||
| - Wildcard Warning: Strictly avoid leading wildcards in `LIKE` patterns (e.g., `%term`) on large text columns. | ||
| - No Function on Columns: `WHERE col = FUNC(val)` is good; `WHERE FUNC(col) = val` is bad. | ||
| - Explicit Columns: Only fetch what is necessary. | ||
| - Early Filtering: Push `WHERE` conditions as close to the base tables as possible. | ||
| - CTE for Clarity: Use `WITH` for complex multi-step logic to improve maintainability and optimizer hints. | ||
|
|
||
| # Best Practices | ||
|
|
||
| - Always verify database structure before querying | ||
| - Use appropriate sampling techniques for large datasets | ||
| - Optimize queries for efficiency based on schema inspection | ||
| - Self-review the draft SQL against the "Anti-Pattern Prevention" list. | ||
| - Perform a silent mental 'EXPLAIN' on your query. If it smells like a full table scan on a large table, refactor it before outputting |
50 changes: 50 additions & 0 deletions
50
alias/src/alias/agent/agents/_built_in_skill/data/text/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| --- | ||
| name: text-file | ||
| description: Guidelines for handling text files | ||
| type: text | ||
| --- | ||
|
|
||
| # Text Files Handling Specifications | ||
|
|
||
| ## Goals | ||
| - Safely read text files without memory exhaustion. | ||
| - Accurately detect encoding to avoid garbled characters. | ||
| - Identify underlying patterns (e.g., Log formats, Markdown structure, delimiters). | ||
| - Efficiently extract or search for specific information within large volumes of text. | ||
|
|
||
| ## Encoding & Detection | ||
|
|
||
| - Encoding Strategy: | ||
| - Default to `utf-8`. | ||
| - If it fails, try `utf-8-sig` (for files with BOM), `gbk/gb18030` (for Chinese context), or `latin-1`. | ||
| - Use `chardet` or similar logic if encoding is unknown and first few bytes look non-standard. | ||
| - Line Endings: Be aware of `\n` (Unix), `\r\n` (Windows), and `\r` (Legacy Mac) when counting lines or splitting. | ||
|
|
||
| ## Inspection | ||
|
|
||
| - Preview: Read the first 10-20 lines to determine: | ||
| - Content Type: Is it a log, code, prose, or a semi-structured list? | ||
| - Uniformity: Does every line follow the same format? | ||
| - Metadata: Check total file size before reading. If >50MB, treat as a "large file" and avoid full loading. | ||
|
|
||
| ## Querying & Reading (Large Files) | ||
|
|
||
| - Streaming: For files exceeding memory or >50MB: | ||
| - Use `with open(path) as f: for line in f:` to process line-by-line. | ||
| - Never use `.read()` or `.readlines()` on large files. | ||
| - Random Sampling: To understand a huge file's structure, read the first N lines, the middle N lines (using `f.seek()`), and the last N lines. | ||
| - Pattern Matching: Use Regular Expressions (Regex) for targeted extraction instead of complex string slicing. | ||
| - Grep-like Search: If searching for a keyword, iterate through lines and only store/return matching lines + context. | ||
|
|
||
| ## Data Quality | ||
|
|
||
| - Truncation Warning: If only a portion of the file is read, clearly state: "Displaying first X lines of Y total lines." | ||
| - Empty Lines/Comments: Decide early whether to ignore blank lines or lines starting with specific comment characters (e.g., `#`, `//`). | ||
|
|
||
| ## Best Practices | ||
|
|
||
| - Resource Safety: Always use context managers (`with` statement) to ensure file handles are closed. | ||
| - Memory Consciousness: For logs and large TXT, prioritize "find and extract" over "load and filter." | ||
| - Regex Optimization: Compile regex patterns if they are used repeatedly in a loop over millions of lines. | ||
| - Validation: After reading, verify the content isn't binary (e.g., PDF or EXE renamed to .txt) by checking for null bytes or a high density of non-ASCII characters. | ||
| - Progress Logging: For long-running text processing, log progress every 100k lines or 10% of file size. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.