A multi-agent data processing system built on AgentScope and Data-Juicer (DJ). This project demonstrates how to leverage the natural language understanding capabilities of large language models, enabling non-expert users to easily harness the powerful data processing capabilities of Data-Juicer.
- 📋 Table of Contents
- What Does This Agent Do?
- Architecture
- Quick Start
- Agent Introduction
- Advanced Features
- Feature Preview
- Troubleshooting
Data-Juicer (DJ) is a one-stop system for text and multimodal data processing for large language models. It provides nearly 200 core data processing operators, covering multimodal data such as text, images, and videos, and supports the full pipeline of data analysis, cleaning, and synthesis.
After running this example, you can:
- Intelligent Query: Find suitable operators from nearly 200 data processing operators for your data scenarios
- Automated Pipeline: Describe your data processing needs, automatically generate Data-Juicer YAML configurations and execute them
- Custom Extension: Quickly develop custom operators for specific scenarios
User Query
↓
Router Agent ──┐
├── Data Processing Agent (DJ Agent)
| ├── General File Read/Write Tools
│ ├── query_dj_operators (Query DataJuicer operators)
│ └── execute_safe_command (Execute safe commands including dj-process, dj-analyze)
│
└── Code Development Agent (DJ Dev Agent)
├── General File Read/Write Tools
├── get_basic_files (Get basic development knowledge)
├── get_operator_example (Get operator source code examples related to requirements)
└── configure_data_juicer_path (Configure DataJuicer path)
- Python 3.10+
- Valid DashScope API key
- Optional: Data-Juicer source code (for custom operator development)
# Recommended to use uv
uv pip install -r requirements.txtor
pip install -r requirements.txt- Set API Key
export DASHSCOPE_API_KEY="your-dashscope-key"- Optional: Configure Data-Juicer Path (for custom operator development)
export DATA_JUICER_PATH="your-data-juicer-path"Tip: You can also set this during runtime through conversation, for example:
- "Help me set the DataJuicer path: /path/to/data-juicer"
- "Help me update the DataJuicer path: /path/to/data-juicer"
Choose the running mode using the -u or --use_studio parameter:
# Use AgentScope Studio (provides interactive interface)
python main.py --use_studio True
# Or use command-line mode (default)
python main.pyResponsible for interacting with Data-Juicer and executing actual data processing tasks. Supports automatic operator recommendation from natural language descriptions, configuration generation, and execution.
Typical Use Cases:
- Data Cleaning: Deduplication, removal of low-quality samples, format standardization
- Multimodal Processing: Process text, image, and video data simultaneously
- Batch Conversion: Format conversion, data augmentation, feature extraction
Assists in developing custom data processing operators, powered by the qwen3-coder-480b-a35b-instruct model by default.
Typical Use Cases:
- Develop domain-specific filter or transformation operators
- Integrate proprietary data processing logic
- Extend Data-Juicer capabilities for specific scenarios
DJ Agent implements an intelligent operator retrieval tool that quickly finds the most relevant operators from Data-Juicer's nearly 200 operators through an independent LLM query process. This is a key component enabling the data processing agent and code development agent to run accurately.
We provide three retrieval modes to choose from based on different scenarios:
LLM Retrieval (default)
- Uses the Qwen-Turbo model to match the most relevant operators
- Provides detailed matching reasons and relevance scores
- Suitable for scenarios requiring high-precision matching, but consumes more tokens
Vector Retrieval (vector)
- Based on DashScope text embedding and FAISS similarity search
- Fast and efficient, suitable for large-scale retrieval scenarios
Auto Mode (auto)
- Prioritizes LLM retrieval, automatically falls back to vector retrieval on failure
Specify the retrieval mode using the -r or --retrieve_mode parameter:
python main.py --retrieve_mode vectorFor more parameter descriptions, see python main.py --help
Data-Juicer provides MCP (Model Context Protocol) services that can directly obtain operator information and execute data processing through native interfaces, making it easy to migrate and integrate without separate LLM queries and command-line calls.
Data-Juicer provides two MCP server modes:
Recipe-Flow (Data Recipe)
- Filter by operator type and tags
- Support combining multiple operators into data recipes for execution
Granular-Operators (Fine-grained Operators)
- Provide each operator as an independent tool
- Flexibly specify operator lists through environment variables
- Build fully customized data processing pipelines
For detailed information, please refer to: Data-Juicer MCP Service Documentation
Note: The Data-Juicer MCP server is currently in early development, and features and tools may change with ongoing development.
Configure the service address in configs/mcp_config.json:
{
"mcpServers": {
"DJ_recipe_flow": {
"url": "http://127.0.0.1:8080/sse"
}
}
}Enable MCP Agent to replace DJ Agent:
# Enable MCP Agent and Dev Agent
python main.py --available_agents [dj_mcp, dj_dev]
# Or use shorthand
python main.py -a [dj_mcp, dj_dev]The Data-Juicer agent ecosystem is rapidly expanding. Here are the new agents currently in development or planned:
Provides users with detailed answers about Data-Juicer operators, concepts, and best practices.
Generates data analysis and visualization results, expected to be released soon.
Q: How to get DashScope API key? A: Visit DashScope official website to register an account and apply for an API key.
Q: Why does operator retrieval fail? A: Please check network connection and API key configuration, or try switching to vector retrieval mode.
Q: How to debug custom operators? A: Ensure Data-Juicer path is configured correctly and check the example code provided by the code development agent.
Q: What to do if MCP service connection fails? A: Check if the MCP server is running and confirm the URL address in the configuration file is correct.
- For large-scale data processing, it is recommended to use DataJuicer's distributed mode
- Set batch size appropriately to balance memory usage and processing speed
- For more advanced data processing features (synthesis, Data-Model Co-Development), please refer to DataJuicer documentation
Contributing: Welcome to submit Issues and Pull Requests to improve AgentScope, DataJuicer Agent, and DataJuicer. If you encounter problems during use or have feature suggestions, please feel free to contact us.

