DataJuicer Agent

A multi-agent data processing system built on AgentScope and Data-Juicer (DJ). This project demonstrates how to leverage the natural language understanding capabilities of large language models, enabling non-expert users to easily harness the powerful data processing capabilities of Data-Juicer.

📋 Table of Contents

📋 Table of Contents
What Does This Agent Do?
Architecture
Quick Start
Agent Introduction
- Data Processing Agent
- Code Development Agent (DJ Dev Agent)
Advanced Features
- Operator Retrieval
  - Retrieval Modes
  - Usage
- MCP Agent
Feature Preview
- Data-Juicer Q&A Agent (Demo Available)
- Data Analysis and Visualization Agent (In Development)
Troubleshooting
- Common Issues
- Optimization Recommendations

What Does This Agent Do?

Data-Juicer (DJ) is a one-stop system for text and multimodal data processing for large language models. It provides nearly 200 core data processing operators, covering multimodal data such as text, images, and videos, and supports the full pipeline of data analysis, cleaning, and synthesis.

After running this example, you can:

Intelligent Query: Find suitable operators from nearly 200 data processing operators for your data scenarios
Automated Pipeline: Describe your data processing needs, automatically generate Data-Juicer YAML configurations and execute them
Custom Extension: Quickly develop custom operators for specific scenarios

Architecture

User Query
    ↓
Router Agent ──┐
               ├── Data Processing Agent (DJ Agent)
               |   ├── General File Read/Write Tools
               │   ├── query_dj_operators (Query DataJuicer operators)
               │   └── execute_safe_command (Execute safe commands including dj-process, dj-analyze)
               │
               └── Code Development Agent (DJ Dev Agent)
                   ├── General File Read/Write Tools
                   ├── get_basic_files (Get basic development knowledge)
                   ├── get_operator_example (Get operator source code examples related to requirements)
                   └── configure_data_juicer_path (Configure DataJuicer path)

Quick Start

System Requirements

Python 3.10+
Valid DashScope API key
Optional: Data-Juicer source code (for custom operator development)

Installation

# Recommended to use uv
uv pip install -r requirements.txt

pip install -r requirements.txt

Configuration

Set API Key

export DASHSCOPE_API_KEY="your-dashscope-key"

Optional: Configure Data-Juicer Path (for custom operator development)

export DATA_JUICER_PATH="your-data-juicer-path"

Tip: You can also set this during runtime through conversation, for example:

"Help me set the DataJuicer path: /path/to/data-juicer"

"Help me update the DataJuicer path: /path/to/data-juicer"

Usage

Choose the running mode using the -u or --use_studio parameter:

# Use AgentScope Studio (provides interactive interface)
python main.py --use_studio True

# Or use command-line mode (default)
python main.py

Agent Introduction

Data Processing Agent

Responsible for interacting with Data-Juicer and executing actual data processing tasks. Supports automatic operator recommendation from natural language descriptions, configuration generation, and execution.

Typical Use Cases:

Data Cleaning: Deduplication, removal of low-quality samples, format standardization
Multimodal Processing: Process text, image, and video data simultaneously
Batch Conversion: Format conversion, data augmentation, feature extraction

View Complete Example Log (from AgentScope Studio)

Code Development Agent (DJ Dev Agent)

Assists in developing custom data processing operators, powered by the qwen3-coder-480b-a35b-instruct model by default.

Typical Use Cases:

Develop domain-specific filter or transformation operators
Integrate proprietary data processing logic
Extend Data-Juicer capabilities for specific scenarios

View Complete Example Log (from AgentScope Studio)

Advanced Features

Operator Retrieval

DJ Agent implements an intelligent operator retrieval tool that quickly finds the most relevant operators from Data-Juicer's nearly 200 operators through an independent LLM query process. This is a key component enabling the data processing agent and code development agent to run accurately.

We provide three retrieval modes to choose from based on different scenarios:

Retrieval Modes

LLM Retrieval (default)

Uses the Qwen-Turbo model to match the most relevant operators
Provides detailed matching reasons and relevance scores
Suitable for scenarios requiring high-precision matching, but consumes more tokens

Vector Retrieval (vector)

Based on DashScope text embedding and FAISS similarity search
Fast and efficient, suitable for large-scale retrieval scenarios

Auto Mode (auto)

Prioritizes LLM retrieval, automatically falls back to vector retrieval on failure

Usage

Specify the retrieval mode using the -r or --retrieve_mode parameter:

python main.py --retrieve_mode vector

For more parameter descriptions, see python main.py --help

MCP Agent

Data-Juicer provides MCP (Model Context Protocol) services that can directly obtain operator information and execute data processing through native interfaces, making it easy to migrate and integrate without separate LLM queries and command-line calls.

MCP Server Types

Data-Juicer provides two MCP server modes:

Recipe-Flow (Data Recipe)

Filter by operator type and tags
Support combining multiple operators into data recipes for execution

Granular-Operators (Fine-grained Operators)

Provide each operator as an independent tool
Flexibly specify operator lists through environment variables
Build fully customized data processing pipelines

For detailed information, please refer to: Data-Juicer MCP Service Documentation

Note: The Data-Juicer MCP server is currently in early development, and features and tools may change with ongoing development.

Configuration

Configure the service address in configs/mcp_config.json:

{
    "mcpServers": {
        "DJ_recipe_flow": {
            "url": "http://127.0.0.1:8080/sse"
        }
    }
}

Usage Methods

Enable MCP Agent to replace DJ Agent:

# Enable MCP Agent and Dev Agent
python main.py --available_agents [dj_mcp, dj_dev]

# Or use shorthand
python main.py -a [dj_mcp, dj_dev]

Feature Preview

The Data-Juicer agent ecosystem is rapidly expanding. Here are the new agents currently in development or planned:

Data-Juicer Q&A Agent (Demo Available)

Provides users with detailed answers about Data-Juicer operators, concepts, and best practices.

Data Analysis and Visualization Agent (In Development)

Generates data analysis and visualization results, expected to be released soon.

Troubleshooting

Common Issues

Q: How to get DashScope API key? A: Visit DashScope official website to register an account and apply for an API key.

Q: Why does operator retrieval fail? A: Please check network connection and API key configuration, or try switching to vector retrieval mode.

Q: How to debug custom operators? A: Ensure Data-Juicer path is configured correctly and check the example code provided by the code development agent.

Q: What to do if MCP service connection fails? A: Check if the MCP server is running and confirm the URL address in the configuration file is correct.

Optimization Recommendations

For large-scale data processing, it is recommended to use DataJuicer's distributed mode
Set batch size appropriately to balance memory usage and processing speed
For more advanced data processing features (synthesis, Data-Model Co-Development), please refer to DataJuicer documentation

Contributing: Welcome to submit Issues and Pull Requests to improve AgentScope, DataJuicer Agent, and DataJuicer. If you encounter problems during use or have feature suggestions, please feel free to contact us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataJuicer Agent

📋 Table of Contents

What Does This Agent Do?

Architecture

Quick Start

System Requirements

Installation

Configuration

Usage

Agent Introduction

Data Processing Agent

Code Development Agent (DJ Dev Agent)

Advanced Features

Operator Retrieval

Retrieval Modes

Usage

MCP Agent

MCP Server Types

Configuration

Usage Methods

Feature Preview

Data-Juicer Q&A Agent (Demo Available)

Data Analysis and Visualization Agent (In Development)

Troubleshooting

Common Issues

Optimization Recommendations

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DataJuicer Agent

📋 Table of Contents

What Does This Agent Do?

Architecture

Quick Start

System Requirements

Installation

Configuration

Usage

Agent Introduction

Data Processing Agent

Code Development Agent (DJ Dev Agent)

Advanced Features

Operator Retrieval

Retrieval Modes

Usage

MCP Agent

MCP Server Types

Configuration

Usage Methods

Feature Preview

Data-Juicer Q&A Agent (Demo Available)

Data Analysis and Visualization Agent (In Development)

Troubleshooting

Common Issues

Optimization Recommendations