Skip to content

Tendo33/data-tagger

Repository files navigation

data-tagger

中文

A powerful multi-task data labeling tool for large-scale datasets.

Python Version License PRs Welcome

Features

  • Multi-Task Support: 7 built-in labeling tasks (QUALITY, DIFFICULTY, CLASSIFICATION, SAFETY, REWARD, LANGUAGE, EMBEDDING)
  • Dual Inference Modes: Local VLLM models + Remote API services
  • High Performance: Batch processing with checkpointing for large datasets
  • Vector Storage: Integration with Faiss (local) and Milvus (distributed) for embeddings
  • Flexible Configuration: CLI args and environment variables
  • Easy Extension: Modular design for custom tasks
  • Data Formatting: Built-in data cleaning and conversion tools

Quick Start

Installation

git clone <repository-url>
cd data-tagger
uv sync

Using VLLM (Local Models)

# Run all tasks
bash examples/vllm/run_all_taggers_vllm.sh

# Single task
python -m datatagger.tagger.unified_tagger_vllm \
  --vllm_model_path /path/to/model \
  --tag_mission QUALITY \
  --input_file data.jsonl \
  --output_file output.jsonl

Using API (Remote Models)

# Configure API credentials
cp .env.example .env
# Edit .env with your API settings

# Run all tasks
bash examples/api/run_all_taggers_api.sh

# Single task
python -m datatagger.tagger.unified_tagger_api \
  --api_model_name gpt-4 \
  --tag_mission QUALITY \
  --input_file data.jsonl \
  --output_file output.jsonl

Supported Tasks

Task Description Key Output Fields
QUALITY Dialogue quality assessment (1-5 score) input_quality, response_quality, *_explanation
DIFFICULTY Task difficulty evaluation (0-5 score) difficulty (0-5 float)
CLASSIFICATION Intent categorization task_category, other_task_category
SAFETY Content safety detection safety (VLLM only)
REWARD Response reward scoring instruct_reward (0-5 float, VLLM only)
LANGUAGE Language identification language (ISO codes)
EMBEDDING Vector embedding generation embedding, min_neighbor_distance, repeat_count

Configuration

Environment Variables

# API Settings
API_MODEL_NAME=gpt-4
API_URL=https://api.openai.com/v1/chat/completions
API_KEY=your_key_here

# VLLM Settings
VLLM_MODEL_PATH=/path/to/model
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.9

Key Parameters

Parameter Description Required
--tag_mission Task type
--input_file Input data path
--output_file Output path
--prompt_field Prompt field name
--batch_size Batch size
--checkpoint_every Checkpoint frequency
--dimension Embedding dimension ❌ (EMBEDDING task)
--faiss_store_embeddings Store embeddings in Faiss
--milvus_store_embeddings Store embeddings in Milvus

Data Formatting

Clean and standardize your data:

python -m datatagger.formatter.data_formatter \
  --input_file raw_data.json \
  --output_file clean_data.jsonl \
  --save_as jsonl

Performance

Task VLLM (Qwen3-8B) API (GPT-4)
Quality 120 samples/min 80 samples/min
Classification 150 samples/min 100 samples/min
Embedding 80 samples/min N/A

Troubleshooting

Common Issues

  1. Memory Issues (VLLM): Reduce --batch_size, use --tensor_parallel_size, or adjust --gpu_memory_utilization
  2. API Rate Limits: Increase --retry_delay, reduce --batch_size, or configure --max_requests_per_minute
  3. Model Loading: Check model paths and GPU availability
  4. File Format Errors: Use data formatter before processing, validate JSON/JSONL format
  5. Embedding Storage Issues: Verify dimension consistency, check database connectivity

Debug Mode

--log_level DEBUG --debug --batch_size 1

Advanced Examples

Embedding Generation with Storage

python -m datatagger.tagger.unified_tagger_vllm \
  --vllm_model_path /path/to/Qwen3-Embedding-4B \
  --tag_mission EMBEDDING \
  --input_file data/documents.jsonl \
  --output_file output/embeddings.jsonl \
  --faiss_store_embeddings \
  --dimension 1536

Quality Assessment via API

python -m datatagger.tagger.unified_tagger_api \
  --api_model_name gpt-4 \
  --api_url https://api.example.com \
  --api_key your_api_key \
  --tag_mission QUALITY \
  --input_file data/input.jsonl \
  --output_file output/quality_results.jsonl

Development

# Setup development environment
uv sync --dev
pip install -e .

# Run tests
pytest tests/

# Format code
black datatagger/ && isort datatagger/

License

MIT License - see LICENSE for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Made with ❤️ by the data-tagger community

About

Efficient, Flexible, Multi-task Batch SFT Data Labeling Tool

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages