- Multi-Task Support: 7 built-in labeling tasks (QUALITY, DIFFICULTY, CLASSIFICATION, SAFETY, REWARD, LANGUAGE, EMBEDDING)
- Dual Inference Modes: Local VLLM models + Remote API services
- High Performance: Batch processing with checkpointing for large datasets
- Vector Storage: Integration with Faiss (local) and Milvus (distributed) for embeddings
- Flexible Configuration: CLI args and environment variables
- Easy Extension: Modular design for custom tasks
- Data Formatting: Built-in data cleaning and conversion tools
git clone <repository-url>
cd data-tagger
uv sync
# Run all tasks
bash examples/vllm/run_all_taggers_vllm.sh
# Single task
python -m datatagger.tagger.unified_tagger_vllm \
--vllm_model_path /path/to/model \
--tag_mission QUALITY \
--input_file data.jsonl \
--output_file output.jsonl
# Configure API credentials
cp .env.example .env
# Edit .env with your API settings
# Run all tasks
bash examples/api/run_all_taggers_api.sh
# Single task
python -m datatagger.tagger.unified_tagger_api \
--api_model_name gpt-4 \
--tag_mission QUALITY \
--input_file data.jsonl \
--output_file output.jsonl
Task | Description | Key Output Fields |
---|---|---|
QUALITY |
Dialogue quality assessment (1-5 score) | input_quality , response_quality , *_explanation |
DIFFICULTY |
Task difficulty evaluation (0-5 score) | difficulty (0-5 float) |
CLASSIFICATION |
Intent categorization | task_category , other_task_category |
SAFETY |
Content safety detection | safety (VLLM only) |
REWARD |
Response reward scoring | instruct_reward (0-5 float, VLLM only) |
LANGUAGE |
Language identification | language (ISO codes) |
EMBEDDING |
Vector embedding generation | embedding , min_neighbor_distance , repeat_count |
# API Settings
API_MODEL_NAME=gpt-4
API_URL=https://api.openai.com/v1/chat/completions
API_KEY=your_key_here
# VLLM Settings
VLLM_MODEL_PATH=/path/to/model
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.9
Parameter | Description | Required |
---|---|---|
--tag_mission |
Task type | ✅ |
--input_file |
Input data path | ✅ |
--output_file |
Output path | ✅ |
--prompt_field |
Prompt field name | ✅ |
--batch_size |
Batch size | ❌ |
--checkpoint_every |
Checkpoint frequency | ❌ |
--dimension |
Embedding dimension | ❌ (EMBEDDING task) |
--faiss_store_embeddings |
Store embeddings in Faiss | ❌ |
--milvus_store_embeddings |
Store embeddings in Milvus | ❌ |
Clean and standardize your data:
python -m datatagger.formatter.data_formatter \
--input_file raw_data.json \
--output_file clean_data.jsonl \
--save_as jsonl
Task | VLLM (Qwen3-8B) | API (GPT-4) |
---|---|---|
Quality | 120 samples/min | 80 samples/min |
Classification | 150 samples/min | 100 samples/min |
Embedding | 80 samples/min | N/A |
- Memory Issues (VLLM): Reduce
--batch_size
, use--tensor_parallel_size
, or adjust--gpu_memory_utilization
- API Rate Limits: Increase
--retry_delay
, reduce--batch_size
, or configure--max_requests_per_minute
- Model Loading: Check model paths and GPU availability
- File Format Errors: Use data formatter before processing, validate JSON/JSONL format
- Embedding Storage Issues: Verify dimension consistency, check database connectivity
--log_level DEBUG --debug --batch_size 1
python -m datatagger.tagger.unified_tagger_vllm \
--vllm_model_path /path/to/Qwen3-Embedding-4B \
--tag_mission EMBEDDING \
--input_file data/documents.jsonl \
--output_file output/embeddings.jsonl \
--faiss_store_embeddings \
--dimension 1536
python -m datatagger.tagger.unified_tagger_api \
--api_model_name gpt-4 \
--api_url https://api.example.com \
--api_key your_api_key \
--tag_mission QUALITY \
--input_file data/input.jsonl \
--output_file output/quality_results.jsonl
# Setup development environment
uv sync --dev
pip install -e .
# Run tests
pytest tests/
# Format code
black datatagger/ && isort datatagger/
MIT License - see LICENSE for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
Made with ❤️ by the data-tagger community