Skip to content

semioz/synthclaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

synthclaw

Crates.io Documentation

Synthetic data generation in Rust. Generate and augment datasets using OpenAI, Anthropic with support for HuggingFace datasets.

Available as both a CLI tool and a Rust library.

Installation

CLI

cargo install synthclaw

Library

[dependencies]
synthclaw = "0.1.3"

Quick Start

export OPENAI_API_KEY=sk-...

# Generate 50 product reviews across categories
synthclaw generate \
  --prompt "Generate a realistic {category} product review, 2-3 sentences" \
  --provider openai \
  --categories electronics,books,clothing \
  -n 50 \
  -o reviews.jsonl

# Or use a config file
synthclaw generate --config examples/configs/generate_reviews.yaml

CLI Usage

Explore HuggingFace Datasets

# Search
synthclaw datasets search "sentiment" --limit 10

# Get info
synthclaw datasets info cornell-movie-review-data/rotten_tomatoes

# Preview rows
synthclaw datasets preview cornell-movie-review-data/rotten_tomatoes --rows 5

Generate Data

# From scratch with categories
synthclaw generate \
  --prompt "Generate a {category} example" \
  --provider openai \
  --categories positive,negative \
  -n 100

# Dry run (no API calls)
synthclaw generate --dry-run --config config.yaml

Writing Good Prompts

The tool uses system prompts by default to ensure clean outputs. You provide the user prompt template.

Template Variables

For generate mode:

  • {category} - current category being generated
  • {index} - item number (0, 1, 2...)

For augment mode:

  • Any column from source data: {text}, {label}, etc.

Good Prompt Examples

Product Reviews:

template: |
  Generate a realistic product review for: {category}
  
  Requirements:
  - Customer perspective, 2-4 sentences
  - Include specific details (brand, features, price)
  - Natural tone - can be positive, negative, or mixed

Sentiment Data:

template: |
  Generate a {category} movie review.
  
  Requirements:
  - The sentiment must clearly be {category}
  - 1-3 sentences
  - Mention specific aspects (acting, plot, visuals)

Data Augmentation (paraphrase):

template: |
  Paraphrase this text while preserving meaning and sentiment:
  
  Original: {text}
  
  Paraphrase:

Question-Answer Generation:

template: |
  Based on this document, generate a Q&A pair:
  
  Document: {text}
  
  Output JSON: {"question": "...", "answer": "..."}
system_prompt: |
  Generate educational Q&A pairs. Output ONLY valid JSON.

Configuration

Generate from Scratch

name: "product_reviews"

provider:
  type: openai
  model: "gpt-4o-mini"
  temperature: 0.8

generation:
  task: generate
  count: 100
  concurrency: 10
  categories:
    - electronics
    - books
    - clothing
  template: |
    Generate a realistic {category} product review.
    2-3 sentences, customer perspective, specific details.

output:
  format: jsonl
  path: "./output/reviews.jsonl"

Augment Existing Data

name: "sentiment_augmentation"

source:
  type: huggingface
  dataset: "cornell-movie-review-data/rotten_tomatoes"
  split: "train"
  sample: 500

provider:
  type: openai
  model: "gpt-4o-mini"

generation:
  task: augment
  count_per_example: 2
  concurrency: 10
  strategy: paraphrase

output:
  format: jsonl
  path: "./output/augmented.jsonl"

Custom System Prompt

Override the default system prompt when you need specific behavior:

generation:
  template: |
    Generate a {category} example in JSON format.
  system_prompt: |
    You are a data generation assistant.
    Output ONLY valid JSON, no markdown, no explanations.
    Schema: {"text": "...", "label": "..."}

Validation

Filter bad outputs and remove duplicates:

validation:
  min_length: 20
  max_length: 1000
  json: true                    # must be valid JSON
  json_schema: [question, answer]  # required fields
  blocklist: true               # filter "Sure!", "As an AI", etc.
  repetition: true              # filter repetitive text
  dedupe: normalized            # exact | normalized | jaccard

Upload to HuggingFace Hub

hub:
  repo: "username/my-dataset"
  private: false
  token: "hf_..."  # or set HF_TOKEN env var, or `huggingface-cli login`

Token is resolved in order: config → HF_TOKEN env → ~/.cache/huggingface/token

Library Usage

use synthclaw::{
    config::SynthConfig,
    datasets::{HuggingFaceSource, DataSource},
    providers::{create_provider, GenerationRequest},
};

// Load HuggingFace dataset
let mut source = HuggingFaceSource::new(
    "cornell-movie-review-data/rotten_tomatoes".to_string(),
    None,
    "train".to_string(),
    None,
)?;
let records = source.load(Some(100))?;

// Create provider and generate
let config = SynthConfig::from_file(&"config.yaml".into())?;
let provider = create_provider(&config.provider)?;

let response = provider.generate(GenerationRequest {
    prompt: "Generate a movie review".to_string(),
    system_prompt: Some("Output only the review text.".to_string()),
    temperature: Some(0.7),
    max_tokens: Some(500),
}).await?;

Validation (Library)

use synth_claw::validation::{
    ValidationPipeline, MinLength, Json, JsonSchema, Blocklist,
    Deduplicator, validate_and_dedupe,
};

let results = engine.run(&config).await?;

let pipeline = ValidationPipeline::new()
    .add(MinLength(20))
    .add(Json)
    .add(JsonSchema::require(&["question", "answer"]))
    .add(Blocklist::llm_artifacts());

let validated = validate_and_dedupe(results, &pipeline, Some(&Deduplicator::Normalized));

println!("passed: {}, failed: {}", validated.stats.passed, validated.stats.failed);
for r in validated.results { /* clean data */ }

Upload to HuggingFace Hub

use synth_claw::hub::DatasetUploader;

let uploader = DatasetUploader::new("username/my-dataset", false, None).await?;

// Upload JSONL data
let data = vec![json!({"q": "...", "a": "..."})];
uploader.upload_jsonl(&data, "train.jsonl").await?;

// Upload any file
uploader.upload("README.md", readme.as_bytes(), Some("Add README")).await?;

println!("Dataset: {}", uploader.repo_url());

Requires HF_TOKEN env var.

Output Formats

  • jsonl - Line-delimited JSON (recommended for large datasets)
  • csv - Comma-separated values
  • parquet - Apache Parquet (efficient for analytics)

Environment Variables

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
HF_TOKEN=hf_...  # for uploading to HuggingFace Hub

Roadmap

Production Scale

  • Streaming pipeline (generate → validate → write, no memory accumulation)
  • Checkpointing & resume
  • Retry with exponential backoff
  • Rate limiting
  • Budget limits

Providers

  • Gemini, Ollama, Azure OpenAI, Together AI, Groq

Integration

  • HuggingFace Hub upload
  • Dataset cards

About

synthetic data generation cli/library in rust with huggingface support.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages