synthclaw

Synthetic data generation in Rust. Generate and augment datasets using OpenAI, Anthropic with support for HuggingFace datasets.

Available as both a CLI tool and a Rust library.

Installation

CLI

cargo install synthclaw

Library

[dependencies]
synthclaw = "0.1.3"

Quick Start

export OPENAI_API_KEY=sk-...

# Generate 50 product reviews across categories
synthclaw generate \
  --prompt "Generate a realistic {category} product review, 2-3 sentences" \
  --provider openai \
  --categories electronics,books,clothing \
  -n 50 \
  -o reviews.jsonl

# Or use a config file
synthclaw generate --config examples/configs/generate_reviews.yaml

CLI Usage

Explore HuggingFace Datasets

# Search
synthclaw datasets search "sentiment" --limit 10

# Get info
synthclaw datasets info cornell-movie-review-data/rotten_tomatoes

# Preview rows
synthclaw datasets preview cornell-movie-review-data/rotten_tomatoes --rows 5

Generate Data

# From scratch with categories
synthclaw generate \
  --prompt "Generate a {category} example" \
  --provider openai \
  --categories positive,negative \
  -n 100

# Dry run (no API calls)
synthclaw generate --dry-run --config config.yaml

Writing Good Prompts

The tool uses system prompts by default to ensure clean outputs. You provide the user prompt template.

Template Variables

For generate mode:

{category} - current category being generated
{index} - item number (0, 1, 2...)

For augment mode:

Any column from source data: {text}, {label}, etc.

Good Prompt Examples

Product Reviews:

template: |
  Generate a realistic product review for: {category}
  
  Requirements:
  - Customer perspective, 2-4 sentences
  - Include specific details (brand, features, price)
  - Natural tone - can be positive, negative, or mixed

Sentiment Data:

template: |
  Generate a {category} movie review.
  
  Requirements:
  - The sentiment must clearly be {category}
  - 1-3 sentences
  - Mention specific aspects (acting, plot, visuals)

Data Augmentation (paraphrase):

template: |
  Paraphrase this text while preserving meaning and sentiment:
  
  Original: {text}
  
  Paraphrase:

Question-Answer Generation:

template: |
  Based on this document, generate a Q&A pair:
  
  Document: {text}
  
  Output JSON: {"question": "...", "answer": "..."}
system_prompt: |
  Generate educational Q&A pairs. Output ONLY valid JSON.

Configuration

Generate from Scratch

name: "product_reviews"

provider:
  type: openai
  model: "gpt-4o-mini"
  temperature: 0.8

generation:
  task: generate
  count: 100
  concurrency: 10
  categories:
    - electronics
    - books
    - clothing
  template: |
    Generate a realistic {category} product review.
    2-3 sentences, customer perspective, specific details.

output:
  format: jsonl
  path: "./output/reviews.jsonl"

Augment Existing Data

name: "sentiment_augmentation"

source:
  type: huggingface
  dataset: "cornell-movie-review-data/rotten_tomatoes"
  split: "train"
  sample: 500

provider:
  type: openai
  model: "gpt-4o-mini"

generation:
  task: augment
  count_per_example: 2
  concurrency: 10
  strategy: paraphrase

output:
  format: jsonl
  path: "./output/augmented.jsonl"

Custom System Prompt

Override the default system prompt when you need specific behavior:

generation:
  template: |
    Generate a {category} example in JSON format.
  system_prompt: |
    You are a data generation assistant.
    Output ONLY valid JSON, no markdown, no explanations.
    Schema: {"text": "...", "label": "..."}

Validation

Filter bad outputs and remove duplicates:

validation:
  min_length: 20
  max_length: 1000
  json: true                    # must be valid JSON
  json_schema: [question, answer]  # required fields
  blocklist: true               # filter "Sure!", "As an AI", etc.
  repetition: true              # filter repetitive text
  dedupe: normalized            # exact | normalized | jaccard

Upload to HuggingFace Hub

hub:
  repo: "username/my-dataset"
  private: false
  token: "hf_..."  # or set HF_TOKEN env var, or `huggingface-cli login`

Token is resolved in order: config → HF_TOKEN env → ~/.cache/huggingface/token

Library Usage

use synthclaw::{
    config::SynthConfig,
    datasets::{HuggingFaceSource, DataSource},
    providers::{create_provider, GenerationRequest},
};

// Load HuggingFace dataset
let mut source = HuggingFaceSource::new(
    "cornell-movie-review-data/rotten_tomatoes".to_string(),
    None,
    "train".to_string(),
    None,
)?;
let records = source.load(Some(100))?;

// Create provider and generate
let config = SynthConfig::from_file(&"config.yaml".into())?;
let provider = create_provider(&config.provider)?;

let response = provider.generate(GenerationRequest {
    prompt: "Generate a movie review".to_string(),
    system_prompt: Some("Output only the review text.".to_string()),
    temperature: Some(0.7),
    max_tokens: Some(500),
}).await?;

Validation (Library)

use synth_claw::validation::{
    ValidationPipeline, MinLength, Json, JsonSchema, Blocklist,
    Deduplicator, validate_and_dedupe,
};

let results = engine.run(&config).await?;

let pipeline = ValidationPipeline::new()
    .add(MinLength(20))
    .add(Json)
    .add(JsonSchema::require(&["question", "answer"]))
    .add(Blocklist::llm_artifacts());

let validated = validate_and_dedupe(results, &pipeline, Some(&Deduplicator::Normalized));

println!("passed: {}, failed: {}", validated.stats.passed, validated.stats.failed);
for r in validated.results { /* clean data */ }

Upload to HuggingFace Hub

use synth_claw::hub::DatasetUploader;

let uploader = DatasetUploader::new("username/my-dataset", false, None).await?;

// Upload JSONL data
let data = vec![json!({"q": "...", "a": "..."})];
uploader.upload_jsonl(&data, "train.jsonl").await?;

// Upload any file
uploader.upload("README.md", readme.as_bytes(), Some("Add README")).await?;

println!("Dataset: {}", uploader.repo_url());

Requires HF_TOKEN env var.

Output Formats

jsonl - Line-delimited JSON (recommended for large datasets)
csv - Comma-separated values
parquet - Apache Parquet (efficient for analytics)

Environment Variables

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
HF_TOKEN=hf_...  # for uploading to HuggingFace Hub

Roadmap

Production Scale

Streaming pipeline (generate → validate → write, no memory accumulation)
Checkpointing & resume
Retry with exponential backoff
Rate limiting
Budget limits

Providers

Gemini, Ollama, Azure OpenAI, Together AI, Groq

Integration

HuggingFace Hub upload
Dataset cards

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
examples/configs		examples/configs
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

synthclaw

Installation

CLI

Library

Quick Start

CLI Usage

Explore HuggingFace Datasets

Generate Data

Writing Good Prompts

Template Variables

Good Prompt Examples

Configuration

Generate from Scratch

Augment Existing Data

Custom System Prompt

Validation

Upload to HuggingFace Hub

Library Usage

Validation (Library)

Upload to HuggingFace Hub

Output Formats

Environment Variables

Roadmap

Production Scale

Providers

Integration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

synthclaw

Installation

CLI

Library

Quick Start

CLI Usage

Explore HuggingFace Datasets

Generate Data

Writing Good Prompts

Template Variables

Good Prompt Examples

Configuration

Generate from Scratch

Augment Existing Data

Custom System Prompt

Validation

Upload to HuggingFace Hub

Library Usage

Validation (Library)

Upload to HuggingFace Hub

Output Formats

Environment Variables

Roadmap

Production Scale

Providers

Integration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages