Synthetic data generation in Rust. Generate and augment datasets using OpenAI, Anthropic with support for HuggingFace datasets.
Available as both a CLI tool and a Rust library.
cargo install synthclaw[dependencies]
synthclaw = "0.1.3"export OPENAI_API_KEY=sk-...
# Generate 50 product reviews across categories
synthclaw generate \
--prompt "Generate a realistic {category} product review, 2-3 sentences" \
--provider openai \
--categories electronics,books,clothing \
-n 50 \
-o reviews.jsonl
# Or use a config file
synthclaw generate --config examples/configs/generate_reviews.yaml# Search
synthclaw datasets search "sentiment" --limit 10
# Get info
synthclaw datasets info cornell-movie-review-data/rotten_tomatoes
# Preview rows
synthclaw datasets preview cornell-movie-review-data/rotten_tomatoes --rows 5# From scratch with categories
synthclaw generate \
--prompt "Generate a {category} example" \
--provider openai \
--categories positive,negative \
-n 100
# Dry run (no API calls)
synthclaw generate --dry-run --config config.yamlThe tool uses system prompts by default to ensure clean outputs. You provide the user prompt template.
For generate mode:
{category}- current category being generated{index}- item number (0, 1, 2...)
For augment mode:
- Any column from source data:
{text},{label}, etc.
Product Reviews:
template: |
Generate a realistic product review for: {category}
Requirements:
- Customer perspective, 2-4 sentences
- Include specific details (brand, features, price)
- Natural tone - can be positive, negative, or mixedSentiment Data:
template: |
Generate a {category} movie review.
Requirements:
- The sentiment must clearly be {category}
- 1-3 sentences
- Mention specific aspects (acting, plot, visuals)Data Augmentation (paraphrase):
template: |
Paraphrase this text while preserving meaning and sentiment:
Original: {text}
Paraphrase:Question-Answer Generation:
template: |
Based on this document, generate a Q&A pair:
Document: {text}
Output JSON: {"question": "...", "answer": "..."}
system_prompt: |
Generate educational Q&A pairs. Output ONLY valid JSON.name: "product_reviews"
provider:
type: openai
model: "gpt-4o-mini"
temperature: 0.8
generation:
task: generate
count: 100
concurrency: 10
categories:
- electronics
- books
- clothing
template: |
Generate a realistic {category} product review.
2-3 sentences, customer perspective, specific details.
output:
format: jsonl
path: "./output/reviews.jsonl"name: "sentiment_augmentation"
source:
type: huggingface
dataset: "cornell-movie-review-data/rotten_tomatoes"
split: "train"
sample: 500
provider:
type: openai
model: "gpt-4o-mini"
generation:
task: augment
count_per_example: 2
concurrency: 10
strategy: paraphrase
output:
format: jsonl
path: "./output/augmented.jsonl"Override the default system prompt when you need specific behavior:
generation:
template: |
Generate a {category} example in JSON format.
system_prompt: |
You are a data generation assistant.
Output ONLY valid JSON, no markdown, no explanations.
Schema: {"text": "...", "label": "..."}Filter bad outputs and remove duplicates:
validation:
min_length: 20
max_length: 1000
json: true # must be valid JSON
json_schema: [question, answer] # required fields
blocklist: true # filter "Sure!", "As an AI", etc.
repetition: true # filter repetitive text
dedupe: normalized # exact | normalized | jaccardhub:
repo: "username/my-dataset"
private: false
token: "hf_..." # or set HF_TOKEN env var, or `huggingface-cli login`Token is resolved in order: config → HF_TOKEN env → ~/.cache/huggingface/token
use synthclaw::{
config::SynthConfig,
datasets::{HuggingFaceSource, DataSource},
providers::{create_provider, GenerationRequest},
};
// Load HuggingFace dataset
let mut source = HuggingFaceSource::new(
"cornell-movie-review-data/rotten_tomatoes".to_string(),
None,
"train".to_string(),
None,
)?;
let records = source.load(Some(100))?;
// Create provider and generate
let config = SynthConfig::from_file(&"config.yaml".into())?;
let provider = create_provider(&config.provider)?;
let response = provider.generate(GenerationRequest {
prompt: "Generate a movie review".to_string(),
system_prompt: Some("Output only the review text.".to_string()),
temperature: Some(0.7),
max_tokens: Some(500),
}).await?;use synth_claw::validation::{
ValidationPipeline, MinLength, Json, JsonSchema, Blocklist,
Deduplicator, validate_and_dedupe,
};
let results = engine.run(&config).await?;
let pipeline = ValidationPipeline::new()
.add(MinLength(20))
.add(Json)
.add(JsonSchema::require(&["question", "answer"]))
.add(Blocklist::llm_artifacts());
let validated = validate_and_dedupe(results, &pipeline, Some(&Deduplicator::Normalized));
println!("passed: {}, failed: {}", validated.stats.passed, validated.stats.failed);
for r in validated.results { /* clean data */ }use synth_claw::hub::DatasetUploader;
let uploader = DatasetUploader::new("username/my-dataset", false, None).await?;
// Upload JSONL data
let data = vec![json!({"q": "...", "a": "..."})];
uploader.upload_jsonl(&data, "train.jsonl").await?;
// Upload any file
uploader.upload("README.md", readme.as_bytes(), Some("Add README")).await?;
println!("Dataset: {}", uploader.repo_url());Requires HF_TOKEN env var.
jsonl- Line-delimited JSON (recommended for large datasets)csv- Comma-separated valuesparquet- Apache Parquet (efficient for analytics)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
HF_TOKEN=hf_... # for uploading to HuggingFace Hub- Streaming pipeline (generate → validate → write, no memory accumulation)
- Checkpointing & resume
- Retry with exponential backoff
- Rate limiting
- Budget limits
- Gemini, Ollama, Azure OpenAI, Together AI, Groq
- HuggingFace Hub upload
- Dataset cards