Skip to content

shaafsalman/Mimesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mimesis

Synthetic Dataset Generation Pipeline

Point at a dataset, pick a model, hit run. Mimesis turns raw seeds into training data — batched, parallel, and ready to push to HuggingFace.

Landing


Overview

Mimesis is a local-first, full-stack tool for generating synthetic LLM training data at scale. You bring seed rows (a CSV or a HuggingFace dataset), write a system instruction, pick a provider and model, tune your batching, and let the pipeline do the rest — with real-time WebSocket progress, graceful error handling, and optional push to the HF Hub when done.


Features

  • Any seed source — CSV upload or HuggingFace dataset (public or private via token)
  • Multi-column mapping — seed column, optional context column, optional per-row instruction column
  • 7 LLM providers out of the box — Anthropic, OpenAI, Gemini, DeepSeek, Together AI, OpenRouter, vLLM
  • 55+ hardcoded models — Claude 4.x/3.7/3.5, GPT-4o/4.5/o-series, Gemini 2.5/2.0/1.5, Llama 3.x, Qwen, Mistral, and more
  • Live model fetch — pull the live model list directly from any provider's API with one click
  • Test connection — validate credentials with a single dummy request before running
  • Custom model ID — override the catalogue with any model ID (fine-tunes, new releases)
  • vLLM support — local deployments via OpenAI-compatible API; auto-discover loaded models from /v1/models
  • GPU RPM estimator — estimate sustainable RPM from hardware specs (VRAM, GPU count, quantization)
  • Parallel workers — configurable concurrency with a shared token-bucket rate limiter
  • Preview mode — generate N samples for human review before committing to a full run
  • Real-time dashboard — WebSocket-driven progress bar, throughput chart, and live sample feed
  • Batch CSV saving — flush to disk every N samples; safe to restart on failure
  • HuggingFace push — upload the merged dataset to the HF Hub when complete
  • Dark / light mode — full theme support

Screenshots

Landing

Landing page


1 · Seeds

Configure your input data source — upload a CSV or load any HuggingFace dataset. Map columns to seed, context, and instruction fields. Write the system instruction that will be applied to every row.

Seeds page


2 · Provider

Select a provider and model, enter your API key, test the connection, and tune generation parameters. Switch to vLLM for local deployments — the GPU estimator calculates your max sustainable RPM from hardware specs. Live-fetch the full model list from any provider's API with a single click, or type in a custom model ID for fine-tunes and new releases.

Provider page


3 · Pipeline

Set batch size, worker concurrency, and preview count. Configure the output directory and optional HuggingFace Hub push. The Summary card gives a live snapshot of your full configuration before you proceed.

Pipeline page


4 · Preview

Generate a small sample batch before committing to the full run. Inspect each row — seed input, context, instruction, and generated output — side by side. Approve to proceed or go back and adjust your settings.

Preview page


5 · Run

Launch the full pipeline and watch it work. Real-time throughput chart, progress bar, elapsed and remaining time, and a live scrolling sample feed — all streamed over WebSocket.

Run page


6 · Output

Browse and download all generated files. Complete merged datasets and per-batch CSVs are listed with file sizes, timestamps, and one-click download buttons.

Output page


Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+

1 — Clone and install

git clone https://github.com/youruser/mimesis.git
cd mimesis

# Backend
cd backend && pip install -r requirements.txt && cd ..

# Frontend
cd frontend && npm install && cd ..

2 — Start both servers

./start.sh
Service URL
Frontend http://localhost:5173
Backend API http://localhost:8000
Interactive API docs http://localhost:8000/docs

Or start them individually:

# Terminal 1 — backend
cd backend && uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Terminal 2 — frontend
cd frontend && npm run dev

Providers & Models

Mimesis ships with a comprehensive hardcoded catalogue (55+ models) and supports live-fetching the full model list directly from each provider's API.

Provider Models included Live fetch
Anthropic Claude Opus/Sonnet/Haiku 4.x · Claude 3.7 Sonnet · 3.5 Sonnet/Haiku · 3.0 Opus/Sonnet/Haiku
OpenAI GPT-4o · GPT-4.5 · o1/o1-mini/o1-preview · o3/o3-mini · o4-mini · GPT-4 Turbo
Google Gemini Gemini 2.5 Pro/Flash · 2.0 Flash/Flash-Lite/Thinking/Pro · 1.5 Pro/Flash/Flash-8B
DeepSeek DeepSeek V3 (Chat) · R1 (Reasoner) · Coder V2
Together AI Llama 3.1/3.2/3.3 · DeepSeek V3/R1 · Qwen 2.5 · Mixtral 8×7B/8×22B · Gemma 2 · Sky-T1
OpenRouter 24 curated routes + full live catalogue (200+ models)
vLLM Any model — auto-discovered via /v1/models on your server

Rate limit presets

Provider Default (Tier 1) Higher tiers
Anthropic 60 RPM up to 8,000 (Tier 4)
OpenAI 500 RPM up to 10,000 (Tier 3)
Gemini 15 RPM (free) 2,000 (Flash paid) · 360 (Pro paid)
DeepSeek 60 RPM
Together 60 RPM (free) 600 (paid)
OpenRouter 200 RPM
vLLM auto-calculated from GPU hardware specs

Project Structure

mimesis/
├── backend/
│   ├── main.py                        # FastAPI entry point
│   ├── config.py                      # Model catalogue + rate limits
│   ├── requirements.txt
│   ├── core/
│   │   ├── pipeline.py                # Async generation engine
│   │   ├── seed_loader.py             # CSV / HuggingFace loader
│   │   ├── rate_limiter.py            # Token-bucket rate limiter
│   │   ├── output_manager.py          # CSV saving + HF Hub push
│   │   └── providers/
│   │       ├── base.py                # BaseProvider abstract class
│   │       ├── anthropic_provider.py
│   │       ├── openai_provider.py
│   │       ├── gemini_provider.py
│   │       ├── deepseek_provider.py
│   │       ├── together_provider.py
│   │       ├── openrouter_provider.py
│   │       └── vllm_provider.py
│   └── api/
│       ├── websocket.py               # WebSocket connection manager
│       └── routes/
│           ├── seeds.py               # /api/seeds/*
│           ├── providers.py           # /api/providers/*
│           └── pipeline.py            # /api/pipeline/*
│
├── frontend/
│   └── src/
│       ├── App.tsx                    # Router + layout shell
│       ├── types/index.ts             # Shared TypeScript types
│       ├── config/theme.ts            # Design tokens (colours, radii)
│       ├── store/pipelineStore.ts     # Zustand global state
│       ├── hooks/
│       │   ├── useApi.ts              # Typed REST API helpers
│       │   └── useWebSocket.ts        # Real-time event handler
│       ├── components/ui/             # Design-system primitives
│       └── pages/
│           ├── LandingPage.tsx
│           ├── SeedsPage.tsx
│           ├── ProviderPage.tsx
│           ├── PipelinePage.tsx
│           ├── PreviewPage.tsx
│           ├── RunPage.tsx
│           └── OutputPage.tsx
│
├── outputs/                           # Generated CSVs saved here
├── docs/
│   └── screenshots/                   # UI screenshots (used in this README)
└── start.sh

API Reference

Method Endpoint Description
GET /api/providers/catalogue Model catalogue + rate limit info
POST /api/providers/fetch-models Live-fetch models from a provider's API
POST /api/providers/test-connection Validate credentials with a dummy request
POST /api/providers/vllm/calculate Estimate RPM from GPU hardware specs
POST /api/seeds/columns Inspect columns from a CSV or HF dataset
POST /api/seeds/upload Upload a CSV file
POST /api/pipeline/preview Run a small preview batch
POST /api/pipeline/start Start a full pipeline run
POST /api/pipeline/stop Stop a running pipeline
GET /api/pipeline/status Poll pipeline status
GET /api/pipeline/outputs List generated output files
WS /ws/{pipeline_id} Real-time progress stream

Full interactive docs at http://localhost:8000/docs when the backend is running.


Adding a New Provider

  1. Create backend/core/providers/myprovider_provider.py
  2. Extend BaseProvider, implement async generate(prompt, system) -> str
  3. Register in backend/core/providers/__init__.pyREGISTRY
  4. Add model list + rate limits to backend/config.pyPROVIDER_MODELS
  5. Add the provider name to EXTERNAL_PROVIDERS in frontend/src/pages/ProviderPage.tsx

Tech Stack

Layer Technology
Backend Python 3.10+, FastAPI, uvicorn, asyncio, httpx
Frontend React 18, TypeScript, Vite, Tailwind CSS
State management Zustand
Charts Recharts
Icons Lucide React
LLM SDKs anthropic, openai, google-generativeai

License

MIT

About

Generate data At scale. Point at a dataset, pick a model, hit run. Mimesis turns raw seeds into training data — batched, parallel, and ready to push to HuggingFace.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors