🎲 Synthetic Data Generator

A powerful, user-friendly web application for generating realistic synthetic data

🚀 Live Demo • 🐛 Report Bug • ✨ Request Feature

📋 Table of Contents

Overview
✨ Features
🎯 Use Cases
📊 Data Types
🚀 Quick Start
🛠️ Installation
🔧 Usage
⚙️ Configuration
📈 Performance
📄 License
🙏 Acknowledgments

Overview

The Synthetic Data Generator is a comprehensive web application built with Streamlit that enables users to generate realistic, high-quality synthetic data for testing, development, analytics, and research purposes. With an intuitive interface and powerful generation capabilities, it supports multiple data types and export formats.

🌍 In today's World, our increasingly data-driven society, organizations face unprecedented challenges around data access, privacy, and compliance. Synthetic data has emerged as a transformative solution that addresses these modern challenges while accelerating innovation.

🎯 Why Synthetic Data?

Privacy Protection: No real customer data at risk
Development Speed: Instant test data generation
Compliance: GDPR/CCPA/DPDP compliant testing environments
Cost Effective: No need to purchase or license real datasets
Scalable: Generate upto 10,000 records instantly

✨ Features

🎨 Intuitive Web Interface

Clean, modern UI with responsive design
Real-time data preview and statistics
Interactive configuration panels
Progress indicators and loading states

📊 Multiple Data Types

Personal/Customer Data: Names, addresses, contact info
Sales Transactions: Purchase records, revenue data
Employee Records: HR data, performance metrics
Time Series: Temporal data with trends
Text Data: Reviews, posts, social media content
Application Logs: System events, user actions, errors
System Data: OS logs, metrics, security events

💾 Flexible Export Options

CSV, JSON, Excel formats
Individual file downloads
Bulk ZIP packages
Configurable data volumes (upto 10,000 records)

⚡ High Performance

Optimized data generation algorithms
Memory-efficient processing
Instant preview capabilities
Scalable architecture

🔒 Privacy & Security

No real data storage
Client-side processing
No personal information collected
Open-source transparency

🎯 Use Cases

Use Case	Description	Benefits
Software Testing	Generate test datasets for QA and testing	Reliable, consistent test data
Development	Mock data for development environments	Faster development cycles
Data Science	Practice datasets for learning and prototyping	Real-world data structures
Demos & Training	Sample data for presentations and tutorials	Professional, realistic examples
Analytics Testing	Test dashboards and reporting tools	Comprehensive data coverage
Database Seeding	Populate development and staging databases	Realistic data relationships

📊 Data Types

👤 Personal/Customer Data

Perfect for CRM systems, user databases, and customer analytics:

Identity: First name, last name, email, phone
Demographics: Age, gender, occupation, salary
Location: Full addresses with city, state, ZIP
Metadata: Creation dates, unique IDs

Sample Output:

id,first_name,last_name,email,phone,city,salary
550e8400-e29b-41d4-a716-446655440000,John,Doe,[email protected],(555) 123-4567,New York,75000

💰 Sales Transactions

Ideal for e-commerce platforms, retail analytics, and revenue tracking:

Products: Names, categories, prices, quantities
Transactions: IDs, dates, payment methods, discounts
Sales Data: Revenue, regions, sales representatives
Customer Info: Buyer IDs and preferences

Sample Output:

transaction_id,product_name,category,quantity,unit_price,total_amount,payment_method
TXN-001,Wireless Headphones,Electronics,2,199.99,399.98,Credit Card

👨‍💼 Employee Records

Essential for HR systems, payroll, and organizational analysis:

Personal: Employee IDs, names, contact information
Professional: Departments, positions, hire dates
Performance: Salaries, ratings, experience levels
Organization: Manager relationships, remote work status

Sample Output:

employee_id,first_name,department,position,salary,hire_date,performance_rating
EMP1001,Alice Johnson,Engineering,Senior Developer,95000,2022-03-15,4.2

📈 Time Series Data

Perfect for analytics, IoT applications, and financial modeling:

Temporal: Date/time sequences with various intervals
Metrics: Multiple measurement categories
Trends: Realistic growth and seasonal patterns
Analytics: Moving averages, cumulative values

Sample Output:

date,value,category_a,category_b,moving_avg_7d
2024-01-01,142.5,45.2,23.8,140.2
2024-01-02,138.7,52.1,28.3,141.1

📝 Text Content

Great for content management systems, social media analysis, and NLP:

Product Reviews: Ratings, titles, detailed feedback
Blog Posts: Articles with metadata and engagement metrics
Social Media: Posts with hashtags, likes, and shares
Content Marketing: Realistic text for various platforms

Sample Output:

review_id,product_name,rating,review_title,review_text,helpful_votes
REV00001,Smart Watch Pro,4,"Great battery life","This watch exceeded my expectations...",23

📊 Application Logs

Perfect for testing log analysis tools, SIEM systems, and application monitoring:

Application Lifecycle: Start/stop events, deployments, health checks
User Interactions: Login attempts, button clicks, page views, form submissions
Data Generation: Performance metrics, processing times, record counts
File Operations: Upload/download events, file validation, backup operations
Error Tracking: Exception handling, stack traces, severity levels
Session Metrics: User engagement, session duration, feature usage

Sample Output:

timestamp,log_level,event_type,user_id,session_id,response_time_ms,success
2024-12-17 10:31:15,INFO,BUTTON_CLICK,user_1234,abc123ef,250,true
2024-12-17 10:31:45,ERROR,DatabaseConnectionError,user_5678,def456gh,5000,false

🖥️ System Data

Enterprise-grade system monitoring data for infrastructure testing and analysis:

System Logs

Operating system and service logs with realistic patterns:

Services: SSH, Nginx, MySQL, Docker, Systemd events
Security: Login attempts, authentication failures, access violations
System Events: Service starts/stops, configuration changes, errors
Network: Connection logs, firewall events, traffic patterns

Performance Metrics

Comprehensive system performance and resource utilization:

CPU: Usage percentages, load averages, core temperatures
Memory: RAM usage, swap utilization, buffer/cache statistics
Disk I/O: Read/write operations, throughput, queue depths
Network: Bandwidth utilization, packet rates, connection counts

Resource Usage

Detailed per-process and per-service resource consumption:

Process Details: PIDs, command lines, parent relationships
Resource Allocation: CPU time, memory usage, file descriptors
System Calls: I/O operations, network activity, thread counts
Performance: Response times, error rates, throughput metrics

Security Events

Security-focused monitoring and incident data:

Authentication: Login successes/failures, privilege escalations
Network Security: Intrusion attempts, firewall blocks, geo-location data
Access Control: File access, permission changes, audit trails
Threat Intelligence: Severity classifications, threat indicators

Infrastructure Monitoring

Enterprise infrastructure health and availability data:

Component Health: Servers, databases, load balancers, storage systems
Availability Metrics: Uptime percentages, response times, error rates
Hardware Status: Temperature sensors, power consumption, firmware versions
Maintenance: Backup status, alert counts, scheduled maintenance windows

Sample Output - System Logs:

timestamp,hostname,severity,service,message,source_ip,user
2024-12-17 10:30:45,web-server-01,INFO,ssh,"Accepted password for admin from 192.168.1.100",192.168.1.100,admin
2024-12-17 10:30:50,db-primary,ERROR,mysql,"Access denied for user 'backup'@'10.0.0.50'",10.0.0.50,backup

Sample Output - Performance Metrics:

timestamp,hostname,cpu_usage_percent,memory_usage_percent,disk_read_mb_per_sec,network_in_mbps
2024-12-17 10:30:00,web-server-01,23.5,67.2,12.8,45.6
2024-12-17 10:31:00,web-server-01,45.8,68.1,8.4,52.3

Sample Output - Security Events:

timestamp,hostname,event_type,severity,source_ip,action,threat_level,success
2024-12-17 10:30:45,firewall,intrusion_attempt,HIGH,198.51.100.25,DENY,High,false
2024-12-17 10:31:20,web-server-01,login_attempt,MEDIUM,203.0.113.45,ALLOW,Low,true

---

🚀 Quick Start

🌐 Option 1: Use Online (Recommended)

Visit the live application: Launch App →

Select your data type from the sidebar
Configure the number of records
Choose export format
Click "Generate Data"
Preview and download your dataset

💻 Option 2: Run Locally

Clone the repository

git clone https://github.com/sarkarbikram90/Synthetic-Data-Generator.git

cd Synthetic-Data-Generator

Install dependencies

pip install -r requirements.txt

Launch the application

streamlit run app.py

Access the app at http://localhost:8501

🛠️ Installation

Prerequisites

Python 3.8 or higher
pip package manager

Dependencies

The application uses the following key libraries:

streamlit>=1.28.0    # Web app framework
pandas>=1.5.0        # Data manipulation
numpy>=1.24.0        # Numerical computing
faker>=19.0.0        # Realistic fake data generation
openpyxl>=3.1.0      # Excel file support
python-dateutil>=2.8.0  # Date/time utilities

Installation Steps

Clone the repository:

git clone https://github.com/sarkarbikram90/Synthetic-Data-Generator.git
cd Synthetic-Data-Generator

Create a virtual environment (recommended):

python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
```
streamlit run app.py
```

🔧 Usage

Basic Workflow

Configure Data Type

Sidebar → Select Data Type → Choose from 7 options

Set Parameters

Sidebar → Number of Records → Slide to desired amount (upto 10,000)

Generate Data

Sidebar → Generate Data Button → Wait for processing

Preview Results

Main Area → Review generated data → Check statistics

Export Data

Download Section → Choose format → Click download button

Advanced Features

Column Information: Expand the "Column Information" section to see data types, null counts, and unique values
Multiple Formats: Select "All Formats" to download CSV, JSON, and Excel files in one ZIP package
Data Validation: Review the statistics cards to ensure data quality meets your requirements

⚙️ Configuration

Customizing Data Generation

The application allows for extensive customization through the sidebar interface:

Data Volume

Minimum: 10 records (for quick testing)
Maximum: 10,000 records (for comprehensive datasets)
Recommendation: Start with 100-500 records for initial evaluation

Export Formats

CSV: Universal compatibility, best for data analysis
JSON: API integration, NoSQL databases
Excel: Business reporting, spreadsheet analysis
ZIP Package: All formats for comprehensive delivery

Text Content Types

When selecting "Text Data", choose from:

Reviews: Product feedback with ratings and sentiment
Blog Posts: Articles with metadata and engagement metrics
Social Media: Posts with hashtags and social signals

📈 Performance

Benchmarks

Records	Generation Time	Memory Usage	File Size (CSV)
100	< 1 second	~2 MB	~15 KB
1,000	~2 seconds	~5 MB	~150 KB
5,000	~8 seconds	~20 MB	~750 KB

Optimization Tips

Batch Processing: Generate large datasets in chunks if memory is limited
Format Selection: Use CSV for fastest processing and smallest file sizes
Preview First: Always preview smaller samples before generating large datasets

🗺️ Roadmap

Version 2.0 (Planned)

API Endpoints: RESTful API for programmatic access
Custom Schemas: User-defined data structures
Data Relationships: Foreign keys and referential integrity
Advanced Text: AI-powered content generation
Real-time Streaming: Live data generation capabilities

Version 2.1 (Future)

User Accounts: Save preferences and generation history
Collaboration: Team workspaces and shared templates
Enterprise Features: SSO, audit logs, advanced security
Cloud Integration: Direct export to cloud storage
Advanced Analytics: Data profiling and quality metrics

Long-term Vision

Machine Learning: Generate data based on existing patterns
Industry Templates: Pre-built schemas for common use cases
Multi-language Support: Internationalization
Mobile App: iOS and Android applications

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Bikram Sarkar

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

🙏 Acknowledgments

Streamlit - For the amazing web app framework
Faker - For realistic fake data generation
Pandas - For powerful data manipulation capabilities
NumPy - For numerical computing support

Special Thanks

The open-source community for inspiration

📞 Support

Get Help

📖 Documentation: Check this README and inline code comments
🐛 Bug Reports: GitHub Issues
💡 Feature Requests: GitHub Issues
📧 Contact: [[email protected]] (for urgent matters)

Community

⭐ Star this repo if you find it useful!
🍴 Fork to create your own version
👀 Watch for updates and new releases
📢 Share with your network

Built with ❤️ by Bikram Sarkar

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

sarkarbikram90/Synthetic-Data-Generator

Folders and files

Latest commit

History

Repository files navigation

🎲 Synthetic Data Generator

📋 Table of Contents

Overview

🎯 Why Synthetic Data?

✨ Features

🎨 Intuitive Web Interface

📊 Multiple Data Types

💾 Flexible Export Options

⚡ High Performance

🔒 Privacy & Security

🎯 Use Cases

📊 Data Types

System Logs

Performance Metrics

Resource Usage

Security Events

Infrastructure Monitoring

🚀 Quick Start

🌐 Option 1: Use Online (Recommended)

💻 Option 2: Run Locally

Clone the repository

Install dependencies

Launch the application

🛠️ Installation

Prerequisites

Dependencies

Installation Steps

🔧 Usage

Basic Workflow

Advanced Features

⚙️ Configuration

Customizing Data Generation

Data Volume

Export Formats

Text Content Types

📈 Performance

Benchmarks

Optimization Tips

🗺️ Roadmap

Version 2.0 (Planned)

Version 2.1 (Future)

Long-term Vision

📄 License

🙏 Acknowledgments

Special Thanks

📞 Support

Get Help

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages