Skip to content

sarkarbikram90/Synthetic-Data-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎲 Synthetic Data Generator

Python Streamlit License Status

A powerful, user-friendly web application for generating realistic synthetic data

πŸš€ Live Demo β€’ πŸ› Report Bug β€’ ✨ Request Feature


πŸ“‹ Table of Contents


Overview

The Synthetic Data Generator is a comprehensive web application built with Streamlit that enables users to generate realistic, high-quality synthetic data for testing, development, analytics, and research purposes. With an intuitive interface and powerful generation capabilities, it supports multiple data types and export formats.

🌍 In today's World, our increasingly data-driven society, organizations face unprecedented challenges around data access, privacy, and compliance. Synthetic data has emerged as a transformative solution that addresses these modern challenges while accelerating innovation.

🎯 Why Synthetic Data?

  • Privacy Protection: No real customer data at risk
  • Development Speed: Instant test data generation
  • Compliance: GDPR/CCPA/DPDP compliant testing environments
  • Cost Effective: No need to purchase or license real datasets
  • Scalable: Generate upto 10,000 records instantly

✨ Features

🎨 Intuitive Web Interface

  • Clean, modern UI with responsive design
  • Real-time data preview and statistics
  • Interactive configuration panels
  • Progress indicators and loading states

πŸ“Š Multiple Data Types

  • Personal/Customer Data: Names, addresses, contact info
  • Sales Transactions: Purchase records, revenue data
  • Employee Records: HR data, performance metrics
  • Time Series: Temporal data with trends
  • Text Data: Reviews, posts, social media content
  • Application Logs: System events, user actions, errors
  • System Data: OS logs, metrics, security events

πŸ’Ύ Flexible Export Options

  • CSV, JSON, Excel formats
  • Individual file downloads
  • Bulk ZIP packages
  • Configurable data volumes (upto 10,000 records)

⚑ High Performance

  • Optimized data generation algorithms
  • Memory-efficient processing
  • Instant preview capabilities
  • Scalable architecture

πŸ”’ Privacy & Security

  • No real data storage
  • Client-side processing
  • No personal information collected
  • Open-source transparency

🎯 Use Cases

Use Case Description Benefits
Software Testing Generate test datasets for QA and testing Reliable, consistent test data
Development Mock data for development environments Faster development cycles
Data Science Practice datasets for learning and prototyping Real-world data structures
Demos & Training Sample data for presentations and tutorials Professional, realistic examples
Analytics Testing Test dashboards and reporting tools Comprehensive data coverage
Database Seeding Populate development and staging databases Realistic data relationships

πŸ“Š Data Types

πŸ‘€ Personal/Customer Data

Perfect for CRM systems, user databases, and customer analytics:

  • Identity: First name, last name, email, phone
  • Demographics: Age, gender, occupation, salary
  • Location: Full addresses with city, state, ZIP
  • Metadata: Creation dates, unique IDs

Sample Output:

id,first_name,last_name,email,phone,city,salary
550e8400-e29b-41d4-a716-446655440000,John,Doe,[email protected],(555) 123-4567,New York,75000
πŸ’° Sales Transactions

Ideal for e-commerce platforms, retail analytics, and revenue tracking:

  • Products: Names, categories, prices, quantities
  • Transactions: IDs, dates, payment methods, discounts
  • Sales Data: Revenue, regions, sales representatives
  • Customer Info: Buyer IDs and preferences

Sample Output:

transaction_id,product_name,category,quantity,unit_price,total_amount,payment_method
TXN-001,Wireless Headphones,Electronics,2,199.99,399.98,Credit Card
πŸ‘¨β€πŸ’Ό Employee Records

Essential for HR systems, payroll, and organizational analysis:

  • Personal: Employee IDs, names, contact information
  • Professional: Departments, positions, hire dates
  • Performance: Salaries, ratings, experience levels
  • Organization: Manager relationships, remote work status

Sample Output:

employee_id,first_name,department,position,salary,hire_date,performance_rating
EMP1001,Alice Johnson,Engineering,Senior Developer,95000,2022-03-15,4.2
πŸ“ˆ Time Series Data

Perfect for analytics, IoT applications, and financial modeling:

  • Temporal: Date/time sequences with various intervals
  • Metrics: Multiple measurement categories
  • Trends: Realistic growth and seasonal patterns
  • Analytics: Moving averages, cumulative values

Sample Output:

date,value,category_a,category_b,moving_avg_7d
2024-01-01,142.5,45.2,23.8,140.2
2024-01-02,138.7,52.1,28.3,141.1
πŸ“ Text Content

Great for content management systems, social media analysis, and NLP:

  • Product Reviews: Ratings, titles, detailed feedback
  • Blog Posts: Articles with metadata and engagement metrics
  • Social Media: Posts with hashtags, likes, and shares
  • Content Marketing: Realistic text for various platforms

Sample Output:

review_id,product_name,rating,review_title,review_text,helpful_votes
REV00001,Smart Watch Pro,4,"Great battery life","This watch exceeded my expectations...",23
πŸ“Š Application Logs

Perfect for testing log analysis tools, SIEM systems, and application monitoring:

  • Application Lifecycle: Start/stop events, deployments, health checks
  • User Interactions: Login attempts, button clicks, page views, form submissions
  • Data Generation: Performance metrics, processing times, record counts
  • File Operations: Upload/download events, file validation, backup operations
  • Error Tracking: Exception handling, stack traces, severity levels
  • Session Metrics: User engagement, session duration, feature usage

Sample Output:

timestamp,log_level,event_type,user_id,session_id,response_time_ms,success
2024-12-17 10:31:15,INFO,BUTTON_CLICK,user_1234,abc123ef,250,true
2024-12-17 10:31:45,ERROR,DatabaseConnectionError,user_5678,def456gh,5000,false
πŸ–₯️ System Data

Enterprise-grade system monitoring data for infrastructure testing and analysis:

System Logs

Operating system and service logs with realistic patterns:

  • Services: SSH, Nginx, MySQL, Docker, Systemd events
  • Security: Login attempts, authentication failures, access violations
  • System Events: Service starts/stops, configuration changes, errors
  • Network: Connection logs, firewall events, traffic patterns

Performance Metrics

Comprehensive system performance and resource utilization:

  • CPU: Usage percentages, load averages, core temperatures
  • Memory: RAM usage, swap utilization, buffer/cache statistics
  • Disk I/O: Read/write operations, throughput, queue depths
  • Network: Bandwidth utilization, packet rates, connection counts

Resource Usage

Detailed per-process and per-service resource consumption:

  • Process Details: PIDs, command lines, parent relationships
  • Resource Allocation: CPU time, memory usage, file descriptors
  • System Calls: I/O operations, network activity, thread counts
  • Performance: Response times, error rates, throughput metrics

Security Events

Security-focused monitoring and incident data:

  • Authentication: Login successes/failures, privilege escalations
  • Network Security: Intrusion attempts, firewall blocks, geo-location data
  • Access Control: File access, permission changes, audit trails
  • Threat Intelligence: Severity classifications, threat indicators

Infrastructure Monitoring

Enterprise infrastructure health and availability data:

  • Component Health: Servers, databases, load balancers, storage systems
  • Availability Metrics: Uptime percentages, response times, error rates
  • Hardware Status: Temperature sensors, power consumption, firmware versions
  • Maintenance: Backup status, alert counts, scheduled maintenance windows

Sample Output - System Logs:

timestamp,hostname,severity,service,message,source_ip,user
2024-12-17 10:30:45,web-server-01,INFO,ssh,"Accepted password for admin from 192.168.1.100",192.168.1.100,admin
2024-12-17 10:30:50,db-primary,ERROR,mysql,"Access denied for user 'backup'@'10.0.0.50'",10.0.0.50,backup

Sample Output - Performance Metrics:

timestamp,hostname,cpu_usage_percent,memory_usage_percent,disk_read_mb_per_sec,network_in_mbps
2024-12-17 10:30:00,web-server-01,23.5,67.2,12.8,45.6
2024-12-17 10:31:00,web-server-01,45.8,68.1,8.4,52.3

Sample Output - Security Events:

timestamp,hostname,event_type,severity,source_ip,action,threat_level,success
2024-12-17 10:30:45,firewall,intrusion_attempt,HIGH,198.51.100.25,DENY,High,false
2024-12-17 10:31:20,web-server-01,login_attempt,MEDIUM,203.0.113.45,ALLOW,Low,true
---

πŸš€ Quick Start

🌐 Option 1: Use Online (Recommended)

Visit the live application: Launch App β†’

  1. Select your data type from the sidebar
  2. Configure the number of records
  3. Choose export format
  4. Click "Generate Data"
  5. Preview and download your dataset

πŸ’» Option 2: Run Locally

Clone the repository

git clone https://github.com/sarkarbikram90/Synthetic-Data-Generator.git
cd Synthetic-Data-Generator

Install dependencies

pip install -r requirements.txt

Launch the application

streamlit run app.py

Access the app at http://localhost:8501


πŸ› οΈ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Dependencies

The application uses the following key libraries:

streamlit>=1.28.0    # Web app framework
pandas>=1.5.0        # Data manipulation
numpy>=1.24.0        # Numerical computing
faker>=19.0.0        # Realistic fake data generation
openpyxl>=3.1.0      # Excel file support
python-dateutil>=2.8.0  # Date/time utilities

Installation Steps

  1. Clone the repository:

    git clone https://github.com/sarkarbikram90/Synthetic-Data-Generator.git
    cd Synthetic-Data-Generator
  2. Create a virtual environment (recommended):

    python -m venv venv
    
    # Activate virtual environment
    # On Windows:
    venv\Scripts\activate
    # On macOS/Linux:
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Run the application:

    streamlit run app.py


πŸ”§ Usage

Basic Workflow

  1. Configure Data Type

    Sidebar β†’ Select Data Type β†’ Choose from 7 options
    
  2. Set Parameters

    Sidebar β†’ Number of Records β†’ Slide to desired amount (upto 10,000)
    
  3. Generate Data

    Sidebar β†’ Generate Data Button β†’ Wait for processing
    
  4. Preview Results

    Main Area β†’ Review generated data β†’ Check statistics
    
  5. Export Data

    Download Section β†’ Choose format β†’ Click download button
    

Advanced Features

  • Column Information: Expand the "Column Information" section to see data types, null counts, and unique values
  • Multiple Formats: Select "All Formats" to download CSV, JSON, and Excel files in one ZIP package
  • Data Validation: Review the statistics cards to ensure data quality meets your requirements

βš™οΈ Configuration

Customizing Data Generation

The application allows for extensive customization through the sidebar interface:

Data Volume

  • Minimum: 10 records (for quick testing)
  • Maximum: 10,000 records (for comprehensive datasets)
  • Recommendation: Start with 100-500 records for initial evaluation

Export Formats

  • CSV: Universal compatibility, best for data analysis
  • JSON: API integration, NoSQL databases
  • Excel: Business reporting, spreadsheet analysis
  • ZIP Package: All formats for comprehensive delivery

Text Content Types

When selecting "Text Data", choose from:

  • Reviews: Product feedback with ratings and sentiment
  • Blog Posts: Articles with metadata and engagement metrics
  • Social Media: Posts with hashtags and social signals

πŸ“ˆ Performance

Benchmarks

Records Generation Time Memory Usage File Size (CSV)
100 < 1 second ~2 MB ~15 KB
1,000 ~2 seconds ~5 MB ~150 KB
5,000 ~8 seconds ~20 MB ~750 KB

Optimization Tips

  • Batch Processing: Generate large datasets in chunks if memory is limited
  • Format Selection: Use CSV for fastest processing and smallest file sizes
  • Preview First: Always preview smaller samples before generating large datasets


πŸ—ΊοΈ Roadmap

Version 2.0 (Planned)

  • API Endpoints: RESTful API for programmatic access
  • Custom Schemas: User-defined data structures
  • Data Relationships: Foreign keys and referential integrity
  • Advanced Text: AI-powered content generation
  • Real-time Streaming: Live data generation capabilities

Version 2.1 (Future)

  • User Accounts: Save preferences and generation history
  • Collaboration: Team workspaces and shared templates
  • Enterprise Features: SSO, audit logs, advanced security
  • Cloud Integration: Direct export to cloud storage
  • Advanced Analytics: Data profiling and quality metrics

Long-term Vision

  • Machine Learning: Generate data based on existing patterns
  • Industry Templates: Pre-built schemas for common use cases
  • Multi-language Support: Internationalization
  • Mobile App: iOS and Android applications

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Bikram Sarkar

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

πŸ™ Acknowledgments

  • Streamlit - For the amazing web app framework
  • Faker - For realistic fake data generation
  • Pandas - For powerful data manipulation capabilities
  • NumPy - For numerical computing support

Special Thanks

  • The open-source community for inspiration

πŸ“ž Support

Get Help

Community

  • ⭐ Star this repo if you find it useful!
  • 🍴 Fork to create your own version
  • πŸ‘€ Watch for updates and new releases
  • πŸ“’ Share with your network

Built with ❀️ by Bikram Sarkar

GitHub stars GitHub forks

About

Generate realistic synthetic data for testing, development, and analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages