A powerful, user-friendly web application for generating realistic synthetic data
- Overview
- β¨ Features
- π― Use Cases
- π Data Types
- π Quick Start
- π οΈ Installation
- π§ Usage
- βοΈ Configuration
- π Performance
- π License
- π Acknowledgments
The Synthetic Data Generator is a comprehensive web application built with Streamlit that enables users to generate realistic, high-quality synthetic data for testing, development, analytics, and research purposes. With an intuitive interface and powerful generation capabilities, it supports multiple data types and export formats.
π In today's World, our increasingly data-driven society, organizations face unprecedented challenges around data access, privacy, and compliance. Synthetic data has emerged as a transformative solution that addresses these modern challenges while accelerating innovation.
- Privacy Protection: No real customer data at risk
- Development Speed: Instant test data generation
- Compliance: GDPR/CCPA/DPDP compliant testing environments
- Cost Effective: No need to purchase or license real datasets
- Scalable: Generate upto 10,000 records instantly
- Clean, modern UI with responsive design
- Real-time data preview and statistics
- Interactive configuration panels
- Progress indicators and loading states
- Personal/Customer Data: Names, addresses, contact info
- Sales Transactions: Purchase records, revenue data
- Employee Records: HR data, performance metrics
- Time Series: Temporal data with trends
- Text Data: Reviews, posts, social media content
- Application Logs: System events, user actions, errors
- System Data: OS logs, metrics, security events
- CSV, JSON, Excel formats
- Individual file downloads
- Bulk ZIP packages
- Configurable data volumes (upto 10,000 records)
- Optimized data generation algorithms
- Memory-efficient processing
- Instant preview capabilities
- Scalable architecture
- No real data storage
- Client-side processing
- No personal information collected
- Open-source transparency
Use Case | Description | Benefits |
---|---|---|
Software Testing | Generate test datasets for QA and testing | Reliable, consistent test data |
Development | Mock data for development environments | Faster development cycles |
Data Science | Practice datasets for learning and prototyping | Real-world data structures |
Demos & Training | Sample data for presentations and tutorials | Professional, realistic examples |
Analytics Testing | Test dashboards and reporting tools | Comprehensive data coverage |
Database Seeding | Populate development and staging databases | Realistic data relationships |
π€ Personal/Customer Data
Perfect for CRM systems, user databases, and customer analytics:
- Identity: First name, last name, email, phone
- Demographics: Age, gender, occupation, salary
- Location: Full addresses with city, state, ZIP
- Metadata: Creation dates, unique IDs
Sample Output:
id,first_name,last_name,email,phone,city,salary
550e8400-e29b-41d4-a716-446655440000,John,Doe,[email protected],(555) 123-4567,New York,75000
π° Sales Transactions
Ideal for e-commerce platforms, retail analytics, and revenue tracking:
- Products: Names, categories, prices, quantities
- Transactions: IDs, dates, payment methods, discounts
- Sales Data: Revenue, regions, sales representatives
- Customer Info: Buyer IDs and preferences
Sample Output:
transaction_id,product_name,category,quantity,unit_price,total_amount,payment_method
TXN-001,Wireless Headphones,Electronics,2,199.99,399.98,Credit Card
π¨βπΌ Employee Records
Essential for HR systems, payroll, and organizational analysis:
- Personal: Employee IDs, names, contact information
- Professional: Departments, positions, hire dates
- Performance: Salaries, ratings, experience levels
- Organization: Manager relationships, remote work status
Sample Output:
employee_id,first_name,department,position,salary,hire_date,performance_rating
EMP1001,Alice Johnson,Engineering,Senior Developer,95000,2022-03-15,4.2
π Time Series Data
Perfect for analytics, IoT applications, and financial modeling:
- Temporal: Date/time sequences with various intervals
- Metrics: Multiple measurement categories
- Trends: Realistic growth and seasonal patterns
- Analytics: Moving averages, cumulative values
Sample Output:
date,value,category_a,category_b,moving_avg_7d
2024-01-01,142.5,45.2,23.8,140.2
2024-01-02,138.7,52.1,28.3,141.1
π Text Content
Great for content management systems, social media analysis, and NLP:
- Product Reviews: Ratings, titles, detailed feedback
- Blog Posts: Articles with metadata and engagement metrics
- Social Media: Posts with hashtags, likes, and shares
- Content Marketing: Realistic text for various platforms
Sample Output:
review_id,product_name,rating,review_title,review_text,helpful_votes
REV00001,Smart Watch Pro,4,"Great battery life","This watch exceeded my expectations...",23
π Application Logs
Perfect for testing log analysis tools, SIEM systems, and application monitoring:
- Application Lifecycle: Start/stop events, deployments, health checks
- User Interactions: Login attempts, button clicks, page views, form submissions
- Data Generation: Performance metrics, processing times, record counts
- File Operations: Upload/download events, file validation, backup operations
- Error Tracking: Exception handling, stack traces, severity levels
- Session Metrics: User engagement, session duration, feature usage
Sample Output:
timestamp,log_level,event_type,user_id,session_id,response_time_ms,success
2024-12-17 10:31:15,INFO,BUTTON_CLICK,user_1234,abc123ef,250,true
2024-12-17 10:31:45,ERROR,DatabaseConnectionError,user_5678,def456gh,5000,false
π₯οΈ System Data
Enterprise-grade system monitoring data for infrastructure testing and analysis:
Operating system and service logs with realistic patterns:
- Services: SSH, Nginx, MySQL, Docker, Systemd events
- Security: Login attempts, authentication failures, access violations
- System Events: Service starts/stops, configuration changes, errors
- Network: Connection logs, firewall events, traffic patterns
Comprehensive system performance and resource utilization:
- CPU: Usage percentages, load averages, core temperatures
- Memory: RAM usage, swap utilization, buffer/cache statistics
- Disk I/O: Read/write operations, throughput, queue depths
- Network: Bandwidth utilization, packet rates, connection counts
Detailed per-process and per-service resource consumption:
- Process Details: PIDs, command lines, parent relationships
- Resource Allocation: CPU time, memory usage, file descriptors
- System Calls: I/O operations, network activity, thread counts
- Performance: Response times, error rates, throughput metrics
Security-focused monitoring and incident data:
- Authentication: Login successes/failures, privilege escalations
- Network Security: Intrusion attempts, firewall blocks, geo-location data
- Access Control: File access, permission changes, audit trails
- Threat Intelligence: Severity classifications, threat indicators
Enterprise infrastructure health and availability data:
- Component Health: Servers, databases, load balancers, storage systems
- Availability Metrics: Uptime percentages, response times, error rates
- Hardware Status: Temperature sensors, power consumption, firmware versions
- Maintenance: Backup status, alert counts, scheduled maintenance windows
Sample Output - System Logs:
timestamp,hostname,severity,service,message,source_ip,user
2024-12-17 10:30:45,web-server-01,INFO,ssh,"Accepted password for admin from 192.168.1.100",192.168.1.100,admin
2024-12-17 10:30:50,db-primary,ERROR,mysql,"Access denied for user 'backup'@'10.0.0.50'",10.0.0.50,backup
Sample Output - Performance Metrics:
timestamp,hostname,cpu_usage_percent,memory_usage_percent,disk_read_mb_per_sec,network_in_mbps
2024-12-17 10:30:00,web-server-01,23.5,67.2,12.8,45.6
2024-12-17 10:31:00,web-server-01,45.8,68.1,8.4,52.3
Sample Output - Security Events:
timestamp,hostname,event_type,severity,source_ip,action,threat_level,success
2024-12-17 10:30:45,firewall,intrusion_attempt,HIGH,198.51.100.25,DENY,High,false
2024-12-17 10:31:20,web-server-01,login_attempt,MEDIUM,203.0.113.45,ALLOW,Low,true
Visit the live application: Launch App β
- Select your data type from the sidebar
- Configure the number of records
- Choose export format
- Click "Generate Data"
- Preview and download your dataset
git clone https://github.com/sarkarbikram90/Synthetic-Data-Generator.git
cd Synthetic-Data-Generator
pip install -r requirements.txt
streamlit run app.py
Access the app at http://localhost:8501
- Python 3.8 or higher
- pip package manager
The application uses the following key libraries:
streamlit>=1.28.0 # Web app framework
pandas>=1.5.0 # Data manipulation
numpy>=1.24.0 # Numerical computing
faker>=19.0.0 # Realistic fake data generation
openpyxl>=3.1.0 # Excel file support
python-dateutil>=2.8.0 # Date/time utilities
-
Clone the repository:
git clone https://github.com/sarkarbikram90/Synthetic-Data-Generator.git cd Synthetic-Data-Generator
-
Create a virtual environment (recommended):
python -m venv venv # Activate virtual environment # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the application:
streamlit run app.py
-
Configure Data Type
Sidebar β Select Data Type β Choose from 7 options
-
Set Parameters
Sidebar β Number of Records β Slide to desired amount (upto 10,000)
-
Generate Data
Sidebar β Generate Data Button β Wait for processing
-
Preview Results
Main Area β Review generated data β Check statistics
-
Export Data
Download Section β Choose format β Click download button
- Column Information: Expand the "Column Information" section to see data types, null counts, and unique values
- Multiple Formats: Select "All Formats" to download CSV, JSON, and Excel files in one ZIP package
- Data Validation: Review the statistics cards to ensure data quality meets your requirements
The application allows for extensive customization through the sidebar interface:
- Minimum: 10 records (for quick testing)
- Maximum: 10,000 records (for comprehensive datasets)
- Recommendation: Start with 100-500 records for initial evaluation
- CSV: Universal compatibility, best for data analysis
- JSON: API integration, NoSQL databases
- Excel: Business reporting, spreadsheet analysis
- ZIP Package: All formats for comprehensive delivery
When selecting "Text Data", choose from:
- Reviews: Product feedback with ratings and sentiment
- Blog Posts: Articles with metadata and engagement metrics
- Social Media: Posts with hashtags and social signals
Records | Generation Time | Memory Usage | File Size (CSV) |
---|---|---|---|
100 | < 1 second | ~2 MB | ~15 KB |
1,000 | ~2 seconds | ~5 MB | ~150 KB |
5,000 | ~8 seconds | ~20 MB | ~750 KB |
- Batch Processing: Generate large datasets in chunks if memory is limited
- Format Selection: Use CSV for fastest processing and smallest file sizes
- Preview First: Always preview smaller samples before generating large datasets
- API Endpoints: RESTful API for programmatic access
- Custom Schemas: User-defined data structures
- Data Relationships: Foreign keys and referential integrity
- Advanced Text: AI-powered content generation
- Real-time Streaming: Live data generation capabilities
- User Accounts: Save preferences and generation history
- Collaboration: Team workspaces and shared templates
- Enterprise Features: SSO, audit logs, advanced security
- Cloud Integration: Direct export to cloud storage
- Advanced Analytics: Data profiling and quality metrics
- Machine Learning: Generate data based on existing patterns
- Industry Templates: Pre-built schemas for common use cases
- Multi-language Support: Internationalization
- Mobile App: iOS and Android applications
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2024 Bikram Sarkar
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
- Streamlit - For the amazing web app framework
- Faker - For realistic fake data generation
- Pandas - For powerful data manipulation capabilities
- NumPy - For numerical computing support
- The open-source community for inspiration
- π Documentation: Check this README and inline code comments
- π Bug Reports: GitHub Issues
- π‘ Feature Requests: GitHub Issues
- π§ Contact: [[email protected]] (for urgent matters)
- β Star this repo if you find it useful!
- π΄ Fork to create your own version
- π Watch for updates and new releases
- π’ Share with your network
Built with β€οΈ by Bikram Sarkar