tysoncung
diff --git a/‎docs/README.md‎
Lines changed: 55 additions & 0 deletions b/‎docs/README.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎docs/architecture/system-architecture.md‎
Lines changed: 214 additions & 0 deletions b/‎docs/architecture/system-architecture.md‎
Lines changed: 214 additions & 0 deletions
@@ -0,0 +1,55 @@
+# AWS Data Platform Documentation
+
+Welcome to the comprehensive documentation for the AWS Data Platform template. This documentation covers architecture, deployment, operations, and customization of the platform.
+
+## 📚 Documentation Structure
+
+### Architecture
+- [System Architecture](architecture/system-architecture.md) - High-level platform architecture
+- [Data Flow](architecture/data-flow.md) - Data ingestion and processing flows
+- [Security Architecture](architecture/security.md) - Security model and best practices
+- [Networking](architecture/networking.md) - VPC and network design
+
+### Deployment Guides
+- [Quick Start](guides/quick-start.md) - Get started in 15 minutes
+- [Production Deployment](guides/production-deployment.md) - Production deployment checklist
+- [Multi-Region Setup](guides/multi-region.md) - Deploy across multiple AWS regions
+- [Disaster Recovery](guides/disaster-recovery.md) - Backup and recovery procedures
+
+### Component Guides
+- [Streaming Pipeline](guides/streaming-pipeline.md) - Kinesis and Lambda configuration
+- [Batch Processing](guides/batch-processing.md) - EMR and Spark job development
+- [Data Warehouse](guides/data-warehouse.md) - Redshift optimization
+- [Machine Learning](guides/machine-learning.md) - SageMaker pipeline setup
+- [Business Intelligence](guides/business-intelligence.md) - QuickSight dashboard creation
+
+### Operations
+- [Monitoring](guides/monitoring.md) - CloudWatch dashboards and alerts
+- [Cost Optimization](guides/cost-optimization.md) - Reduce AWS costs
+- [Performance Tuning](guides/performance-tuning.md) - Optimize processing performance
+- [Troubleshooting](guides/troubleshooting.md) - Common issues and solutions
+
+### API Reference
+- [Lambda Functions](api/lambda-functions.md) - Stream processor API documentation
+- [Data Models](api/data-models.md) - Schema definitions
+- [REST APIs](api/rest-apis.md) - API Gateway endpoints
+
+## 🚀 Quick Links
+
+- [Environment Configuration](.env.example) - Environment variables reference
+- [Infrastructure Code](../infrastructure/) - CDK stack implementations
+- [Source Code](../src/) - Application source code
+- [Scripts](../scripts/) - Deployment and utility scripts
+
+## 💡 Getting Help
+
+- **Issues**: [GitHub Issues](https://github.com/tysoncung/aws-data-platform/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/tysoncung/aws-data-platform/discussions)
+- **Examples**: See the [examples](examples/) directory
+
+## 📖 Additional Resources
+
+- [AWS Best Practices](https://aws.amazon.com/architecture/well-architected/)
+- [AWS CDK Documentation](https://docs.aws.amazon.com/cdk/)
+- [Apache Spark Documentation](https://spark.apache.org/docs/latest/)
+- [Amazon Redshift Documentation](https://docs.aws.amazon.com/redshift/)
@@ -0,0 +1,214 @@
+# System Architecture
+
+## Overview
+
+The AWS Data Platform implements a modern, cloud-native architecture that combines real-time streaming, batch processing, and machine learning capabilities in a unified platform.
+
+## Architecture Principles
+
+### 1. Separation of Concerns
+- **Ingestion Layer**: Handles data intake from various sources
+- **Processing Layer**: Transforms and enriches data
+- **Storage Layer**: Persists data in appropriate formats
+- **Serving Layer**: Provides data to consumers
+- **Consumption Layer**: Enables analytics and ML workloads
+
+### 2. Scalability
+- Horizontal scaling for all components
+- Auto-scaling based on workload
+- Elastic compute resources
+
+### 3. Reliability
+- Multi-AZ deployments
+- Automated failover
+- Data replication and backups
+
+### 4. Security
+- Encryption at rest and in transit
+- IAM role-based access control
+- VPC isolation and security groups
+- AWS Secrets Manager for credentials
+
+## Component Architecture
+
+### Real-Time Streaming Layer
+
+```
+Data Sources → Kinesis Data Streams → Lambda Functions → DynamoDB/S3
+                     ↓
+              Kinesis Analytics → Real-time Dashboards
+                     ↓
+              Kinesis Firehose → S3 Data Lake
+```
+
+**Components:**
+- **Kinesis Data Streams**: Ingests streaming data at scale
+- **Lambda Functions**: Serverless processing of stream records
+- **DynamoDB**: Low-latency storage for real-time data
+- **Kinesis Analytics**: SQL queries on streaming data
+
+### Batch Processing Layer
+
+```
+S3 Data Lake → EMR Cluster → Processed Data → Redshift
+                    ↓
+              Glue ETL Jobs → Data Catalog
+                    ↓
+               Athena Queries → Analytics
+```
+
+**Components:**
+- **EMR**: Distributed processing with Spark/Hadoop
+- **Glue ETL**: Serverless data transformation
+- **Athena**: Interactive SQL queries on S3 data
+- **Redshift**: Data warehouse for analytics
+
+### Machine Learning Layer
+
+```
+Feature Store → SageMaker Training → Model Registry
+                         ↓
+                  Model Endpoints → Inference API
+                         ↓
+                   A/B Testing → Production
+```
+
+**Components:**
+- **SageMaker**: End-to-end ML platform
+- **Feature Store**: Centralized feature management
+- **Model Registry**: Version control for ML models
+- **Endpoints**: Real-time and batch inference
+
+## Data Flow Patterns
+
+### 1. Lambda Architecture
+Combines batch and streaming processing:
+- **Speed Layer**: Real-time processing via Kinesis/Lambda
+- **Batch Layer**: Historical processing via EMR/Glue
+- **Serving Layer**: Unified view via Redshift/DynamoDB
+
+### 2. Kappa Architecture
+Streaming-first approach:
+- All data flows through Kinesis
+- Reprocessing via stream replay
+- Simplified architecture with single processing path
+
+### 3. Data Mesh
+Domain-oriented decentralization:
+- Domain-specific data products
+- Self-serve data platform
+- Federated governance
+
+## Technology Stack
+
+### Core Services
+| Component | Technology | Purpose |
+|-----------|------------|---------|
+| Streaming | Amazon Kinesis | Real-time data ingestion |
+| Compute | AWS Lambda, EMR | Data processing |
+| Storage | S3, DynamoDB, Redshift | Data persistence |
+| Analytics | Athena, QuickSight | Data analysis |
+| ML | SageMaker | Machine learning |
+| Orchestration | Step Functions, Airflow | Workflow management |
+| Monitoring | CloudWatch, X-Ray | Observability |
+
+### Programming Languages
+- **Python**: Primary language for all components
+- **SQL**: Data transformations and queries
+- **Spark**: Large-scale data processing
+
+### Data Formats
+- **Raw**: JSON, CSV, Parquet
+- **Processed**: Parquet, ORC
+- **Serving**: JSON, Avro
+
+## Network Architecture
+
+### VPC Design
+```
+VPC (10.0.0.0/16)
+├── Public Subnets (10.0.1.0/24, 10.0.2.0/24)
+│   └── NAT Gateways, Load Balancers
+├── Private Subnets (10.0.10.0/24, 10.0.11.0/24)
+│   └── EMR, Lambda, ECS Tasks
+└── Database Subnets (10.0.20.0/24, 10.0.21.0/24)
+    └── Redshift, RDS, ElastiCache
+```
+
+### Security Groups
+- **EMR-Master-SG**: Controls access to EMR master node
+- **EMR-Worker-SG**: Inter-node communication
+- **Redshift-SG**: Database access control
+- **Lambda-SG**: Outbound only for Lambda functions
+
+## High Availability
+
+### Multi-AZ Deployment
+- Redshift clusters span multiple AZs
+- DynamoDB global tables for cross-region replication
+- S3 cross-region replication for critical data
+
+### Disaster Recovery
+- **RPO**: 1 hour for batch data, real-time for streaming
+- **RTO**: 2 hours for full platform recovery
+- Automated backups and snapshots
+- Infrastructure as code for rapid redeployment
+
+## Performance Optimization
+
+### Caching Strategy
+- **CloudFront**: CDN for static content
+- **ElastiCache**: Redis for application caching
+- **S3 Intelligent-Tiering**: Automatic storage optimization
+
+### Data Partitioning
+- **S3**: Partitioned by year/month/day/hour
+- **Redshift**: Distribution and sort keys optimization
+- **DynamoDB**: Partition key design for even distribution
+
+## Cost Optimization
+
+### Resource Management
+- **Auto-scaling**: Scale based on actual usage
+- **Spot Instances**: For EMR worker nodes
+- **Reserved Instances**: For predictable workloads
+- **S3 Lifecycle Policies**: Archive old data to Glacier
+
+### Monitoring
+- **Cost Explorer**: Track spending trends
+- **Budgets**: Alert on cost overruns
+- **Trusted Advisor**: Optimization recommendations
+
+## Security Architecture
+
+### Data Protection
+- **Encryption**: AES-256 for data at rest
+- **TLS 1.2+**: For data in transit
+- **KMS**: Key management service integration
+
+### Access Control
+- **IAM Roles**: Service-specific permissions
+- **Lake Formation**: Fine-grained data access
+- **Secrets Manager**: Credential rotation
+
+### Compliance
+- **CloudTrail**: Audit logging
+- **Config**: Compliance monitoring
+- **GuardDuty**: Threat detection
+
+## Scalability Limits
+
+### Streaming
+- Kinesis: 1MB/sec or 1000 records/sec per shard
+- Lambda: 1000 concurrent executions (soft limit)
+- DynamoDB: 40,000 RCU/WCU per table
+
+### Batch Processing
+- EMR: 500 nodes per cluster
+- Glue: 100 DPU per job
+- Redshift: 128 nodes per cluster
+
+### Storage
+- S3: Unlimited storage
+- Single object: 5TB maximum
+- DynamoDB item: 400KB maximum