DataPipe is an end-to-end data pipeline demonstrating a Lambda Architecture for high-volume data processing. This repository contains the first major component: a high-throughput, resilient data ingestion pipeline designed for local development and testing.
It captures real-time changes from a PostgreSQL database using Debezium CDC, streams them through a highly-available Kafka cluster, and reliably archives them to AWS S3 in Parquet format. The entire stack is orchestrated on Kubernetes (via kind) and is meticulously engineered to operate within a strict 6GB RAM budget.
- Real-time CDC Streaming: Captures database changes from PostgreSQL using Debezium and streams them through a high-availability, KRaft-based Kafka cluster.
- Managed Schema Evolution: Employs Confluent Schema Registry with Avro to enforce data contracts and handle schema changes without breaking the pipeline.
- Optimized S3 Archival: Archives the event stream to AWS S3 as compressed Parquet files, partitioned by time (
year/month/day/hour) for efficient analytical queries. - Kubernetes-Native Deployment: Defined entirely with Kubernetes manifests for reproducible, production-parity deployment on a local
kindcluster. - Resource-Efficient: Meticulously configured to run the complete pipeline—PostgreSQL, Kafka, and Connect—within a 6GB RAM budget.
- Secure by Design: Enforces strict service-to-service communication with Network Policies and follows the principle of least privilege with RBAC.
The pipeline follows a classic three-layer event-driven architecture: Source, Streaming, and Archive.
graph TB
subgraph "Source Layer"
PG[PostgreSQL<br/>OLTP Database<br/>Logical Replication]
end
subgraph "Streaming Layer"
CDC[Debezium CDC<br/>Connector]
KAFKA[Apache Kafka<br/>3 Brokers<br/>KRaft Mode]
SR[Schema Registry<br/>Avro Schemas]
end
subgraph "Archive Layer"
S3SINK[Kafka Connect<br/>S3 Sink Connector]
S3[AWS S3<br/>Parquet Files<br/>Date Partitioned]
end
subgraph "Infrastructure"
K8S[Kubernetes<br/>Docker Desktop + kind]
end
PG -->|WAL Changes| CDC
CDC -->|Avro Events| KAFKA
KAFKA -->|Schema Validation| SR
KAFKA -->|Batch Events| S3SINK
S3SINK -->|Parquet Files| S3
- PostgreSQL: An application performs transactions on the e-commerce database tables (
users,products,orders). - Debezium CDC: The Debezium PostgreSQL connector, running on Kafka Connect, tails the PostgreSQL Write-Ahead Log (WAL) via logical replication. It captures row-level changes and converts them into structured Avro events.
- Schema Registry: As events are produced, their Avro schemas are registered and validated against the Schema Registry, ensuring data contract compliance.
- Apache Kafka: Change events are published to corresponding Kafka topics (e.g.,
postgres.public.users). The 3-broker cluster provides high availability and data durability. - S3 Sink Connector: The S3 Sink connector consumes messages from the Kafka topics, batches them by record count or time, converts them to Parquet format, and writes them to an AWS S3 bucket.
- AWS S3: The final, immutable data lands in S3, partitioned by time and ready for consumption by batch processing frameworks, data lakes, or analytics tools.
| Component | Technology | Purpose |
|---|---|---|
| Orchestration | Kubernetes (Kind) | Container orchestration for local development |
| Database (Source) | PostgreSQL | OLTP database and CDC source |
| Streaming | Apache Kafka (KRaft) | Distributed event streaming platform |
| CDC | Debezium | Real-time change data capture |
| Schema Management | Confluent Schema Registry | Avro schema management and evolution |
| Integration | Kafka Connect | Framework for scalable data streaming |
| Storage (Archive) | AWS S3 | Durable, long-term object storage |
| Data Format | Apache Parquet | Columnar storage format for analytics |
| Schema Format | Apache Avro | Data serialization format with schemas |
| Deployment | Bash, Python | Automation and data generation scripts |
The core data ingestion pipeline is complete and fully operational.
- ✅ Phase 1: Foundation: Infrastructure setup, including Kind cluster, persistent volumes, namespaces, and PostgreSQL deployment.
- ✅ Phase 2: Core Services: Deployment of the 3-broker Kafka cluster, Schema Registry, and the Kafka Connect framework.
- ✅ Phase 3: Integration: Configuration of the Debezium and S3 Sink connectors to establish the end-to-end data flow.
- ✅ Phase 4: Production Readiness: Core security procedures implemented.
The entire data ingestion pipeline is designed to run within a 6Gi RAM limit, making it ideal for local development on standard personal computers.
| Service | Requested Memory | CPU Request/Limit | Storage |
|---|---|---|---|
| PostgreSQL | 1Gi |
500m / 1 |
5Gi |
| Kafka Cluster (3) | 2Gi (total) |
750m / 1.5 |
10Gi |
| Schema Registry | 1Gi |
250m / 500m |
- |
| Kafka Connect | 2Gi |
500m / 1 |
- |
| Total | 6Gi |
2 / 4 |
15Gi |
- Docker Desktop for Windows/macOS with Kubernetes enabled.
- kind (Kubernetes in Docker).
- kubectl configured to interact with your Docker cluster.
- An AWS account with an S3 bucket and IAM credentials (
access key idandsecret access key).
-
Clone the Repository
git clone <your-repo-url> cd DataPipe/1-data-ingestion-pipeline
-
Configure Secrets The pipeline requires credentials for PostgreSQL, AWS, and Schema Registry. Copy the example file and fill in your details.
cp 04-secrets.yaml.example 04-secrets.yaml
Edit
04-secrets.yamland provide base64 encoded values for:aws-credentials: Your AWSaccess-key-id,secret-access-key,region, ands3-bucketname.- Other secrets can be left as default for local development.
-
Deploy the Pipeline The
deploy-data-ingestion-pipeline.shscript automates the entire setup process../deploy-data-ingestion-pipeline.sh
This script will:
- Create a 3-node
kindKubernetes cluster. - Install the Kubernetes Metrics Server.
- Apply all Kubernetes manifests for namespaces, services, storage, and applications in the correct order.
- Wait for each component to become ready before proceeding to the next.
- Deploy the Debezium and S3 Sink connector configurations to Kafka Connect.
- Run a performance test using
data-generator.pyto simulate a workload.
- Create a 3-node
- Check Pod Status: Ensure all pods in the
data-ingestionnamespace areRunningorCompleted.kubectl get pods -n data-ingestion
- Check Connectors: Verify that both connectors are in a
RUNNINGstate.kubectl exec -n data-ingestion deploy/kafka-connect -- curl -s http://localhost:8083/connectors | jq .
- Check S3 Bucket: After the data generator runs, navigate to your S3 bucket. You should see new objects organized in a
year=.../month=.../day=.../hour=...directory structure.
The repository is organized to separate different pipeline components, with the data ingestion pipeline contained in its own directory alongside Kubernetes manifests, connector configurations, and automation scripts.
DataPipe
├── .kiro/specs/ # Project Design, Requirements, and Tasks
├── 1-data-ingestion-pipeline/ # Data Ingestion Pipeline Components
│ ├── connectors/ # Kafka Connect connector configurations
│ ├── 01-namespace.yaml # Kubernetes Namespace and Resource Quotas
│ ├── 02-service-accounts.yaml # RBAC Service Accounts, Roles, and Bindings
│ ├── 03-network-policies.yaml # Network isolation rules for all components
│ ├── 04-secrets.yaml.example # Template for secrets
│ ├── data-generator.py # Performance test data generator
│ ├── deploy-data-ingestion-pipeline.sh # Main deployment automation script
│ ├── kind-config.yaml # 3-node Kind cluster definition
│ ├── storage-classes.yaml # Differentiated storage for DB vs. streaming
│ ├── task*.yaml # Kubernetes manifests for each pipeline component
│ └── task*.sh # Validation and testing scripts
├── 2-batch-analytics-layer/ # Batch Analytics Layer (Future Phase)
├── logs/ # Pipeline execution logs
├── utils-dev/ # Development utilities and query tools
└── README.md # This fileThe core data ingestion pipeline is complete and operational. The broader Lambda Architecture vision includes:
- Optional Enhancements (for current pipeline):
- Data-specific backup and recovery procedures
- Performance testing to validate 10,000 events/sec target throughput
- Batch Layer: Use dbt Core and Snowflake to process the Parquet data in S3 for BI and reporting.
- Data Lakehouse: Evolve the S3 storage layer into an Apache Iceberg table for transactional capabilities.
- Speed Layer: Integrate Apache Spark Streaming to consume data from Kafka for real-time analytics.
- Orchestration: Introduce Airflow to manage complex, scheduled workflows across the batch and speed layers.
This project is licensed under the MIT License - see the LICENSE file for details.