DataPipe: High-Throughput Data Ingestion Pipeline

DataPipe is an end-to-end data pipeline demonstrating a Lambda Architecture for high-volume data processing. This repository contains the first major component: a high-throughput, resilient data ingestion pipeline designed for local development and testing.

It captures real-time changes from a PostgreSQL database using Debezium CDC, streams them through a highly-available Kafka cluster, and reliably archives them to AWS S3 in Parquet format. The entire stack is orchestrated on Kubernetes (via kind) and is meticulously engineered to operate within a strict 6GB RAM budget.

✨ Features

Real-time CDC Streaming: Captures database changes from PostgreSQL using Debezium and streams them through a high-availability, KRaft-based Kafka cluster.
Managed Schema Evolution: Employs Confluent Schema Registry with Avro to enforce data contracts and handle schema changes without breaking the pipeline.
Optimized S3 Archival: Archives the event stream to AWS S3 as compressed Parquet files, partitioned by time (year/month/day/hour) for efficient analytical queries.
Kubernetes-Native Deployment: Defined entirely with Kubernetes manifests for reproducible, production-parity deployment on a local kind cluster.
Resource-Efficient: Meticulously configured to run the complete pipeline—PostgreSQL, Kafka, and Connect—within a 6GB RAM budget.
Secure by Design: Enforces strict service-to-service communication with Network Policies and follows the principle of least privilege with RBAC.

🏗️ Architecture

The pipeline follows a classic three-layer event-driven architecture: Source, Streaming, and Archive.

graph TB
    subgraph "Source Layer"
        PG[PostgreSQL<br/>OLTP Database<br/>Logical Replication]
    end
    
    subgraph "Streaming Layer"
        CDC[Debezium CDC<br/>Connector]
        KAFKA[Apache Kafka<br/>3 Brokers<br/>KRaft Mode]
        SR[Schema Registry<br/>Avro Schemas]
    end
    
    subgraph "Archive Layer"
        S3SINK[Kafka Connect<br/>S3 Sink Connector]
        S3[AWS S3<br/>Parquet Files<br/>Date Partitioned]
    end
    
    subgraph "Infrastructure"
        K8S[Kubernetes<br/>Docker Desktop + kind]
    end
    
    PG -->|WAL Changes| CDC
    CDC -->|Avro Events| KAFKA
    KAFKA -->|Schema Validation| SR
    KAFKA -->|Batch Events| S3SINK
    S3SINK -->|Parquet Files| S3

Data Flow

PostgreSQL: An application performs transactions on the e-commerce database tables (users, products, orders).
Debezium CDC: The Debezium PostgreSQL connector, running on Kafka Connect, tails the PostgreSQL Write-Ahead Log (WAL) via logical replication. It captures row-level changes and converts them into structured Avro events.
Schema Registry: As events are produced, their Avro schemas are registered and validated against the Schema Registry, ensuring data contract compliance.
Apache Kafka: Change events are published to corresponding Kafka topics (e.g., postgres.public.users). The 3-broker cluster provides high availability and data durability.
S3 Sink Connector: The S3 Sink connector consumes messages from the Kafka topics, batches them by record count or time, converts them to Parquet format, and writes them to an AWS S3 bucket.
AWS S3: The final, immutable data lands in S3, partitioned by time and ready for consumption by batch processing frameworks, data lakes, or analytics tools.

💻 Technology Stack

Component	Technology	Purpose
Orchestration	Kubernetes (Kind)	Container orchestration for local development
Database (Source)	PostgreSQL	OLTP database and CDC source
Streaming	Apache Kafka (KRaft)	Distributed event streaming platform
CDC	Debezium	Real-time change data capture
Schema Management	Confluent Schema Registry	Avro schema management and evolution
Integration	Kafka Connect	Framework for scalable data streaming
Storage (Archive)	AWS S3	Durable, long-term object storage
Data Format	Apache Parquet	Columnar storage format for analytics
Schema Format	Apache Avro	Data serialization format with schemas
Deployment	Bash, Python	Automation and data generation scripts

📊 Project Status

The core data ingestion pipeline is complete and fully operational.

✅ Phase 1: Foundation: Infrastructure setup, including Kind cluster, persistent volumes, namespaces, and PostgreSQL deployment.
✅ Phase 2: Core Services: Deployment of the 3-broker Kafka cluster, Schema Registry, and the Kafka Connect framework.
✅ Phase 3: Integration: Configuration of the Debezium and S3 Sink connectors to establish the end-to-end data flow.
✅ Phase 4: Production Readiness: Core security procedures implemented.

⚙️ Resource Allocation (6Gi Constraint)

The entire data ingestion pipeline is designed to run within a 6Gi RAM limit, making it ideal for local development on standard personal computers.

Service	Requested Memory	CPU Request/Limit	Storage
PostgreSQL	`1Gi`	`500m` / `1`	`5Gi`
Kafka Cluster (3)	`2Gi` (total)	`750m` / `1.5`	`10Gi`
Schema Registry	`1Gi`	`250m` / `500m`	-
Kafka Connect	`2Gi`	`500m` / `1`	-
Total	`6Gi`	`2 / 4`	`15Gi`

🚀 Getting Started

Prerequisites

Docker Desktop for Windows/macOS with Kubernetes enabled.
kind (Kubernetes in Docker).
kubectl configured to interact with your Docker cluster.
An AWS account with an S3 bucket and IAM credentials (access key id and secret access key).

Deployment

Clone the Repository

git clone <your-repo-url>
cd DataPipe/1-data-ingestion-pipeline

Configure Secrets The pipeline requires credentials for PostgreSQL, AWS, and Schema Registry. Copy the example file and fill in your details.
```
cp 04-secrets.yaml.example 04-secrets.yaml
```
Edit 04-secrets.yaml and provide base64 encoded values for:
- aws-credentials: Your AWS access-key-id, secret-access-key, region, and s3-bucket name.
- Other secrets can be left as default for local development.
Deploy the Pipeline The deploy-data-ingestion-pipeline.sh script automates the entire setup process.
```
./deploy-data-ingestion-pipeline.sh
```
This script will:
- Create a 3-node kind Kubernetes cluster.
- Install the Kubernetes Metrics Server.
- Apply all Kubernetes manifests for namespaces, services, storage, and applications in the correct order.
- Wait for each component to become ready before proceeding to the next.
- Deploy the Debezium and S3 Sink connector configurations to Kafka Connect.
- Run a performance test using data-generator.py to simulate a workload.

Verifying the Pipeline

Check Pod Status: Ensure all pods in the data-ingestion namespace are Running or Completed.
```
kubectl get pods -n data-ingestion
```

Check Connectors: Verify that both connectors are in a RUNNING state.

kubectl exec -n data-ingestion deploy/kafka-connect -- curl -s http://localhost:8083/connectors | jq .

Check S3 Bucket: After the data generator runs, navigate to your S3 bucket. You should see new objects organized in a year=.../month=.../day=.../hour=... directory structure.

📂 Project Structure

The repository is organized to separate different pipeline components, with the data ingestion pipeline contained in its own directory alongside Kubernetes manifests, connector configurations, and automation scripts.

DataPipe
├── .kiro/specs/                                    # Project Design, Requirements, and Tasks
├── 1-data-ingestion-pipeline/                      # Data Ingestion Pipeline Components
│   ├── connectors/                                 # Kafka Connect connector configurations
│   ├── 01-namespace.yaml                           # Kubernetes Namespace and Resource Quotas
│   ├── 02-service-accounts.yaml                    # RBAC Service Accounts, Roles, and Bindings
│   ├── 03-network-policies.yaml                    # Network isolation rules for all components
│   ├── 04-secrets.yaml.example                     # Template for secrets
│   ├── data-generator.py                           # Performance test data generator
│   ├── deploy-data-ingestion-pipeline.sh           # Main deployment automation script
│   ├── kind-config.yaml                            # 3-node Kind cluster definition
│   ├── storage-classes.yaml                        # Differentiated storage for DB vs. streaming
│   ├── task*.yaml                                  # Kubernetes manifests for each pipeline component
│   └── task*.sh                                    # Validation and testing scripts
├── 2-batch-analytics-layer/                        # Batch Analytics Layer (Future Phase)
├── logs/                                           # Pipeline execution logs
├── utils-dev/                                      # Development utilities and query tools
└── README.md                                       # This file

🗺️ Future Work & Roadmap

The core data ingestion pipeline is complete and operational. The broader Lambda Architecture vision includes:

Optional Enhancements (for current pipeline):
- Data-specific backup and recovery procedures
- Performance testing to validate 10,000 events/sec target throughput
Batch Layer: Use dbt Core and Snowflake to process the Parquet data in S3 for BI and reporting.
Data Lakehouse: Evolve the S3 storage layer into an Apache Iceberg table for transactional capabilities.
Speed Layer: Integrate Apache Spark Streaming to consume data from Kafka for real-time analytics.
Orchestration: Introduce Airflow to manage complex, scheduled workflows across the batch and speed layers.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 367 Commits
.devcontainer		.devcontainer
.kiro		.kiro
1-data-ingestion-pipeline		1-data-ingestion-pipeline
2-batch-analytics-layer		2-batch-analytics-layer
logs		logs
utils-dev		utils-dev
.gitignore		.gitignore
.mcp.json		.mcp.json
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
kiro.md		kiro.md
original_requirements.txt		original_requirements.txt
resource-monitor.sh		resource-monitor.sh
utils.sh		utils.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataPipe: High-Throughput Data Ingestion Pipeline

✨ Features

🏗️ Architecture

Data Flow

💻 Technology Stack

📊 Project Status

⚙️ Resource Allocation (6Gi Constraint)

🚀 Getting Started

Prerequisites

Deployment

Verifying the Pipeline

📂 Project Structure

🗺️ Future Work & Roadmap

📄 License

About

Uh oh!

Releases

Packages

Languages

License

mohidmakhdoomi/DataPipe

Folders and files

Latest commit

History

Repository files navigation

DataPipe: High-Throughput Data Ingestion Pipeline

✨ Features

🏗️ Architecture

Data Flow

💻 Technology Stack

📊 Project Status

⚙️ Resource Allocation (6Gi Constraint)

🚀 Getting Started

Prerequisites

Deployment

Verifying the Pipeline

📂 Project Structure

🗺️ Future Work & Roadmap

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages