☁️ CloudLog

Cloud-native async log processing system

AWS · Terraform · Python · ECS Fargate · SQS · DynamoDB

1. Problem Statement

Log files generated by web servers contain rich operational data such as traffic patterns, error rates, and latency distributions but processing them synchronously blocks the user and scales poorly. A 500MB log file should not require the user to wait at a terminal.

CloudLog solves this with an asynchronous processing pipeline. The user submits a log file and immediately receives a job ID. A decoupled worker picks up the job in the background, computes analytics, and stores the results. The user fetches the report whenever they want whether it be seconds, minutes, or hours later.

Why async processing?

Submit returns in milliseconds regardless of log file size
The CLI, API, and worker are fully decoupled so that each can scale independently
Worker failures are isolated and retried automatically via SQS to keep the user experience unaffected

Why log analytics?

Access logs follow a well-defined format (Combined Log Format), making parsing deterministic
The metrics are meaningful and easy to demonstrate: top IPs, error rates, status distributions
File sizes vary enough to make async processing genuinely necessary, not just theoretical

2. Architecture

Deployed Architecture

Production Reference Architecture

The production architecture introduces three changes for network isolation:

ECS Fargate moves to a private subnet, removing its public IP
A NAT Gateway in the public subnet handles outbound traffic to SQS and ECR via the internet gateway
VPC Gateway Endpoints for S3 and DynamoDB and a VPC Interface Endpoint for CloudWatch keep that traffic fully private within AWS's network, bypassing the NAT Gateway entirely

Request flow

CLI uploads the log file directly to S3 and calls POST /jobs
API Gateway invokes Lambda, which writes a PENDING job to DynamoDB and pushes a message to SQS
Lambda returns the job_id immediately so that the CLI does not wait for processing
ECS Fargate worker polls SQS, downloads the log from S3, computes metrics, and updates DynamoDB
CLI polls GET /jobs/{id}/report to retrieve results when ready

Repository structure

cloudlog/
├── terraform/          # Infrastructure as code (IaC)
│   ├── main.tf
│   ├── variables.tf
│   ├── s3.tf
│   ├── dynamodb.tf
│   ├── sqs.tf
│   ├── vpc.tf
│   ├── iam.tf
│   ├── ecs.tf
│   └── api_gateway.tf
├── api/
│   ├── handler.py      # Lambda handler: POST /jobs, GET /jobs/{id}, GET /jobs/{id}/report
│   └── requirements.txt
├── worker/
│   ├── __init__.py 
│   ├── app.py          # SQS polling loop
│   ├── parser.py       # Combined Log Format parser
│   ├── metrics.py      # Analytics computation
│   ├── models.py       # LogEntry dataclass
│   ├── Dockerfile
│   └── requirements.txt
├── cli/
│   └── cloudlog.py     # CLI client
│   └── requirements.txt 
└── README.md

3. Design Decisions

Decision	Rationale
ECS Fargate over Lambda	The worker is a long-running polling loop as it runs continuously and pulls work from SQS. Lambda's invocation model is event-triggered with a max runtime of 15 minutes and is the wrong fit. ECS runs the container indefinitely with no cold start overhead per message.
DynamoDB over RDS	Job records have a single access pattern: get or update by `job_id`. There are no joins, no relational queries, and no schema migrations. DynamoDB's `PAY_PER_REQUEST` billing essentially means zero cost at low traffic.
SQS over direct invocation	SQS decouples the API from the worker entirely. If the worker is down, messages queue up and are processed when it recovers. Built-in retry and DLQ fallback require no application code.
Public IP over NAT Gateway (deployed)	ECS tasks are assigned public IPs rather than routing through a NAT Gateway. This avoids the $30+/month fixed cost, which is unnecessary for a portfolio environment. See production reference architecture for the production-grade equivalent.
Separate ECS IAM roles	ECS uses two IAM roles: an execution role (used by ECS to pull the Docker image from ECR and send logs to CloudWatch) and a task role (used by the worker's `boto3` calls at runtime). Combining them would over-permission the application code.
VPC Endpoints for S3, DynamoDB, CloudWatch (production)	Gateway Endpoints for S3 and DynamoDB are free and keep data and state traffic private within AWS's network. A CloudWatch Interface Endpoint ensures observability traffic does not depend on the internet egress path. SQS uses the NAT Gateway path to avoid Interface Endpoint hourly costs.

4. Failure Handling

Retries

SQS visibility timeout is set to 300 seconds (longer than any realistic processing time). If the worker crashes or throws an unhandled exception before deleting the message, SQS makes the message visible again after the timeout and another attempt is made. The worker only deletes the SQS message upon success.

Dead-letter queue (DLQ)

After 3 failed receive attempts (configurable via sqs_max_receive_count), SQS automatically moves the message to the DLQ. A CloudWatch alarm fires when the DLQ message count exceeds zero. This is handled entirely by SQS configuration options and requires no application code.

Idempotency

Each job has a stable job_id generated at submission time. DynamoDB updates use UpdateItem rather than PutItem, so re-processing the same job_id overwrites the record safely rather than creating a duplicate.

Job status reporting

The worker marks jobs PROCESSING immediately on pickup, before any S3 download or parsing begins. On any exception it marks the job FAILED with the error message stored in DynamoDB. The CLI's report command surfaces both states explicitly so that the user always knows whether to wait or investigate.

5. Cost Considerations

Estimated monthly cost for a portfolio environment with light usage:

Service	Notes	Est. monthly cost
ECS Fargate	0.25 vCPU / 512 MB, running continuously	~$9–11
DynamoDB	PAY_PER_REQUEST, low traffic	~$0
SQS	Free tier: 1M requests/month	~$0
S3	Minimal storage + GET/PUT requests	<$1
API Gateway	$1 per million calls	~$0
CloudWatch Logs	14-day retention, low volume	<$1
NAT Gateway	Avoided by ECS tasks employing public IPs	$0 (saved $30+)

Production note: In production, private subnets with a NAT Gateway would be preferred for network isolation. The $30+/month fixed cost is the primary reason this is omitted here and is documented as a known tradeoff.

6. How to Deploy

Prerequisites

AWS CLI configured with appropriate credentials
Terraform >= 1.6
Docker with buildx (for cross-platform builds)
Python 3.12+

1. Provision infrastructure

cd terraform
terraform init
terraform apply

2. Set environment variables

export CLOUDLOG_API_URL=$(terraform output -raw api_gateway_url)
export CLOUDLOG_S3_BUCKET=$(terraform output -raw s3_bucket_name)
export ECR_URL=$(terraform output -raw ecr_repository_url)

3. Build and push the worker image

# Authenticate Docker to ECR
aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin $ECR_URL

# Build for linux/amd64 (required for Fargate)
docker build --platform linux/amd64 -t cloudlog-worker ./worker
docker tag cloudlog-worker:latest $ECR_URL:latest
docker push "${ECR_URL}:latest"

# Deploy to ECS
aws ecs update-service \
  --cluster cloudlog-cluster \
  --service cloudlog-worker \
  --force-new-deployment

4. Install CLI dependencies

pip install boto3 requests

5. Submit a log file

python cli/cloudlog.py submit path/to/access.log --wait

7. CLI Demo

Submit a log file and wait for results

$ python cli/cloudlog.py submit worker/sample.log --wait            
Uploading sample.log to S3...
Creating job...
Job created: 91d779f2-2bce-4d15-88bc-3196e3fab182
Waiting for job to complete  done.
Total Requests: 50
Unique IPs: 9

Top IPs:
  10.0.0.5 — 7
  192.168.1.1 — 6
  192.168.1.2 — 6
  10.0.0.8 — 6
  172.16.0.3 — 6
  192.168.1.3 — 5
  10.0.0.9 — 5
  192.168.1.4 — 5
  172.16.0.4 — 4

Status Codes:
  200 — 30
  201 — 4
  401 — 5
  500 — 3
  204 — 1
  403 — 3
  404 — 4

Error Rate: 0.30%

Total Bytes: 139257
Average Bytes / Request: 2785.14

Check job status

$ python cli/cloudlog.py status 91d779f2-2bce-4d15-88bc-3196e3fab182
Job ID: 91d779f2-2bce-4d15-88bc-3196e3fab182
Status: COMPLETED
Created: 2026-03-27T20:05:57.499137+00:00

Fetch the report

$ python cli/cloudlog.py report 91d779f2-2bce-4d15-88bc-3196e3fab182
Total Requests: 50
Unique IPs: 9

Top IPs:
  10.0.0.5 — 7
  192.168.1.1 — 6
  192.168.1.2 — 6
  10.0.0.8 — 6
  172.16.0.3 — 6
  192.168.1.3 — 5
  10.0.0.9 — 5
  192.168.1.4 — 5
  172.16.0.4 — 4

Status Codes:
  200 — 30
  201 — 4
  401 — 5
  500 — 3
  204 — 1
  403 — 3
  404 — 4

Error Rate: 0.30%

Total Bytes: 139257
Average Bytes / Request: 2785.14

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
api		api
architecture		architecture
cli		cli
terraform		terraform
worker		worker
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

☁️ CloudLog

Table of Contents

1. Problem Statement

Why async processing?

Why log analytics?

2. Architecture

Deployed Architecture

Production Reference Architecture

Request flow

Repository structure

3. Design Decisions

4. Failure Handling

Retries

Dead-letter queue (DLQ)

Idempotency

Job status reporting

5. Cost Considerations

6. How to Deploy

Prerequisites

1. Provision infrastructure

2. Set environment variables

3. Build and push the worker image

4. Install CLI dependencies

5. Submit a log file

7. CLI Demo

Submit a log file and wait for results

Check job status

Fetch the report

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

☁️ CloudLog

Table of Contents

1. Problem Statement

Why async processing?

Why log analytics?

2. Architecture

Deployed Architecture

Production Reference Architecture

Request flow

Repository structure

3. Design Decisions

4. Failure Handling

Retries

Dead-letter queue (DLQ)

Idempotency

Job status reporting

5. Cost Considerations

6. How to Deploy

Prerequisites

1. Provision infrastructure

2. Set environment variables

3. Build and push the worker image

4. Install CLI dependencies

5. Submit a log file

7. CLI Demo

Submit a log file and wait for results

Check job status

Fetch the report

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages