Distributed Task Scheduler with Dead-Letter Queues

A production-ready background job engine built from scratch with Node.js, TypeScript, and Redis — demonstrating deep knowledge of distributed systems, asynchronous patterns, and fault tolerance without relying on heavy queue frameworks.

There is also a small HTML dashboard (static page + Express API) so you can click buttons to enqueue jobs, watch queue stats update in near real time, and see failures land in the Dead-Letter Queue without reading terminal logs alone.

Why This Project?

Most real-world applications need some form of background job processing: sending emails, resizing images, generating reports, or calling unreliable third-party APIs. Popular libraries like BullMQ or Sidekiq abstract away the hard parts. This project implements those hard parts directly, using only ioredis, to demonstrate:

How Redis data structures (LISTs, ZSETs) map to queue semantics
How atomic Lua scripts prevent race conditions across distributed workers
How to build fault tolerance into a system from first principles

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Client Code                              │
│  queue.enqueue(job)          queue.schedule(job, runAt)         │
└──────────────┬───────────────────────────┬──────────────────────┘
               │ RPUSH                     │ ZADD (score = runAt)
               ▼                           ▼
    ┌─────────────────┐         ┌──────────────────────┐
    │  scheduler:     │◄────────│  scheduler:delayed   │
    │  active  (LIST) │ RPUSH   │  (ZSET, score=ms)    │
    └────────┬────────┘ (Lua)   └──────────────────────┘
             │ BLMOVE                  ▲
             │ LEFT→LEFT         promoter tick (setInterval)
             ▼                   (also used for retries)
    ┌─────────────────┐
    │  scheduler:     │
    │  processing     │
    │  (LIST)         │
    └────────┬────────┘
             │ executeJob()
             ▼
    ┌─────────────────┐
    │   Job Handler   │──► success: LREM from processing
    │   (user fn)     │
    └────────┬────────┘
             │ failure
             ▼
    ┌─────────────────────────────────────────┐
    │            retry.handleFailure()        │
    │                                         │
    │  attempts < maxRetries?                 │
    │    YES → ZADD scheduler:delayed         │
    │           (score = now + backoff)       │
    │                                         │
    │    NO  → RPUSH scheduler:dlq            │
    └─────────────────────────────────────────┘

Key Design Decisions

Redis Sorted Sets for Scheduling

Delayed and retry jobs are stored in a Redis Sorted Set (ZSET) where the score is the Unix millisecond timestamp at which the job should run:

ZADD scheduler:delayed <runAt_ms> <jobJson>

A background "promoter" runs every 500 ms and queries:

ZRANGEBYSCORE scheduler:delayed -inf <now_ms>

Any jobs whose score ≤ now are atomically removed from the ZSET and pushed into the active LIST using a Lua script. This single server-side script executes atomically, so two competing worker instances can never double-promote the same job.

Why ZSETs?

ZADD is O(log N) — inserting a million delayed jobs stays fast
ZRANGEBYSCORE returns all due jobs in one round-trip
The score-based ordering naturally handles arbitrary future timestamps

Processing Shadow List for Crash Safety

A naive LPOP would lose a job if the worker process crashed between popping and finishing. Instead, we use BLMOVE:

BLMOVE scheduler:active scheduler:processing LEFT LEFT <timeout>

This atomically moves the job into a "processing" shadow list. The job only leaves scheduler:processing when it either:

Succeeds → removed with LREM scheduler:processing 1 <jobJson>
Fails → removed before rescheduling or DLQ insertion

If the worker crashes before either path completes, the job remains in scheduler:processing. On startup, Queue.rescueStuckJobs() moves all orphaned entries back to scheduler:active, guaranteeing at-least-once delivery.

Dead-Letter Queue for Zero Data Loss

Rather than silently discarding a job that has failed too many times, we move it to a Dead-Letter Queue (DLQ):

scheduler:dlq  (Redis LIST, append-only)

Each entry is a JSON object containing the original job, a failedAt timestamp, and the finalError message. This means:

No job payload is ever lost
Engineers can inspect the DLQ, fix the underlying bug, and replay jobs
Operations teams can set up alerts on LLEN scheduler:dlq > 0

Full-Jitter Exponential Backoff

When a job fails and is eligible for retry, the delay before the next attempt uses Full Jitter:

cap   = min(baseMs × 2^attempt, maxMs)
delay = random(0, cap)

This approach (recommended by the AWS Architecture Blog) spreads retrying workers across a wide time window, preventing the thundering herd problem where all workers simultaneously hammer a recovering downstream service.

Attempt	Base	Cap (maxMs=30s)	Example delay
1	1s	2s	~1.3s
2	1s	4s	~2.8s
3	1s	8s	~5.1s
4	1s	16s	~9.4s
5+	1s	30s	≤30s

Concurrency Semaphore

The Worker class maintains a simple integer counter (activeCount) as a semaphore. The poll loop checks activeCount < concurrency before attempting to dequeue a new job. When at capacity, it sleeps for pollIntervalMs before trying again.

This avoids spawning an unbounded number of async tasks when jobs arrive faster than they complete, giving predictable memory and CPU usage under load.

Project Structure

distributed-task-scheduler/
├── docker-compose.yml      # Redis 7 with AOF persistence
├── .env.example            # All configurable parameters
├── package.json
├── tsconfig.json
├── README.md
├── public/
│   └── index.html          # Interactive dashboard (stats + enqueue buttons + DLQ list)
└── src/
    ├── types.ts            # Job<T>, DeadLetter<T>, JobHandler interfaces
    ├── config.ts           # Redis client factory, queue key constants
    ├── queue.ts            # Queue class: enqueue, schedule, promoteDelayed, DLQ
    ├── retry.ts            # calcBackoff(), handleFailure()
    ├── worker.ts           # Worker class: poll loop, concurrency, graceful shutdown
    ├── example.ts          # CLI demo: succeed / transient-fail / permanent-fail jobs
    └── server.ts           # Express app: serves dashboard + REST API, runs Worker

Local Setup & Running the Demo

Prerequisites

Node.js ≥ 18
Docker Desktop (for Redis)

1. Clone and install

git clone https://github.com/your-username/distributed-task-scheduler.git
cd distributed-task-scheduler
npm install

2. Configure environment

cp .env.example .env
# The defaults work out of the box with docker-compose

3. Start Redis

docker-compose up -d
# Verify: docker-compose ps  (should show "healthy")

4. Run the CLI demo

npm run dev

Prefer a browser? After Redis is up, run npm run web and open http://localhost:3000 — see Interactive Web Dashboard.

You will see output like:

────────────────────────────────────────────────────────────
  Distributed Task Scheduler — Demo
────────────────────────────────────────────────────────────

[Redis:demo:queue] connected to 127.0.0.1:6379
[Queue] enqueued job abc123 (succeed-alpha)
[Queue] enqueued job def456 (succeed-beta)
...
[Queue] scheduled job ghi789 (succeed-delayed-eta) to run in 5000ms

────────────────────────────────────────────────────────────
  Starting worker (concurrency=3, runs for 30s)
────────────────────────────────────────────────────────────

[Worker] starting — concurrency: 3, pollInterval: 300ms
[Worker] ✓ job abc123 (succeed-alpha) completed successfully
[Worker] ✓ job def456 (succeed-beta) completed successfully
[Retry]  job xyz (permanent-fail-epsilon) failed (attempt 1/2). Retrying in 743ms.
...
[Queue]  job xyz (permanent-fail-epsilon) exhausted 2 retries → DLQ

────────────────────────────────────────────────────────────
  Demo complete — inspecting results
────────────────────────────────────────────────────────────

  Dead-Letter Queue (2 entries):

  ┌─ Job ID   : xyz...
  │  Name     : permanent-fail-epsilon
  │  Attempts : 2/2
  │  Failed at: 2026-03-22T10:30:00.000Z
  │  Error    : Permanent failure in job "permanent-fail-epsilon"
  │  Payload  : {"task":"broken-integration","serviceId":"payment-gateway"}
  └──────────────────────────────────────────────────────

Interactive Web Dashboard

For a hands-on way to explore the scheduler (recommended for interviews or portfolio walkthroughs), run the bundled HTML dashboard. It is a single-page UI in public/index.html served by src/server.ts: an Express server exposes JSON endpoints (/api/stats, /api/dlq, /api/enqueue, /api/flush) and the same Worker + Queue logic as the CLI demo runs in-process.

What you can try:

Action	What it demonstrates
Add Success Job	Normal FIFO flow: active → processing → done.
Add Delayed Job (5s)	Jobs sit in the ZSET until promoted; watch the Delayed counter drop when they move to Active.
Add Transient-Fail	First attempt fails; exponential backoff + jitter schedules a retry; second attempt succeeds.
Add Permanent-Fail	Retries exhaust; job lands in the DLQ and appears in the list at the bottom.
Clear Queue Data	Deletes `scheduler:*` keys in Redis (demo reset only).

Run the dashboard (Redis must be up — see step 3):

npm run web

Then open http://localhost:3000 (or set PORT in .env to use another port). Stats poll every ~500 ms so you can see counters move as jobs flow through the system.

5. Build for production

npm run build              # Compiles TypeScript → dist/
npm start                  # Runs dist/example.js (CLI demo)
node dist/server.js        # Compiled dashboard + API (same as npm run web, but production JS)

6. Tear down

docker-compose down -v   # Stops Redis and removes the data volume

Configuration Reference

All values can be set in .env (copy from .env.example):

Variable	Default	Description
`REDIS_HOST`	`127.0.0.1`	Redis server hostname
`REDIS_PORT`	`6379`	Redis server port
`REDIS_PASSWORD`	(empty)	Redis AUTH password (leave blank for no auth)
`PORT`	`3000`	HTTP port for the web dashboard (`npm run web`)
`QUEUE_NAMESPACE`	`scheduler`	Prefix for all Redis keys
`WORKER_CONCURRENCY`	`5`	Max jobs processed simultaneously per worker instance
`WORKER_POLL_INTERVAL_MS`	`500`	How often (ms) the poll loop retries when at capacity
`RETRY_BASE_DELAY_MS`	`1000`	Base delay for exponential backoff (ms)
`RETRY_MAX_DELAY_MS`	`30000`	Maximum retry delay cap (ms)

How It Works: Step by Step

Enqueue — queue.enqueue(options) creates a Job<T> envelope with a UUID, wraps the user payload, and appends it to scheduler:active (Redis LIST) via RPUSH.
Schedule — queue.schedule(options) stores the job in scheduler:delayed (Redis ZSET) with runAt as the score. A setInterval ticker calls queue.promoteDelayed() every 500 ms, which executes a Lua script to atomically move all due jobs into scheduler:active.
Dequeue — The Worker calls BLMOVE scheduler:active scheduler:processing LEFT LEFT 1. This blocks for up to 1 second waiting for a job, then atomically moves it to the processing shadow list.
Execute — The user-provided handler function is called with the Job<T> object. The handler has full access to the payload and can do anything: HTTP calls, database writes, file I/O.
Success — The job is removed from scheduler:processing via LREM. The active count semaphore is decremented, freeing a concurrency slot.
Failure / Retry — handleFailure() increments job.attempts, calculates a jittered backoff delay, and calls queue.schedule(job, runAt) to re-insert the job into the ZSET. The job re-enters the normal scheduling flow.
Permanent Failure / DLQ — When job.attempts >= job.maxRetries, the job is serialised as a DeadLetter record and appended to scheduler:dlq (Redis LIST) via RPUSH. It is never silently dropped.
Graceful Shutdown — On SIGINT/SIGTERM, the worker sets shuttingDown = true, stops the promoter, and awaits a drain promise that resolves when activeCount === 0. In-flight jobs always complete before the process exits.

Tech Stack

Technology	Role
Node.js 18+	Async runtime; native `Promise` / `async-await` throughout
TypeScript 5	Full type safety on all job payloads via generics
ioredis 5	Redis client; used directly to showcase raw Redis commands
Redis 7	State storage: LISTs, ZSETs, Lua scripting
Docker	Reproducible local Redis environment with AOF persistence
uuid	Collision-resistant job IDs (v4)
dotenv	Twelve-factor app configuration via environment variables
Express	HTTP server for the HTML dashboard + REST API (`src/server.ts`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Task Scheduler with Dead-Letter Queues

Table of Contents

Why This Project?

Architecture Overview

Key Design Decisions

Redis Sorted Sets for Scheduling

Processing Shadow List for Crash Safety

Dead-Letter Queue for Zero Data Loss

Full-Jitter Exponential Backoff

Concurrency Semaphore

Project Structure

Local Setup & Running the Demo

Prerequisites

1. Clone and install

2. Configure environment

3. Start Redis

4. Run the CLI demo

Interactive Web Dashboard

5. Build for production

6. Tear down

Configuration Reference

How It Works: Step by Step

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
public		public
src		src
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Distributed Task Scheduler with Dead-Letter Queues

Table of Contents

Why This Project?

Architecture Overview

Key Design Decisions

Redis Sorted Sets for Scheduling

Processing Shadow List for Crash Safety

Dead-Letter Queue for Zero Data Loss

Full-Jitter Exponential Backoff

Concurrency Semaphore

Project Structure

Local Setup & Running the Demo

Prerequisites

1. Clone and install

2. Configure environment

3. Start Redis

4. Run the CLI demo

Interactive Web Dashboard

5. Build for production

6. Tear down

Configuration Reference

How It Works: Step by Step

Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages