A production-ready background job engine built from scratch with Node.js, TypeScript, and Redis — demonstrating deep knowledge of distributed systems, asynchronous patterns, and fault tolerance without relying on heavy queue frameworks.
There is also a small HTML dashboard (static page + Express API) so you can click buttons to enqueue jobs, watch queue stats update in near real time, and see failures land in the Dead-Letter Queue without reading terminal logs alone.
- Why This Project?
- Architecture Overview
- Key Design Decisions
- Project Structure
- Local Setup & Running the Demo
- Interactive Web Dashboard
- Configuration Reference
- How It Works: Step by Step
- Tech Stack
Most real-world applications need some form of background job processing: sending emails, resizing images, generating reports, or calling unreliable third-party APIs. Popular libraries like BullMQ or Sidekiq abstract away the hard parts. This project implements those hard parts directly, using only ioredis, to demonstrate:
- How Redis data structures (LISTs, ZSETs) map to queue semantics
- How atomic Lua scripts prevent race conditions across distributed workers
- How to build fault tolerance into a system from first principles
┌─────────────────────────────────────────────────────────────────┐
│ Client Code │
│ queue.enqueue(job) queue.schedule(job, runAt) │
└──────────────┬───────────────────────────┬──────────────────────┘
│ RPUSH │ ZADD (score = runAt)
▼ ▼
┌─────────────────┐ ┌──────────────────────┐
│ scheduler: │◄────────│ scheduler:delayed │
│ active (LIST) │ RPUSH │ (ZSET, score=ms) │
└────────┬────────┘ (Lua) └──────────────────────┘
│ BLMOVE ▲
│ LEFT→LEFT promoter tick (setInterval)
▼ (also used for retries)
┌─────────────────┐
│ scheduler: │
│ processing │
│ (LIST) │
└────────┬────────┘
│ executeJob()
▼
┌─────────────────┐
│ Job Handler │──► success: LREM from processing
│ (user fn) │
└────────┬────────┘
│ failure
▼
┌─────────────────────────────────────────┐
│ retry.handleFailure() │
│ │
│ attempts < maxRetries? │
│ YES → ZADD scheduler:delayed │
│ (score = now + backoff) │
│ │
│ NO → RPUSH scheduler:dlq │
└─────────────────────────────────────────┘
Delayed and retry jobs are stored in a Redis Sorted Set (ZSET) where the score is the Unix millisecond timestamp at which the job should run:
ZADD scheduler:delayed <runAt_ms> <jobJson>
A background "promoter" runs every 500 ms and queries:
ZRANGEBYSCORE scheduler:delayed -inf <now_ms>
Any jobs whose score ≤ now are atomically removed from the ZSET and pushed into the active LIST using a Lua script. This single server-side script executes atomically, so two competing worker instances can never double-promote the same job.
Why ZSETs?
ZADDis O(log N) — inserting a million delayed jobs stays fastZRANGEBYSCOREreturns all due jobs in one round-trip- The score-based ordering naturally handles arbitrary future timestamps
A naive LPOP would lose a job if the worker process crashed between popping and finishing. Instead, we use BLMOVE:
BLMOVE scheduler:active scheduler:processing LEFT LEFT <timeout>
This atomically moves the job into a "processing" shadow list. The job only leaves scheduler:processing when it either:
- Succeeds → removed with
LREM scheduler:processing 1 <jobJson> - Fails → removed before rescheduling or DLQ insertion
If the worker crashes before either path completes, the job remains in scheduler:processing. On startup, Queue.rescueStuckJobs() moves all orphaned entries back to scheduler:active, guaranteeing at-least-once delivery.
Rather than silently discarding a job that has failed too many times, we move it to a Dead-Letter Queue (DLQ):
scheduler:dlq (Redis LIST, append-only)
Each entry is a JSON object containing the original job, a failedAt timestamp, and the finalError message. This means:
- No job payload is ever lost
- Engineers can inspect the DLQ, fix the underlying bug, and replay jobs
- Operations teams can set up alerts on
LLEN scheduler:dlq > 0
When a job fails and is eligible for retry, the delay before the next attempt uses Full Jitter:
cap = min(baseMs × 2^attempt, maxMs)
delay = random(0, cap)
This approach (recommended by the AWS Architecture Blog) spreads retrying workers across a wide time window, preventing the thundering herd problem where all workers simultaneously hammer a recovering downstream service.
| Attempt | Base | Cap (maxMs=30s) | Example delay |
|---|---|---|---|
| 1 | 1s | 2s | ~1.3s |
| 2 | 1s | 4s | ~2.8s |
| 3 | 1s | 8s | ~5.1s |
| 4 | 1s | 16s | ~9.4s |
| 5+ | 1s | 30s | ≤30s |
The Worker class maintains a simple integer counter (activeCount) as a semaphore. The poll loop checks activeCount < concurrency before attempting to dequeue a new job. When at capacity, it sleeps for pollIntervalMs before trying again.
This avoids spawning an unbounded number of async tasks when jobs arrive faster than they complete, giving predictable memory and CPU usage under load.
distributed-task-scheduler/
├── docker-compose.yml # Redis 7 with AOF persistence
├── .env.example # All configurable parameters
├── package.json
├── tsconfig.json
├── README.md
├── public/
│ └── index.html # Interactive dashboard (stats + enqueue buttons + DLQ list)
└── src/
├── types.ts # Job<T>, DeadLetter<T>, JobHandler interfaces
├── config.ts # Redis client factory, queue key constants
├── queue.ts # Queue class: enqueue, schedule, promoteDelayed, DLQ
├── retry.ts # calcBackoff(), handleFailure()
├── worker.ts # Worker class: poll loop, concurrency, graceful shutdown
├── example.ts # CLI demo: succeed / transient-fail / permanent-fail jobs
└── server.ts # Express app: serves dashboard + REST API, runs Worker
- Node.js ≥ 18
- Docker Desktop (for Redis)
git clone https://github.com/your-username/distributed-task-scheduler.git
cd distributed-task-scheduler
npm installcp .env.example .env
# The defaults work out of the box with docker-composedocker-compose up -d
# Verify: docker-compose ps (should show "healthy")npm run devPrefer a browser? After Redis is up, run npm run web and open http://localhost:3000 — see Interactive Web Dashboard.
You will see output like:
────────────────────────────────────────────────────────────
Distributed Task Scheduler — Demo
────────────────────────────────────────────────────────────
[Redis:demo:queue] connected to 127.0.0.1:6379
[Queue] enqueued job abc123 (succeed-alpha)
[Queue] enqueued job def456 (succeed-beta)
...
[Queue] scheduled job ghi789 (succeed-delayed-eta) to run in 5000ms
────────────────────────────────────────────────────────────
Starting worker (concurrency=3, runs for 30s)
────────────────────────────────────────────────────────────
[Worker] starting — concurrency: 3, pollInterval: 300ms
[Worker] ✓ job abc123 (succeed-alpha) completed successfully
[Worker] ✓ job def456 (succeed-beta) completed successfully
[Retry] job xyz (permanent-fail-epsilon) failed (attempt 1/2). Retrying in 743ms.
...
[Queue] job xyz (permanent-fail-epsilon) exhausted 2 retries → DLQ
────────────────────────────────────────────────────────────
Demo complete — inspecting results
────────────────────────────────────────────────────────────
Dead-Letter Queue (2 entries):
┌─ Job ID : xyz...
│ Name : permanent-fail-epsilon
│ Attempts : 2/2
│ Failed at: 2026-03-22T10:30:00.000Z
│ Error : Permanent failure in job "permanent-fail-epsilon"
│ Payload : {"task":"broken-integration","serviceId":"payment-gateway"}
└──────────────────────────────────────────────────────
For a hands-on way to explore the scheduler (recommended for interviews or portfolio walkthroughs), run the bundled HTML dashboard. It is a single-page UI in public/index.html served by src/server.ts: an Express server exposes JSON endpoints (/api/stats, /api/dlq, /api/enqueue, /api/flush) and the same Worker + Queue logic as the CLI demo runs in-process.
What you can try:
| Action | What it demonstrates |
|---|---|
| Add Success Job | Normal FIFO flow: active → processing → done. |
| Add Delayed Job (5s) | Jobs sit in the ZSET until promoted; watch the Delayed counter drop when they move to Active. |
| Add Transient-Fail | First attempt fails; exponential backoff + jitter schedules a retry; second attempt succeeds. |
| Add Permanent-Fail | Retries exhaust; job lands in the DLQ and appears in the list at the bottom. |
| Clear Queue Data | Deletes scheduler:* keys in Redis (demo reset only). |
Run the dashboard (Redis must be up — see step 3):
npm run webThen open http://localhost:3000 (or set PORT in .env to use another port). Stats poll every ~500 ms so you can see counters move as jobs flow through the system.
npm run build # Compiles TypeScript → dist/
npm start # Runs dist/example.js (CLI demo)
node dist/server.js # Compiled dashboard + API (same as npm run web, but production JS)docker-compose down -v # Stops Redis and removes the data volumeAll values can be set in .env (copy from .env.example):
| Variable | Default | Description |
|---|---|---|
REDIS_HOST |
127.0.0.1 |
Redis server hostname |
REDIS_PORT |
6379 |
Redis server port |
REDIS_PASSWORD |
(empty) | Redis AUTH password (leave blank for no auth) |
PORT |
3000 |
HTTP port for the web dashboard (npm run web) |
QUEUE_NAMESPACE |
scheduler |
Prefix for all Redis keys |
WORKER_CONCURRENCY |
5 |
Max jobs processed simultaneously per worker instance |
WORKER_POLL_INTERVAL_MS |
500 |
How often (ms) the poll loop retries when at capacity |
RETRY_BASE_DELAY_MS |
1000 |
Base delay for exponential backoff (ms) |
RETRY_MAX_DELAY_MS |
30000 |
Maximum retry delay cap (ms) |
-
Enqueue —
queue.enqueue(options)creates aJob<T>envelope with a UUID, wraps the user payload, and appends it toscheduler:active(Redis LIST) viaRPUSH. -
Schedule —
queue.schedule(options)stores the job inscheduler:delayed(Redis ZSET) withrunAtas the score. AsetIntervalticker callsqueue.promoteDelayed()every 500 ms, which executes a Lua script to atomically move all due jobs intoscheduler:active. -
Dequeue — The Worker calls
BLMOVE scheduler:active scheduler:processing LEFT LEFT 1. This blocks for up to 1 second waiting for a job, then atomically moves it to the processing shadow list. -
Execute — The user-provided handler function is called with the
Job<T>object. The handler has full access to the payload and can do anything: HTTP calls, database writes, file I/O. -
Success — The job is removed from
scheduler:processingviaLREM. The active count semaphore is decremented, freeing a concurrency slot. -
Failure / Retry —
handleFailure()incrementsjob.attempts, calculates a jittered backoff delay, and callsqueue.schedule(job, runAt)to re-insert the job into the ZSET. The job re-enters the normal scheduling flow. -
Permanent Failure / DLQ — When
job.attempts >= job.maxRetries, the job is serialised as aDeadLetterrecord and appended toscheduler:dlq(Redis LIST) viaRPUSH. It is never silently dropped. -
Graceful Shutdown — On
SIGINT/SIGTERM, the worker setsshuttingDown = true, stops the promoter, and awaits a drain promise that resolves whenactiveCount === 0. In-flight jobs always complete before the process exits.
| Technology | Role |
|---|---|
| Node.js 18+ | Async runtime; native Promise / async-await throughout |
| TypeScript 5 | Full type safety on all job payloads via generics |
| ioredis 5 | Redis client; used directly to showcase raw Redis commands |
| Redis 7 | State storage: LISTs, ZSETs, Lua scripting |
| Docker | Reproducible local Redis environment with AOF persistence |
| uuid | Collision-resistant job IDs (v4) |
| dotenv | Twelve-factor app configuration via environment variables |
| Express | HTTP server for the HTML dashboard + REST API (src/server.ts) |