A production-grade, horizontally scalable WebSocket gateway built with Node.js, TypeScript,
ws, and Redis Pub/Sub — designed to serve multiple independent tenant applications over a single shared infrastructure.
See the gateway in action. This shows the interactive demo UI where messages from Client A are instantly fanned out to Client B via Redis Pub/Sub:
(Note: If the video above does not play on GitHub, you can download/view it here)
- What This Is
- Architecture Overview
- Why Redis Pub/Sub for Horizontal Scaling
- How the Heartbeat Prevents Memory Leaks
- Project Structure
- Configuration Reference
- Local Setup
- Connecting a Client
- Wire Protocol
- Testing the Scaling Behaviour
- Health Endpoint
- Design Decisions & Trade-offs
- Roadmap / What I Would Add Next
Most WebSocket tutorials show you a single-process chat server. This project goes further:
- Multi-tenancy — One gateway binary serves many independent client applications (
tenant-alpha,tenant-beta, …). Each tenant is isolated by itsappId; messages never bleed across tenants. - Pre-handshake auth — Credentials are validated at the HTTP
upgradeevent, before the WebSocket handshake completes. Unauthorised clients are rejected with a401and their TCP socket is destroyed — no partial WebSocket state is ever created. - Horizontal scalability — Multiple gateway instances can run behind a load balancer. A message sent to one instance is automatically broadcast to clients on every other instance via Redis Pub/Sub.
- Zombie-connection prevention — A server-driven heartbeat terminates unresponsive sockets immediately, keeping RSS stable under high churn.
┌────────────────────────────────────────┐
│ Load Balancer / Proxy │
│ (nginx, AWS ALB, Cloudflare, …) │
└───────────┬────────────────┬────────────┘
│ │
ws://host:3001 │ │ ws://host:3002
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Gateway │ │ Gateway │ … N instances
│ Instance 1 │ │ Instance 2 │
│ │ │ │
│ clients Map │ │ clients Map │
│ rooms Map │ │ rooms Map │
└──────┬────────┘ └───────┬───────┘
│ PUBLISH │ PUBLISH
│ │
┌──────▼───────────────────▼───────┐
│ │
│ Redis 7 │
│ Channels: │
│ ws:msg:{appId}:{roomId} │
│ │
└────────────────────────────────── ┘
│ SUBSCRIBE │ SUBSCRIBE
│ │
┌──────▼────────┐ ┌───────▼───────┐
│ Instance 1 │ │ Instance 2 │
│ subscriber │ │ subscriber │
└───────────────┘ └───────────────┘
Client Gateway Instance 1 Redis Instance 2
│ │ │ │
│── ws://…?appId=X&apiKey=Y ───▶│ │ │
│ │ validateTenant() │ │
│ │ ✓ auth OK │ │
│ │ handleUpgrade → WS handshake │ │
│◀─────── 101 Switching ────────│ │ │
│ │ joinRoom() │ │
│ │──── SUBSCRIBE ws:msg:X:room ─▶│ │
│ │ │◀── SUBSCRIBE ws:msg:X:room│
│ │ │ │
│── { type:"message", … } ─────▶│ │ │
│ │──── PUBLISH ws:msg:X:room ───▶│ │
│ │ │──── dispatch to local ───▶│
│ │◀─── dispatch to local ────────│ clients│
│◀────── message echoed ────────│ (sender excluded) │ │
WebSocket connections are stateful and long-lived. Once a client upgrades, its TCP connection is pinned to one server process. If you run two gateway instances behind a load balancer, a message sent by a client on Instance 1 is initially known only to Instance 1's in-process memory.
Without a coordination layer, clients on Instance 2 would never see that message. The naive fix — sticky sessions — partially alleviates the problem but doesn't help when a user has multiple tabs open on different instances.
| Approach | Trade-offs |
|---|---|
| Redis Pub/Sub ✓ | Fire-and-forget, zero persistence overhead, sub-millisecond latency, built-in fan-out. Perfect for ephemeral real-time messages. |
| Redis Streams | Persistent, replayable, ordered. Use when clients need message history or exactly-once delivery. |
| Kafka / RabbitMQ | Best for durable event pipelines, but operationally heavy for a pure WS gateway. |
| gRPC mesh | Instances talk directly to each other; no central broker needed. Increases latency and requires service discovery. |
Each gateway instance creates two Redis connections:
publisher ──▶ PUBLISH ws:msg:{appId}:{roomId} ──▶ Redis
subscriber ◀── message callback ◀── Redis
Redis mandates separate connections for pub and sub because once a client enters subscribe mode it can only issue
SUBSCRIBE/UNSUBSCRIBE— it cannot runPUBLISHor any other command.
When a message arrives from a client:
- The message is wrapped in a
RedisEnvelopetagged with the originatinginstanceIdandconnectionId. - It is published to
ws:msg:{appId}:{roomId}. - Every instance — including the publisher — receives it via its subscriber.
- Each instance iterates its local
roomsmap and callssafeSend()for every locally connected socket in that room, skipping the original sender (detected via theoriginInstanceId+originConnectionIdpair in the envelope).
This design requires zero coordination at connection time — instances are completely independent and do not need to know about each other.
WebSocket connections can die silently. A client might:
- Lose network access without sending a TCP FIN/RST.
- Have its device sleep or battery die.
- Be behind a NAT/firewall that silently drops idle connections.
In all these cases the server-side socket remains in ESTABLISHED state indefinitely. Left unchecked, these "zombie" connections accumulate in the clients Map, consuming memory and file descriptors until the process crashes or is restarted.
t=0s server sends WebSocket PING frame ──────────────────────▶ client
t=0s server sets client.isAlive = false
◀────────────────────── client replies with PONG (automatic)
t=ε server receives PONG
t=ε server sets client.isAlive = true
t=30s server sends next PING ──────────────────────────────────▶ client
t=30s server sets client.isAlive = false
… if PONG never arrives before next tick …
t=60s client.isAlive is still false
server calls socket.terminate() ← IMMEDIATE TCP teardown, no handshake
'close' event fires → clients.delete(connectionId) → leaveRoom()
Memory is reclaimed ✓
socket.terminate() is used deliberately — not socket.close(). The close handshake sends a CLOSE frame and waits for acknowledgement, which will never come from a zombie. terminate() calls the underlying socket.destroy() immediately, reclaiming the file descriptor without waiting.
.
├── src/
│ ├── server.ts # Entry point — HTTP server, bootstrap, graceful shutdown
│ ├── websocket.ts # WebSocket server, upgrade handler, heartbeat, room logic
│ ├── redis.ts # Publisher + subscriber clients, Pub/Sub helpers
│ ├── auth.ts # Pre-handshake tenant validation middleware
│ ├── config.ts # Typed, validated environment configuration
│ ├── logger.ts # Structured JSON logger (NDJSON output)
│ └── types.ts # Shared TypeScript interfaces
│
├── docker-compose.yml # Redis 7 with AOF persistence
├── package.json
├── tsconfig.json
├── .env.example # Documented environment variable template
└── README.md
Copy .env.example to .env and adjust as needed:
| Variable | Default | Description |
|---|---|---|
PORT |
3000 |
TCP port the gateway listens on |
INSTANCE_ID |
auto-generated | Unique name for this process; used in Redis envelopes |
REDIS_HOST |
localhost |
Redis server hostname |
REDIS_PORT |
6379 |
Redis server port |
REDIS_PASSWORD |
(empty) | Redis AUTH password (leave blank for no auth) |
HEARTBEAT_INTERVAL_MS |
30000 |
Ping interval in milliseconds |
- Node.js ≥ 18
- Docker & Docker Compose
# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/ws-gateway.git
cd ws-gateway
# 2. Install dependencies
npm install
# 3. Copy and review environment config
cp .env.example .env
# 4. Start Redis
docker compose up -d
# 5. Start the gateway in development mode (hot-reload via tsx)
npm run devYou should see structured JSON logs like:
{"timestamp":"2024-01-15T10:23:41.123Z","level":"info","message":"Redis connections established","host":"localhost","port":6379}
{"timestamp":"2024-01-15T10:23:41.145Z","level":"info","message":"WS Gateway is ready","port":3000,"instanceId":"instance-a1b2c3d4"}npm run build # Compiles src/ → dist/
npm start # Runs dist/server.jsWebSocket connection URL format:
ws://localhost:3000?appId=<tenant>&apiKey=<key>&roomId=<room>&userId=<user>
appId |
apiKey |
|---|---|
tenant-alpha |
key-alpha-secret-001 |
tenant-beta |
key-beta-secret-002 |
tenant-gamma |
key-gamma-secret-003 |
# Install: https://github.com/vi/websocat
websocat "ws://localhost:3000?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=alice"On connection you will receive a welcome system message:
{
"type": "system",
"payload": {
"message": "Connected to WS Gateway",
"connectionId": "550e8400-e29b-41d4-a716-446655440000",
"roomId": "lobby",
"instanceId": "instance-a1b2c3d4"
},
"timestamp": "2024-01-15T10:23:45.000Z"
}npm install -g wscat
wscat -c "ws://localhost:3000?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=bob"All messages are UTF-8 encoded JSON.
// Room message from another user
{
"type": "message",
"payload": { "text": "hello room!" },
"from": "alice",
"roomId": "lobby",
"timestamp": "2024-01-15T10:23:50.000Z"
}
// System notification (connection events, etc.)
{
"type": "system",
"payload": { "message": "Connected to WS Gateway", "connectionId": "…" },
"timestamp": "…"
}
// Error
{
"type": "error",
"payload": { "message": "Invalid JSON — all messages must be valid JSON objects." },
"timestamp": "…"
}
// Pong (reply to application-level ping)
{ "type": "pong", "payload": {}, "timestamp": "…" }HTTP/1.1 401 Unauthorized
Content-Type: text/plain
Connection: close
Unauthorized
This is the most interesting part to demo to a recruiter.
In two separate terminals:
# Terminal 1
PORT=3001 INSTANCE_ID=instance-1 npm run dev
# Terminal 2
PORT=3002 INSTANCE_ID=instance-2 npm run devBoth instances share the same Redis (already running via Docker Compose).
# Client A → Instance 1
wscat -c "ws://localhost:3001?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=alice"
# Client B → Instance 2 (different port = different process)
wscat -c "ws://localhost:3002?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=bob"In Client A's terminal, type and send:
{ "type": "message", "payload": { "text": "Cross-instance message!" } }Client B receives it instantly, even though it is connected to a completely different Node.js process. This is the Redis Pub/Sub fanout in action.
Connect a client to tenant-beta in the same room name:
wscat -c "ws://localhost:3001?appId=tenant-beta&apiKey=key-beta-secret-002&roomId=lobby&userId=carol"Messages sent by Alice are not received by Carol — the Redis channel name encodes the appId, keeping tenants fully isolated.
curl http://localhost:3000/health{
"status": "ok",
"instanceId": "instance-a1b2c3d4",
"totalConnections": 3,
"totalRooms": 2,
"rooms": {
"tenant-alpha:lobby": 2,
"tenant-beta:general": 1
},
"uptimeSeconds": 142
}Use this with your load balancer's health probe to automatically drain an instance before rolling updates.
Socket.io is a higher-level abstraction that adds its own protocol on top of WebSocket (and falls back to HTTP long-polling). Using raw ws forces an explicit understanding of the WebSocket RFC — framing, control frames (ping/pong/close), and the upgrade handshake. It also eliminates ~200 KB of transitive dependencies and makes the gateway agnostic to client language/framework.
By passing noServer: true, the WebSocketServer does not bind its own HTTP server or attach its own upgrade listener. This gives us full control over the upgrade event — we can reject connections with a proper HTTP response before any WebSocket state is allocated, rather than rejecting after the handshake.
ioredis follows the Redis protocol specification: a connection in subscribe mode is dedicated and cannot issue PUBLISH. Two connections are therefore required per instance, regardless of how many rooms or tenants are active. The subscription count scales with distinct rooms active on this instance, not with the number of connected clients.
Connection metadata (clients Map, rooms Map) is kept in process memory, not in Redis. This is intentional:
- Read path is O(1) — a direct Map lookup.
- Redis is not a bottleneck — it is only hit on message publish/receive, not on every state read.
- The trade-off is that a process crash loses its slice of state. Clients reconnect automatically (expected behaviour for a gateway), and because pub/sub subscriptions are re-established on startup, no message is permanently lost for active connections.
close() initiates the WebSocket closing handshake and waits for a CLOSE frame reply. For a zombie connection (device asleep, NAT timeout, etc.) this reply will never come, so the socket lingers. terminate() calls net.Socket.destroy() immediately, bypassing the handshake, which is the correct behaviour when we have already decided the connection is dead.
- TLS /
wss://— Pass the HTTPSServerinstance toinitWebSocketServerand terminate TLS at the gateway (or upstream at the load balancer). - Redis connection tracking — Store
HSET ws:connections:{appId}:{roomId} {connectionId} {instanceId}so the health endpoint can report global (cross-instance) room populations. - Rate limiting — Track message counts per
connectionIdin a sliding window and disconnect clients that exceed a threshold. - JWT auth — Replace the hardcoded
apiKeycheck withjsonwebtokenverification to support short-lived, revocable credentials. - Metrics — Expose a
/metricsendpoint in Prometheus text format (connections, messages/sec, Redis round-trip latency) for Grafana dashboards. - Kubernetes manifests —
Deployment+HorizontalPodAutoscaler+PodDisruptionBudgetfor zero-downtime rolling updates. - Integration tests — Spin up two gateway instances and a real Redis with
testcontainers-nodeand assert cross-instance delivery in CI.
MIT — see LICENSE.