Skip to content

1iPluto/ws-gateway-scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-time Multi-tenant WebSocket Gateway

A production-grade, horizontally scalable WebSocket gateway built with Node.js, TypeScript, ws, and Redis Pub/Sub — designed to serve multiple independent tenant applications over a single shared infrastructure.

Node.js TypeScript Redis License


🎥 Quick Demo

See the gateway in action. This shows the interactive demo UI where messages from Client A are instantly fanned out to Client B via Redis Pub/Sub:

(Note: If the video above does not play on GitHub, you can download/view it here)


Table of Contents

  1. What This Is
  2. Architecture Overview
  3. Why Redis Pub/Sub for Horizontal Scaling
  4. How the Heartbeat Prevents Memory Leaks
  5. Project Structure
  6. Configuration Reference
  7. Local Setup
  8. Connecting a Client
  9. Wire Protocol
  10. Testing the Scaling Behaviour
  11. Health Endpoint
  12. Design Decisions & Trade-offs
  13. Roadmap / What I Would Add Next

What This Is

Most WebSocket tutorials show you a single-process chat server. This project goes further:

  • Multi-tenancy — One gateway binary serves many independent client applications (tenant-alpha, tenant-beta, …). Each tenant is isolated by its appId; messages never bleed across tenants.
  • Pre-handshake auth — Credentials are validated at the HTTP upgrade event, before the WebSocket handshake completes. Unauthorised clients are rejected with a 401 and their TCP socket is destroyed — no partial WebSocket state is ever created.
  • Horizontal scalability — Multiple gateway instances can run behind a load balancer. A message sent to one instance is automatically broadcast to clients on every other instance via Redis Pub/Sub.
  • Zombie-connection prevention — A server-driven heartbeat terminates unresponsive sockets immediately, keeping RSS stable under high churn.

Architecture Overview

                        ┌────────────────────────────────────────┐
                        │           Load Balancer / Proxy         │
                        │    (nginx, AWS ALB, Cloudflare, …)      │
                        └───────────┬────────────────┬────────────┘
                                    │                │
                   ws://host:3001   │                │   ws://host:3002
                                    ▼                ▼
                        ┌───────────────┐  ┌───────────────┐
                        │  Gateway      │  │  Gateway      │    … N instances
                        │  Instance 1   │  │  Instance 2   │
                        │               │  │               │
                        │  clients Map  │  │  clients Map  │
                        │  rooms Map    │  │  rooms Map    │
                        └──────┬────────┘  └───────┬───────┘
                               │  PUBLISH          │  PUBLISH
                               │                   │
                        ┌──────▼───────────────────▼───────┐
                        │                                   │
                        │           Redis 7                 │
                        │   Channels:                       │
                        │   ws:msg:{appId}:{roomId}         │
                        │                                   │
                        └────────────────────────────────── ┘
                               │  SUBSCRIBE        │  SUBSCRIBE
                               │                   │
                        ┌──────▼────────┐  ┌───────▼───────┐
                        │  Instance 1   │  │  Instance 2   │
                        │  subscriber   │  │  subscriber   │
                        └───────────────┘  └───────────────┘

Request Lifecycle (Happy Path)

Client                    Gateway Instance 1                    Redis                    Instance 2
  │                               │                               │                          │
  │── ws://…?appId=X&apiKey=Y ───▶│                               │                          │
  │                               │  validateTenant()             │                          │
  │                               │  ✓ auth OK                    │                          │
  │                               │  handleUpgrade → WS handshake │                          │
  │◀─────── 101 Switching ────────│                               │                          │
  │                               │  joinRoom()                   │                          │
  │                               │──── SUBSCRIBE ws:msg:X:room ─▶│                          │
  │                               │                               │◀── SUBSCRIBE ws:msg:X:room│
  │                               │                               │                          │
  │── { type:"message", … } ─────▶│                               │                          │
  │                               │──── PUBLISH ws:msg:X:room ───▶│                          │
  │                               │                               │──── dispatch to local ───▶│
  │                               │◀─── dispatch to local ────────│                    clients│
  │◀────── message echoed ────────│  (sender excluded)            │                          │

Why Redis Pub/Sub for Horizontal Scaling

WebSocket connections are stateful and long-lived. Once a client upgrades, its TCP connection is pinned to one server process. If you run two gateway instances behind a load balancer, a message sent by a client on Instance 1 is initially known only to Instance 1's in-process memory.

Without a coordination layer, clients on Instance 2 would never see that message. The naive fix — sticky sessions — partially alleviates the problem but doesn't help when a user has multiple tabs open on different instances.

Why Pub/Sub specifically (vs. other options)?

Approach Trade-offs
Redis Pub/Sub Fire-and-forget, zero persistence overhead, sub-millisecond latency, built-in fan-out. Perfect for ephemeral real-time messages.
Redis Streams Persistent, replayable, ordered. Use when clients need message history or exactly-once delivery.
Kafka / RabbitMQ Best for durable event pipelines, but operationally heavy for a pure WS gateway.
gRPC mesh Instances talk directly to each other; no central broker needed. Increases latency and requires service discovery.

How it works in this codebase

Each gateway instance creates two Redis connections:

publisher  ──▶  PUBLISH ws:msg:{appId}:{roomId}  ──▶  Redis
subscriber ◀──  message callback                 ◀──  Redis

Redis mandates separate connections for pub and sub because once a client enters subscribe mode it can only issue SUBSCRIBE / UNSUBSCRIBE — it cannot run PUBLISH or any other command.

When a message arrives from a client:

  1. The message is wrapped in a RedisEnvelope tagged with the originating instanceId and connectionId.
  2. It is published to ws:msg:{appId}:{roomId}.
  3. Every instance — including the publisher — receives it via its subscriber.
  4. Each instance iterates its local rooms map and calls safeSend() for every locally connected socket in that room, skipping the original sender (detected via the originInstanceId + originConnectionId pair in the envelope).

This design requires zero coordination at connection time — instances are completely independent and do not need to know about each other.


How the Heartbeat Prevents Memory Leaks

WebSocket connections can die silently. A client might:

  • Lose network access without sending a TCP FIN/RST.
  • Have its device sleep or battery die.
  • Be behind a NAT/firewall that silently drops idle connections.

In all these cases the server-side socket remains in ESTABLISHED state indefinitely. Left unchecked, these "zombie" connections accumulate in the clients Map, consuming memory and file descriptors until the process crashes or is restarted.

The solution: server-driven heartbeat

t=0s   server sends WebSocket PING frame  ──────────────────────▶  client
t=0s   server sets client.isAlive = false
                                          ◀──────────────────────  client replies with PONG (automatic)
t=ε    server receives PONG
t=ε    server sets client.isAlive = true

t=30s  server sends next PING ──────────────────────────────────▶  client
t=30s  server sets client.isAlive = false
       … if PONG never arrives before next tick …
t=60s  client.isAlive is still false
       server calls socket.terminate()  ← IMMEDIATE TCP teardown, no handshake
       'close' event fires → clients.delete(connectionId) → leaveRoom()
       Memory is reclaimed ✓

socket.terminate() is used deliberately — not socket.close(). The close handshake sends a CLOSE frame and waits for acknowledgement, which will never come from a zombie. terminate() calls the underlying socket.destroy() immediately, reclaiming the file descriptor without waiting.


Project Structure

.
├── src/
│   ├── server.ts          # Entry point — HTTP server, bootstrap, graceful shutdown
│   ├── websocket.ts       # WebSocket server, upgrade handler, heartbeat, room logic
│   ├── redis.ts           # Publisher + subscriber clients, Pub/Sub helpers
│   ├── auth.ts            # Pre-handshake tenant validation middleware
│   ├── config.ts          # Typed, validated environment configuration
│   ├── logger.ts          # Structured JSON logger (NDJSON output)
│   └── types.ts           # Shared TypeScript interfaces
│
├── docker-compose.yml     # Redis 7 with AOF persistence
├── package.json
├── tsconfig.json
├── .env.example           # Documented environment variable template
└── README.md

Configuration Reference

Copy .env.example to .env and adjust as needed:

Variable Default Description
PORT 3000 TCP port the gateway listens on
INSTANCE_ID auto-generated Unique name for this process; used in Redis envelopes
REDIS_HOST localhost Redis server hostname
REDIS_PORT 6379 Redis server port
REDIS_PASSWORD (empty) Redis AUTH password (leave blank for no auth)
HEARTBEAT_INTERVAL_MS 30000 Ping interval in milliseconds

Local Setup

Prerequisites

  • Node.js ≥ 18
  • Docker & Docker Compose

Steps

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/ws-gateway.git
cd ws-gateway

# 2. Install dependencies
npm install

# 3. Copy and review environment config
cp .env.example .env

# 4. Start Redis
docker compose up -d

# 5. Start the gateway in development mode (hot-reload via tsx)
npm run dev

You should see structured JSON logs like:

{"timestamp":"2024-01-15T10:23:41.123Z","level":"info","message":"Redis connections established","host":"localhost","port":6379}
{"timestamp":"2024-01-15T10:23:41.145Z","level":"info","message":"WS Gateway is ready","port":3000,"instanceId":"instance-a1b2c3d4"}

Build for production

npm run build   # Compiles src/ → dist/
npm start       # Runs dist/server.js

Connecting a Client

WebSocket connection URL format:

ws://localhost:3000?appId=<tenant>&apiKey=<key>&roomId=<room>&userId=<user>

Mock tenants (pre-configured in src/auth.ts)

appId apiKey
tenant-alpha key-alpha-secret-001
tenant-beta key-beta-secret-002
tenant-gamma key-gamma-secret-003

Quick test with websocat

# Install: https://github.com/vi/websocat
websocat "ws://localhost:3000?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=alice"

On connection you will receive a welcome system message:

{
  "type": "system",
  "payload": {
    "message": "Connected to WS Gateway",
    "connectionId": "550e8400-e29b-41d4-a716-446655440000",
    "roomId": "lobby",
    "instanceId": "instance-a1b2c3d4"
  },
  "timestamp": "2024-01-15T10:23:45.000Z"
}

Quick test with wscat

npm install -g wscat
wscat -c "ws://localhost:3000?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=bob"

Wire Protocol

All messages are UTF-8 encoded JSON.

Client → Server

// Send a message to your current room
{ "type": "message", "payload": { "text": "hello room!" } }

// Broadcast (semantically identical in this implementation — alias for clarity)
{ "type": "broadcast", "payload": { "text": "hey everyone!" } }

// Application-level ping (for clients that cannot send WS control frames)
{ "type": "ping" }

Server → Client

// Room message from another user
{
  "type": "message",
  "payload": { "text": "hello room!" },
  "from": "alice",
  "roomId": "lobby",
  "timestamp": "2024-01-15T10:23:50.000Z"
}

// System notification (connection events, etc.)
{
  "type": "system",
  "payload": { "message": "Connected to WS Gateway", "connectionId": "" },
  "timestamp": ""
}

// Error
{
  "type": "error",
  "payload": { "message": "Invalid JSON — all messages must be valid JSON objects." },
  "timestamp": ""
}

// Pong (reply to application-level ping)
{ "type": "pong", "payload": {}, "timestamp": "" }

Rejected connection (HTTP 401)

HTTP/1.1 401 Unauthorized
Content-Type: text/plain
Connection: close

Unauthorized

Testing the Scaling Behaviour

This is the most interesting part to demo to a recruiter.

Step 1 — Spin up two gateway instances

In two separate terminals:

# Terminal 1
PORT=3001 INSTANCE_ID=instance-1 npm run dev

# Terminal 2
PORT=3002 INSTANCE_ID=instance-2 npm run dev

Both instances share the same Redis (already running via Docker Compose).

Step 2 — Connect two clients to different instances, same room

# Client A  →  Instance 1
wscat -c "ws://localhost:3001?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=alice"

# Client B  →  Instance 2 (different port = different process)
wscat -c "ws://localhost:3002?appId=tenant-alpha&apiKey=key-alpha-secret-001&roomId=lobby&userId=bob"

Step 3 — Send a message from Client A

In Client A's terminal, type and send:

{ "type": "message", "payload": { "text": "Cross-instance message!" } }

Client B receives it instantly, even though it is connected to a completely different Node.js process. This is the Redis Pub/Sub fanout in action.

Step 4 — Verify tenant isolation

Connect a client to tenant-beta in the same room name:

wscat -c "ws://localhost:3001?appId=tenant-beta&apiKey=key-beta-secret-002&roomId=lobby&userId=carol"

Messages sent by Alice are not received by Carol — the Redis channel name encodes the appId, keeping tenants fully isolated.


Health Endpoint

curl http://localhost:3000/health
{
  "status": "ok",
  "instanceId": "instance-a1b2c3d4",
  "totalConnections": 3,
  "totalRooms": 2,
  "rooms": {
    "tenant-alpha:lobby": 2,
    "tenant-beta:general": 1
  },
  "uptimeSeconds": 142
}

Use this with your load balancer's health probe to automatically drain an instance before rolling updates.


Design Decisions & Trade-offs

Raw ws over Socket.io

Socket.io is a higher-level abstraction that adds its own protocol on top of WebSocket (and falls back to HTTP long-polling). Using raw ws forces an explicit understanding of the WebSocket RFC — framing, control frames (ping/pong/close), and the upgrade handshake. It also eliminates ~200 KB of transitive dependencies and makes the gateway agnostic to client language/framework.

noServer: true on WebSocketServer

By passing noServer: true, the WebSocketServer does not bind its own HTTP server or attach its own upgrade listener. This gives us full control over the upgrade event — we can reject connections with a proper HTTP response before any WebSocket state is allocated, rather than rejecting after the handshake.

Two Redis connections per instance

ioredis follows the Redis protocol specification: a connection in subscribe mode is dedicated and cannot issue PUBLISH. Two connections are therefore required per instance, regardless of how many rooms or tenants are active. The subscription count scales with distinct rooms active on this instance, not with the number of connected clients.

In-memory state (not Redis for connection tracking)

Connection metadata (clients Map, rooms Map) is kept in process memory, not in Redis. This is intentional:

  • Read path is O(1) — a direct Map lookup.
  • Redis is not a bottleneck — it is only hit on message publish/receive, not on every state read.
  • The trade-off is that a process crash loses its slice of state. Clients reconnect automatically (expected behaviour for a gateway), and because pub/sub subscriptions are re-established on startup, no message is permanently lost for active connections.

socket.terminate() over socket.close()

close() initiates the WebSocket closing handshake and waits for a CLOSE frame reply. For a zombie connection (device asleep, NAT timeout, etc.) this reply will never come, so the socket lingers. terminate() calls net.Socket.destroy() immediately, bypassing the handshake, which is the correct behaviour when we have already decided the connection is dead.


Roadmap / What I Would Add Next

  • TLS / wss:// — Pass the HTTPS Server instance to initWebSocketServer and terminate TLS at the gateway (or upstream at the load balancer).
  • Redis connection tracking — Store HSET ws:connections:{appId}:{roomId} {connectionId} {instanceId} so the health endpoint can report global (cross-instance) room populations.
  • Rate limiting — Track message counts per connectionId in a sliding window and disconnect clients that exceed a threshold.
  • JWT auth — Replace the hardcoded apiKey check with jsonwebtoken verification to support short-lived, revocable credentials.
  • Metrics — Expose a /metrics endpoint in Prometheus text format (connections, messages/sec, Redis round-trip latency) for Grafana dashboards.
  • Kubernetes manifestsDeployment + HorizontalPodAutoscaler + PodDisruptionBudget for zero-downtime rolling updates.
  • Integration tests — Spin up two gateway instances and a real Redis with testcontainers-node and assert cross-instance delivery in CI.

License

MIT — see LICENSE.

About

A production-ready, multi-tenant WebSocket gateway built with Node.js and Redis. Features pre-handshake authentication, server-driven heartbeats to prevent memory leaks, and horizontal scaling via Redis Pub/Sub. Includes a polished interactive demo UI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors