A production-ready, horizontally scalable sync backend for tldraw, designed to run on Google Cloud Platform (GKE).
This project implements a Stateful Room Ownership model to safely support real-time collaboration at scale.
-
π§ Stateful Room Architecture
- Each room is owned by a single pod at any given time.
- Prevents split-brain and data corruption.
-
π Redis Distributed Locking
- Guarantees exclusive room ownership across pods.
- Auto-renewed locks with safe expiration handling.
-
π Two-Phase Coordinated Handover
- Safe room migration during scaling events.
- Users disconnected only after new pod is ready.
- Zero data loss during pod transitions.
-
βοΈ Google Cloud Storage Persistence
- Room snapshots and assets are persisted to GCS.
- Ensures durability across pod restarts and deployments.
-
π WebSocket-based Sync
- Powered by
@tldraw/sync-core. - Low-latency real-time collaboration.
- Server-side keep-alive to prevent idle timeouts.
- Powered by
-
β»οΈ Graceful Shutdown
- Active rooms are force-saved on shutdown.
- Redis locks are released immediately to allow fast reconnection.
-
π³ Container & GKE Ready
- Designed for Docker, GKE, and CI/CD pipelines.
tldraw-client (Example App)
|
| WebSocket
v
GCP Network Load Balancer
|
v
NGINX Ingress Controller (Consistent Hashing by URI)
|
v
GKE Pods (Node.js)
|
| Redis Lock (room ownership)
| Redis Pub/Sub (handover coordination)
v
Redis
|
| Snapshots / Assets
v
Google Cloud Storage
Key: NGINX uses upstream-hash-by: "$uri" to route requests for the same room to the same pod. When pods scale, a coordinated handover protocol ensures safe room migration.
- Node.js v20+
- Yarn v4.11.0+
- Redis (local or Docker)
- Google Cloud Storage bucket
gcloudCLI (recommended)
cp .env.example .env# Server
PORT=3001
NODE_ENV=development
# Redis (Room Locking)
REDIS_URL=redis://localhost:6379
# Google Cloud Storage
GCS_BUCKET_NAME=your-tldraw-bucket-name
# Optional (local dev only)
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.jsonyarn installdocker run --name tldraw-redis -p 6379:6379 -d redisyarn devBackend will be available at:
http://localhost:3001
This repository includes a fully working example frontend located in:
/tldraw-client
The client demonstrates real-world integration with this sync backend using
@tldraw/sync and can be used to quickly validate your setup.
β οΈ Ensure the backend server is already running before starting the client.
cd tldraw-client
npm install
npm run devThe client will start (usually on):
http://localhost:5173
- WebSocket connection to
/api/connect/:roomId - Automatic reconnect handling
- Real-time multi-user collaboration
- Compatibility with Redis room locking & GCS persistence
This client is intended for testing, debugging, and reference β not production deployment.
Example React integration using @tldraw/sync.
import { useSync } from "@tldraw/sync"
import { Tldraw } from "tldraw"
const roomId = "room-123"
const WORKER_URL = import.meta.env.PROD
? "https://your-gcp-loadbalancer.com"
: "http://localhost:3001"
export function CollaborationRoom() {
const wsUri = `${WORKER_URL.replace("http", "ws")}/api/connect/${roomId}`
const store = useSync({
uri: wsUri,
})
return (
<div style={{ position: "fixed", inset: 0 }}>
<Tldraw store={store} />
</div>
)
}For a complete step-by-step guide to deploy from scratch, see:
π Manual GCP Deployment Guide
This covers:
- GCP project setup and API enablement
- Terraform infrastructure provisioning
- GKE cluster configuration
- NGINX Ingress installation
- Docker image build and push
- Kubernetes manifest deployment
docker build -t tldraw-sync-gcp .
docker run -p 3001:3001 --env-file .env tldraw-sync-gcpFor existing deployments, the typical CI/CD pipeline:
- Build Docker image (with
--platform linux/amd64) - Push to Google Artifact Registry
- Deploy to GKE via
kubectl set image - Rolling update with zero downtime
See .github/workflows/deploy.yaml for the GitHub Actions workflow.
Ensure NGINX Ingress is configured with upstream-hash-by: "$uri" for consistent room routing. See kubernetes/ingress.yaml.
| Code | Meaning | Cause | Resolution |
|---|---|---|---|
| 1013 | Try Again Later | Room migration in progress (two-phase handover) | Client auto-retries. Normal during scaling events. |
| 1011 | Internal Error | Redis or GCS unreachable | Verify env variables |
| 1005 | Idle Timeout | Connection idle too long | Server keep-alive should prevent this. Check PING_INTERVAL_MS. |
| 503 | Unavailable | Pod shutting down or overloaded | Client will reconnect |
GET /api/health
200 OKUsed by GCP Load Balancers and Kubernetes probes.
- Lock Key:
lock:room:{roomId} - TTL: 10 seconds
- Renew Interval: 5 seconds
- Locks are released immediately on shutdown
On SIGTERM:
- Stop accepting new connections
- Save all active rooms to GCS
- Release Redis locks
- Exit process cleanly
This ensures zero data loss during rolling deployments.
GitHub:
https://github.com/tldraw/tldraw-sync-gcp
Tested with k6 stress tests from within GCP:
| Concurrent Users | Rooms Γ Users | Success Rate |
|---|---|---|
| 5,000 | 50 Γ 100 | 100% |
| 7,000 | 100 Γ 70 | 99.99% |
| 10,000 | 100 Γ 100 | ~30% (exceeded capacity) |
Tested infrastructure: 5 NGINX Ingress replicas, 10 app pods, 3Γ e2-medium nodes
Connection latency: ~235ms (normal load), ~10-20s (under heavy load)
See stress-test/README.md for running your own benchmarks.
- Smart autoscaling and empty-room draining are intentionally not implemented
- This design favors correctness and safety over aggressive scaling
- Ideal for production collaborative environments
MIT