fix(cluster): Fix critical bugs in cluster cross-node message routing
Fixed 2 critical bugs that prevented cluster cross-node message routing from working:
File: cmd/emqx-go/main.go Issue: Nodes advertised their address using nodeID as hostname (e.g., "node2:8083"), which cannot be resolved in local multi-process deployment. Fix: Use "localhost" as hostname for local deployment (e.g., "localhost:8083").
File: pkg/cluster/server.go Issue: Join RPC handler used the request context which gets canceled when RPC returns, causing reverse peer connections to fail immediately. Fix: Use context.Background() instead to allow peer connections to persist beyond RPC lifecycle.
These bugs completely prevented cross-node message routing:
- Node2 could join Node1's cluster
- Node2's subscriptions were synced to Node1
- But when Node1 tried to forward messages to Node2, it failed with "no peer client for node"
After fixes, full bidirectional cluster communication works correctly.
Added comprehensive E2E testing infrastructure:
- Created 3-node cluster deployment scripts (local and Docker)
- Developed cluster testing tool in Go
- Verified cross-node message routing works correctly
Test result: ✅ SUCCESS - Messages successfully route from Node1 to Node2
- cmd/emqx-go/main.go - Fix node address, add environment variable support
- pkg/cluster/server.go - Fix Join handler context usage
- README.md - Add cluster testing documentation
- cmd/cluster-test/main.go - Enhanced cluster testing tool with improved error handling
- cmd/cluster-test/README.md - Comprehensive documentation for test tool
- scripts/start-cluster.sh - Start 3-node local cluster
- scripts/stop-cluster.sh - Stop cluster
- docker-compose-cluster.yaml - Docker cluster configuration
- CLUSTER_E2E_TEST_REPORT.md - Detailed test report
- CLUSTER_TESTING_GUIDE.md - Testing guide
- CLUSTER_TEST_SUMMARY.md - Quick summary
- Added command-line flags for all test parameters
- Improved error handling with detailed error messages
- Added connection timeout detection (5s)
- Added verbose logging mode for debugging
- Measure and display message delivery latency
- Provide troubleshooting tips on test failure
- Support for testing remote clusters
- Return proper exit codes (0=success, 1=failure)
# Start 3-node cluster
./scripts/start-cluster.sh
# Run test
go run ./cmd/cluster-test/main.go
# Stop cluster
./scripts/stop-cluster.shExpected output: "✓✓✓ SUCCESS! Cross-node messaging works!"