Skip to content

fix: connection stability improvements for production issues #1648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sanity
Copy link
Collaborator

@sanity sanity commented Jun 10, 2025

Summary

  • Fix graceful handling of RSA intro packets on established connections
  • Make keep-alive constants consistent between debug and release builds

Fixes

  1. Intro packet handling: When established connections receive 256-byte RSA intro packets (e.g., when a peer tries to reestablish), we now decrypt and ACK them instead of logging decryption errors
  2. Keep-alive compatibility: Debug and release builds now use the same timing (10s interval, 30s timeout) to prevent premature connection drops

Root Cause

Production decryption errors after ~60 seconds were caused by:

  • Debug/release keep-alive incompatibility causing legitimate timeouts
  • Lack of graceful handling when peers sent intro packets to reestablish

Test plan

  • Integration tests pass
  • Deploy to gateways and verify connection stability
  • Monitor for decryption errors in production

🤖 Generated with Claude Code

sanity and others added 5 commits June 9, 2025 21:13
- test_gateway_to_gateway_connection: tests gateway-to-gateway connection setup
- test_gateway_packet_size_change_after_60s: monitors connection for 75 seconds with GET requests every 5 seconds
- test_production_decryption_error_scenario: attempts to reproduce exact production scenario

These tests are designed to reproduce the decryption errors seen in production where:
- Errors start appearing ~60 seconds after connection establishment
- Packet sizes change from 48 bytes to 256 bytes when errors begin
- Same inbound_key is used throughout (no key rotation issue)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add transport_secret_key field to RemoteConnection for RSA decryption
- Implement intro packet detection for 256-byte packets on established connections
- Send ACK response when intro packets are successfully decrypted
- This addresses the secondary bug where established peers don't handle intro packets gracefully
- Related to production decryption errors occurring after ~60 seconds
…uilds

- Set KEEP_ALIVE_INTERVAL to 10 seconds for all builds
- Set KILL_CONNECTION_AFTER to 30 seconds for all builds
- Prevents incompatibility between debug and release peers
- Debug peers were timing out connections to release peers after 6 seconds
- This was a critical design flaw causing connection instability
This directory will contain:
- Release automation scripts
- Deployment summaries
- Local test files
- Other files that shouldn't be committed to git

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant