Skip to content

🐛 BUG: host won't reconnect to lighthouse successfully behind a ip-changed NAT #889

Open
@JimLee1996

Description

What version of nebula are you using?

1.7.2

What operating system are you using?

Linux

Describe the Bug

  1. Lighthouse: with static public ip
  2. Host: behind a NAT whose public ip may change
  3. bug happens excatly after the NAT public ip changed: the nebula host's reconnection fails

Maybe there should be a counter or pivot which will reload service when tries fail.

Logs from affected hosts

Jun 03 08:17:10 N1 nebula[279798]: level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2265257244 localIndex=2265257244 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1
Jun 03 08:17:17 N1 nebula[279798]: level=info msg="Handshake timed out" durationNs=6891732272 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2265257244 localIndex=2265257244 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1
Jun 03 08:18:10 N1 nebula[279798]: level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=789404569 localIndex=789404569 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1
Jun 03 08:18:17 N1 nebula[279798]: level=info msg="Handshake timed out" durationNs=6688333707 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=789404569 localIndex=789404569 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1

# ============================
# a lot of same logs here
# ============================

Jun 03 09:32:52 N1 nebula[279798]: level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=215861812 localIndex=215861812 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1
Jun 03 09:32:59 N1 nebula[279798]: level=info msg="Handshake timed out" durationNs=6664379861 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=215861812 localIndex=215861812 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1
Jun 03 09:32:59 N1 nebula[279798]: level=info msg="Handshake timed out" durationNs=6962423141 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2959928072 localIndex=2959928072 remoteIndex=0 udpAddrs="[]" vpnIp=192.168.100.10


# ============================
# manually restart neblua host
# ============================

Jun 03 09:33:00 N1 nebula[279798]: level=info msg="Caught signal, shutting down" signal=terminated
Jun 03 09:33:00 N1 nebula[279798]: level=info msg=Goodbye
Jun 03 09:33:00 N1 systemd[1]: Stopping Nebula overlay networking tool...
Jun 03 09:33:00 N1 systemd[1]: nebula.service: Succeeded.
Jun 03 09:33:00 N1 systemd[1]: Stopped Nebula overlay networking tool.
Jun 03 09:33:00 N1 systemd[1]: nebula.service: Consumed 30min 44.211s CPU time.
Jun 03 09:33:00 N1 systemd[1]: Started Nebula overlay networking tool.
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Firewall rule added" firewallRule="map[caName: caSha: direction:outgoing endPort:0 groups:[] host:any ip: localIp: proto:0 startPort:0]"
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Firewall rule added" firewallRule="map[caName: caSha: direction:incoming endPort:0 groups:[] host:any ip: localIp: proto:0 startPort:0]"
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Firewall started" firewallHash=498215dec4e5687a2353f51c10838c113bd1af35ef72b8e8c9f536986ada5417
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Main HostMap created" network=192.168.100.2/24 preferredRanges="[]"
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="punchy enabled"
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Loaded send_recv_error config" sendRecvError=always
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Nebula interface is active" boringcrypto=false build=1.7.2 interface=tun0 network=192.168.100.2/24 udpAddr="0.0.0.0:44710"
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="DNS results changed for host list" newSet="map[*.*.*.*:4242:{}]" origSet="&map[]"

# ============================
# now it's back to normal
# ============================

Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1175697647 localIndex=1175697647 remoteIndex=0 udpAddrs="[*.*.*.*:4242]" vpnIp=192.168.100.1
Jun 03 09:33:00 N1 nebula[299280]: level=info msg="Handshake message received" certName=ICL durationNs=327741706 fingerprint=a01937d6e07d050ba2cfc91fd2f56ec3f008b33690b7931f3a5bfe99f835f67a handshake="map[stage:2 style:ix_psk0]" initiatorIndex=1175697647 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=1175697647 responderIndex=3925596160 sentCachedPackets=1 udpAddr="*.*.*.*:4242" vpnIp=192.168.100.1
Jun 03 09:33:03 N1 nebula[299280]: level=info msg="Handshake message received" certName=Macbook fingerprint=bd3d7b77768b32aa25b5ce82c2cc67a4620b78aaf1ed95999c3e93016c8795f5 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2451724569 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="192.168.123.10:61939" vpnIp=192.168.100.10
Jun 03 09:33:03 N1 nebula[299280]: level=info msg="Handshake message sent" certName=Macbook fingerprint=bd3d7b77768b32aa25b5ce82c2cc67a4620b78aaf1ed95999c3e93016c8795f5 handshake="map[stage:2 style:ix_psk0]" initiatorIndex=2451724569 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=3070221539 sentCachedPackets=0 udpAddr="192.168.123.10:61939" vpnIp=192.168.100.10
Jun 03 09:33:03 N1 nebula[299280]: level=info msg="Handshake message received" certName=Macbook fingerprint=bd3d7b77768b32aa25b5ce82c2cc67a4620b78aaf1ed95999c3e93016c8795f5 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2451724569 issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="172.16.16.10:61939" vpnIp=192.168.100.10
Jun 03 09:33:03 N1 nebula[299280]: level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" udpAddr="172.16.16.10:61939" vpnIp=192.168.100.10
issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="172.16.16.11:56979" vpnIp=192.168.100.11
issuer=33768094d6855b7ca53962932dd41ce99b11347d220ff89a33d1f01f0f5ab578 remoteIndex=0 responderIndex=0 udpAddr="192.168.123.11:56979" vpnIp=192.168.100.11
Jun 03 09:40:23 N1 nebula[299280]: level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" udpAddr="192.168.123.11:56979" vpnIp=192.168.100.11

Config files from affected hosts

pki:
  ca: /root/bin/nebula/cert/ca.crt
  cert: /root/bin/nebula/cert/SY.crt
  key: /root/bin/nebula/cert/SY.key

static_host_map:
  "192.168.100.1": ["example.com:4242"] # hidden

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy:
  punch: true
  respond: true
  delay: 1s
  respond_delay: 5s

cipher: aes

tun:
  disabled: false
  tx_queue: 500
  mtu: 1300

  # Unsafe routes allows you to route traffic over nebula to non-nebula nodes
  # Unsafe routes should be avoided unless you have hosts/services that cannot run nebula
  # NOTE: The nebula certificate of the "via" node *MUST* have the "route" defined as a subnet in its certificate
  # `mtu`: will default to tun mtu if this option is not specified
  # `metric`: will default to 0 if this option is not specified
  # `install`: will default to true, controls whether this route is installed in the systems routing table.
  # unsafe_routes:
  #   - route: 192.168.1.0/24
  #     via: 192.168.100.1
  #     mtu: 1300
  #     install: true

logging:
  level: info
  format: text
  disable_timestamp: true

firewall:
  outbound_action: drop
  inbound_action: drop

  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m

  # The firewall is default deny. There is no way to write a deny rule.
  # Rules are comprised of a protocol, port, and one or more of host, group, or CIDR
  # Logical evaluation is roughly: port AND proto AND (ca_sha OR ca_name) AND (host OR group OR groups OR cidr)
  # - port: Takes `0` or `any` as any, a single number `80`, a range `200-901`, or `fragment` to match second and further fragments of fragmented packets (since there is no port available).
  #   code: same as port but makes more sense when talking about ICMP, TODO: this is not currently implemented in a way that works, use `any`
  #   proto: `any`, `tcp`, `udp`, or `icmp`
  #   host: `any` or a literal hostname, ie `test-host`
  #   group: `any` or a literal group name, ie `default-group`
  #   groups: Same as group but accepts a list of values. Multiple values are AND'd together and a certificate would have to contain all groups to pass
  #   cidr: a remote CIDR, `0.0.0.0/0` is any.
  #   local_cidr: a local CIDR, `0.0.0.0/0` is any. This could be used to filter destinations when using unsafe_routes.
  #   ca_name: An issuing CA name
  #   ca_sha: An issuing CA shasum

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: any
      host: any

Metadata

Assignees

No one assigned

    Labels

    NeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions