Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 165 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,19 +176,64 @@ This practice ensures:
- Too many tasks to fix immediately (113+)
- Focus on new code having proper names


### 3. Jinja2 Template Complexity
### 2. DNS Architecture and Common Issues

#### Understanding local_service_ip
- Algo uses a randomly generated IP in the 172.16.0.0/12 range on the loopback interface
- This IP (`local_service_ip`) is where dnscrypt-proxy should listen
- Requires `net.ipv4.conf.all.route_localnet=1` sysctl for VPN clients to reach loopback IPs
- This is by design for consistency across VPN types (WireGuard + IPsec)

#### dnscrypt-proxy Service Failures
**Problem:** "Unit dnscrypt-proxy.socket is masked" or service won't start
- The service has `Requires=dnscrypt-proxy.socket` dependency
- Masking the socket prevents the service from starting
- **Solution:** Configure socket properly instead of fighting it

#### DNS Not Accessible to VPN Clients
**Symptoms:** VPN connects but no internet/DNS access
1. **First check what's listening:** `sudo ss -ulnp | grep :53`
- Should show `local_service_ip:53` (e.g., 172.24.117.23:53)
- If showing only 127.0.2.1:53, socket override didn't apply
2. **Check socket status:** `systemctl status dnscrypt-proxy.socket`
- Look for "configuration has changed while running" - needs restart
3. **Verify route_localnet:** `sysctl net.ipv4.conf.all.route_localnet`
- Must be 1 for VPN clients to reach loopback IPs
4. **Check firewall:** Ensure allows VPN subnets: `-A INPUT -s {{ subnets }} -d {{ local_service_ip }}`
- **Never** allow DNS from all sources (0.0.0.0/0) - security risk!

### 3. Multi-homed Systems and NAT
**DigitalOcean and other providers with multiple IPs:**
- Servers may have both public and private IPs on same interface
- MASQUERADE needs output interface: `-o {{ ansible_default_ipv4['interface'] }}`
- Don't overengineer with SNAT - MASQUERADE with interface works fine
- Use `alternative_ingress_ip` option only when truly needed

### 4. iptables Backend Changes (nft vs legacy)
**Critical:** Switching between iptables-nft and iptables-legacy can break subtle behaviors
- Ubuntu 22.04+ defaults to iptables-nft which may have implicit NAT behaviors
- Algo forces iptables-legacy for consistent rule ordering
- This switch can break DNS routing that "just worked" before
- Always test thoroughly after backend changes

### 5. systemd Socket Activation Gotchas
- Interface-specific sysctls (e.g., `net.ipv4.conf.wg0.route_localnet`) fail if interface doesn't exist yet
- WireGuard interface only created when service starts
- Use global sysctls or apply settings after service start
- Socket configuration changes require explicit restart (not just reload)

### 6. Jinja2 Template Complexity
- Many templates use Ansible-specific filters
- Test templates with `tests/unit/test_template_rendering.py`
- Mock Ansible filters when testing

### 4. OpenSSL Version Compatibility
### 7. OpenSSL Version Compatibility
```yaml
# Check version and use appropriate flags
{{ (openssl_version is version('3', '>=')) | ternary('-legacy', '') }}
```

### 5. IPv6 Endpoint Formatting
### 8. IPv6 Endpoint Formatting
- WireGuard configs must bracket IPv6 addresses
- Template logic: `{% if ':' in IP %}[{{ IP }}]:{{ port }}{% else %}{{ IP }}:{{ port }}{% endif %}`

Expand Down Expand Up @@ -223,9 +268,11 @@ This practice ensures:
Each has specific requirements:
- **AWS**: Requires boto3, specific AMI IDs
- **Azure**: Complex networking setup
- **DigitalOcean**: Simple API, good for testing
- **DigitalOcean**: Simple API, good for testing (watch for multiple IPs on eth0)
- **Local**: KVM/Docker for development

**Testing Note:** DigitalOcean droplets often have both public and private IPs on the same interface, making them excellent test cases for multi-IP scenarios and NAT issues.

### Architecture Considerations
- Support both x86_64 and ARM64
- Some providers have limited ARM support
Expand Down Expand Up @@ -265,6 +312,17 @@ Each has specific requirements:
- Linter compliance
- Conservative approach

### Time Wasters to Avoid (Lessons Learned)
**Don't spend time on these unless absolutely necessary:**
1. **Converting MASQUERADE to SNAT** - MASQUERADE works fine for Algo's use case
2. **Fighting systemd socket activation** - Configure it properly instead of trying to disable it
3. **Debugging NAT before checking DNS** - Most "routing" issues are DNS issues
4. **Complex IPsec policy matching** - Keep NAT rules simple, avoid `-m policy --pol none`
5. **Testing on existing servers** - Always test on fresh deployments
6. **Interface-specific route_localnet** - WireGuard interface doesn't exist until service starts
7. **DNAT for loopback addresses** - Packets to local IPs don't traverse PREROUTING
8. **Removing BPF JIT hardening** - It's optional and causes errors on many kernels

## Working with Algo

### Local Development Setup
Expand Down Expand Up @@ -297,6 +355,108 @@ ansible-playbook users.yml -e "server=SERVER_NAME"
3. Check firewall rules
4. Review generated configs in `configs/`

### Troubleshooting VPN Connectivity

#### Debugging Methodology
When VPN connects but traffic doesn't work, follow this **exact order** (learned from painful experience):

1. **Check DNS listening addresses first**
```bash
ss -lnup | grep :53
# Should show local_service_ip:53 (e.g., 172.24.117.23:53)
# If showing 127.0.2.1:53, socket override didn't apply
```

2. **Check both socket AND service status**
```bash
systemctl status dnscrypt-proxy.socket dnscrypt-proxy.service
# Look for "configuration has changed while running" warnings
```

3. **Verify route_localnet is enabled**
```bash
sysctl net.ipv4.conf.all.route_localnet
# Must be 1 for VPN clients to reach loopback IPs
```

4. **Test DNS resolution from server**
```bash
dig @172.24.117.23 google.com # Use actual local_service_ip
# Should return results if DNS is working
```

5. **Check firewall counters**
```bash
iptables -L INPUT -v -n | grep -E '172.24|10.49|10.48'
# Look for increasing packet counts
```

6. **Verify NAT is happening**
```bash
iptables -t nat -L POSTROUTING -v -n
# Check for MASQUERADE rules with packet counts
```

**Key insight:** 90% of "routing" issues are actually DNS issues. Always check DNS first!

#### systemd and dnscrypt-proxy (Critical for Ubuntu/Debian)
**Background:** Ubuntu's dnscrypt-proxy package uses systemd socket activation which **completely overrides** the `listen_addresses` setting in the config file.

**How it works:**
1. Default socket listens on 127.0.2.1:53 (hardcoded in package)
2. Socket activation means systemd opens the port, not dnscrypt-proxy
3. Config file `listen_addresses` is ignored when socket activation is used
4. Must configure the socket, not just the service

**Correct approach:**
```bash
# Create socket override at /etc/systemd/system/dnscrypt-proxy.socket.d/10-algo-override.conf
[Socket]
ListenStream= # Clear ALL defaults first
ListenDatagram= # Clear UDP defaults too
ListenStream=172.x.x.x:53 # Add TCP on VPN IP
ListenDatagram=172.x.x.x:53 # Add UDP on VPN IP
```

**Config requirements:**
- Use empty `listen_addresses = []` in dnscrypt-proxy.toml for socket activation
- Socket must be restarted (not just reloaded) after config changes
- Check with: `systemctl status dnscrypt-proxy.socket` for warnings
- Verify with: `ss -lnup | grep :53` to see actual listening addresses

**Common mistakes:**
- Trying to disable/mask the socket (breaks service with Requires= dependency)
- Only setting ListenStream (need ListenDatagram for UDP)
- Forgetting to clear defaults first (results in listening on both IPs)
- Not restarting socket after configuration changes

## Architectural Decisions and Trade-offs

### DNS Service IP Design
Algo uses a randomly generated IP in the 172.16.0.0/12 range on the loopback interface for DNS (`local_service_ip`). This design has trade-offs:

**Why it's done this way:**
- Provides a consistent DNS IP across both WireGuard and IPsec
- Avoids binding to VPN gateway IPs which differ between protocols
- Survives interface changes and restarts
- Works the same way across all cloud providers

**The cost:**
- Requires `route_localnet=1` sysctl (minor security consideration)
- Adds complexity with systemd socket activation
- Can be confusing to debug

**Alternatives considered but rejected:**
- Binding to VPN gateway IPs directly (breaks unified configuration)
- Using dummy interface instead of loopback (non-standard, more complex)
- DNAT redirects (doesn't work with loopback destinations)

### iptables Backend Choice
Algo forces iptables-legacy instead of iptables-nft on Ubuntu 22.04+ because:
- nft reorders rules unpredictably, breaking VPN traffic
- Legacy backend provides consistent, predictable behavior
- Trade-off: Lost some implicit NAT behaviors that nft provided

## Important Context for LLMs

### What Makes Algo Special
Expand Down
Loading
Loading