|
| 1 | +# Model Socket Starvation: Client Timeout Behavior |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +Current lading UDS generator (lading/src/generator/unix_datagram.rs:277-289) tracks failed sends but doesn't accurately model client timeout behavior. The generator uses async `socket.send()` which blocks until the socket is ready, rather than failing quickly when the socket cannot accept data within a timeout window. |
| 6 | + |
| 7 | +Real clients (e.g., datadog-go) use a 100ms write timeout ([source](https://github.com/DataDog/datadog-go/blob/a6b9cdf87a42985c32e5abfc1e6e10e349107ee0/statsd/options.go#L19)). When Agent cannot keep up, clients timeout and drop metrics. Lading needs to model this behavior to accurately measure performance under CPU starvation conditions. |
| 8 | + |
| 9 | +## Current Implementation Analysis |
| 10 | + |
| 11 | +### What Works Today |
| 12 | +- Failed sends are tracked via `request_failure` metric |
| 13 | +- Throttle mechanism controls send rate via `throttle.wait_for(total_bytes)` |
| 14 | +- Connection failures are tracked and retried |
| 15 | +- "Peek then consume" pattern ensures throttle accuracy |
| 16 | + |
| 17 | +### What's Missing |
| 18 | +- No timeout modeling - sends block indefinitely |
| 19 | +- No explicit queue to track pending writes |
| 20 | +- Cannot simulate client-side timeouts and dropped metrics |
| 21 | +- Backpressure isn't accurately measured |
| 22 | + |
| 23 | +### Key Implementation Insights |
| 24 | + |
| 25 | +After analyzing the codebase: |
| 26 | +- **Throttle has no capacity return mechanism** - this is by design and actually helps our use case |
| 27 | +- **Throttle measures attempted load**, not successful sends - this is the correct behavior |
| 28 | +- **Current implementation uses datagram semantics** - no partial writes, no retries |
| 29 | + |
| 30 | +## Technical Design: Timeout-Aware Send Layer |
| 31 | + |
| 32 | +### Core Architecture |
| 33 | + |
| 34 | +Introduce a thin timeout modeling layer between throttle and socket: |
| 35 | + |
| 36 | +```rust |
| 37 | +// Current flow: |
| 38 | +throttle.wait_for(bytes) → socket.send(block) → track_metrics |
| 39 | + |
| 40 | +// Proposed flow: |
| 41 | +throttle.wait_for(bytes) → timeout_aware_send(block) → track_metrics |
| 42 | + ↓ |
| 43 | + internal_queue_management |
| 44 | +``` |
| 45 | + |
| 46 | +### Implementation Design |
| 47 | + |
| 48 | +```rust |
| 49 | +struct TimeoutAwareSender { |
| 50 | + socket: UnixDatagram, |
| 51 | + timeout: Duration, |
| 52 | + pending_queue: VecDeque<(Instant, Vec<u8>)>, |
| 53 | + max_queue_size: usize, |
| 54 | +} |
| 55 | + |
| 56 | +impl TimeoutAwareSender { |
| 57 | + async fn send(&mut self, data: &[u8]) -> SendResult { |
| 58 | + // 1. Expire old entries first (deterministic) |
| 59 | + self.expire_old_entries(); |
| 60 | + |
| 61 | + // 2. Try non-blocking send |
| 62 | + match self.socket.try_send(data) { |
| 63 | + Ok(bytes) => SendResult::Success(bytes), |
| 64 | + Err(e) if e.kind() == WouldBlock => { |
| 65 | + // 3. Add to queue if under limit |
| 66 | + if self.pending_queue.len() < self.max_queue_size { |
| 67 | + self.pending_queue.push_back((Instant::now(), data.to_vec())); |
| 68 | + SendResult::Queued |
| 69 | + } else { |
| 70 | + SendResult::Dropped(DropReason::QueueFull) |
| 71 | + } |
| 72 | + } |
| 73 | + Err(e) => SendResult::Failed(e), |
| 74 | + } |
| 75 | + } |
| 76 | + |
| 77 | + fn expire_old_entries(&mut self) { |
| 78 | + let cutoff = Instant::now() - self.timeout; |
| 79 | + while let Some((timestamp, _)) = self.pending_queue.front() { |
| 80 | + if *timestamp < cutoff { |
| 81 | + self.pending_queue.pop_front(); |
| 82 | + // Track expiration metric |
| 83 | + } else { |
| 84 | + break; |
| 85 | + } |
| 86 | + } |
| 87 | + } |
| 88 | +} |
| 89 | +``` |
| 90 | + |
| 91 | +### Key Design Decisions |
| 92 | + |
| 93 | +1. **Maintain Throttle Semantics**: The throttle continues to control the rate of send attempts, not successful sends. This accurately models real client behavior where clients attempt to send at a configured rate regardless of success. |
| 94 | + |
| 95 | +2. **Bounded Queue**: Use a fixed-size queue to: |
| 96 | + - Prevent unbounded memory growth |
| 97 | + - Maintain determinism |
| 98 | + - Model realistic client behavior (clients have finite buffers) |
| 99 | + |
| 100 | +3. **Deterministic Expiration**: Check for expired entries on each send attempt, not via background timers. This ensures: |
| 101 | + - Predictable behavior |
| 102 | + - No additional threads/tasks |
| 103 | + - Consistent measurement |
| 104 | + |
| 105 | +4. **Clear Result Types**: Distinguish between: |
| 106 | + - Success: Data sent immediately |
| 107 | + - Queued: Added to pending queue |
| 108 | + - Dropped: Queue full or other failure |
| 109 | + - Expired: Timed out in queue |
| 110 | + |
| 111 | +### Integration Points |
| 112 | + |
| 113 | +The implementation requires minimal changes to existing code: |
| 114 | + |
| 115 | +1. **In Child::spin()**: Replace direct socket.send() with TimeoutAwareSender |
| 116 | +2. **Configuration**: Add optional timeout parameters |
| 117 | +3. **Metrics**: Add new counters for queue behavior |
| 118 | +4. **No throttle changes**: Throttle remains unchanged |
| 119 | + |
| 120 | +### Coordinated Omission Avoidance |
| 121 | + |
| 122 | +This design avoids coordinated omission by: |
| 123 | +- Attempting sends on every throttle tick (as required by blt's comment) |
| 124 | +- Never blocking on socket readiness |
| 125 | +- Tracking all attempted operations, not just successful ones |
| 126 | +- Maintaining the expected send rate regardless of socket congestion |
| 127 | + |
| 128 | +## Implementation Steps |
| 129 | + |
| 130 | +1. **Create TimeoutAwareSender struct** |
| 131 | + - Implement non-blocking send logic |
| 132 | + - Add queue management with expiration |
| 133 | + - Define clear result types |
| 134 | + |
| 135 | +2. **Integrate into unix_datagram generator** |
| 136 | + - Replace socket.send() calls |
| 137 | + - Maintain existing metrics |
| 138 | + - Add new timeout-specific metrics |
| 139 | + |
| 140 | +3. **Add Configuration** |
| 141 | + ```yaml |
| 142 | + generator: |
| 143 | + unix_datagram: |
| 144 | + path: /var/run/datadog/dsd.socket |
| 145 | + bytes_per_second: 10_000_000 |
| 146 | + client_behavior: |
| 147 | + timeout_ms: 100 # Optional, defaults to no timeout |
| 148 | + max_queue_depth: 1000 # Optional, defaults to reasonable limit |
| 149 | + ``` |
| 150 | +
|
| 151 | +4. **Testing Strategy** |
| 152 | + - Unit tests for TimeoutAwareSender in isolation |
| 153 | + - Property tests for queue bounds and determinism |
| 154 | + - Integration tests with real sockets |
| 155 | + - Verification that throttle rate matches attempted sends |
| 156 | +
|
| 157 | +## Metrics to Track |
| 158 | +
|
| 159 | +Existing metrics remain unchanged: |
| 160 | +- `bytes_written`: Successfully sent bytes |
| 161 | +- `packets_sent`: Successfully sent packets |
| 162 | +- `request_failure`: Failed send attempts |
| 163 | + |
| 164 | +New metrics for timeout behavior: |
| 165 | +- `writes_queued`: Writes added to pending queue |
| 166 | +- `writes_expired`: Writes that exceeded timeout window |
| 167 | +- `writes_dropped`: Writes dropped due to queue limits |
| 168 | +- `queue_depth`: Current number of pending writes |
| 169 | +- `send_attempt_result{result="success|queued|dropped|expired"}`: Result distribution |
| 170 | + |
| 171 | +## Benefits of This Approach |
| 172 | + |
| 173 | +1. **Accurate Client Modeling**: Matches real client behavior under load |
| 174 | +2. **Throttle Accuracy**: Maintains precise control over attempted load |
| 175 | +3. **Deterministic**: No background tasks or non-deterministic draining |
| 176 | +4. **Testable**: Timeout layer can be tested in isolation |
| 177 | +5. **Backward Compatible**: Opt-in via configuration |
| 178 | +6. **Performance**: Minimal overhead, no additional threads |
| 179 | + |
| 180 | +## References |
| 181 | + |
| 182 | +- Current implementation: lading/src/generator/unix_datagram.rs |
| 183 | +- Throttle mechanism: lading_throttle/ |
| 184 | +- datadog-go timeout: 100ms default |
0 commit comments