Skip to content

Commit f236836

Browse files
committed
Adds some ideas around modeling client write timeouts
1 parent dae69da commit f236836

File tree

1 file changed

+184
-0
lines changed

1 file changed

+184
-0
lines changed

MODEL_SOCKET_STARVATION.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Model Socket Starvation: Client Timeout Behavior
2+
3+
## Problem Statement
4+
5+
Current lading UDS generator (lading/src/generator/unix_datagram.rs:277-289) tracks failed sends but doesn't accurately model client timeout behavior. The generator uses async `socket.send()` which blocks until the socket is ready, rather than failing quickly when the socket cannot accept data within a timeout window.
6+
7+
Real clients (e.g., datadog-go) use a 100ms write timeout ([source](https://github.com/DataDog/datadog-go/blob/a6b9cdf87a42985c32e5abfc1e6e10e349107ee0/statsd/options.go#L19)). When Agent cannot keep up, clients timeout and drop metrics. Lading needs to model this behavior to accurately measure performance under CPU starvation conditions.
8+
9+
## Current Implementation Analysis
10+
11+
### What Works Today
12+
- Failed sends are tracked via `request_failure` metric
13+
- Throttle mechanism controls send rate via `throttle.wait_for(total_bytes)`
14+
- Connection failures are tracked and retried
15+
- "Peek then consume" pattern ensures throttle accuracy
16+
17+
### What's Missing
18+
- No timeout modeling - sends block indefinitely
19+
- No explicit queue to track pending writes
20+
- Cannot simulate client-side timeouts and dropped metrics
21+
- Backpressure isn't accurately measured
22+
23+
### Key Implementation Insights
24+
25+
After analyzing the codebase:
26+
- **Throttle has no capacity return mechanism** - this is by design and actually helps our use case
27+
- **Throttle measures attempted load**, not successful sends - this is the correct behavior
28+
- **Current implementation uses datagram semantics** - no partial writes, no retries
29+
30+
## Technical Design: Timeout-Aware Send Layer
31+
32+
### Core Architecture
33+
34+
Introduce a thin timeout modeling layer between throttle and socket:
35+
36+
```rust
37+
// Current flow:
38+
throttle.wait_for(bytes) → socket.send(block) → track_metrics
39+
40+
// Proposed flow:
41+
throttle.wait_for(bytes) → timeout_aware_send(block) → track_metrics
42+
43+
internal_queue_management
44+
```
45+
46+
### Implementation Design
47+
48+
```rust
49+
struct TimeoutAwareSender {
50+
socket: UnixDatagram,
51+
timeout: Duration,
52+
pending_queue: VecDeque<(Instant, Vec<u8>)>,
53+
max_queue_size: usize,
54+
}
55+
56+
impl TimeoutAwareSender {
57+
async fn send(&mut self, data: &[u8]) -> SendResult {
58+
// 1. Expire old entries first (deterministic)
59+
self.expire_old_entries();
60+
61+
// 2. Try non-blocking send
62+
match self.socket.try_send(data) {
63+
Ok(bytes) => SendResult::Success(bytes),
64+
Err(e) if e.kind() == WouldBlock => {
65+
// 3. Add to queue if under limit
66+
if self.pending_queue.len() < self.max_queue_size {
67+
self.pending_queue.push_back((Instant::now(), data.to_vec()));
68+
SendResult::Queued
69+
} else {
70+
SendResult::Dropped(DropReason::QueueFull)
71+
}
72+
}
73+
Err(e) => SendResult::Failed(e),
74+
}
75+
}
76+
77+
fn expire_old_entries(&mut self) {
78+
let cutoff = Instant::now() - self.timeout;
79+
while let Some((timestamp, _)) = self.pending_queue.front() {
80+
if *timestamp < cutoff {
81+
self.pending_queue.pop_front();
82+
// Track expiration metric
83+
} else {
84+
break;
85+
}
86+
}
87+
}
88+
}
89+
```
90+
91+
### Key Design Decisions
92+
93+
1. **Maintain Throttle Semantics**: The throttle continues to control the rate of send attempts, not successful sends. This accurately models real client behavior where clients attempt to send at a configured rate regardless of success.
94+
95+
2. **Bounded Queue**: Use a fixed-size queue to:
96+
- Prevent unbounded memory growth
97+
- Maintain determinism
98+
- Model realistic client behavior (clients have finite buffers)
99+
100+
3. **Deterministic Expiration**: Check for expired entries on each send attempt, not via background timers. This ensures:
101+
- Predictable behavior
102+
- No additional threads/tasks
103+
- Consistent measurement
104+
105+
4. **Clear Result Types**: Distinguish between:
106+
- Success: Data sent immediately
107+
- Queued: Added to pending queue
108+
- Dropped: Queue full or other failure
109+
- Expired: Timed out in queue
110+
111+
### Integration Points
112+
113+
The implementation requires minimal changes to existing code:
114+
115+
1. **In Child::spin()**: Replace direct socket.send() with TimeoutAwareSender
116+
2. **Configuration**: Add optional timeout parameters
117+
3. **Metrics**: Add new counters for queue behavior
118+
4. **No throttle changes**: Throttle remains unchanged
119+
120+
### Coordinated Omission Avoidance
121+
122+
This design avoids coordinated omission by:
123+
- Attempting sends on every throttle tick (as required by blt's comment)
124+
- Never blocking on socket readiness
125+
- Tracking all attempted operations, not just successful ones
126+
- Maintaining the expected send rate regardless of socket congestion
127+
128+
## Implementation Steps
129+
130+
1. **Create TimeoutAwareSender struct**
131+
- Implement non-blocking send logic
132+
- Add queue management with expiration
133+
- Define clear result types
134+
135+
2. **Integrate into unix_datagram generator**
136+
- Replace socket.send() calls
137+
- Maintain existing metrics
138+
- Add new timeout-specific metrics
139+
140+
3. **Add Configuration**
141+
```yaml
142+
generator:
143+
unix_datagram:
144+
path: /var/run/datadog/dsd.socket
145+
bytes_per_second: 10_000_000
146+
client_behavior:
147+
timeout_ms: 100 # Optional, defaults to no timeout
148+
max_queue_depth: 1000 # Optional, defaults to reasonable limit
149+
```
150+
151+
4. **Testing Strategy**
152+
- Unit tests for TimeoutAwareSender in isolation
153+
- Property tests for queue bounds and determinism
154+
- Integration tests with real sockets
155+
- Verification that throttle rate matches attempted sends
156+
157+
## Metrics to Track
158+
159+
Existing metrics remain unchanged:
160+
- `bytes_written`: Successfully sent bytes
161+
- `packets_sent`: Successfully sent packets
162+
- `request_failure`: Failed send attempts
163+
164+
New metrics for timeout behavior:
165+
- `writes_queued`: Writes added to pending queue
166+
- `writes_expired`: Writes that exceeded timeout window
167+
- `writes_dropped`: Writes dropped due to queue limits
168+
- `queue_depth`: Current number of pending writes
169+
- `send_attempt_result{result="success|queued|dropped|expired"}`: Result distribution
170+
171+
## Benefits of This Approach
172+
173+
1. **Accurate Client Modeling**: Matches real client behavior under load
174+
2. **Throttle Accuracy**: Maintains precise control over attempted load
175+
3. **Deterministic**: No background tasks or non-deterministic draining
176+
4. **Testable**: Timeout layer can be tested in isolation
177+
5. **Backward Compatible**: Opt-in via configuration
178+
6. **Performance**: Minimal overhead, no additional threads
179+
180+
## References
181+
182+
- Current implementation: lading/src/generator/unix_datagram.rs
183+
- Throttle mechanism: lading_throttle/
184+
- datadog-go timeout: 100ms default

0 commit comments

Comments
 (0)