Skip to content

Commit 461848e

Browse files
authored
Avoid traversing entire directory on local search (#276)
* Fix local dir search hanging on large directory trees collect_files now: - Skips hidden dirs (.git) and build dirs (target, node_modules) - Limits recursion depth to 10 - Stops after 5000 files - Handles PermissionDenied gracefully Previously, --local-dir . from a repo root would recursively traverse the entire tree including .git/ and target/, causing the search endpoint to hang. * Reduce local search file limit to 50 * Fix collect_files: cap total entries visited, not just files collected The MAX_COLLECT_FILES limit only bounded files added to the result vec. In directories with many non-matching files, collect_files would still call canonicalize() and metadata() on every entry, causing hangs on large directory trees. Add a MAX_ENTRIES_VISITED counter (10,000) that bounds the total number of directory entries processed across the entire recursive walk, regardless of whether they match the prefix or are directories. * reduce limits and error on permission denied
1 parent b3b4a9f commit 461848e

5 files changed

Lines changed: 173 additions & 307 deletions

File tree

dial9-tokio-telemetry/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ dial9-tokio-telemetry is designed for always-on production use, but it's still e
101101

102102
Yes, check out this [quick walkthrough (YouTube)](https://www.youtube.com/watch?v=zJOzU_6Mf7Q)!
103103

104-
The [viewer](https://dial9-tokio-telemetry.netlify.app/) (autodeployed from code in `main`) is hosted on Netlify for convenience. You can [load the demo trace](https://dial9-tokio-telemetry.netlify.app/?trace=demo-trace.bin) directly, or use [serve.py](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/dial9-tokio-telemetry/serve.py) to run it locally (pure HTML and JS, client side only).
104+
The [viewer](https://dial9-tokio-telemetry.netlify.app/) (autodeployed from code in `main`) is hosted on Netlify for convenience. You can [load the demo trace](https://dial9-tokio-telemetry.netlify.app/?trace=demo-trace.bin) directly, or use [serve.py](/dial9-tokio-telemetry/serve.py) to run it locally (pure HTML and JS, client side only).
105105

106106
<img width="1288" height="659" alt="Screenshot 2026-03-01 at 3 52 59 PM" src="https://github.com/user-attachments/assets/77225801-70b1-4aef-b064-32bc2326b1ef" />
107107

@@ -185,7 +185,7 @@ runtime.block_on(async {
185185
# }
186186
```
187187

188-
For frameworks like Axum where you don't control the spawn call, you need to wrap the accept loop. See [`examples/metrics-service/src/axum_traced.rs`](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/examples/metrics-service/src/axum_traced.rs) for a working example that wraps both the accept loop and per-connection futures.
188+
For frameworks like Axum where you don't control the spawn call, you need to wrap the accept loop. See [`examples/metrics-service/src/axum_traced.rs`](/examples/metrics-service/src/axum_traced.rs) for a working example that wraps both the accept loop and per-connection futures.
189189

190190
## Custom events
191191

@@ -219,7 +219,7 @@ record_event(
219219
# }
220220
```
221221

222-
For events with repeated string values (HTTP methods, endpoint paths, etc.), implement `Encodable` manually to use string interning — see [`examples/custom_events.rs`](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/dial9-tokio-telemetry/examples/custom_events.rs) for a complete example showing both patterns.
222+
For events with repeated string values (HTTP methods, endpoint paths, etc.), implement `Encodable` manually to use string interning — see [`examples/custom_events.rs`](/dial9-tokio-telemetry/examples/custom_events.rs) for a complete example showing both patterns.
223223

224224
Custom events are encoded into the same thread-local buffer as built-in events (~100–200 ns per call) and appear in the trace viewer alongside poll/park/wake events.
225225

@@ -263,7 +263,7 @@ let (runtime, guard) = TracedRuntime::builder()
263263
# fn main() {}
264264
```
265265

266-
This pulls in [`dial9-perf-self-profile`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/perf-self-profile) for `perf_event_open` access. It records `CpuSample` events with raw stack frame addresses. When a `trace_path` is set, the background worker automatically symbolizes sealed segments (resolving addresses to function names via `/proc/self/maps` and blazesym) and gzip-compresses them on disk.
266+
This pulls in [`dial9-perf-self-profile`](/perf-self-profile) for `perf_event_open` access. It records `CpuSample` events with raw stack frame addresses. When a `trace_path` is set, the background worker automatically symbolizes sealed segments (resolving addresses to function names via `/proc/self/maps` and blazesym) and gzip-compresses them on disk.
267267

268268
#### Requirements
269269

@@ -351,7 +351,7 @@ let (io_rt, io_handle) = guard.trace_runtime("io").build(io_builder)?;
351351
# }
352352
```
353353

354-
See [`examples/thread_per_core.rs`](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/dial9-tokio-telemetry/examples/thread_per_core.rs) and [`examples/multi_runtime.rs`](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/dial9-tokio-telemetry/examples/multi_runtime.rs) for complete examples.
354+
See [`examples/thread_per_core.rs`](/dial9-tokio-telemetry/examples/thread_per_core.rs) and [`examples/multi_runtime.rs`](/dial9-tokio-telemetry/examples/multi_runtime.rs) for complete examples.
355355

356356
**Shutdown**: Drop all runtimes before the `TelemetryGuard` so worker threads exit and flush their thread-local buffers. For a clean shutdown that waits for the background worker (e.g. S3 uploads) to drain, call `guard.graceful_shutdown(timeout)` instead of dropping the guard.
357357

@@ -361,7 +361,7 @@ See [`examples/thread_per_core.rs`](https://github.com/dial9-rs/dial9-tokio-tele
361361

362362
### Analyzing traces
363363

364-
[`dial9-viewer`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/dial9-viewer) is an interactive trace viewer and S3 browser. Point it at a local directory or an S3 bucket to browse and visualize traces in the browser. [Here's a demo.](https://www.youtube.com/watch?v=zJOzU_6Mf7Q)
364+
[`dial9-viewer`](/dial9-viewer) is an interactive trace viewer and S3 browser. Point it at a local directory or an S3 bucket to browse and visualize traces in the browser. [Here's a demo.](https://www.youtube.com/watch?v=zJOzU_6Mf7Q)
365365

366366
```bash
367367
# Install
@@ -388,7 +388,7 @@ cargo run --example analyze_trace --features analysis -- /tmp/my_traces/trace.0.
388388
cargo run --example trace_to_jsonl --features analysis -- /tmp/my_traces/trace.0.bin.gz output.jsonl
389389
```
390390

391-
See [TRACE_ANALYSIS_GUIDE.md](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/dial9-tokio-telemetry/TRACE_ANALYSIS_GUIDE.md) for a walkthrough of diagnosing scheduling delays and CPU hotspots from trace data.
391+
See [TRACE_ANALYSIS_GUIDE.md](/dial9-tokio-telemetry/TRACE_ANALYSIS_GUIDE.md) for a walkthrough of diagnosing scheduling delays and CPU hotspots from trace data.
392392

393393
## Features
394394

@@ -456,7 +456,7 @@ cargo run --example telemetry_rotating # manual setup + rotating writer conf
456456
cargo run --example multi_runtime # multiple runtimes, manual TelemetryCore
457457
```
458458

459-
The [`examples/metrics-service`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/examples/metrics-service) directory has a full Axum service with DynamoDB persistence, a load-generating client, and telemetry wired up end-to-end.
459+
The [`examples/metrics-service`](/examples/metrics-service) directory has a full Axum service with DynamoDB persistence, a load-generating client, and telemetry wired up end-to-end.
460460

461461
## Overhead
462462

@@ -476,11 +476,11 @@ Overhead: 3.2%
476476

477477
This repo is a Cargo workspace with five members:
478478

479-
- [`dial9-tokio-telemetry`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/dial9-tokio-telemetry) — the main crate
480-
- [`dial9-viewer`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/dial9-viewer) — CLI and web UI for browsing traces in S3 or on the local filesystem
481-
- [`dial9-macro`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/dial9-macro) — the `#[dial9_tokio_telemetry::main]` attribute macro
482-
- [`dial9-perf-self-profile`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/perf-self-profile) — minimal Linux `perf_event_open` wrapper for CPU profiling and scheduler events
483-
- [`examples/metrics-service`](https://github.com/dial9-rs/dial9-tokio-telemetry/tree/main/examples/metrics-service) — end-to-end example service
479+
- [`dial9-tokio-telemetry`](/dial9-tokio-telemetry) — the main crate
480+
- [`dial9-viewer`](/dial9-viewer) — CLI and web UI for browsing traces in S3 or on the local filesystem
481+
- [`dial9-macro`](/dial9-macro) — the `#[dial9_tokio_telemetry::main]` attribute macro
482+
- [`dial9-perf-self-profile`](/perf-self-profile) — minimal Linux `perf_event_open` wrapper for CPU profiling and scheduler events
483+
- [`examples/metrics-service`](/examples/metrics-service) — end-to-end example service
484484

485485
## Future work
486486

dial9-tokio-telemetry/benches/overhead_bench.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,6 @@ fn run_bench(mode: &str, duration_secs: u64) -> BenchResult {
110110
let (server_rt, guard): (tokio::runtime::Runtime, Option<TelemetryGuard>) = match mode {
111111
"telemetry" => {
112112
let writer = RotatingWriter::single_file("/tmp/overhead_bench_trace.bin").unwrap();
113-
#[allow(unused_mut)]
114113
let mut tb = TracedRuntime::builder().with_task_tracking(true);
115114
#[cfg(target_os = "linux")]
116115
{

dial9-tokio-telemetry/examples/many_workers.rs

Lines changed: 0 additions & 57 deletions
This file was deleted.

dial9-viewer/src/storage.rs

Lines changed: 65 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ impl StorageBackend for LocalBackend {
220220
let prefix2 = prefix.clone();
221221
tokio::task::spawn_blocking(move || {
222222
let mut objects = Vec::new();
223-
collect_files(&root, &root, &prefix2, &mut objects)?;
223+
collect_files(&root, &root, &prefix2, &mut objects, 0, &mut 0)?;
224224
objects.sort_by(|a, b| a.key.cmp(&b.key));
225225
Ok(objects)
226226
})
@@ -313,18 +313,49 @@ impl StorageBackend for LocalBackend {
313313
}
314314
}
315315

316+
/// Maximum directory depth to recurse into when listing local files.
317+
const MAX_COLLECT_DEPTH: u32 = 10;
318+
319+
/// Maximum number of files to return from a local directory listing.
320+
const MAX_COLLECT_FILES: usize = 50;
321+
322+
/// Maximum number of directory entries to visit (files + dirs) across the
323+
/// entire recursive walk. This bounds the number of syscalls (`canonicalize`,
324+
/// `metadata`) so a huge directory tree cannot hang the listing.
325+
const MAX_ENTRIES_VISITED: usize = 500;
326+
327+
/// Directory names to skip during recursive file collection.
328+
fn is_skipped_dir(name: &str) -> bool {
329+
name.starts_with('.') || matches!(name, "target" | "node_modules")
330+
}
331+
316332
fn collect_files(
317333
root: &Path,
318334
dir: &Path,
319335
prefix: &str,
320336
out: &mut Vec<ObjectInfo>,
337+
depth: u32,
338+
visited: &mut usize,
321339
) -> Result<(), StorageError> {
340+
if depth > MAX_COLLECT_DEPTH
341+
|| out.len() >= MAX_COLLECT_FILES
342+
|| *visited >= MAX_ENTRIES_VISITED
343+
{
344+
return Ok(());
345+
}
322346
let entries = match std::fs::read_dir(dir) {
323347
Ok(e) => e,
324348
Err(e) if e.kind() == std::io::ErrorKind::NotFound => return Ok(()),
349+
Err(e) if e.kind() == std::io::ErrorKind::PermissionDenied => {
350+
return Err(StorageError::Other("permission denied".into()));
351+
}
325352
Err(e) => return Err(StorageError::Other(e.to_string())),
326353
};
327354
for entry in entries {
355+
*visited += 1;
356+
if out.len() >= MAX_COLLECT_FILES || *visited >= MAX_ENTRIES_VISITED {
357+
break;
358+
}
328359
let entry = entry.map_err(|e| StorageError::Other(e.to_string()))?;
329360
let path = entry.path();
330361
// Resolve symlinks and verify the target stays within root.
@@ -333,7 +364,11 @@ fn collect_files(
333364
_ => continue,
334365
};
335366
if canonical.is_dir() {
336-
collect_files(root, &canonical, prefix, out)?;
367+
let name = entry.file_name();
368+
let name = name.to_string_lossy();
369+
if !is_skipped_dir(&name) {
370+
collect_files(root, &canonical, prefix, out, depth + 1, visited)?;
371+
}
337372
} else if canonical.is_file() {
338373
let key = path
339374
.strip_prefix(root)
@@ -357,3 +392,31 @@ fn collect_files(
357392
}
358393
Ok(())
359394
}
395+
396+
#[cfg(test)]
397+
mod tests {
398+
use super::*;
399+
400+
#[test]
401+
fn collect_files_caps_entries_visited() {
402+
let dir = tempfile::tempdir().unwrap();
403+
// Create more files than MAX_ENTRIES_VISITED to prove we stop early.
404+
let n = MAX_ENTRIES_VISITED + 500;
405+
for i in 0..n {
406+
std::fs::write(dir.path().join(format!("file_{i:05}.bin")), b"x").unwrap();
407+
}
408+
let mut out = Vec::new();
409+
let mut visited = 0;
410+
collect_files(dir.path(), dir.path(), "", &mut out, 0, &mut visited).unwrap();
411+
// visited must be capped — we should NOT have iterated all n files.
412+
assert!(
413+
visited <= MAX_ENTRIES_VISITED,
414+
"visited {visited} entries, expected at most {MAX_ENTRIES_VISITED}"
415+
);
416+
assert!(
417+
out.len() <= MAX_COLLECT_FILES,
418+
"collected {} files, expected at most {MAX_COLLECT_FILES}",
419+
out.len()
420+
);
421+
}
422+
}

0 commit comments

Comments
 (0)