Enhance scanner pipeline with timeout, resilience, and backpressure#399
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the scanner pipeline’s robustness and operational behavior by adding backpressure to parallel repo scans, enforcing timeouts in git operations/enumeration, and making archive extraction/decompression more resilient (with reduced log noise). It also expands the context verifier’s literal/multiline handling, updates benchmarking docs/tooling, and bumps the release to v1.103.0.
Changes:
- Add bounded in-flight repo scanning (permit pool) plus optional saturation tracking logs to restore end-to-end backpressure.
- Enforce git wall-clock timeouts and propagate enumeration deadlines through indexing/metadata/diff phases; fall back to raw-byte scanning on archive/decompression failures.
- Expand context verifier literal coverage (raw/prefixed/triple/backtick) and stitch multiline assignments; update goldens, docs, and release metadata.
Reviewed changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| testdata/parsers/context_verifier_golden.json | Updates golden expectations for improved context extraction (reduces false positives, adds multiline cases). |
| src/scanner/runner.rs | Adds backpressure via an in-flight permit pool and optional saturation tracker thread. |
| src/scanner/enumerate.rs | Adds decompression fallback to raw bytes and integrates deadline/timeouts during repo/blob enumeration. |
| src/parser/lexer.rs | Expands literal parsing (raw/prefixed/triple/backtick), multiline context stitching, and improves candidate heuristics with new tests. |
| src/parser.rs | Adds HTML multiline <script> assignment context extraction test. |
| src/git_repo_enumerator.rs | Adds deadline-aware metadata enumeration path and checks during assembly. |
| src/git_metadata_graph.rs | Propagates deadline checks through object indexing and metadata/tree traversal. |
| src/git_binary.rs | Adds wall-clock timeouts to git subprocess execution using wait-timeout. |
| src/decompress.rs | Makes tar extraction resilient to truncation, falls back to raw scan when needed, demotes expected warnings to debug, raises cap to 4GB, and fixes ASAR error handling. |
| docs/benchmark/README.md | Rewrites benchmark documentation (flags, output layout, chart regeneration). |
| docs/benchmark/main.go | Refactors benchmark runner, adds version capture, timestamped output, Markdown reporting, and QuickChart PNG rendering/regeneration. |
| docs-site/docs/changelog.md | Adds v1.103.0 release notes. |
| CHANGELOG.md | Adds v1.103.0 release notes. |
| Cargo.toml | Bumps version to v1.103.0 and adds wait-timeout dependency. |
| Cargo.lock | Locks updated version and dependency additions. |
| .gitignore | Ignores benchmark binary and generated benchmark output directories. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+966
to
+976
| impl Drop for ScanGuard { | ||
| fn drop(&mut self) { | ||
| self.scan_counter.fetch_sub(1, Ordering::Relaxed); | ||
| // Bounded to the same cap we pre-filled, and we | ||
| // only send one permit per `recv`, so this can | ||
| // never fail with `Full`. Use `try_send` so a | ||
| // logic bug surfaces immediately rather than | ||
| // blocking a worker thread on cleanup. | ||
| let _ = self.permit_release.try_send(()); | ||
| } | ||
| } |
Comment on lines
142
to
160
| Ok(mut out_file) => { | ||
| if let Err(e) = std::io::copy(&mut entry, &mut out_file) { | ||
| tracing::debug!("failed to extract {}: {}", out_path.display(), e); | ||
| continue; | ||
| let expected_size = entry.size(); | ||
| let copied = match std::io::copy(&mut entry, &mut out_file) { | ||
| Ok(n) => n, | ||
| Err(e) => { | ||
| tracing::debug!("failed to extract {}: {}", out_path.display(), e); | ||
| truncated = true; | ||
| break; | ||
| } | ||
| }; | ||
| if copied != expected_size { | ||
| tracing::debug!( | ||
| "tar entry {} in {} was truncated after {copied} of {expected_size} bytes", | ||
| path_in_tar, | ||
| archive_path.display() | ||
| ); | ||
| truncated = true; | ||
| break; | ||
| } |
Comment on lines
+375
to
+384
| fn context_lines(text: &str) -> Vec<Cow<'_, str>> { | ||
| let mut lines: Vec<Cow<'_, str>> = text.lines().map(Cow::Borrowed).collect(); | ||
| let mut current = String::new(); | ||
| let mut active = false; | ||
|
|
||
| for line in text.lines() { | ||
| let trimmed = line.trim(); | ||
| if trimmed.is_empty() { | ||
| continue; | ||
| } |
Comment on lines
+379
to
+383
| let secs = timeout.as_secs(); | ||
| warn!("git command exceeded {secs}s timeout; killing pid {}", child.id()); | ||
| let _ = child.kill(); | ||
| let _ = child.wait(); | ||
| let _ = stdout_reader.join(); |
Comment on lines
+979
to
+983
| if let Err(err) = self.permit_release.try_send(()) { | ||
| debug_assert!( | ||
| false, | ||
| "permit pool overflowed or disconnected on cleanup: {err}" | ||
| ); |
Comment on lines
+238
to
+242
| #[inline] | ||
| fn check_deadline(deadline: Option<Instant>, phase: &str) -> Result<()> { | ||
| if deadline.is_some_and(|deadline| Instant::now() > deadline) { | ||
| bail!("{phase} timed out") | ||
| } |
Comment on lines
+375
to
+383
| fn context_lines(text: &str) -> Vec<Cow<'_, str>> { | ||
| // Build the candidate list incrementally so that a stitched multi-line | ||
| // statement is emitted right after the line that completes it, rather | ||
| // than appended after the whole file. Callers early-exit (`is_break`) | ||
| // as soon as the sink is satisfied, so out-of-order stitched context | ||
| // would be skipped entirely once an early exit fires. | ||
| let mut lines: Vec<Cow<'_, str>> = Vec::new(); | ||
| let mut current = String::new(); | ||
| let mut active = false; |
Comment on lines
286
to
290
| func handleHTTP(w http.ResponseWriter, req *http.Request) { | ||
| proxy := httputil.ReverseProxy{ | ||
| Director: func(r *http.Request) {}, | ||
| } | ||
| // The incoming forward-proxy request already carries an absolute URL, so the | ||
| // rewrite is a no-op: forward it as-is. | ||
| proxy := httputil.ReverseProxy{Rewrite: func(*httputil.ProxyRequest) {}} | ||
| proxy.ServeHTTP(w, req) |
Comment on lines
+879
to
+887
| let tracker_stop = Arc::new(AtomicBool::new(false)); | ||
| let tracker_handle = if global_args.verbose >= 1 { | ||
| let stop = Arc::clone(&tracker_stop); | ||
| let active = Arc::clone(&active_scans); | ||
| let rx = repo_rx.clone(); | ||
| let permits = permit_return.clone(); | ||
| let cap = scan_inflight_cap; | ||
| std::thread::Builder::new() | ||
| .name("kf-scan-tracker".to_string()) |
Comment on lines
+508
to
+513
| #[inline] | ||
| fn check_deadline(deadline: Option<Instant>, phase: &str) -> Result<()> { | ||
| if deadline.is_some_and(|deadline| Instant::now() > deadline) { | ||
| bail!("{phase} timed out") | ||
| } | ||
| Ok(()) |
Comment on lines
+865
to
+866
| // Sized at 2× the rayon worker count so workers always have a few ready | ||
| // repos staged and pick up the next as soon as one finishes. |
| ## [v1.103.0] | ||
| - Git clone and remote-update operations now enforce wall-clock timeouts (20 min and 10 min defaults respectively) so a single unresponsive remote cannot park a clone worker indefinitely. Configurable via `KF_GIT_CLONE_TIMEOUT_SECS` and `KF_GIT_UPDATE_TIMEOUT_SECS`. | ||
| - Deadline enforcement is now propagated through repository object indexing, commit-graph traversal, tree traversal, and blob metadata assembly, replacing the previous 100 ms polling loop with cooperative cancellation at each phase boundary. | ||
| - Bounded concurrent in-flight repo scans with a permit pool sized at `2× repo_concurrency`. Without this cap, a large multi-repo scan could queue thousands of closures into rayon's unbounded work queue and exhaust memory before any scan completed. Pass `-v` to enable a saturation-tracker thread that logs queue depth, active scan count, and permit availability every ~15 s. |
| ## [v1.103.0] | ||
| - Git clone and remote-update operations now enforce wall-clock timeouts (20 min and 10 min defaults respectively) so a single unresponsive remote cannot park a clone worker indefinitely. Configurable via `KF_GIT_CLONE_TIMEOUT_SECS` and `KF_GIT_UPDATE_TIMEOUT_SECS`. | ||
| - Deadline enforcement is now propagated through repository object indexing, commit-graph traversal, tree traversal, and blob metadata assembly, replacing the previous 100 ms polling loop with cooperative cancellation at each phase boundary. | ||
| - Bounded concurrent in-flight repo scans with a permit pool sized at `2× repo_concurrency`. Without this cap, a large multi-repo scan could queue thousands of closures into rayon's unbounded work queue and exhaust memory before any scan completed. Pass `-v` to enable a saturation-tracker thread that logs queue depth, active scan count, and permit availability every ~15 s. |
Comment on lines
+425
to
+438
| // Timeout. Killing the whole process group closes every | ||
| // writer on the stdout/stderr pipes — including helpers like | ||
| // `git-remote-https`/`ssh` that may have inherited git's pipe | ||
| // ends — which unblocks the reader threads. `kill_process_tree` | ||
| // also reaps the direct child so we don't leave a zombie. | ||
| let secs = timeout.as_secs(); | ||
| warn!( | ||
| "git command exceeded {secs}s timeout; killing process group of pid {}", | ||
| child.id() | ||
| ); | ||
| kill_process_tree(&mut child); | ||
| let _ = stdout_reader.join(); | ||
| let _ = stderr_reader.join(); | ||
| return Err(GitError::Timeout { secs }); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements and fixes to archive extraction, decompression, and logging, as well as updating documentation and dependencies for the v1.103.0 release. The most significant changes include more resilient handling of truncated or malformed archives, reducing log noise by demoting expected warnings to debug level, and enhancing the context verifier and documentation. The version is also bumped to v1.103.0 and a new dependency is added.
Archive extraction and decompression improvements:
Logging and error handling:
Documentation and configuration:
wait-timeoutcrate to dependencies.Cargo.toml.