Skip to content

Enhance scanner pipeline with timeout, resilience, and backpressure#399

Merged
mickgmdb merged 8 commits into
mainfrom
development
Jun 16, 2026
Merged

Enhance scanner pipeline with timeout, resilience, and backpressure#399
mickgmdb merged 8 commits into
mainfrom
development

Conversation

@mickgmdb

Copy link
Copy Markdown
Collaborator

This pull request introduces several improvements and fixes to archive extraction, decompression, and logging, as well as updating documentation and dependencies for the v1.103.0 release. The most significant changes include more resilient handling of truncated or malformed archives, reducing log noise by demoting expected warnings to debug level, and enhancing the context verifier and documentation. The version is also bumped to v1.103.0 and a new dependency is added.

Archive extraction and decompression improvements:

  • Tar archive extraction is now resilient to truncation or corruption: successfully extracted entries are returned even if the archive is incomplete, and if no entries are extracted, the file falls back to a raw-bytes scan instead of being skipped. [1] [2] [3]
  • Archive decompression failures (zip, asar, etc.) now fall back to scanning the file's raw bytes instead of skipping the file.
  • The single-stream decompression cap is raised from 512 MB to 4 GB.

Logging and error handling:

  • Expected truncation and path-filter log events in tar, zip, and asar extractors are demoted from warn to debug to reduce log noise in normal operation. [1] [2] [3] [4] [5] [6] [7]
  • ASAR extraction now returns errors instead of silently yielding an empty archive on failure.

Documentation and configuration:

  • The changelog and documentation are updated for v1.103.0, detailing new features, bug fixes, and improvements, including new timeout settings for git operations and expanded context verifier support for string and numeric literals. [1] [2] [3]
  • Added the wait-timeout crate to dependencies.
  • Version bumped to v1.103.0 in Cargo.toml.

Copilot AI review requested due to automatic review settings June 15, 2026 23:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the scanner pipeline’s robustness and operational behavior by adding backpressure to parallel repo scans, enforcing timeouts in git operations/enumeration, and making archive extraction/decompression more resilient (with reduced log noise). It also expands the context verifier’s literal/multiline handling, updates benchmarking docs/tooling, and bumps the release to v1.103.0.

Changes:

  • Add bounded in-flight repo scanning (permit pool) plus optional saturation tracking logs to restore end-to-end backpressure.
  • Enforce git wall-clock timeouts and propagate enumeration deadlines through indexing/metadata/diff phases; fall back to raw-byte scanning on archive/decompression failures.
  • Expand context verifier literal coverage (raw/prefixed/triple/backtick) and stitch multiline assignments; update goldens, docs, and release metadata.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
testdata/parsers/context_verifier_golden.json Updates golden expectations for improved context extraction (reduces false positives, adds multiline cases).
src/scanner/runner.rs Adds backpressure via an in-flight permit pool and optional saturation tracker thread.
src/scanner/enumerate.rs Adds decompression fallback to raw bytes and integrates deadline/timeouts during repo/blob enumeration.
src/parser/lexer.rs Expands literal parsing (raw/prefixed/triple/backtick), multiline context stitching, and improves candidate heuristics with new tests.
src/parser.rs Adds HTML multiline <script> assignment context extraction test.
src/git_repo_enumerator.rs Adds deadline-aware metadata enumeration path and checks during assembly.
src/git_metadata_graph.rs Propagates deadline checks through object indexing and metadata/tree traversal.
src/git_binary.rs Adds wall-clock timeouts to git subprocess execution using wait-timeout.
src/decompress.rs Makes tar extraction resilient to truncation, falls back to raw scan when needed, demotes expected warnings to debug, raises cap to 4GB, and fixes ASAR error handling.
docs/benchmark/README.md Rewrites benchmark documentation (flags, output layout, chart regeneration).
docs/benchmark/main.go Refactors benchmark runner, adds version capture, timestamped output, Markdown reporting, and QuickChart PNG rendering/regeneration.
docs-site/docs/changelog.md Adds v1.103.0 release notes.
CHANGELOG.md Adds v1.103.0 release notes.
Cargo.toml Bumps version to v1.103.0 and adds wait-timeout dependency.
Cargo.lock Locks updated version and dependency additions.
.gitignore Ignores benchmark binary and generated benchmark output directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/scanner/runner.rs
Comment on lines +966 to +976
impl Drop for ScanGuard {
fn drop(&mut self) {
self.scan_counter.fetch_sub(1, Ordering::Relaxed);
// Bounded to the same cap we pre-filled, and we
// only send one permit per `recv`, so this can
// never fail with `Full`. Use `try_send` so a
// logic bug surfaces immediately rather than
// blocking a worker thread on cleanup.
let _ = self.permit_release.try_send(());
}
}
Comment thread src/decompress.rs
Comment on lines 142 to 160
Ok(mut out_file) => {
if let Err(e) = std::io::copy(&mut entry, &mut out_file) {
tracing::debug!("failed to extract {}: {}", out_path.display(), e);
continue;
let expected_size = entry.size();
let copied = match std::io::copy(&mut entry, &mut out_file) {
Ok(n) => n,
Err(e) => {
tracing::debug!("failed to extract {}: {}", out_path.display(), e);
truncated = true;
break;
}
};
if copied != expected_size {
tracing::debug!(
"tar entry {} in {} was truncated after {copied} of {expected_size} bytes",
path_in_tar,
archive_path.display()
);
truncated = true;
break;
}

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 2 comments.

Comment thread src/parser/lexer.rs Outdated
Comment on lines +375 to +384
fn context_lines(text: &str) -> Vec<Cow<'_, str>> {
let mut lines: Vec<Cow<'_, str>> = text.lines().map(Cow::Borrowed).collect();
let mut current = String::new();
let mut active = false;

for line in text.lines() {
let trimmed = line.trim();
if trimmed.is_empty() {
continue;
}
Comment thread src/git_binary.rs
Comment on lines +379 to +383
let secs = timeout.as_secs();
warn!("git command exceeded {secs}s timeout; killing pid {}", child.id());
let _ = child.kill();
let _ = child.wait();
let _ = stdout_reader.join();

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.

Comment thread src/scanner/runner.rs
Comment on lines +979 to +983
if let Err(err) = self.permit_release.try_send(()) {
debug_assert!(
false,
"permit pool overflowed or disconnected on cleanup: {err}"
);
Comment thread docs/benchmark/main.go
Comment on lines +238 to +242
#[inline]
fn check_deadline(deadline: Option<Instant>, phase: &str) -> Result<()> {
if deadline.is_some_and(|deadline| Instant::now() > deadline) {
bail!("{phase} timed out")
}
Comment thread src/parser/lexer.rs Outdated
Comment on lines +375 to +383
fn context_lines(text: &str) -> Vec<Cow<'_, str>> {
// Build the candidate list incrementally so that a stitched multi-line
// statement is emitted right after the line that completes it, rather
// than appended after the whole file. Callers early-exit (`is_break`)
// as soon as the sink is satisfied, so out-of-order stitched context
// would be skipped entirely once an early exit fires.
let mut lines: Vec<Cow<'_, str>> = Vec::new();
let mut current = String::new();
let mut active = false;

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 3 comments.

Comment thread docs/benchmark/main.go
Comment on lines 286 to 290
func handleHTTP(w http.ResponseWriter, req *http.Request) {
proxy := httputil.ReverseProxy{
Director: func(r *http.Request) {},
}
// The incoming forward-proxy request already carries an absolute URL, so the
// rewrite is a no-op: forward it as-is.
proxy := httputil.ReverseProxy{Rewrite: func(*httputil.ProxyRequest) {}}
proxy.ServeHTTP(w, req)
Comment thread src/scanner/runner.rs
Comment on lines +879 to +887
let tracker_stop = Arc::new(AtomicBool::new(false));
let tracker_handle = if global_args.verbose >= 1 {
let stop = Arc::clone(&tracker_stop);
let active = Arc::clone(&active_scans);
let rx = repo_rx.clone();
let permits = permit_return.clone();
let cap = scan_inflight_cap;
std::thread::Builder::new()
.name("kf-scan-tracker".to_string())
Comment thread src/git_metadata_graph.rs
Comment on lines +508 to +513
#[inline]
fn check_deadline(deadline: Option<Instant>, phase: &str) -> Result<()> {
if deadline.is_some_and(|deadline| Instant::now() > deadline) {
bail!("{phase} timed out")
}
Ok(())

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.

Comment thread src/scanner/runner.rs
Comment on lines +865 to +866
// Sized at 2× the rayon worker count so workers always have a few ready
// repos staged and pick up the next as soon as one finishes.
Comment thread CHANGELOG.md
## [v1.103.0]
- Git clone and remote-update operations now enforce wall-clock timeouts (20 min and 10 min defaults respectively) so a single unresponsive remote cannot park a clone worker indefinitely. Configurable via `KF_GIT_CLONE_TIMEOUT_SECS` and `KF_GIT_UPDATE_TIMEOUT_SECS`.
- Deadline enforcement is now propagated through repository object indexing, commit-graph traversal, tree traversal, and blob metadata assembly, replacing the previous 100 ms polling loop with cooperative cancellation at each phase boundary.
- Bounded concurrent in-flight repo scans with a permit pool sized at `2× repo_concurrency`. Without this cap, a large multi-repo scan could queue thousands of closures into rayon's unbounded work queue and exhaust memory before any scan completed. Pass `-v` to enable a saturation-tracker thread that logs queue depth, active scan count, and permit availability every ~15 s.
## [v1.103.0]
- Git clone and remote-update operations now enforce wall-clock timeouts (20 min and 10 min defaults respectively) so a single unresponsive remote cannot park a clone worker indefinitely. Configurable via `KF_GIT_CLONE_TIMEOUT_SECS` and `KF_GIT_UPDATE_TIMEOUT_SECS`.
- Deadline enforcement is now propagated through repository object indexing, commit-graph traversal, tree traversal, and blob metadata assembly, replacing the previous 100 ms polling loop with cooperative cancellation at each phase boundary.
- Bounded concurrent in-flight repo scans with a permit pool sized at `2× repo_concurrency`. Without this cap, a large multi-repo scan could queue thousands of closures into rayon's unbounded work queue and exhaust memory before any scan completed. Pass `-v` to enable a saturation-tracker thread that logs queue depth, active scan count, and permit availability every ~15 s.
Comment thread src/git_binary.rs
Comment on lines +425 to +438
// Timeout. Killing the whole process group closes every
// writer on the stdout/stderr pipes — including helpers like
// `git-remote-https`/`ssh` that may have inherited git's pipe
// ends — which unblocks the reader threads. `kill_process_tree`
// also reaps the direct child so we don't leave a zombie.
let secs = timeout.as_secs();
warn!(
"git command exceeded {secs}s timeout; killing process group of pid {}",
child.id()
);
kill_process_tree(&mut child);
let _ = stdout_reader.join();
let _ = stderr_reader.join();
return Err(GitError::Timeout { secs });
@mickgmdb mickgmdb merged commit 56c0516 into main Jun 16, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants