|
1 | 1 | ## 💡 Summary |
2 | | -Reduced repeated string allocations and BTreeMap lookups inside the hot-path duplicate file analysis loop by utilizing the `Entry` API with `&str` keys instead of `String`. |
| 2 | +Removed redundant UTF-8 validation and string allocation in the analysis and content enrichers. Files that passed `is_text_like` (which internally does a UTF-8 check) were being re-checked and allocated via `String::from_utf8_lossy`. |
3 | 3 |
|
4 | 4 | ## 🎯 Why |
5 | | -In `build_duplicate_report`, every duplicate file iteration was performing redundant `BTreeMap::get_mut` followed by `BTreeMap::insert` allocations for `module.to_string()`. This caused unnecessary string building and double lookups. |
| 5 | +To reduce hot-path work and unnecessary string building. `String::from_utf8_lossy` unconditionally scans the string for invalid UTF-8 and allocates a `Cow`, even when the caller just proved the bytes were valid UTF-8 via `is_text_like()`. |
6 | 6 |
|
7 | 7 | ## 🔎 Evidence |
8 | | -- File: `crates/tokmd-analysis/src/content/mod.rs` |
9 | | -- Finding: Redundant `String` copies in the hot loop counting duplicates by module. |
10 | | -- Receipt: Cargo tests passed successfully without allocations. |
| 8 | +- `crates/tokmd-analysis/src/api_surface/report.rs` |
| 9 | +- `crates/tokmd-analysis/src/halstead/mod.rs` |
| 10 | +- `crates/tokmd-analysis/src/content/mod.rs` |
| 11 | +- `crates/tokmd-analysis/src/complexity/mod.rs` |
| 12 | +- `crates/tokmd-analysis/src/content/io/read.rs` |
| 13 | +- Observed behavior: `is_text_like` returns `true` only for valid utf-8 strings without null bytes. Following this check with `String::from_utf8_lossy` forces an unnecessary secondary pass over the same file buffers. |
11 | 14 |
|
12 | 15 | ## 🧭 Options considered |
13 | 16 | ### Option A (recommended) |
14 | | -- What it is: Use `&str` bound to the `ExportData` row lifetime and the `Entry` API. |
15 | | -- Why it fits: Aligns perfectly with Bolt's focus on hot-path work reduction and removing unnecessary allocations inside analysis loops. |
16 | | -- Trade-offs: Structure is cleaner; no velocity or governance impact. |
| 17 | +- what it is: Replace `is_text_like` + `from_utf8_lossy` with a single `std::str::from_utf8` that guards against nulls and returns a `&str` directly without allocating. |
| 18 | +- why it fits this repo and shard: It achieves the Bolt persona's goal of removing hot-path validation and redundant allocations while maintaining deterministic structural proof in analysis. |
| 19 | +- trade-offs: Structure / Velocity / Governance - slightly changes code shape (using a `match`), but clearly aligns with performance and zero-cost abstraction goals. |
17 | 20 |
|
18 | 21 | ### Option B |
19 | | -- What it is: Sort vectors partially in `build_top_offenders`. |
20 | | -- When to choose it instead: When memory footprints in the top offenders map dwarf duplicated metrics building. |
21 | | -- Trade-offs: Harder to prove performance improvements and limits dataset size optimizations. |
| 22 | +- what it is: Try to avoid reading files to bytes at all by reading into a `String` directly. |
| 23 | +- when to choose it instead: If all files were known to be text. |
| 24 | +- trade-offs: Fails gracefully handling binary blobs. |
22 | 25 |
|
23 | 26 | ## ✅ Decision |
24 | | -Chose Option A to cleanly eliminate repetitive string building and duplicate map lookups in a hot loop. |
| 27 | +Option A. It optimizes the hot paths directly with minimal structural impact. |
25 | 28 |
|
26 | 29 | ## 🧱 Changes made (SRP) |
27 | | -- `crates/tokmd-analysis/src/content/mod.rs` |
| 30 | +- `crates/tokmd-analysis/src/api_surface/report.rs`: Replaced `is_text_like` + `from_utf8_lossy` with `from_utf8`. |
| 31 | +- `crates/tokmd-analysis/src/halstead/mod.rs`: Replaced `is_text_like` + `from_utf8_lossy` with `from_utf8`. |
| 32 | +- `crates/tokmd-analysis/src/content/mod.rs`: Replaced `is_text_like` + `from_utf8_lossy` with `from_utf8`. |
| 33 | +- `crates/tokmd-analysis/src/complexity/mod.rs`: Replaced `is_text_like` + `from_utf8_lossy` with `from_utf8`. |
| 34 | +- `crates/tokmd-analysis/src/content/io/read.rs`: Optimized `read_text_capped` to use `from_utf8` instead of unconditional `from_utf8_lossy`. |
28 | 35 |
|
29 | 36 | ## 🧪 Verification receipts |
30 | | -cargo test -p tokmd-analysis --verbose |
31 | | -cargo fmt -- --check |
| 37 | +```text |
| 38 | +cargo check -p tokmd-analysis |
| 39 | +cargo test -p tokmd-analysis |
| 40 | +cargo clippy -- -D warnings |
| 41 | +``` |
32 | 42 |
|
33 | 43 | ## 🧭 Telemetry |
34 | | -- Change shape: Performance optimization |
35 | | -- Blast radius: None |
| 44 | +- Change shape: Optimization |
| 45 | +- Blast radius: `crates/tokmd-analysis` |
36 | 46 | - Risk class: Low |
37 | | -- Rollback: `git checkout crates/tokmd-analysis/src/content/mod.rs` |
38 | | -- Gates run: perf-proof, core-rust |
| 47 | +- Rollback: Revert the PR |
| 48 | +- Gates run: `cargo build --verbose`, `CI=true cargo test --verbose`, `cargo fmt -- --check`, `cargo clippy -- -D warnings` |
39 | 49 |
|
40 | 50 | ## 🗂️ .jules artifacts |
41 | | -- `envelope.json` |
42 | | -- `decision.md` |
43 | | -- `receipts.jsonl` |
44 | | -- `result.json` |
45 | | -- `pr_body.md` |
| 51 | +- `.jules/runs/bolt_analysis_stack_builder/envelope.json` |
| 52 | +- `.jules/runs/bolt_analysis_stack_builder/decision.md` |
| 53 | +- `.jules/runs/bolt_analysis_stack_builder/receipts.jsonl` |
| 54 | +- `.jules/runs/bolt_analysis_stack_builder/result.json` |
| 55 | +- `.jules/runs/bolt_analysis_stack_builder/pr_body.md` |
46 | 56 |
|
47 | 57 | ## 🔜 Follow-ups |
48 | | -None |
| 58 | +None. |
0 commit comments