Skip to content

Commit a9062fa

Browse files
jqnatividadclaudeCopilot
authored
feat(stats): widen BLAKE3 fingerprint to cover all streaming stats (#3824)
* feat(stats): widen BLAKE3 fingerprint to cover all streaming stats Bump `FINGERPRINT_HASH_COLUMNS` from 26 to 29 so the dataset fingerprint hash now incorporates `n_positive`, `max_precision`, and `sparsity` — i.e. every streaming column emitted by `stats_headers()`. Previously the last three streaming columns were silently excluded, leaving fingerprints that could collide on datasets that differ only in those fields. Also rewrite the constant's comment to enumerate the streaming columns and note the invariant: when a streaming column is added or removed in `stats_headers()`, the constant must be updated. STATS_DEFINITIONS.md updated to match. Existing stats caches built with prior qsv versions will produce a different `blake3` value after this change; that's harmless because the cache invalidates on `qsv_version` mismatch already. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent f3e3ee2 commit a9062fa

2 files changed

Lines changed: 12 additions & 3 deletions

File tree

docs/STATS_DEFINITIONS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ When stats are cached, the `.stats.csv.json` file includes file-level metadata t
253253
| `date_generated` | When the stats were generated. | RFC3339 timestamp (UTC). |
254254
| `compute_duration_ms` | Time taken to compute stats. | Elapsed wall-clock time in milliseconds. |
255255
| `qsv_version` | Version of qsv used to generate stats. | `CARGO_PKG_VERSION` at compile time. Used for cache invalidation when qsv is upgraded. |
256-
| `hash.blake3` | BLAKE3 fingerprint hash of the dataset's stats. | BLAKE3 hash of the first 26 streaming-stats columns (i.e. all streaming columns except the trailing `n_positive`, `max_precision`, and `sparsity`) plus dataset metadata (record_count, field_count, filesize_bytes). Controlled by the `FINGERPRINT_HASH_COLUMNS` constant in `src/cmd/stats.rs`. This allows users to quickly detect duplicate files without having to load the entire file to compute the hash. Especially useful for detecting duplicates of very large files with pre-existing stats cache metadata. |
256+
| `hash.blake3` | BLAKE3 fingerprint hash of the dataset's stats. | BLAKE3 hash of the cached stats record's streaming-stats portion up to the `FINGERPRINT_HASH_COLUMNS` limit (29 columns in the default/non-`--typesonly` output; effectively `min(FINGERPRINT_HASH_COLUMNS, record.len())` columns in reduced-column modes such as `--typesonly`), plus dataset metadata (`record_count`, `field_count`, `filesize_bytes`). The limit is controlled by the `FINGERPRINT_HASH_COLUMNS` constant in `src/cmd/stats.rs`, which is kept in sync with the streaming-column count in `stats_headers()`. This allows users to quickly detect duplicate files without having to load the entire file to compute the hash. Especially useful for detecting duplicates of very large files with pre-existing stats cache metadata. |
257257

258258
### Whitespace Visualization
259259

src/cmd/stats.rs

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -678,8 +678,17 @@ const DAY_DECIMAL_PLACES: u32 = 5;
678678
// maximum number of output columns
679679
const MAX_STAT_COLUMNS: usize = 47;
680680

681-
// the first N columns are fingerprint hash columns
682-
const FINGERPRINT_HASH_COLUMNS: usize = 26;
681+
// the first N columns of each full stats record are used for the dataset
682+
// fingerprint hash. For the normal (non-`--typesonly`) output, N must equal
683+
// the number of "streaming" stats columns emitted by `stats_headers()`
684+
// (currently: field, type, is_ascii, sum, min, max, range, sort_order,
685+
// sortiness, min_length, max_length, sum_length, avg_length, stddev_length,
686+
// variance_length, cv_length, mean, sem, geometric_mean, harmonic_mean,
687+
// stddev, variance, cv, nullcount, n_negative, n_zero, n_positive,
688+
// max_precision, sparsity). `--typesonly` is an exception and emits only
689+
// `field` and `type`. When adding or removing a streaming column in the
690+
// normal `stats_headers()` output, update this constant to match.
691+
const FINGERPRINT_HASH_COLUMNS: usize = 29;
683692

684693
// maximum number of antimodes to display
685694
const MAX_ANTIMODES: usize = 10;

0 commit comments

Comments
 (0)