Skip to content

Commit 4077eb0

Browse files
jqnatividadclaude
andcommitted
fix(stats,frequency): address roborev #2028 findings
- Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8444426 commit 4077eb0

11 files changed

Lines changed: 79 additions & 63 deletions

File tree

.claude/skills/qsv/qsv-frequency.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@
7777
{
7878
"flag": "--memcheck",
7979
"type": "flag",
80-
"description": "Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. On OOM, qsv auto-creates an index when possible and also switches to the Frequent Items sketch (Apache DataSketches Misra-Gries, equivalent to sketch-method frequent_items) where compatible, before failing. A wwarn is emitted when the sketch fallback engages."
80+
"description": "Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. On OOM, qsv auto-creates an index when no index exists (skipped for stdin) and ALSO switches to the Frequent Items sketch (Apache DataSketches Misra-Gries, equivalent to sketch-method frequent_items) where compatible. The sketch fallback can also fire when an index is already present and the OOM still trips (e.g., when jobs is pinned to 1 on a pre-indexed file). A wwarn is emitted when the sketch fallback engages."
8181
},
8282
{
8383
"flag": "--no-float",

.claude/skills/qsv/qsv-stats.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@
8181
{
8282
"flag": "--memcheck",
8383
"type": "flag",
84-
"description": "Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. This option is ignored when computing default, streaming statistics, as it is not needed. On OOM, qsv auto-creates an index when possible and also switches to approx quantile + approx cardinality methods (DataSketches t-digest and HyperLogLog) where compatible, before failing. A wwarn is emitted listing the auto-enabled estimators."
84+
"description": "Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. This option is ignored when computing default, streaming statistics, as it is not needed. On OOM, qsv auto-creates an index when no index exists (skipped for stdin) and ALSO switches to approx quantile + approx cardinality methods (DataSketches t-digest and HyperLogLog) where compatible. The sketch fallback can also fire when an index is already present and the OOM still trips (e.g., when jobs is pinned to 1 on a pre-indexed file). A wwarn is emitted listing the auto-enabled estimators."
8585
},
8686
{
8787
"flag": "--mode",

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/C
3939
- `exclude`: add stdin support and memcheck [#3749](https://github.com/dathere/qsv/pull/3749)
4040

4141
### Changed
42-
- `stats` / `frequency`: when `--memcheck` is set and `util::mem_file_check` returns OOM, qsv now auto-enables the Apache DataSketches-backed estimators (t-digest + HyperLogLog for `stats`; Misra-Gries Frequent Items for `frequency`) in addition to the existing auto-index fallback, where flag conflicts allow. The OOM error is only propagated when neither fallback engages. A `wwarn!` is emitted listing the auto-enabled estimators. Explicit `--quantile-method exact` / `--cardinality-method exact` / `--sketch-method exact` still suppresses the auto-enable for that method.
42+
- `stats` / `frequency`: when `--memcheck` is set and `util::mem_file_check` returns OOM, qsv now auto-enables the Apache DataSketches-backed estimators (t-digest + HyperLogLog for `stats`; Misra-Gries Frequent Items for `frequency`) in addition to the existing auto-index fallback, where flag conflicts allow. The OOM error is only propagated when neither fallback engages. A `wwarn!` is emitted listing the auto-enabled estimators. The sketch fallback can fire even when an index is already present (e.g., with `--jobs 1` on a pre-indexed file) — this is a behavior change from the previous "error out" path in that narrow case. Explicit `--quantile-method exact` / `--cardinality-method exact` / `--sketch-method exact` still suppresses the auto-enable for that method.
4343
- **BREAKING** `excel`: `--metadata csv` column ordering for `type`, `visible`, and `headers` is corrected. Previously the CSV header row declared `type, visible, headers` but the data rows pushed values in the order `headers, typ, visible`, so under each named column the wrong values appeared (the `type` column held the headers list, `visible` held the type, and `headers` held the visibility). The CSV output now matches the `--metadata json` (`SheetMetadata` struct) field order: `index, sheet_name, type, visible, headers, column_count, …`. Pipelines that consumed `qsv excel --metadata csv` and indexed by column position must shift those three columns; consumers that indexed by header name see corrected values automatically.
4444
- **BREAKING** `enum`: `--hash` digest values change. The hashed input now carries a `u64` length prefix per field (to fix the multi-column collision bug above), so every `--hash` digest differs from earlier qsv versions — single-column hashes change identity values too, and stored hashes from earlier qsv versions will not match. Same input still hashes deterministically across rows and runs in ≥ this version.
4545
- **BREAKING** `luau`: `qsv_loadcsv` now returns the headers table 1-indexed (per Lua convention). Scripts that accessed `headers[0]` or iterated `for i = 0, #headers - 1` must shift to `headers[1]` and `for i = 1, #headers` (or `ipairs(headers)`). Previously `headers[1]` returned the *second* header.

docs/STATS_DEFINITIONS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -363,7 +363,7 @@ By default, `stats` produces **exact, deterministic** results. Three opt-in flag
363363
- `--quantile-method` auto-enables unless `--weight` is set; if `--mad` or `--everything` is also set, MAD is auto-disabled (mirroring the existing `--quantile-method approx` guard).
364364
- `--cardinality-method` auto-enables unless `--infer-boolean` is set.
365365

366-
A `wwarn!` is emitted listing each auto-enabled estimator. The original OOM error is only propagated when **neither** fallback engages. Users can override by passing `--quantile-method exact` or `--cardinality-method exact` explicitly (the auto-enable only flips fields that were left at their `exact` default).
366+
A `wwarn!` is emitted listing each auto-enabled estimator. The original OOM error is only propagated when **neither** fallback engages. The sketch fallback can fire even when an index is already present and the OOM check still trips (e.g., with `--jobs 1` on a pre-indexed file) — that is a behavior change from the previous "error out" path in this narrow case. Users can disable the auto-enable by passing `--quantile-method exact` or `--cardinality-method exact` explicitly (the auto-enable only flips fields that were left at their `exact` default).
367367

368368
**See also:** [t-digest paper (Dunning, 2019)](https://arxiv.org/abs/1902.04023), [HyperLogLog (Flajolet et al., 2007)](https://en.wikipedia.org/wiki/HyperLogLog), [Apache DataSketches](https://datasketches.apache.org/).
369369

docs/help/frequency.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,7 @@ qsv frequency --help
155155
| &nbsp;`‑o,`<br>`‑‑output`&nbsp; | string | Write output to <file> instead of stdout. | |
156156
| &nbsp;`‑n,`<br>`‑‑no‑headers`&nbsp; | flag | When set, the first row will NOT be included in the frequency table. Additionally, the 'field' column will be 1-based indices instead of header names. | |
157157
| &nbsp;`‑d,`<br>`‑‑delimiter`&nbsp; | string | The field delimiter for reading CSV data. Must be a single character. (default: ,) | |
158-
| &nbsp;`‑‑memcheck`&nbsp; | flag | Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. On OOM, qsv auto-creates an index when possible and also switches to the Frequent Items sketch (Apache DataSketches Misra-Gries, equivalent to sketch-method frequent_items) where compatible, before failing. A wwarn is emitted when the sketch fallback engages. | |
158+
| &nbsp;`‑‑memcheck`&nbsp; | flag | Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. On OOM, qsv auto-creates an index when no index exists (skipped for stdin) and ALSO switches to the Frequent Items sketch (Apache DataSketches Misra-Gries, equivalent to sketch-method frequent_items) where compatible. The sketch fallback can also fire when an index is already present and the OOM still trips (e.g., when jobs is pinned to 1 on a pre-indexed file). A wwarn is emitted when the sketch fallback engages. | |
159159

160160
---
161161
**Source:** [`src/cmd/frequency.rs`](https://github.com/dathere/qsv/blob/master/src/cmd/frequency.rs)

docs/help/stats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ qsv stats --help
282282
| &nbsp;`‑o,`<br>`‑‑output`&nbsp; | string | Write output to <file> instead of stdout. | |
283283
| &nbsp;`‑n,`<br>`‑‑no‑headers`&nbsp; | flag | When set, the first row will NOT be interpreted as column names. i.e., They will be included in statistics. | |
284284
| &nbsp;`‑d,`<br>`‑‑delimiter`&nbsp; | string | The field delimiter for READING CSV data. Must be a single character. (default: ,) | |
285-
| &nbsp;`‑‑memcheck`&nbsp; | flag | Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. This option is ignored when computing default, streaming statistics, as it is not needed. On OOM, qsv auto-creates an index when possible and also switches to approx quantile + approx cardinality methods (DataSketches t-digest and HyperLogLog) where compatible, before failing. A wwarn is emitted listing the auto-enabled estimators. | |
285+
| &nbsp;`‑‑memcheck`&nbsp; | flag | Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. This option is ignored when computing default, streaming statistics, as it is not needed. On OOM, qsv auto-creates an index when no index exists (skipped for stdin) and ALSO switches to approx quantile + approx cardinality methods (DataSketches t-digest and HyperLogLog) where compatible. The sketch fallback can also fire when an index is already present and the OOM still trips (e.g., when jobs is pinned to 1 on a pre-indexed file). A wwarn is emitted listing the auto-enabled estimators. | |
286286

287287
---
288288
**Source:** [`src/cmd/stats.rs`](https://github.com/dathere/qsv/blob/master/src/cmd/stats.rs)

src/cmd/frequency.rs

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -263,12 +263,15 @@ Common options:
263263
Must be a single character. (default: ,)
264264
--memcheck Check if there is enough memory to load the entire
265265
CSV into memory using CONSERVATIVE heuristics.
266-
On OOM, qsv auto-creates an index when possible and
267-
also switches to the Frequent Items sketch
268-
(Apache DataSketches Misra-Gries, equivalent to
269-
sketch-method frequent_items) where compatible,
270-
before failing. A wwarn is emitted when the sketch
271-
fallback engages.
266+
On OOM, qsv auto-creates an index when no index
267+
exists (skipped for stdin) and ALSO switches to the
268+
Frequent Items sketch (Apache DataSketches
269+
Misra-Gries, equivalent to sketch-method
270+
frequent_items) where compatible. The sketch
271+
fallback can also fire when an index is already
272+
present and the OOM still trips (e.g., when jobs is
273+
pinned to 1 on a pre-indexed file). A wwarn is
274+
emitted when the sketch fallback engages.
272275
"#;
273276

274277
use core::hint::cold_path;
@@ -852,14 +855,16 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
852855
// Sketch fallback: only for OOM (not other CliErrors).
853856
// The Frequent Items dispatch at frequency.rs:787 normally fires before
854857
// mem_file_check runs, so we re-route into run_frequent_items() directly
855-
// from here rather than just flipping the flag and falling through.
858+
// here rather than mutating flag_sketch_method and falling through —
859+
// run_frequent_items doesn't consult flag_sketch_method, so the field
860+
// stays at "exact" and that's fine (cache key / diagnostics use other
861+
// paths). Verified via grep over run_frequent_items's body.
856862
if matches!(e, crate::CliError::OutOfMemory(_)) && can_enable_frequent_items(&args)
857863
{
858-
args.flag_sketch_method = "frequent_items".to_string();
859864
wwarn!(
860865
"OOM during memory check: auto-enabling --sketch-method frequent_items \
861866
(Misra-Gries, map size {n}). Re-run with explicit --sketch-method exact \
862-
to override.",
867+
to disable the auto-enable.",
863868
n = args.flag_sketch_map_size,
864869
);
865870
return args.run_frequent_items(&rconfig);

src/cmd/stats.rs

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -355,11 +355,15 @@ Common options:
355355
CSV into memory using CONSERVATIVE heuristics.
356356
This option is ignored when computing default, streaming
357357
statistics, as it is not needed.
358-
On OOM, qsv auto-creates an index when possible and
359-
also switches to approx quantile + approx cardinality
360-
methods (DataSketches t-digest and HyperLogLog) where
361-
compatible, before failing. A wwarn is emitted listing
362-
the auto-enabled estimators.
358+
On OOM, qsv auto-creates an index when no index
359+
exists (skipped for stdin) and ALSO switches to
360+
approx quantile + approx cardinality methods
361+
(DataSketches t-digest and HyperLogLog) where
362+
compatible. The sketch fallback can also fire when
363+
an index is already present and the OOM still trips
364+
(e.g., when jobs is pinned to 1 on a pre-indexed
365+
file). A wwarn is emitted listing the auto-enabled
366+
estimators.
363367
"#;
364368

365369
/*
@@ -1542,8 +1546,9 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
15421546
if !enabled.is_empty() {
15431547
wwarn!(
15441548
"OOM during memory check: auto-enabling DataSketches \
1545-
estimators ({}). Re-run with explicit \
1546-
--quantile-method/--cardinality-method to override.",
1549+
estimators ({}). Re-run with explicit --quantile-method \
1550+
exact / --cardinality-method exact to disable the \
1551+
auto-enable.",
15471552
enabled.join(", ")
15481553
);
15491554
} else if !index_succeeded {

tests/test_frequency.rs

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6721,28 +6721,16 @@ fn frequency_sketch_method_frequent_items_with_null_sorted_is_rejected() {
67216721
// in addition to the existing auto-index fallback. These tests need a
67226722
// multi-GB file to deterministically trigger OOM and are #[ignore]'d by
67236723
// default; run with `--ignored` to exercise them.
6724-
6725-
fn build_large_oom_csv(name: &str) -> (Workdir, std::path::PathBuf) {
6726-
let wrk = Workdir::new(name);
6727-
let mut data = vec![svec!["col1", "col2", "col3", "col4", "col5"]];
6728-
for i in 0..10_000_000 {
6729-
data.push(vec![
6730-
format!("value_{}_with_some_padding_to_make_it_larger", i),
6731-
format!("another_value_{}_with_more_data", i),
6732-
format!("data_{}", i),
6733-
format!("field_{}_content", i),
6734-
format!("final_field_{}_with_additional_text", i),
6735-
]);
6736-
}
6737-
let path = wrk.path("large_data.csv");
6738-
wrk.create("large_data.csv", data);
6739-
(wrk, path)
6740-
}
6724+
//
6725+
// The shared `build_large_oom_csv` helper lives in `tests/workdir.rs` so the
6726+
// row count, column shape, and padding stay in sync with the parallel stats
6727+
// tests; see `stats_oom_*` in tests/test_stats.rs.
67416728

67426729
#[test]
67436730
#[ignore = "Requires a multi-GB file to trigger OOM via mem_file_check"]
67446731
fn frequency_oom_auto_enables_frequent_items() {
6745-
let (wrk, test_file) = build_large_oom_csv("frequency_oom_auto_enables_frequent_items");
6732+
let (wrk, test_file) =
6733+
crate::workdir::build_large_oom_csv("frequency_oom_auto_enables_frequent_items");
67466734
let mut cmd = wrk.command("frequency");
67476735
cmd.arg("--memcheck")
67486736
.env("QSV_FREEMEMORY_HEADROOM_PCT", "90")
@@ -6761,7 +6749,7 @@ fn frequency_oom_auto_enables_frequent_items() {
67616749
#[test]
67626750
#[ignore = "Requires a multi-GB file to trigger OOM via mem_file_check"]
67636751
fn frequency_oom_skips_when_asc_set() {
6764-
let (wrk, test_file) = build_large_oom_csv("frequency_oom_skips_when_asc_set");
6752+
let (wrk, test_file) = crate::workdir::build_large_oom_csv("frequency_oom_skips_when_asc_set");
67656753
let mut cmd = wrk.command("frequency");
67666754
cmd.arg("--memcheck")
67676755
.arg("--asc")
@@ -6777,7 +6765,8 @@ fn frequency_oom_skips_when_asc_set() {
67776765
#[test]
67786766
#[ignore = "Requires a multi-GB file to trigger OOM via mem_file_check"]
67796767
fn frequency_oom_skips_when_json_output() {
6780-
let (wrk, test_file) = build_large_oom_csv("frequency_oom_skips_when_json_output");
6768+
let (wrk, test_file) =
6769+
crate::workdir::build_large_oom_csv("frequency_oom_skips_when_json_output");
67816770
let mut cmd = wrk.command("frequency");
67826771
cmd.arg("--memcheck")
67836772
.arg("--json")

tests/test_stats.rs

Lines changed: 11 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4206,28 +4206,16 @@ fn stats_auto_index_creation_on_oom() {
42064206
// the existing auto-index fallback. Like stats_auto_index_creation_on_oom
42074207
// above, these need a multi-GB file to deterministically trigger OOM, so
42084208
// they are #[ignore]'d by default and must be run with `--ignored`.
4209-
4210-
fn build_large_oom_csv(name: &str) -> (Workdir, std::path::PathBuf) {
4211-
let wrk = Workdir::new(name);
4212-
let mut data = vec![svec!["col1", "col2", "col3", "col4", "col5"]];
4213-
for i in 0..10_000_000 {
4214-
data.push(vec![
4215-
format!("value_{}_with_some_padding_to_make_it_larger", i),
4216-
format!("another_value_{}_with_more_data", i),
4217-
format!("data_{}", i),
4218-
format!("field_{}_content", i),
4219-
format!("final_field_{}_with_additional_text", i),
4220-
]);
4221-
}
4222-
let path = wrk.path("large_data.csv");
4223-
wrk.create("large_data.csv", data);
4224-
(wrk, path)
4225-
}
4209+
//
4210+
// The shared `build_large_oom_csv` helper lives in `tests/workdir.rs` so the
4211+
// row count, column shape, and padding stay in sync with the parallel
4212+
// frequency tests; see `frequency_oom_*` in tests/test_frequency.rs.
42264213

42274214
#[test]
42284215
#[ignore = "Requires a multi-GB file to trigger OOM via mem_file_check"]
42294216
fn stats_oom_auto_enables_approx_sketches() {
4230-
let (wrk, test_file) = build_large_oom_csv("stats_oom_auto_enables_approx_sketches");
4217+
let (wrk, test_file) =
4218+
crate::workdir::build_large_oom_csv("stats_oom_auto_enables_approx_sketches");
42314219
let mut cmd = wrk.command("stats");
42324220
cmd.arg("--everything")
42334221
.arg("--memcheck")
@@ -4251,7 +4239,8 @@ fn stats_oom_auto_enables_approx_sketches() {
42514239
#[test]
42524240
#[ignore = "Requires a multi-GB file to trigger OOM via mem_file_check"]
42534241
fn stats_oom_skips_approx_quantiles_with_weight() {
4254-
let (wrk, test_file) = build_large_oom_csv("stats_oom_skips_approx_quantiles_with_weight");
4242+
let (wrk, test_file) =
4243+
crate::workdir::build_large_oom_csv("stats_oom_skips_approx_quantiles_with_weight");
42554244
// Weight column is col1 (all distinct strings — qsv will treat parse failures
42564245
// as default weight 1.0; this is acceptable for the test because we only care
42574246
// that --weight is *set*, which is what suppresses the quantile auto-enable).
@@ -4276,8 +4265,9 @@ fn stats_oom_skips_approx_quantiles_with_weight() {
42764265
#[test]
42774266
#[ignore = "Requires a multi-GB file to trigger OOM via mem_file_check"]
42784267
fn stats_oom_skips_approx_cardinality_with_infer_boolean() {
4279-
let (wrk, test_file) =
4280-
build_large_oom_csv("stats_oom_skips_approx_cardinality_with_infer_boolean");
4268+
let (wrk, test_file) = crate::workdir::build_large_oom_csv(
4269+
"stats_oom_skips_approx_cardinality_with_infer_boolean",
4270+
);
42814271
let mut cmd = wrk.command("stats");
42824272
cmd.arg("--everything")
42834273
.arg("--memcheck")

0 commit comments

Comments
 (0)