Skip to content

Commit f733c6a

Browse files
authored
Merge pull request #121 from harvard-lil/markdown-detail-dedup
Dedup summary-covered detail bullets + regenerate stale docs
2 parents 0e89d64 + 0328544 commit f733c6a

14 files changed

Lines changed: 124 additions & 21 deletions

File tree

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ jobs:
5656

5757
- uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
5858
with:
59-
toolchain: 1.88.0
59+
toolchain: 1.95.0
6060

6161
- uses: Swatinem/rust-cache@e18b497796c12c097a38f9edb9d0641fb99eee32 # v2
6262

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ debug = "line-tables-only"
4444
[workspace.package]
4545
version = "0.2.0"
4646
edition = "2021"
47-
rust-version = "1.88"
47+
rust-version = "1.95"
4848
license = "MIT"
4949
repository = "https://github.com/harvard-lil/binoc"
5050
homepage = "https://github.com/harvard-lil/binoc"

binoc-stdlib/src/renderers/markdown.rs

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -873,11 +873,19 @@ fn specialized_detail_verb(verb: &str) -> bool {
873873
)
874874
}
875875

876+
/// True when the node summary already states this edit, so a generic detail
877+
/// bullet would only repeat it (and, for structural edits, dump raw params).
878+
/// Each arm pairs the edit verb with the tag that proves the summary covers it.
876879
fn summary_covered_generic_verb(node: &DiffNode, edit: &serde_json::Value) -> bool {
877880
let Some(verb) = edit.get("verb").and_then(|value| value.as_str()) else {
878881
return false;
879882
};
880-
matches!(verb, "tabular.rename_column") && node.tags.contains("binoc.column-rename")
883+
match verb {
884+
"tabular.rename_column" => node.tags.contains("binoc.column-rename"),
885+
"tabular.reorder_columns" => node.tags.contains("binoc.column-reorder"),
886+
"document.serialization_change" => node.tags.contains("binoc.serialization-change"),
887+
_ => false,
888+
}
881889
}
882890

883891
fn humanize_edit_verb(verb: &str) -> String {

docs/adr/2026-04-10-rust_msrv_and_dependency_update_policy.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
11
# Rust MSRV and dependency update policy
22

33
**Date:** 2026-04-10
4-
**Status:** Implemented
4+
**Status:** Implemented (MSRV raised to 1.95 on 2026-06-29 — see Addendum)
5+
6+
## Addendum (2026-06-29): MSRV raised to 1.95
7+
8+
The workspace MSRV is now `1.95` (was `1.88`). This is the intentional,
9+
called-out bump the policy below requires: the `rusqlite` 0.39 → 0.40 update
10+
pulls `libsqlite3-sys` 0.38, whose build script uses `cfg_select!`, stabilized
11+
in Rust 1.95. Rather than pin `rusqlite` back to hold the 1.88 floor, we accept
12+
the bump — the project is pre-1.0 and moving quickly, and contributors and CI
13+
already run ≥1.95. `rust-version` in the workspace manifest and the MSRV CI job
14+
(`.github/workflows/ci.yml`) move together to `1.95.0`. The original 1.88
15+
decision and its rationale are preserved below.
516

617
## Context
718

docs/adr/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Newer entries appear first. Each entry shows its date and current status. Create
66

77
| Date | Title | Status |
88
|---|---|---|
9-
| 2026-06-22 | [The Vintage Audience: a Kept Benchmark for Metadata-Over-Data Reading](2026-06-22-vintage_audience_and_metadata_only_benchmark.md) | Accepted (benchmark landed; features deferred) |
9+
| 2026-06-22 | [The Vintage Audience: a Kept Benchmark for Metadata-Over-Data Reading](2026-06-22-vintage_audience_and_metadata_only_benchmark.md) | Accepted (benchmark landed; features deliberately deferred) |
1010
| 2026-06-15 | [Tiered Artifact Metadata: Column, Table, and a `parser_metadata_v1` Artifact](2026-06-15-tiered_artifact_metadata.md) | Implemented (channels + producers in CFM-80; rendering + significance in CFM-82) |
1111
| 2026-06-15 | [The Engine Overhaul, Told Whole: Single-Tree to Correspondence-First](2026-06-15-engine_overhaul_retrospective.md) | Retrospective |
1212
| 2026-06-15 | [Partition Identities: a JIT, Format-Owned Capability for N↔M Correspondence (CFM-72)](2026-06-15-partition_identities_jit_format_capability.md) | Implemented |
@@ -42,7 +42,7 @@ Newer entries appear first. Each entry shows its date and current status. Create
4242
| 2026-04-16 | [Test vector materialization: plugin trait, not a runtime plugin point](2026-04-16-test_vector_materialization.md) | Implemented |
4343
| 2026-04-16 | [Opportunistic ItemRef Metadata, Transformer-Hydrated for Correlation](2026-04-16-opportunistic_itemref_metadata.md) | Implemented |
4444
| 2026-04-10 | [Security posture and how to audit Binoc (core and plugins)](2026-04-10-security_posture_and_auditing.md) | Accepted |
45-
| 2026-04-10 | [Rust MSRV and dependency update policy](2026-04-10-rust_msrv_and_dependency_update_policy.md) | Implemented |
45+
| 2026-04-10 | [Rust MSRV and dependency update policy](2026-04-10-rust_msrv_and_dependency_update_policy.md) | Implemented (MSRV raised to 1.95 on 2026-06-29 — see Addendum) |
4646
| 2026-04-10 | [Independent release tags and published version policy](2026-04-10-independent_release_tags_and_published_version_policy.md) | Implemented |
4747
| 2026-04-08 | [Release Surface And Automated Publishing](2026-04-08-release_surface_and_automated_publishing.md) | Implemented |
4848
| 2026-03-20 | [Transformer Dispatch Refinement](2026-03-20-transformer_dispatch_refinement.md) | Implemented |

docs/tutorial.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,6 @@ binoc diff ./test-vectors-materialized/csv-column-reorder/snapshot-a ./test-vect
129129
# Changelog: ./test-vectors-materialized/csv-column-reorder/snapshot-a → ./test-vectors-materialized/csv-column-reorder/snapshot-b
130130
131131
- **data.csv**: Columns reordered
132-
- Reorder Columns: order: ["city","name","age"]
133132
134133
```
135134

@@ -150,7 +149,6 @@ binoc diff ./test-vectors-materialized/csv-mixed-changes/snapshot-a ./test-vecto
150149
- **data.csv**: Column added: 'email'; Columns reordered; 1 row added
151150
- Rows added
152151
- row 3: 'SF', 'Charlie', '35'
153-
- Reorder Columns: order: ["city","name","age"]
154152
- Add Column: name: 'email'; values: {"total_values":3,"truncated":false,"values":["a@test.com","b@test.com","c@test.com"]}
155153
156154
```

docs/users/explanation/test-vectors-gallery.md

Lines changed: 98 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ audience: new user, data steward, archivist
1313

1414
These are runnable examples from binoc's test suite. Each example links to its source folder on GitHub, tells you whether it needs any extra setup, gives you the exact command to run, and shows the Markdown changelog binoc is expected to print.
1515

16-
Binoc currently ships **62 shared examples** in this gallery.
16+
Binoc currently ships **63 shared examples** in this gallery.
1717

1818
## One-time setup
1919

@@ -46,6 +46,7 @@ just materialize
4646
| [`csv-stacked-tables`](#csv-stacked-tables) | Detects two logical tables stacked in one messy CSV | data.csv/>table_2: 1 row added | Default pipeline |
4747
| [`csv-to-tsv-reformat`](#csv-to-tsv-reformat) | Table reformatted from CSV to TSV with row edits: detected as one reformatted-and-modified table, not remove + add | data.tsv: | Default pipeline |
4848
| [`csv-verbosity-full`](#csv-verbosity-full) | Markdown full verbosity renders every captured changed-cell example. | data.csv: 5 cells changed | Custom config |
49+
| [`csv-vintage-benchmark`](#csv-vintage-benchmark) | A 'vintage' reader compares two editions of the same published dataset and wants the structural story (a column appeared, a category vocabulary shifted) surfaced above the bulk data churn they intend to ignore. | facilities.csv: Column added: 'region'; 1 cell changed | Custom config |
4950
| [`directory-file-copy`](#directory-file-copy) | New file with same content as an existing unchanged file detected as a copy | duplicate.txt: Copied from original.txt | Default pipeline |
5051
| [`directory-nested`](#directory-nested) | Subdirectories with mixed changes | data/records.csv: 1 row added | Default pipeline |
5152
| [`directory-nested-with-tar`](#directory-nested-with-tar) | Shows binoc diffing a tar archive and a plain directory that contain overlapping internal paths. | data.tar.gz/>records.csv: 1 cell changed | Default pipeline |
@@ -233,7 +234,6 @@ Result:
233234
# Changelog: snapshot-a → snapshot-b
234235

235236
- **data.csv**: Columns reordered
236-
- Reorder Columns: order: ["city","name","age"]
237237
```
238238

239239
## csv-distribution-shift
@@ -380,7 +380,6 @@ Result:
380380
- **data.csv**: Column added: 'email'; Columns reordered; 1 row added
381381
- Rows added
382382
- row 2: 'LA', 'Bob', '25'
383-
- Reorder Columns: order: ["city","name","age"]
384383
- Add Column: name: 'email'; values: {"total_values":3,"truncated":false,"values":["alice@example.test","bob@example.test","charlie@example.test"]}
385384
```
386385

@@ -405,7 +404,6 @@ Result:
405404
- **data.csv**: Column added: 'email'; Columns reordered; 1 row added
406405
- Rows added
407406
- row 3: 'SF', 'Charlie', '35'
408-
- Reorder Columns: order: ["city","name","age"]
409407
- Add Column: name: 'email'; values: {"total_values":3,"truncated":false,"values":["a@test.com","b@test.com","c@test.com"]}
410408
```
411409

@@ -571,6 +569,102 @@ Result:
571569
- row 5, column 'score': '50' -> '51'
572570
```
573571

572+
## csv-vintage-benchmark
573+
574+
A 'vintage' reader compares two editions of the same published dataset and wants the structural story (a column appeared, a category vocabulary shifted) surfaced above the bulk data churn they intend to ignore.
575+
576+
- **Browse source:** [csv-vintage-benchmark](https://github.com/harvard-lil/binoc/tree/main/test-vectors/csv-vintage-benchmark)
577+
- **Tags:** `csv`, `vintage`, `metadata`, `benchmark`
578+
- **Snapshots:** `snapshot-a` has 2 files — `facilities.csv`, `inspections.csv`; `snapshot-b` has 2 files — `facilities.csv`, `inspections.csv`
579+
- **Setup:** The dataset is a yearly facilities register published as a small directory of
580+
CSVs. Between the two editions:
581+
582+
* `facilities.csv` gains a `region` column (schema change) and one row's
583+
`status` moves to a brand-new category value, `decommissioned`
584+
(a *vocabulary* shift — the set of distinct values in a categorical column
585+
grew).
586+
* `inspections.csv` changes only in its data: several scores are edited and
587+
two rows are appended. This is exactly the churn a vintage reader does not
588+
want to read.
589+
590+
The markdown config models the vintage stance as significance: schema/structural
591+
tags are the high-priority group, bulk cell/row tags the low-priority group.
592+
Because `classify_tags` promotes a node to the highest-priority group among its
593+
tags, `facilities.csv` (which carries both schema and cell tags) floats up to
594+
"Schema & vocabulary changes" while the pure-data `inspections.csv` sinks to
595+
"Bulk data updates". That file-granularity separation is the best vintage view
596+
binoc offers today.
597+
598+
WHAT THIS BENCHMARK IS FOR — the gap between today's output (see
599+
`expected-output/changelog.snap`) and the target (see `VINTAGE-IDEAL.md`):
600+
601+
1. Within-node significance. `facilities.csv`'s `region` addition and its
602+
`status` cell edit live on one node, so they cannot be separated: the
603+
vintage reader still sees the cell bullet. There is no config-driven
604+
edit-level drop/keep (only `EditProjection.visible`, set by writers).
605+
2. Vocabulary as a first-class change. The `active -> decommissioned` shift is
606+
reported as an ordinary `binoc.cell-change`, not as "the `status` vocabulary
607+
gained a value". Columns are not first-class nodes and distinct-value-set
608+
diffing does not exist.
609+
3. Summary statistics. `inspections.csv` is rendered as full cell/row detail,
610+
not as a one-line vintage statistic ("142 -> 144 rows, 3 cells changed").
611+
The Summary/GlobalClaim seams exist to carry such a fact; no rule emits one.
612+
613+
This vector is a kept benchmark, not a feature. It is expected to PASS against
614+
current output; as the vintage story improves, update the snapshot and watch it
615+
converge on VINTAGE-IDEAL.md. See docs/adr for the design rationale.
616+
Save this dataset config as `/tmp/csv-vintage-benchmark.yaml`:
617+
618+
```yaml
619+
output:
620+
markdown:
621+
groups:
622+
- heading: Schema & vocabulary changes
623+
tags:
624+
- binoc.schema-change
625+
- binoc.column-addition
626+
- binoc.column-removal
627+
- binoc.column-rename
628+
- binoc.metadata.value-label-set
629+
- heading: Bulk data updates
630+
tags:
631+
- binoc.cell-change
632+
- binoc.row-addition
633+
- binoc.row-removal
634+
```
635+
636+
637+
Run it:
638+
```bash
639+
binoc diff \
640+
./test-vectors-materialized/csv-vintage-benchmark/snapshot-a \
641+
./test-vectors-materialized/csv-vintage-benchmark/snapshot-b \
642+
--config /tmp/csv-vintage-benchmark.yaml
643+
```
644+
Result:
645+
```markdown
646+
# Changelog: snapshot-a → snapshot-b
647+
648+
## Schema & vocabulary changes
649+
650+
- **facilities.csv**: Column added: 'region'; 1 cell changed
651+
- Changed cells
652+
- row 2, column 'status': 'active' -> 'decommissioned'
653+
- Set Headers: from: ["facility_id","name","status"]; to: ["facility_id","name","status","region"]
654+
- Add Column: name: 'region'; values: {"total_values":4,"truncated":false,"values":["north","east","west","south"]}
655+
656+
## Bulk data updates
657+
658+
- **inspections.csv**: 2 rows added; 3 cells changed
659+
- Changed cells
660+
- row 1, column 'score': '82' -> '85'
661+
- row 3, column 'score': '90' -> '91'
662+
- row 4, column 'score': '68' -> '70'
663+
- Rows added
664+
- row 5: 'I104', 'F001', '88'
665+
- row 6: 'I105', 'F002', '73'
666+
```
667+
574668
## directory-file-copy
575669

576670
New file with same content as an existing unchanged file detected as a copy
@@ -965,7 +1059,6 @@ Result:
9651059
# Changelog: snapshot-a → snapshot-b
9661060

9671061
- **metadata.json**: Document serialization changed
968-
- Serialization Change: kinds: ["object_key_order","formatting"]; left: {"byte_len":70,"line_ending":"lf","object_key_orders":[{"keys":["id","name"],"path":"$.fields"},{"keys":["name","version","fields"],"path":"$"}],"trailing_newli...; right: {"byte_len":98,"indentation":"2 spaces","line_ending":"lf","object_key_orders":[{"keys":["name","id"],"path":"$.fields"},{"keys":["fields","version","name"],"pa...
9691062
```
9701063

9711064
## json-records-cell-change
@@ -1101,7 +1194,6 @@ Result:
11011194
- '\nFAKEICONv1'
11021195
- **license-copy.txt**: Copied from license.txt
11031196
- **metrics.csv**: Columns reordered
1104-
- Reorder Columns: order: ["category","year","value"]
11051197
- **summary.txt**: Moved from report.txt
11061198
```
11071199

@@ -1623,7 +1715,6 @@ Result:
16231715
# Changelog: snapshot-a → snapshot-b
16241716

16251717
- **archive.zip/>metadata.json**: Document serialization changed
1626-
- Serialization Change: kinds: ["object_key_order","formatting"]; left: {"byte_len":82,"line_ending":"lf","object_key_orders":[{"keys":["id","name"],"path":"$.schema"},{"keys":["dataset","issued","schema"],"path":"$"}],"trailing_new...; right: {"byte_len":110,"indentation":"2 spaces","line_ending":"lf","object_key_orders":[{"keys":["name","id"],"path":"$.schema"},{"keys":["schema","issued","dataset"],...
16271718
```
16281719

16291720
## zip-nested

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ nav:
171171
- Architectural decisions:
172172
# BEGIN-ADR-NAV
173173
- adr/README.md
174+
- 'The Vintage Audience: a Kept Benchmark for Metadata-Over-Data Reading': adr/2026-06-22-vintage_audience_and_metadata_only_benchmark.md
174175
- 'Tiered Artifact Metadata: Column, Table, and a `parser_metadata_v1` Artifact': adr/2026-06-15-tiered_artifact_metadata.md
175176
- 'The Engine Overhaul, Told Whole: Single-Tree to Correspondence-First': adr/2026-06-15-engine_overhaul_retrospective.md
176177
- 'Partition Identities: a JIT, Format-Owned Capability for N↔M Correspondence (CFM-72)': adr/2026-06-15-partition_identities_jit_format_capability.md

test-vectors/csv-column-reorder/expected-output/changelog.snap

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,3 @@ expression: "&md"
55
# Changelog: snapshot-asnapshot-b
66

77
- **data.csv**: Columns reordered
8-
- Reorder Columns: order: ["city","name","age"]

test-vectors/csv-mid-row-insertion/expected-output/changelog.snap

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,4 @@ expression: "&md"
77
- **data.csv**: Column added: 'email'; Columns reordered; 1 row added
88
- Rows added
99
- row 2: 'LA', 'Bob', '25'
10-
- Reorder Columns: order: ["city","name","age"]
1110
- Add Column: name: 'email'; values: {"total_values":3,"truncated":false,"values":["alice@example.test","bob@example.test","charlie@example.test"]}

0 commit comments

Comments
 (0)