Skip to content

Commit 0cdee0d

Browse files
authored
Merge pull request #6 from harvard-lil/renames
Terminology updates
2 parents 67e2005 + 241a9d7 commit 0cdee0d

77 files changed

Lines changed: 973 additions & 906 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Project Overview
44

5-
Binoc generates changelogs for datasets that don't ship with them. Given two snapshots of a dataset, it detects structural and content changes, records them as a migration tree (the IR), and renders changes as JSON or Markdown. The primary audience is archivists, data scientists, and stewards tracking undocumented changes to published datasets.
5+
Binoc generates changelogs for datasets that don't ship with them. Given two snapshots of a dataset, it detects structural and content changes, records them as a changeset tree (the IR), and renders changes as JSON or Markdown. The primary audience is archivists, data scientists, and stewards tracking undocumented changes to published datasets.
66

77
Rust workspace with five crates:
88

@@ -22,9 +22,9 @@ Shared test fixtures live in `test-vectors/`. Authoritative architecture spec is
2222

2323
2. **The standard library (`binoc-stdlib`) is a plugin pack**, architecturally identical to third-party packs. The core engine has zero domain knowledge—not even about directories or text files.
2424

25-
3. **Comparators are the parser** (raw data → IR). **Transformers are optimization passes** (IR → IR, no raw data access). **Significance classification is an outputter concern**, mapped from semantic tags via config—not baked into the IR.
25+
3. **Comparators are the parser** (raw data → IR). **Transformers are optimization passes** (IR → IR, no raw data access). **Significance classification is a renderer concern**, mapped from semantic tags via config—not baked into the IR.
2626

27-
4. **The IR is tree-structured, openly typed, and tag-annotated.** `kind`, `item_type`, and `tags` are open enums/strings. No built-in types or significance levels. Conventions, not enforcement.
27+
4. **The IR is tree-structured, openly typed, and tag-annotated.** `action`, `item_type`, and `tags` are open enums/strings. No built-in types or significance levels. Conventions, not enforcement.
2828

2929
5. **Dispatch is declarative-first** (type/extension filters) **with an imperative escape hatch** (`can_handle`). First comparator to claim an item wins. Ordering is a config concern, not a plugin concern.
3030

README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Binoc: The Missing Changelog for Datasets
22

3-
Binoc generates changelogs for datasets that don't have them. Given a series of snapshots of a dataset downloaded at different times, Binoc detects what changed, expresses those changes as a minimal structured diff, and produces human-readable summaries that distinguish substantive policy changes from ministerial housekeeping.
3+
Binoc generates changelogs for datasets that don't have them. Given a series of snapshots of a dataset downloaded at different times, Binoc detects what changed, expresses those changes as a minimal structured diff, and produces human-readable summaries that distinguish substantive policy changes from clerical housekeeping.
44

55
The core workflow: an archivist, data scientist, or steward has five copies of a government dataset containing CSVs, downloaded over two years. Some are identical. Some have reordered columns. One has a new category relevant to their research. Binoc tells them exactly what changed, when, and whether (by their definition) it matters.
66

@@ -15,7 +15,7 @@ binoc diff release-q3/ release-q4/
1515
```
1616
# Changelog: release-q3/ → release-q4/
1717
18-
## Ministerial Changes
18+
## Clerical Changes
1919
2020
- **data.zip/agencies.csv**: Columns reordered (content unchanged)
2121
@@ -24,7 +24,7 @@ binoc diff release-q3/ release-q4/
2424
- **summary.sqlite**: Content changed (12.0 KB → 12.0 KB)
2525
```
2626

27-
Binoc looked inside the zip and compared the CSV column-by-column — the reorder is flagged as ministerial housekeeping, not a real data change. But `.sqlite` is opaque to the standard library, so you only learn that the bytes differ.
27+
Binoc looked inside the zip and compared the CSV column-by-column — the reorder is flagged as clerical housekeeping, not a real data change. But `.sqlite` is opaque to the standard library, so you only learn that the bytes differ.
2828

2929
```bash
3030
pip install binoc-sqlite
@@ -34,7 +34,7 @@ binoc diff release-q3/ release-q4/
3434
```
3535
# Changelog: release-q3/ → release-q4/
3636
37-
## Ministerial Changes
37+
## Clerical Changes
3838
3939
- **data.zip/agencies.csv**: Columns reordered (content unchanged)
4040
@@ -50,7 +50,7 @@ Same command, richer output. The plugin parsed the database and found the actual
5050
Datasets published by governments, research institutions, and public bodies are living artifacts, and can change without warning or documentation (or without consistent documentation). The archival and data science communities need tooling to:
5151

5252
- Detect whether a new snapshot of a dataset actually differs from the previous one.
53-
- Describe changes precisely — not just "the file changed," but "three columns were reordered (ministerial) and one column was split into two (substantive)."
53+
- Describe changes precisely — not just "the file changed," but "three columns were reordered (clerical) and one column was split into two (substantive)."
5454
- Produce changelogs that are machine-readable for automated pipelines and human-readable for policy analysis.
5555
- Handle real-world messiness: datasets inside zip archives, nested containers, mixed formats, renamed files.
5656

@@ -65,8 +65,8 @@ Generic diff tools don't understand data formats, while version control systems
6565
- Compare text files at line level
6666
- Compare binary files by content hash
6767
- Detect moves and copies from content hashes
68-
- Extract actual changed data from migration nodes (added rows, text diffs, etc.)
69-
- Render migrations as JSON or Markdown changelogs
68+
- Extract actual changed data from changeset nodes (added rows, text diffs, etc.)
69+
- Render changesets as JSON or Markdown changelogs
7070
- Extend comparison and transformation pipelines via Rust native plugins (C ABI), Python plugins, or in-workspace stdlib plugins
7171

7272
## Documentation
@@ -97,7 +97,7 @@ Diff two snapshots (prints a Markdown changelog to stdout by default):
9797
binoc diff path/to/snapshot-a path/to/snapshot-b
9898
```
9999

100-
Get raw migration JSON instead:
100+
Get raw changeset JSON instead:
101101

102102
```bash
103103
binoc diff path/to/snapshot-a path/to/snapshot-b --format json
@@ -107,19 +107,19 @@ Save outputs to files (format inferred from extension, or use `format:path` synt
107107

108108
```bash
109109
binoc diff path/to/snapshot-a path/to/snapshot-b \
110-
-o migration.json -o CHANGELOG.md -q
110+
-o changeset.json -o CHANGELOG.md -q
111111
```
112112

113-
Combine saved migrations into a changelog:
113+
Combine saved changesets into a changelog:
114114

115115
```bash
116-
binoc changelog migrations/*.json
116+
binoc changelog changesets/*.json
117117
```
118118

119-
Extract the actual changed data from a migration node (requires original snapshots):
119+
Extract the actual changed data from a changeset node (requires original snapshots):
120120

121121
```bash
122-
binoc extract migration.json data.csv rows_added
122+
binoc extract changeset.json data.csv rows_added
123123
```
124124

125125
### Plugins
@@ -176,7 +176,7 @@ This builds both packages from source and wires up entry-point discovery automat
176176
| `binoc-stdlib/` | Standard comparators and transformers (architecturally identical to third-party plugins) |
177177
| `binoc-cli/` | CLI library + standalone Rust binary |
178178
| `binoc-python/` | PyO3 bindings, native plugin loader (`libloading`), Python plugin bridges, `binoc` CLI entry point |
179-
| `model-plugins/` | Reference plugin implementations: `binoc-sqlite` (Rust comparator), `binoc-row-reorder` (Rust transformer), `binoc-html` (Python outputter) |
179+
| `model-plugins/` | Reference plugin implementations: `binoc-sqlite` (Rust comparator), `binoc-row-reorder` (Rust transformer), `binoc-html` (Python renderer) |
180180
| `test-vectors/` | Shared test fixtures for standard library plugins |
181181
| `docs/` | Documentation, design notes, and ADRs |
182182

binoc-cli/src/lib.rs

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ use clap::{Parser, Subcommand};
66
use binoc_core::config::{DatasetConfig, PluginRegistry, ResolvedPlugins};
77
use binoc_core::controller::Controller;
88
use binoc_core::output;
9-
use binoc_sdk::{BinocError, ExtractResult, Migration, Outputter};
9+
use binoc_sdk::{BinocError, Changeset, ExtractResult, Renderer};
1010

1111
#[derive(Parser)]
1212
#[command(name = "binoc", about = "The missing changelog for datasets")]
@@ -30,9 +30,9 @@ enum Commands {
3030
#[arg(long, short)]
3131
quiet: bool,
3232
},
33-
/// Generate a human-readable changelog from one or more migrations.
33+
/// Generate a human-readable changelog from one or more changesets.
3434
Changelog {
35-
migrations: Vec<PathBuf>,
35+
changesets: Vec<PathBuf>,
3636
#[arg(long)]
3737
config: Option<PathBuf>,
3838
#[arg(long, short)]
@@ -42,9 +42,9 @@ enum Commands {
4242
#[arg(long, short)]
4343
quiet: bool,
4444
},
45-
/// Extract actual changed data from a migration node.
45+
/// Extract actual changed data from a changeset node.
4646
Extract {
47-
migration: PathBuf,
47+
changeset: PathBuf,
4848
node: String,
4949
#[arg(default_value = "content")]
5050
aspect: String,
@@ -85,7 +85,7 @@ impl OutputSpec {
8585

8686
enum ResolvedFormat {
8787
Json,
88-
Outputter(Arc<dyn Outputter>),
88+
Renderer(Arc<dyn Renderer>),
8989
}
9090

9191
fn resolve_format(
@@ -99,8 +99,8 @@ fn resolve_format(
9999
if ext == "json" {
100100
return Ok(ResolvedFormat::Json);
101101
}
102-
match resolved.outputter_for_extension(ext)? {
103-
Some(o) => Ok(ResolvedFormat::Outputter(o)),
102+
match resolved.renderer_for_extension(ext)? {
103+
Some(o) => Ok(ResolvedFormat::Renderer(o)),
104104
None => Err(BinocError::Config(format!(
105105
"cannot infer format for .{ext}; use format:path syntax (e.g. markdown:{path})",
106106
path = spec.path.display(),
@@ -118,28 +118,28 @@ fn resolve_format_name(
118118
return Ok(ResolvedFormat::Json);
119119
}
120120
resolved
121-
.outputter_by_name(name)
122-
.map(ResolvedFormat::Outputter)
121+
.renderer_by_name(name)
122+
.map(ResolvedFormat::Renderer)
123123
.ok_or_else(|| BinocError::Config(format!("unknown output format: {name}")))
124124
}
125125

126126
fn render(
127127
format: &ResolvedFormat,
128-
migrations: &[Migration],
128+
changesets: &[Changeset],
129129
config: &DatasetConfig,
130130
) -> Result<String, BinocError> {
131131
match format {
132132
ResolvedFormat::Json => {
133-
if migrations.len() == 1 {
134-
output::to_json(&migrations[0]).map_err(|e| BinocError::Other(e.to_string()))
133+
if changesets.len() == 1 {
134+
output::to_json(&changesets[0]).map_err(|e| BinocError::Other(e.to_string()))
135135
} else {
136-
serde_json::to_string_pretty(&migrations)
136+
serde_json::to_string_pretty(&changesets)
137137
.map_err(|e| BinocError::Other(e.to_string()))
138138
}
139139
}
140-
ResolvedFormat::Outputter(o) => {
141-
let outputter_config = config.output.get_for_outputter(&o.descriptor().name);
142-
o.render(migrations, &outputter_config)
140+
ResolvedFormat::Renderer(o) => {
141+
let renderer_config = config.output.get_for_renderer(&o.descriptor().name);
142+
o.render(changesets, &renderer_config)
143143
}
144144
}
145145
}
@@ -148,20 +148,20 @@ fn write_outputs(
148148
output_specs: &[String],
149149
stdout_format: &str,
150150
quiet: bool,
151-
migrations: &[Migration],
151+
changesets: &[Changeset],
152152
config: &DatasetConfig,
153153
resolved: &ResolvedPlugins,
154154
) -> Result<(), Box<dyn std::error::Error>> {
155155
if !quiet {
156156
let fmt = resolve_format_name(stdout_format, resolved)?;
157-
let text = render(&fmt, migrations, config)?;
157+
let text = render(&fmt, changesets, config)?;
158158
print!("{text}");
159159
}
160160

161161
for raw in output_specs {
162162
let spec = OutputSpec::parse(raw);
163163
let fmt = resolve_format(&spec, resolved)?;
164-
let text = render(&fmt, migrations, config)?;
164+
let text = render(&fmt, changesets, config)?;
165165
if let Some(parent) = spec.path.parent() {
166166
if !parent.as_os_str().is_empty() {
167167
std::fs::create_dir_all(parent)?;
@@ -210,20 +210,20 @@ pub fn run(
210210
let snap_a = snapshot_a.to_string_lossy().to_string();
211211
let snap_b = snapshot_b.to_string_lossy().to_string();
212212

213-
let migration = controller.diff(&snap_a, &snap_b)?;
214-
let migrations = [migration];
213+
let changeset = controller.diff(&snap_a, &snap_b)?;
214+
let changesets = [changeset];
215215

216216
write_outputs(
217217
&output,
218218
&format,
219219
quiet,
220-
&migrations,
220+
&changesets,
221221
&dataset_config,
222222
&resolved,
223223
)?;
224224
}
225225
Commands::Changelog {
226-
migrations: migration_paths,
226+
changesets: changeset_paths,
227227
config,
228228
output,
229229
format,
@@ -236,39 +236,39 @@ pub fn run(
236236

237237
let resolved = registry.resolve(&dataset_config)?;
238238

239-
let mut migrations: Vec<Migration> = Vec::new();
240-
for path in &migration_paths {
239+
let mut changesets: Vec<Changeset> = Vec::new();
240+
for path in &changeset_paths {
241241
let data = std::fs::read_to_string(path)?;
242-
let m: Migration = serde_json::from_str(&data)?;
243-
migrations.push(m);
242+
let m: Changeset = serde_json::from_str(&data)?;
243+
changesets.push(m);
244244
}
245245

246246
write_outputs(
247247
&output,
248248
&format,
249249
quiet,
250-
&migrations,
250+
&changesets,
251251
&dataset_config,
252252
&resolved,
253253
)?;
254254
}
255255
Commands::Extract {
256-
migration: migration_path,
256+
changeset: changeset_path,
257257
node,
258258
aspect,
259259
snapshot_a,
260260
snapshot_b,
261261
config,
262262
} => {
263-
let data = std::fs::read_to_string(&migration_path)?;
264-
let migration: Migration = serde_json::from_str(&data)?;
263+
let data = std::fs::read_to_string(&changeset_path)?;
264+
let changeset: Changeset = serde_json::from_str(&data)?;
265265

266266
let snap_a = snapshot_a
267267
.map(|p| p.to_string_lossy().to_string())
268-
.unwrap_or_else(|| migration.from_snapshot.clone());
268+
.unwrap_or_else(|| changeset.from_snapshot.clone());
269269
let snap_b = snapshot_b
270270
.map(|p| p.to_string_lossy().to_string())
271-
.unwrap_or_else(|| migration.to_snapshot.clone());
271+
.unwrap_or_else(|| changeset.to_snapshot.clone());
272272

273273
if !std::path::Path::new(&snap_a).exists() {
274274
eprintln!("Snapshot A not found: {snap_a}");
@@ -289,7 +289,7 @@ pub fn run(
289289
let resolved = registry.resolve(&dataset_config)?;
290290
let controller = Controller::new(resolved.comparators, resolved.transformers);
291291

292-
match controller.extract(&migration, &node, &aspect, &snap_a, &snap_b) {
292+
match controller.extract(&changeset, &node, &aspect, &snap_a, &snap_b) {
293293
Ok(result) => match result {
294294
ExtractResult::Text(text) => {
295295
print!("{text}");

0 commit comments

Comments
 (0)