Skip to content

Reduce boilerplate: error traits, array accessors, and job helpers#35

Open
tonyalaribe wants to merge 6 commits into
masterfrom
claude/festive-brahmagupta-6QmC3
Open

Reduce boilerplate: error traits, array accessors, and job helpers#35
tonyalaribe wants to merge 6 commits into
masterfrom
claude/festive-brahmagupta-6QmC3

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Summary

This PR reduces cross-cutting boilerplate across the codebase by introducing three reusable abstractions:

  1. Error-conversion extension traits (error_ext.rs) — eliminate repeated map_err(|e| DataFusionError::…) patterns
  2. Unified Arrow string accessor (arrow_access.rs) — collapse downcast cascades for Utf8/LargeUtf8/Utf8View arrays
  3. Scheduled job and table-iteration helpers in database.rs — remove closure/Box::pin scaffolding and duplicated unified+custom table loops

Key Changes

  • New modules:

    • src/arrow_access.rsStrAccessor enum that unifies reads across the three Arrow string encodings, replacing ~15 duplicated downcast-and-loop blocks
    • src/error_ext.rsArrowResultExt and ExecResultExt traits that convert Arrow and generic errors to DataFusion results in one method call
    • docs/code-conciseness-research.md — detailed audit of boilerplate patterns and reduction opportunities across the codebase
  • src/database.rs refactoring:

    • New add_async_job() helper that wraps the Job::new_async(… { let db=db.clone(); move|_,_| { Box::pin(async move {…}) } }) pattern
    • New all_tables() helper that returns a snapshot of unified + custom tables as (project_id, table_name, table) tuples
    • New table_label() function for consistent maintenance log messages ("unified table 'X'" vs. "custom project 'P' table 'X'")
    • Refactored all six scheduled jobs (light optimize, optimize, recompress, vacuum, cache stats, stats refresh) to use the new helpers, reducing ~200 LOC of closure boilerplate and duplicated table iteration
  • Callsite updates:

    • src/functions.rs — use StrAccessor in extract_scalar_string() and evaluate_jsonpath_on_json_string(); apply error traits to Arrow casts and JSONPath parsing
    • src/tantivy_index/udf.rs and src/tantivy_index/builder.rs — replace duplicated string_extractor() implementations with StrAccessor
    • src/mem_buffer.rs, src/dml.rs — apply error traits to Arrow operations
    • src/lib.rs — export new modules

Implementation Details

  • StrAccessor::try_new() performs a single downcast cascade and returns a unified enum; callers use value(), get(), is_null() without branching on the concrete type
  • Error traits are zero-cost: they're simple #[inline] wrappers around existing error constructors
  • The add_async_job() helper is generic over the closure type and hides the move |_,_| { let db=db.clone(); Box::pin(async move {…}) } pattern entirely
  • All scheduled jobs now share identical structure: acquire a snapshot of tables via all_tables(), iterate once, and log with table_label() for consistency

Testing

Existing tests cover all modified code paths; no new test files added. The refactoring is purely mechanical — same error handling, same table iteration logic, just consolidated into reusable helpers.

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df

claude added 6 commits June 5, 2026 11:08
Analysis of src/ identifying cross-cutting consolidation opportunities
(shared Arrow array-access layer, error-conversion traits, scheduled-job
helpers) plus per-file boilerplate clusters and library/derive/macro
strategies. Estimated ~1,000-1,300 LOC reduction.

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df
Introduce src/error_ext.rs with two extension traits that remove repeated
error-mapping boilerplate:
- ArrowResultExt::into_df() replaces
  `.map_err(|e| DataFusionError::ArrowError(Box::new(e), None))`
- ExecResultExt::exec_context(msg) replaces
  `.map_err(|e| DataFusionError::Execution(format!("msg: {e}")))`

Migrate database.rs: 6 ArrowError sites and 4 Execution-context sites.
First increment of the consolidation work in docs/code-conciseness-research.md.

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df
…stats_table

Migrate ~11 more verbose map_err sites to ArrowResultExt::into_df() and
ExecResultExt::exec_context():
- functions.rs: 1 ArrowError + 3 Execution-context sites
- mem_buffer.rs: 3 ArrowError sites
- stats_table.rs: 1 ArrowError site (drop now-unused DataFusionError import)
- dml.rs: 5 Execution-context sites

Verified with cargo check (clean, no new warnings).

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df
Add src/arrow_access.rs with StrAccessor, generalizing the existing
BinaryAccessor pattern to read uniformly across Utf8/LargeUtf8/Utf8View
arrays. Use it to collapse duplicated per-encoding downcast-and-loop blocks
in functions.rs:
- evaluate_jsonpath_on_json_string: two near-identical loops -> one
- extract_scalar_string: two downcast branches -> one

StrAccessor also accepts LargeUtf8 (a strict superset of prior behavior).
Verified with cargo check (clean).

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df
Both tantivy_index/builder.rs and tantivy_index/udf.rs hand-rolled a
near-identical Utf8/Utf8View downcast-to-closure helper. Replace both with
StrAccessor (adding a null-aware get() accessor), dropping the now-unused
AsArray/Array/StringArray/StringViewArray imports. Behavior preserved
(StrAccessor additionally accepts LargeUtf8).

Verified with cargo check (clean).

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df
Add two helpers on Database and collapse the six cron jobs in
start_maintenance_schedulers:

- add_async_job(): hides the repeated
  `Job::new_async(.., { let db = db.clone(); move |_,_| { let db = db.clone();
  Box::pin(async move { .. }) }})` + `scheduler.add(job).await?` plumbing that
  every job duplicated. The single remaining Box::pin lives in the helper.
- all_tables(): returns a (project_id, table_name, table) snapshot over both
  unified and custom tables, replacing the duplicated "iterate unified, then
  iterate custom" double loops in 5 jobs.
- table_label(): one place for the "unified table 'X'" / "custom project 'P'
  table 'X'" log wording.

Net -148/+107 lines in database.rs.

Behavior notes (intentional, called out for review):
- all_tables() clones the Arc handles and releases the map read-locks before
  per-table work, rather than holding them across the whole maintenance run.
  This matches the snapshot pattern the recompress job already used and
  reduces lock contention with table-map writers. Per-table ops still take the
  individual table's RwLock internally.
- Maintenance log messages are normalized to the table_label() format; the
  stats-refresh job previously logged custom tables as "P:X" and now logs
  "custom project 'P' table 'X'". Log wording only; no functional change.

Verified with cargo check (clean, no warnings).

https://claude.ai/code/session_01SCS7yFmpjgSgdg68UnM7Df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants