Replace SQLAlchemy with ibis-framework for all database operations#66
Replace SQLAlchemy with ibis-framework for all database operations#66daniel-thom wants to merge 46 commits intomainfrom
Conversation
Replace the SQLAlchemy ORM with ibis-framework to provide a cleaner multi-backend abstraction for DuckDB, SQLite, and Spark. This is a clean API break: engine->backend, Connection params removed, ibis expressions replace SQLAlchemy select/join chains. Key changes: - New src/chronify/ibis/ module with IbisBackend ABC and DuckDB/SQLite/Spark implementations - Remove SQLAlchemy, pyhive vendor code, Hive support, and related dependencies - Migrate all source modules (store, mappers, checker, converters) to ibis API - Migrate all tests to use backend fixtures instead of engine fixtures - Add ibis-framework[duckdb,sqlite] dependency, pyspark >= 4.0 for spark extra Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR migrates Chronify’s database layer from SQLAlchemy/Hive-specific code to an ibis-based backend abstraction, aiming to support multiple execution engines (DuckDB, SQLite, Spark) with a unified API.
Changes:
- Introduces
src/chronify/ibis/with backend implementations and I/O helpers (read_query,write_table, Parquet helpers). - Refactors mappers/checkers/time zone utilities and tests to use
IbisBackendinstead ofsqlalchemy.Engine/MetaData. - Removes SQLAlchemy + Hive/pyhive vendor code and updates project dependencies accordingly.
Reviewed changes
Copilot reviewed 53 out of 56 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_time_zone_localizer.py | Switches timezone localization tests from SQLAlchemy engine fixtures to IbisBackend fixtures. |
| tests/test_time_zone_converter.py | Switches timezone conversion tests to the ibis backend API. |
| tests/test_time_series_checker.py | Updates timestamp-checker tests to call check_timestamps(backend, table_name, schema). |
| tests/test_models.py | Updates dtype tests to use ibis datatypes (ibis.expr.datatypes). |
| tests/test_mapper_representative_time_to_datetime.py | Refactors representative-time mapper tests to use read_query/write_table. |
| tests/test_mapper_index_time_to_datetime.py | Refactors index-time mapper tests to use ibis backend querying. |
| tests/test_mapper_datetime_to_datetime.py | Updates datetime-to-datetime mapping tests and write-table duplicate-config behavior expectations. |
| tests/test_mapper_column_representative_to_datetime.py | Updates Store fixture + mapping assertions to use ibis querying instead of SQLAlchemy. |
| tests/test_csv_parser.py | Updates parser tests to construct Store(backend=...). |
| tests/test_checker_representative_time.py | Updates checker tests to use ibis backend and new check_timestamps signature. |
| tests/conftest.py | Replaces SQLAlchemy engine fixtures with ibis backend fixtures (iter_backends, make_backend). |
| src/chronify/utils/sqlalchemy_view.py | Removes SQLAlchemy view DDL utilities. |
| src/chronify/utils/sqlalchemy_table.py | Removes SQLAlchemy “CREATE TABLE AS SELECT” utilities and SQLite timestamp workaround. |
| src/chronify/time_zone_localizer.py | Refactors timezone localization to backend-driven querying + mapping. |
| src/chronify/time_zone_converter.py | Refactors timezone conversion to backend-driven querying + mapping. |
| src/chronify/time_series_mapper.py | Updates public mapping entrypoint to accept IbisBackend. |
| src/chronify/time_series_mapper_representative.py | Refactors mapper to fetch distinct time zones via ibis expressions. |
| src/chronify/time_series_mapper_index_time.py | Refactors index-time mapper to fetch distinct time zones via ibis expressions. |
| src/chronify/time_series_mapper_datetime.py | Refactors datetime mapper base wiring to use IbisBackend. |
| src/chronify/time_series_mapper_column_representative_to_datetime.py | Replaces SQLAlchemy-based intermediate table creation with ibis table creation + joins. |
| src/chronify/time_series_mapper_base.py | Reimplements mapping application as ibis join/aggregate + Parquet output support. |
| src/chronify/time_series_checker.py | Refactors timestamp checks to use backend SQL execution and ibis selects. |
| src/chronify/sqlalchemy/functions.py | Removes SQLAlchemy read/write optimization helpers. |
| src/chronify/schema_manager.py | Refactors schema persistence to a backend-managed table and DataFrame inserts. |
| src/chronify/models.py | Replaces SQLAlchemy dtype handling with ibis dtype handling and new conversion helpers. |
| src/chronify/ibis/base.py | Adds IbisBackend ABC and common helpers (raw SQL exec, “transaction” cleanup). |
| src/chronify/ibis/init.py | Exposes backends and make_backend factory. |
| src/chronify/ibis/duckdb_backend.py | Implements DuckDB backend wrapper. |
| src/chronify/ibis/sqlite_backend.py | Implements SQLite backend wrapper (including insert + Parquet handling). |
| src/chronify/ibis/spark_backend.py | Implements Spark backend wrapper and Spark-specific datetime preparation. |
| src/chronify/ibis/functions.py | Adds backend-agnostic read/write utilities (table/query, parquet, datetime conversions). |
| src/chronify/ibis/types.py | Adds DuckDB/ibis type conversion and dataframe schema inference helpers. |
| src/chronify/hive_functions.py | Removes Hive materialized-view workaround. |
| src/chronify/csv_io.py | Updates CSV dtype mapping to use ibis/duckdb conversion. |
| src/chronify/init.py | Removes pyhive/TCLIService module injection. |
| src/chronify/_vendor/kyuubi/* | Removes vendored pyhive/kyuubi code. |
| pyproject.toml | Drops SQLAlchemy/pyhive deps and adds ibis + pyspark extra. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix create_view_from_parquet to return ObjectType so callers drop the correct object type (SQLite creates a table, not a view) - Handle directory paths in DuckDB create_view_from_parquet for partitioned parquet datasets - Add unique index on schemas table name column to prevent duplicates - Escape single quotes in schema remove_schema to prevent SQL injection - Add dispose() teardown to test fixtures to prevent resource leaks - Use explicit column ordering in DuckDB insert to prevent column mismatch when DataFrame column order differs from table - Clean up schema on failed ingestion rollback in all Store methods Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 53 out of 56 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
src/chronify/time_series_mapper_column_representative_to_datetime.py:1
- This intermediate mapping table previously used replace semantics, but now
if_exists=\"fail\"will error if a prior run crashed (or a previous test left the table behind). That can make retries flaky. Consider switching toif_exists=\"replace\"or explicitly dropping the mapping table before creating it so the workflow is robust to partial failures.
tests/conftest.py:1 - This fixture still uses the name
iter_stores_by_engineeven though it now iterates over backends. Renaming it (and any references) to something likeiter_stores_by_backendwould make the updated abstraction clearer and reduce confusion for future contributors.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e453b76 to
177ca7e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 55 out of 58 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #66 +/- ##
===========================================
+ Coverage 52.41% 89.39% +36.98%
===========================================
Files 58 37 -21
Lines 12936 3066 -9870
===========================================
- Hits 6781 2741 -4040
+ Misses 6155 325 -5830 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Allow injecting an existing ibis connection into DuckDB/SQLite backends,
mirroring SparkBackend(session=...); external connections are not
disposed by the backend.
- Remove IbisBackend.reconnect; replace with backend-native backup():
DuckDB uses disconnect+copy+reconnect internally, SQLite uses the
sqlite3 online backup API, Spark raises InvalidOperation.
- Validate insert column sets on DuckDB/SQLite/Spark and raise
InvalidParameter on mismatch instead of silent reindex.
- DuckDB write_parquet uses COPY (FORMAT PARQUET) for the unpartitioned
path instead of materializing to pandas.
- SparkBackend.dispose always disconnects the ibis connection and stops
the session only when owned; factor temp-view handling into a helper.
- Collapse duplicated _write_to_{duckdb,sqlite,spark} into a single
_apply_if_exists helper.
- ObjectType is a StrEnum; make_backend raises InvalidParameter for
unknown names; schema_manager imports moved to top.
- Restore docstrings on convert/localize time zone methods and
read_raw_query.
- Relax pyspark pin to >=4.0,<5.
- CI: install package editable and pass --cov=chronify so Codecov can
map coverage paths back to repo sources.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the apply_mapping call in try/finally so the intermediate table created by _intermediate_mapping_ymdp_to_ymdh is cleaned up on exceptions, preventing stale tables that would block subsequent runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 66 out of 70 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 66 out of 70 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (3)
src/chronify/schema_manager.py:1
drop_table(self.SCHEMAS_TABLE)without anif_existsguard can fail precisely in the scenario described (Spark reportsLOCATION_ALREADY_EXISTSeven though the table is not registered). Ifdrop_tableerrors when the table isn’t present in the catalog, the retry logic won’t recover. Use a safe drop (e.g.,drop_table(..., if_exists=True)orexecute_sql("DROP TABLE IF EXISTS ...")) before retrying.
src/chronify/time_series_checker.py:1- Building SQL by concatenating unquoted identifier names will break for valid schemas that use reserved keywords or require quoting (e.g., columns named
index,group, etc.). Previously SQLAlchemy would quote these identifiers automatically. Prefer expressing these checks with Ibis expressions (so the backend can handle quoting), or add backend-specific identifier quoting utilities and apply them consistently toid_cols,time_cols, and join predicates.
src/chronify/time_zone_localizer.py:1 - The public function docstring was reduced to a single sentence while the function still implements non-trivial constraints (e.g., “standard time zone only”, TIMESTAMP_NTZ expectations, output schema changes, and
check_mapped_timestampsbehavior). Please restore the key parameter/behavior documentation (even if abbreviated) so downstream users understand the API break and the constraints without reading class internals.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Reinstates the full Parameters / Raises / Examples docstrings that were reduced to one-liners during the SQLAlchemy→ibis migration. Drops references to the removed connection and scratch_dir parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert null-consistency and time-array count checks in TimeSeriesChecker from f-string SQL to ibis filter/group_by/join expressions, and replace duckdb.sql() in csv_io with rel.project(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 69 out of 73 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ibis-sqlite's create_table calls con.commit() internally, which silently terminated any outer BEGIN — DDL inside transaction() blocks survived rollback. Wrap the underlying sqlite3.Connection in a no-commit proxy for the duration of the transaction so DDL is now genuinely covered by rollback. Drop the unused `created` cleanup list from transaction(); no caller populated it. transaction() is now a thin DB BEGIN/COMMIT/ROLLBACK wrapper, with semantics documented per backend. Wrap apply_mapping and _intermediate_mapping_ymdp_to_ymdh in transaction() so DuckDB and SQLite get atomic rollback of the multi-step DDL for free. Spark falls back to manual cleanup. The output_file parquet (not transactional) is unlinked on failure. Drop the redundant DuckDB and Spark write_parquet overrides — the base class already delegates to ibis's to_parquet, which uses server-side COPY/native partitioning. Switch the quick_start doc example from pandas-style ibis indexing to filter()/select(), matching the codebase idiom. Add backend-parametric regression tests for transactional DDL on both DuckDB and SQLite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer-flagged: ``apply_mapping`` was deleting the existing ``output_path`` before renaming the staged output into place, which opened a window where a failure or process crash left the user with neither the old output nor the new one — contradicting the docstring's "prior on-disk state is restored" claim. Replace the delete-then-rename sequence with a backup-rename-replace pattern: rename the existing target aside to a sibling ``.<name>.backup.<uid>`` path, rename staging into the target, then delete the backup. If the second rename fails, restore the backup so the original is preserved. Each ``Path.replace`` call is atomic, and the user's data is never simultaneously absent from both target and backup. Add ``test_map_table_preserves_existing_parquet_on_promotion_failure``, which monkey-patches ``Path.replace`` to fail on the staging→target call and asserts the pre-existing parquet is preserved with no debris left over. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer-flagged: ``delete_if_exists(backup_path)`` ran inside the same try block as the mapping work, so any failure cleaning up the backup caused ``apply_mapping`` to re-raise. ``Store.map_table_time_config`` then skipped its ``add_schema`` call and the user was left with the new parquet on disk but no schema registration — the store believed the mapping had failed. Wrap the backup cleanup in try/except + warning. The promotion already succeeded by this point and the new output is observable; a leftover backup is cosmetic debris that the user can remove manually. Add ``test_map_table_succeeds_when_backup_cleanup_fails`` which monkey-patches ``delete_if_exists`` to fail for backup paths only and asserts the mapping completes, the new content is in place, and the schema is registered — matching the reviewer's exact reproducer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three correctness improvements to transaction handling, motivated by a review of the branch: - Cleanup paths (Store.ingest_*, apply_mapping, _intermediate_mapping_ymdp_to_ymdh) no longer mask the original error. A connectivity failure on Spark inside drop_table during rollback previously replaced the user-visible InvalidTable, leaving callers debugging the wrong thing. Each cleanup step is now individually guarded; failures are logged via logger.exception and swallowed so the bare raise propagates the cause. Adds test_ingest_cleanup_failure_does_not_mask_original_error as a regression test. - DuckDBBackend and SparkBackend now initialize self._in_transaction in __init__ to match SQLiteBackend. The class-level default on IbisBackend is dropped in favor of a docstring contract that subclasses must initialize the instance attribute, eliminating the attribute-shadowing trick. - Docstrings on ingest_tables, ingest_from_csvs, and ingest_pivoted_tables now spell out that Spark cannot roll back partial appends to a pre-existing table; only the new-table case is fully cleaned up on failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removing the class-level default in 6af444c left mypy unable to infer the type of ``self._in_transaction`` when read from the base class's ``transaction()`` method. Add a bare ``_in_transaction: bool`` annotation on ``IbisBackend`` to declare the attribute's type without giving it a value (subclasses still must initialize it in ``__init__``), and drop the now-stale ``# type: ignore[assignment]`` on the SQLite proxy swap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ibis-pyspark's ``to_parquet`` forwards ``partition_by`` as a kwarg to
``df.write.format('parquet').save(path, **kwargs)``. PySpark's writer
expects camel-cased ``partitionBy``; the snake_case kwarg is silently
dropped as an unknown option, so partitioned writes fell out as a
single unpartitioned directory. (DuckDB's native ``to_parquet`` accepts
``partition_by`` directly, so this only broke on Spark.)
Override ``write_parquet`` on ``SparkBackend`` to call
``df.write.partitionBy(*cols).parquet(path)`` directly when partition
columns are given, falling back to the ibis path otherwise.
Fixes ``test_spark_write_parquet_partitioned``.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 69 out of 73 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pyarrow 24 added a ``py.typed`` marker, so mypy now uses the bundled type info instead of treating pyarrow as ``Any``. The compute kernels (``assume_timezone``, etc.) are registered dynamically at import time and aren't declared in the stubs, so mypy reports them as missing attributes even though they exist at runtime. CI installs pyarrow 24 fresh and trips this; local mypy was clean only because the existing environment still had pyarrow 23. Add a targeted ``# type: ignore[attr-defined]`` on the single call site with a comment explaining the cause. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- TimeZoneLocalizerByColumn._get_time_zones() previously rejected only
the *mixed* case of the literal 'None' string (the get_tzname(None)
sentinel) plus real time zones. The all-'None' case slipped through
and crashed with ZoneInfoNotFoundError on ZoneInfo('None'). Tighten
the check to reject 'None' whenever it appears, with a message
pointing the user to localize_time_zone(None) for tz-naive rows. Add
test_localize_time_zone_by_column_rejects_none_sentinel as a
regression test.
- SQLiteBackend.insert() now wraps con.cursor() in
contextlib.closing so the cursor is deterministically closed even on
executemany failure or non-CPython runtimes. (Cursor close is
independent of transaction state — commit still happens on the
connection afterwards via _commit_if_needed.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 69 out of 73 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace the SQLAlchemy ORM with ibis-framework to provide a cleaner multi-backend abstraction for DuckDB, SQLite, and Spark. This is a clean API break: engine->backend, Connection params removed, ibis expressions replace SQLAlchemy select/join chains.
Key changes: