Skip to content

[HWORKS-2802] Remove client-side duplicate-record check on Delta inserts#963

Open
jimdowling wants to merge 2 commits into
logicalclocks:mainfrom
jimdowling:remove-delta-dup-check
Open

[HWORKS-2802] Remove client-side duplicate-record check on Delta inserts#963
jimdowling wants to merge 2 commits into
logicalclocks:mainfrom
jimdowling:remove-delta-dup-check

Conversation

@jimdowling
Copy link
Copy Markdown
Contributor

Summary

  • Drop the client-side _check_duplicate_records helper that ran on every Delta-offline insert in both the python and Spark engines. The check duplicated semantics the Delta writer doesn't enforce and added a full groupBy/aggregate pass per insert (extra Spark action; per-batch PyArrow groupBy on the python engine).
  • Remove both call sites in save_dataframe, the dependent FeatureStoreException.DUPLICATE_RECORD_ERROR_MESSAGE constant, and the matching test suites (call-counting tests, parametrized duplicate fixtures, polars-branch tests).
  • Drop the now-unused pyspark.sql.functions.count import from engine/spark.py.

Net effect: 1009 lines deleted across 5 files; user data flows through unchanged; the storage engine handles dedup where it applies.

Test plan

  • uv run --project python pytest python/tests/test_feature_group.py python/tests/test_feature_store.py python/tests/engine/test_python.py python/tests/engine/test_spark.py — all pass; deleted suites stop existing.
  • ruff check clean on touched files.
  • Integration: an existing loadtest Delta-offline insert workflow still passes on a dev cluster.

🤖 Generated with Claude Code

The python and Spark engines used to scan every Delta-offline insert
for duplicate (primary_key, event_time, partition_key) rows and raise
client-side before the write. The check added a full groupBy/aggregate
pass on every insert (extra Spark action; per-batch PyArrow groupBy
under the python engine) and duplicated semantics that the Delta
writer itself does not enforce. Drop the check entirely; let the
storage engine handle deduplication where appropriate, and let user
data flow through unchanged.

The removal is non-trivial across hsfs/engine/python.py and
hsfs/engine/spark.py: both _check_duplicate_records helpers, both
call sites in save_dataframe, the dependent
FeatureStoreException.DUPLICATE_RECORD_ERROR_MESSAGE constant, and
the matching test suites (call-counting tests, parametrized duplicate
fixtures, and the polars-branch tests) are all removed. The Spark
engine no longer references pyspark.sql.functions.count, so that
import is dropped too.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  python/hopsworks_common/client
  exceptions.py
  python/hsfs/engine
  python.py
  spark.py 579
Project Total  

This report was generated by python-coverage-comment-action

@jimdowling jimdowling requested a review from manu-sj May 21, 2026 12:02
@jimdowling jimdowling changed the title Remove client-side duplicate-record check on Delta inserts [HWORKS-2802] Remove client-side duplicate-record check on Delta inserts May 21, 2026
@jimdowling jimdowling requested a review from Copilot May 21, 2026 12:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the client-side duplicate-record detection that previously ran on every Delta offline insert, relying instead on the underlying Delta writer behavior and eliminating an extra full-pass aggregation/groupBy per insert.

Changes:

  • Removed _check_duplicate_records from both the Python and Spark engines and deleted the corresponding call sites in save_dataframe.
  • Removed the FeatureStoreException.DUPLICATE_RECORD_ERROR_MESSAGE constant and deleted the related test suites/fixtures asserting duplicate-detection behavior.
  • Dropped the now-unused pyspark.sql.functions.count import in the Spark engine.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/hsfs/engine/spark.py Removes Spark-side duplicate-record validation for Delta offline inserts and cleans up an unused import.
python/hsfs/engine/python.py Removes Python-engine duplicate-record validation before Delta offline inserts.
python/hopsworks_common/client/exceptions.py Removes the duplicate-record error message constant from FeatureStoreException.
python/tests/engine/test_spark.py Deletes tests that asserted Spark Delta inserts invoked/failed due to duplicate-record checks.
python/tests/engine/test_python.py Deletes tests covering Python-engine duplicate-record checks (including polars/pyarrow branches).

Comment thread python/hopsworks_common/client/exceptions.py
Comment thread python/tests/engine/test_python.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants