[HWORKS-2802] Remove client-side duplicate-record check on Delta inserts by jimdowling · Pull Request #963 · logicalclocks/hopsworks-api

jimdowling · 2026-05-21T11:24:51Z

Summary

Drop the client-side _check_duplicate_records helper that ran on every Delta-offline insert in both the python and Spark engines. The check duplicated semantics the Delta writer doesn't enforce and added a full groupBy/aggregate pass per insert (extra Spark action; per-batch PyArrow groupBy on the python engine).
Remove both call sites in save_dataframe, the dependent FeatureStoreException.DUPLICATE_RECORD_ERROR_MESSAGE constant, and the matching test suites (call-counting tests, parametrized duplicate fixtures, polars-branch tests).
Drop the now-unused pyspark.sql.functions.count import from engine/spark.py.

Net effect: 1009 lines deleted across 5 files; user data flows through unchanged; the storage engine handles dedup where it applies.

Test plan

uv run --project python pytest python/tests/test_feature_group.py python/tests/test_feature_store.py python/tests/engine/test_python.py python/tests/engine/test_spark.py — all pass; deleted suites stop existing.
ruff check clean on touched files.
Integration: an existing loadtest Delta-offline insert workflow still passes on a dev cluster.

🤖 Generated with Claude Code

The python and Spark engines used to scan every Delta-offline insert for duplicate (primary_key, event_time, partition_key) rows and raise client-side before the write. The check added a full groupBy/aggregate pass on every insert (extra Spark action; per-batch PyArrow groupBy under the python engine) and duplicated semantics that the Delta writer itself does not enforce. Drop the check entirely; let the storage engine handle deduplication where appropriate, and let user data flow through unchanged. The removal is non-trivial across hsfs/engine/python.py and hsfs/engine/spark.py: both _check_duplicate_records helpers, both call sites in save_dataframe, the dependent FeatureStoreException.DUPLICATE_RECORD_ERROR_MESSAGE constant, and the matching test suites (call-counting tests, parametrized duplicate fixtures, and the polars-branch tests) are all removed. The Spark engine no longer references pyspark.sql.functions.count, so that import is dropped too. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T11:32:47Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
python/hopsworks_common/client
exceptions.py
python/hsfs/engine
python.py
spark.py					579
Project Total

_{This report was generated by python-coverage-comment-action}

Copilot

Pull request overview

This PR removes the client-side duplicate-record detection that previously ran on every Delta offline insert, relying instead on the underlying Delta writer behavior and eliminating an extra full-pass aggregation/groupBy per insert.

Changes:

Removed _check_duplicate_records from both the Python and Spark engines and deleted the corresponding call sites in save_dataframe.
Removed the FeatureStoreException.DUPLICATE_RECORD_ERROR_MESSAGE constant and deleted the related test suites/fixtures asserting duplicate-detection behavior.
Dropped the now-unused pyspark.sql.functions.count import in the Spark engine.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
python/hsfs/engine/spark.py	Removes Spark-side duplicate-record validation for Delta offline inserts and cleans up an unused import.
python/hsfs/engine/python.py	Removes Python-engine duplicate-record validation before Delta offline inserts.
python/hopsworks_common/client/exceptions.py	Removes the duplicate-record error message constant from `FeatureStoreException`.
python/tests/engine/test_spark.py	Deletes tests that asserted Spark Delta inserts invoked/failed due to duplicate-record checks.
python/tests/engine/test_python.py	Deletes tests covering Python-engine duplicate-record checks (including polars/pyarrow branches).

jimdowling requested a review from manu-sj May 21, 2026 12:02

jimdowling changed the title ~~Remove client-side duplicate-record check on Delta inserts~~ [HWORKS-2802] Remove client-side duplicate-record check on Delta inserts May 21, 2026

jimdowling requested a review from Copilot May 21, 2026 12:03

Copilot started reviewing on behalf of jimdowling May 21, 2026 12:03 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread python/hopsworks_common/client/exceptions.py

Comment thread python/tests/engine/test_python.py

Merge branch 'main' into remove-delta-dup-check

995adb4

manu-sj approved these changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HWORKS-2802] Remove client-side duplicate-record check on Delta inserts#963

[HWORKS-2802] Remove client-side duplicate-record check on Delta inserts#963
jimdowling wants to merge 2 commits into
logicalclocks:mainfrom
jimdowling:remove-delta-dup-check

jimdowling commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jimdowling commented May 21, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 21, 2026

Coverage report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants