[FSTORE-2030] Add support for specifying lookback windows for PIT queries#959
Draft
manu-sj wants to merge 9 commits into
Draft
[FSTORE-2030] Add support for specifying lookback windows for PIT queries#959manu-sj wants to merge 9 commits into
manu-sj wants to merge 9 commits into
Conversation
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 PIT (point-in-time) joins in the Feature Store currently emit `feature_fg.event_time <= root.event_time` as a row-correlated inequality, so engines must scan every historical partition of every joined feature group on every query. As feature groups grow with daily ingestion the cost grows unboundedly: a 31 GB / 64-week transactional feature group ends up scanning all 64 partitions per PIT batch read. The ticket adds an optional `lookback` parameter on `FeatureView.get_batch_data`, `create_training_data`, and the split variants. The user supplies a `key` (`partition_key` or `event_time`) and a `start_window` (required) plus optional `end_window`. The backend turns the window into a constant-bound predicate on each joined FG so flyingduck's directory walker and Spark catalyst can prune partitions before opening any files. The SDK exposes a `Lookback` dataclass in `hsfs.constructor.lookback` that accepts either a dict (the customer's documented shape) or the dataclass itself. Both `date` and `datetime` are accepted for the bounds; `date` values are normalised to UTC midnight before conversion to epoch milliseconds. The `Lookback` constructor validates locally: `key` must be in `{event_time, partition_key}`, `start_window` is required, and `start_window < end_window` strictly when `end_window` is provided. Two wire paths fan the uniform lookback across the feature view's join tree before the request is submitted. The stateless `get_batch_data` path calls `attach_to_joins(query.joins, lookback)` so the per-Join `LookbackDTO` rides on the standard `PUT /featurestores/{id}/query` SQL-construction call. The materialised `create_training_data*` path builds a `Dict[prefix + right_fg.name, LookbackDTO]` via `build_joins_lookback_map(...)` and sets it on `TrainingDatasetDTO.joinsLookback`, since the FV-backed training-data POST has no `QueryDTO` body. When `key="event_time"` is selected the engine logs an INFO that partition pruning becomes engine-dependent (Hudi/Delta column-stats only) while row-level correctness is always guaranteed. `Join.from_response_json` and `TrainingDataset.from_response_json` both deserialise the new fields with `None` defaults; older training datasets written before V72 read back unchanged. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Tighten the SDK's public-API docstrings around the new `lookback` parameter. Document `lookback` in the parameter blocks of `get_batch_data`, `create_training_data`, `create_train_test_split`, and `create_train_validation_test_split`. The public-API contract requires every parameter to be documented; the original change added the signature on each method but left the docstrings without an entry. Round out the docstrings on `Lookback.to_dict`, `Lookback.from_dict`, `attach_to_joins`, and `build_joins_lookback_map` so each public surface in the new `lookback` module clears `docsig` and explains what each return value means. Also reorder imports in `lookback.py` to satisfy ruff's isort rule. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Switch the lookback helper imports in `feature_view_engine` to a module import (`from hsfs.constructor import lookback as _lookback`) and qualify the call sites accordingly. Importing the three names with per-name aliases tripped ruff's isort I001 rule under the project's ruff configuration; the module-import form is idiomatic for this style and keeps the call sites short. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Re-run `ruff format` on the lookback module and its tests so they conform to the project formatter. The CI's `ruff format --check` flagged the two files as unformatted after the original commits; the content is unchanged behaviorally. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Extend the per-call lookback API so different joined feature groups can carry different windows on the same query. The v1 implementation fanned a single Lookback across every join; v3 introduces a LookbackPlan container with an optional uniform default plus a per-feature-group override map, while keeping the leaf Lookback class for the uniform 90% case. Field names move from `start_window` / `end_window` to `start` / `end` on both the SDK and the wire; the persistence columns keep their historical names via JPA @column mappings so no new migration ships. The SDK adds LookbackPlan as a second @public class alongside the leaf Lookback. The four FeatureView methods that accept `lookback` route the value through a `_coerce_lookback` helper that disambiguates the dict form on top-level keys: `{key, start, end}` constructs a Lookback, `{default, feature_groups}` constructs a LookbackPlan. Empty dicts, mixed-keys, legacy `start_window` / `end_window` aliases, and unknown keys raise ValueError with verbatim messages defined in the API spec so the docstring and the SDK stay in lockstep. Per-feature-group keys accept six forms (FG instance, (FG, prefix), 3-tuple, 3-tuple with None prefix, 2-tuple, bare string) and normalize to a (name, version, prefix) triple at attach time; ambiguous, unknown, and colliding keys raise with the canonical triples listed so the call can be corrected. The wire shape produced by `build_joins_lookback_list` is the List<JoinLookbackEntry> the backend now expects, deduplicated by triple. Tests assert the exact spec error strings so the API contract stays the single source of truth. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Codex pre-review on the v3 diff flagged that the ambiguity ValueError in _resolve_feature_groups listed only the first canonical triple as the disambiguator hint. The API contract calls for every matching triple to appear so a user joining the same feature group under several prefixes can pick the right one without re-reading the feature view definition. Rewrite the message to repeat the full list of matching triples in the "Disambiguate by passing one of the full triples ..." clause so the hint covers every possible resolution. The existing SDK test that asserts the ambiguity branch matches on the substring "ambiguous", so no test surface changes. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 End-to-end cluster verification of LookbackPlan surfaced a wire-ordering gap: an ambiguous bare-name `feature_groups` key (an FV that joins the same feature group twice under different prefixes) bypassed the SDK's `_resolve_feature_groups` check and reached the engine, which then raised a cryptic DuckDB "Ambiguous reference to table" error instead of the actionable SDK ValueError that names the canonical triples. `attach_to_joins` runs against the backend-reconstructed query returned from `/featureviews/.../batchquery`, whose joins no longer carry the original feature-group identity that `_resolve_feature_groups` matches against. Pre-validate the lookback against `feature_view_obj.query.joins` (the local query the user built, with feature-group instances intact) BEFORE the HTTP roundtrip via a new `validate_against_joins` helper that runs the resolver without mutating joins. After the roundtrip the existing `attach_to_joins` call continues to fan the resolved entries onto the reconstructed joins for the wire payload. `create_training_dataset` is unaffected — it already builds the wire list from `feature_view_obj.query.joins` directly, so the validation already runs against the local query. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Cluster verification verified the PIT join returns correct rows under the v3 lookback contract but did not prove partition pruning was happening: for the seeded data, the PIT join returns the same scores whether the lookback's lower bound filters out historical partitions or not, because no out-of-window row would ever be the latest match for an in-window root row. Without inspecting the generated SQL or the engine's executed plan, the integration tests cannot distinguish a lookback that fires from one that is silently dropped. Extend `FeatureView.get_batch_query` (the SQL-string-returning method) to accept an optional `lookback` parameter so users (and the loadtest suite) can inspect the predicate the backend emits. The engine path runs the same pre-validate against the FV's local join tree as `get_batch_query` does, then attaches the resolved entries to the backend-reconstructed Query before `construct_query` runs the SQL emitter. The downstream loadtest commit adds a `test_partition_key_emits_pruning_predicate` test that asserts the returned SQL contains the lookback lower-bound literal for both `partition_key` and `event_time` modes. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries https://hopsworks.atlassian.net/browse/FSTORE-2030 Customer workload exposed that v3 lookback applied only to joined feature groups, never the root. For a feature view that joins three dimension FGs onto a single root, the SQL emitted partition predicates on each joined FG but left the root unbounded — so the customer's largest table, the one driving query volume, got no I/O reduction. Every pre-existing loadtest and SDK test passed because correctness was unchanged (PIT semantics make out-of-window dim rows irrelevant whether pruned or row-filtered) and the SQL/plan regex assertions matched the first occurrence of the bound anywhere in the SQL. Extend the SDK's lookback resolution to cover the root: `_resolve_feature_groups`, `validate_against_joins`, `attach_to_joins`, and `build_joins_lookback_list` all take an optional root_fg / root_query parameter. The root's `(name, version, prefix=None)` triple participates in key matching like any joined FG: bare-name keys, full triples, and the LookbackPlan default all resolve onto it, raising the same ambiguity / unknown-key errors with the canonical triples listed. `attach_to_joins` stores the resolved root lookback on `Query._lookback`; `build_joins_lookback_list` prepends a wire entry for the root so the training-dataset write path ships it alongside the joined entries. `feature_view.py`'s read and training-data call sites pass `feature_view_obj.query._left_feature_group` into the helpers so every path participates. Wire-format extension on Query: a top-level `lookback: LookbackDTO` field for the construct_query read path, serialized when `Query._lookback` is set. The joinsLookback list shape on the training- dataset DTO is unchanged — the root just gets an entry there. A regression test (`test_uniform_lookback_must_include_root_fg`) asserts the wire payload now contains four entries for the customer's 4-FG view shape; this test was written to fail on the previous implementation and now passes. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lookbackdataclass underhsfs.constructor.lookbackand alookbackparameter onFeatureView.get_batch_data,create_training_data, and the split variants. The window applies uniformly to every joined feature group in the PIT join.partition_key(constant-bound predicate on each FG's partition column — flyingduck and Spark prune partitions before reading files),event_time(predicate on the FG'sevent_timecolumn — row-level correctness; partition pruning is engine-dependent).lookbackKey,startWindow,endWindow) with bounds as epoch ms; the SDK fans the uniform lookback out across the query tree's joins and also emits ajoinsLookbackmap keyed byprefix + fg.nameon the training-dataset POST.JIRA
FSTORE-2030
Test plan
Lookback(validation matrix, dict/dataclass round-trip, date+datetime coercion, nested-map serialization).feature_viewandtraining_datasettests pass.ruff+docsigclean on the new module and the modified public methods.manu-testvia theloadtestintegration tests (companion PR).Companion PRs
logicalclocks/hopsworks-ee→ branchFSTORE-2030logicalclocks/loadtest→ branchFSTORE-2030logicalclocks/logicalclocks.github.io→ branchFSTORE-2030