[onboarding] general fixes for user onboarding via wizard by MagicLex · Pull Request #945 · logicalclocks/hopsworks-api

MagicLex · 2026-05-11T11:50:53Z

Summary

hops fg list and hops fv list now span every feature store visible to the project (own + shared), with a new PROJECT column to show provenance.
Adds Project.get_feature_stores() / Connection.get_feature_stores() backed by a new FeatureStoreApi.get_all(), hitting the same /project/{id}/featurestores endpoint the UI uses.
Pass --current-only on either command to restrict the listing to the active project.

Why

Onboarding users who land in a fresh project with only shared FGs (e.g. hopsworks_default) saw hops fg list return empty and assumed the CLI was broken. The old code only looked at fs.get_feature_groups() for the project's own feature store.

Also in this PR

A caveats/ doc explaining that hopsworks-apigen shim modules are generated at build time and gitignored, so running pytest against an unbuilt source tree fails with confusing import errors. Saved a few hours of head-scratching in this branch.

Test plan

uv tool install from the patched source, then hops fg list shows the 8 shared FGs from hopsworks_default with the project column populated.
hops fg list --current-only returns only the active project's FGs (empty in our test project).
hops fg list --json includes the PROJECT key.
hops fv list follows the same shape (verified, no shared FVs in our test backend so list is empty, but the table headers and iteration are correct).
New tests: test_fg_list_spans_shared_stores, test_fg_list_current_only_skips_shared, test_fv_list_spans_shared_stores pass; existing list tests updated for the new column.
ruff check and ruff format clean on touched files.
docsig clean on the new methods (only pre-existing failures remain).

🤖 Generated with Claude Code

The CLI's hops fg list and hops fv list called fs.get_feature_groups() on the active project's own feature store only, so groups shared from other projects were invisible. Onboarding users landing in a fresh project with only shared data saw an empty list and assumed the CLI was broken. Add Project.get_feature_stores() (and Connection.get_feature_stores()) backed by a new FeatureStoreApi.get_all() that hits /project/{id}/featurestores, the same endpoint the UI's feature-store picker uses. The CLI list commands now iterate every visible store and add a PROJECT column so the source of each row is obvious. Pass --current-only on either command to restrict the listing to the active project's store. Also add a caveat doc explaining that hopsworks-apigen shim modules are generated at build time and gitignored, so pytest run against an unbuilt source tree fails with confusing import errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T11:58:21Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
python/hopsworks/cli
auth.py					91-101
output.py					106
session.py					100-102
python/hopsworks/cli/commands
fg.py					440-451, 507-515, 603-624, 637-651, 887-896
fv.py					80, 104, 132-134, 253-254, 353-355, 402, 471-493
job.py					324-329, 342-343, 346, 351-352, 360
td.py					37
transformation.py					101-102, 117-118, 126
python/hopsworks_common
connection.py					179-180, 195
project.py					187, 214
python/hopsworks_common/core
job_api.py					63-67, 119-120, 179-201
python/hsfs/core
arrow_flight_client.py					71-75
feature_store_api.py					34, 48-51
python/hsml/engine
model_engine.py
Project Total

_{This report was generated by python-coverage-comment-action}

hops fg info / features / preview / insert / derive / stats / search / keywords and hops fv info / read / delete / get-feature-vector all called fs.get_feature_group / fs.get_feature_view on the active project's own feature store. The SDK returns None when the entity is missing from that store, so requests for shared groups silently rendered "?" everywhere. Joins in hops fg derive and hops fv create could not source shared base or joined feature groups either. Centralise the lookup: session.get_feature_stores(ctx) caches the visible-stores list per invocation, and shared _get_fg / _get_fv helpers walk those stores, return the first match, and raise a clear "not found in any visible feature store. Run hops fg list / hops fv list" when nothing matches. The fix flows through every call site that resolves an entity by name without retouching the individual commands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A new feature pipeline today needs four CLI moves to leave the IDE: hops files upload, hops job create, hops job schedule, hops job run. The first three are pure plumbing the SDK can do itself; the onboarding wizard should be able to ship a script with one command. Add JobApi.deploy(local_path, name, type=None, environment_name=None, args=None, remote_dir=None, overwrite=True) -> Job on the SDK side. It composes dataset_api.upload (script lands in /Resources/jobs/<name>/ by default) and the existing PUT-backed create_job, and infers the job type from the extension (.py = PYTHON, .jar = SPARK). Re-deploying the same name overwrites the script and updates the job definition in place, so the call is idempotent. Add hops job deploy LOCAL_FILE --name NAME on the CLI side. It calls JobApi.deploy and then chains job.schedule when --cron is set and job.run when --run is set, so the full upload + register + schedule + launch chain fits on a single line. Live-verified on the onboarding test project: deploy creates the job, re-deploy is idempotent, --run --wait runs to completion and prints the heartbeat in the captured stdout, and --cron attaches a schedule visible via hops job schedule-info. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ice disable Two bugs found while smoke-testing the wizard's pipeline step end to end: 1. hops <command> --json mixed SDK chatter with the JSON payload. The SDK calls logging.basicConfig(stream=sys.stdout) at import time and the login banner is a bare print, so a fresh hops fg list --json round-tripped seven log lines before the actual JSON array, which broke every downstream json.loads. set_json_mode now snapshots sys.stdout at call time, swaps it for sys.stderr so subsequent prints and root-handler logs land on stderr, raises the hopsworks/hopsworks_common/hsfs/hsml loggers to WARNING, and routes print_json to the captured stdout. The snapshot is re-taken on each call so Click's per-invocation CliRunner buffer in tests is honoured. 2. arrow_flight_client._disable_feature_query_service_client() crashed with AttributeError on a brand-new session because the guard inverted the None check: when _arrow_flight_instance was None, the code tried to call .ArrowFlightClient(...) on None instead of constructing one. Fix is a one-line typo: assign the new ArrowFlightClient(disabled_for_session=True) to the module global. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… live pipeline test A pipeline that reads from a HUDI feature group via the Python engine fails with FlightServerError: Catalog Error: Table ... does not exist on this cluster, even though fg.commit_details() reports the right commit and the offline materialization job ran to SUCCEEDED. The Hive fallback is gone in 4.x so there is no second path. The caveat walks through the SDK payload (which is well-formed), the per-query DuckDB registration in FlyingDuck's query_engine.py, and the three likely root causes on the cluster side (HopsFS read permission for the FlyingDuck pod, stale warehouse mount, race between insert ack and materialization visibility). Includes the probes that confirmed the SDK side is clean so the next dev loop goes straight to the FlyingDuck pod logs. Captures the two SDK / CLI fixes that came out of the same investigation so the caveat is self-contained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bug is fixed end-to-end now that ``logicalclocks/hopsworks-ee#2996`` (Java emits the two-part FROM identifier) and ``logicalclocks/flyingduck#196`` (registration uses a real DuckDB schema) are both in. Verified live on lexterm2: ``hops fg preview`` resolves shared HUDI feature groups and the project's own materialised FG without raising the prior catalog error. Per the project's caveat convention ("known gotchas; add new ones to this folder"), once the gotcha disappears the file should follow. Keeping a stale caveat around teaches the wrong shape — the PRs and commit history carry the diagnostic walkthrough for anyone who needs it later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@udf

Move the working doc and the live-tested sample scripts into the repo so a future agent can pick the loop back up from PR state alone (the pod the loop ran in is ephemeral; anything in /tmp or the user homedir disappears when it cycles). Doc lands at .claude/docs/onboarding-flow.md and now opens with a "Handoff state" section pointing at the three cross-repo PRs, the lexterm2 kubeconfig at /hopsfs/Resources/kubeconfig-lexterm2 (which is project-scoped and survives pod restarts), and the cluster-side state of the live demo project. Samples under .claude/docs/samples/ carry the heartbeat job, the BTC feature pipeline, and the vectorised retrieval-time transformations. CLI fixes from the Step 4 / Step 5 loop, committed together so the sample scripts and the doc references all match HEAD: - hops fg stats no longer dumps the raw statistics JSON to stdout in human mode. Renders FEATURE / TYPE / COUNT / MIN / MAX / MEAN / STDDEV / COMPLETENESS as a table, formats wide numerics with thousands separators and four significant digits. - hops transformation create iterates every @udf-decorated function in the source file instead of refusing files with more than one, and emits a per-function "Created transformation NAME vV" line. Mirrors the SDK shape (one create_transformation_function call per UDF) without forcing the caller to split the file. - hops transformation create now defaults to --version 1, working around the backend HTTP 500 / NPE when version is null. The SDK side still passes null straight through and the backend should default to 1 (or the next free version); tracked in the gaps list. - hops fv create --transform now accepts the fn[version]:col shape so an FV can pin a specific transform version. The SDK's fs.get_transformation_function(name=) defaults to v1, so if a fix re-registers as v2, FVs created without an explicit version stay bound to the broken v1. Pinning closes the gap. - Tests updated: validates "register every UDF in source" and the "reject source with no @udf" path; old "reject multiple" parametrize case removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…opsfs path is not a dataset Two gaps hit while running the autoresearch flow on lexterm2: cli: HOPSWORKS_ENGINE was required in interactive Hopsworks pods. hopsworks_common.connection picks the spark engine whenever pyspark imports and we're in-pod, but the typical terminal/notebook pod has pyspark installed transitively without a usable Spark master (Spark Connect, no SparkSession.builder.getOrCreate() fallback), so login crashes on getOrCreate. The CLI is single-process and never needs an in-process spark engine: heavy operations dispatch jobs server-side. hopsworks.cli.auth.login now defaults engine=python when neither the caller nor HOPSWORKS_ENGINE picks one. Explicit args and the env var both still win. hsml: model.save() of a path under /hopsfs/ that is not actually a project dataset path errored with "Path not found". Per-pod mount layouts vary, FUSE writes may not have synced, and user-home folders under /hopsfs/Users/<user>/ are not project datasets. _normalize_ hopsfs_mount_path is a pure string strip, so we now also verify the normalized path with dataset_api.path_exists; only then do we move it, otherwise we fall back to the regular local upload path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…schema is timestamp/date pd.read_json and json.loads both parse ISO 8601 timestamps as object columns, and fg.insert(df) then rejects them with a "wrong type" error even though the value is well-formed. Inspect the FG schema (preferring the new fg.columns, falling back to the deprecated fg.features) and coerce any timestamp/date column in place with pd.to_datetime, so hops fg insert --file row.json on a typical event-time FG just works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loop) Two orchestration briefs the in-app Wizard pastes into Claude Code in the Terminal. Each is a short conversational program: a few inputs from the user, then CLI-first execution. They wrap the canonical 10-step onboarding flow into focused, agent-loadable specs. - docs/wizard/time-series.md: raw FGs through a feature pipeline, feature view, training pipeline, registered model, deployment, and Streamlit app. Built segment by segment with hops context after each so the user sees what just appeared. Hard rule against heredoc-python orchestration; everything goes through the hops CLI. - docs/wizard/research.md: autonomous research loop on a feature view, logged to autoresearch_experiments_<tag> and the model registry. Mirrors hopsworks-autoresearch/program.md's contract, with CLI commands inlined. The briefs explicitly call out conversation rules (one question per turn, summarise after each segment, integrate user pushback) and anti-patterns (heredoc python, pip-install inline, skip-the-schedule) that came out of testing the flow end-to-end on lexterm2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

15 numbered issues captured live while running the time-series wizard brief end-to-end on lexterm2 (raw FGs to deployed Streamlit). Each entry is dated by severity (blocker / time-waster / papercut), reports the symptom verbatim, names the root cause where known, and proposes a fix. The two highest-leverage ones are logicalclocks#10 (model files not copied to Deployments/<name>/<v>/, crashloops the predictor) and logicalclocks#13 (batch get_feature_vectors silently returns 0 rows on ISO-date strings while the singular form accepts them). Lives next to the briefs that produced it so the next agent loading the wizard context sees the known traps before walking into them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four more items hit while wiring the Streamlit consumer onto the deployed model: - logicalclocks#16 hops app create --path rejects absolute /hopsfs/ paths and produces a doubled // HDFS URL; opposite semantics from hops job deploy. - logicalclocks#17 hops app start has --no-wait only (start blocks by default); hops deployment start has --wait. Naming-symmetry break. - logicalclocks#19 deployment.predict() does bare json.dumps() on inputs, so any datetime.date payload TypeError-s in hsml internals. Combined with logicalclocks#13, the date type round-trip is a cliff: SDK forces string, get_feature_vectors silently drops it. - logicalclocks#18 python-app-pipeline env lacks plotly. Crash surfaces only after upload + start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MagicLex changed the title ~~[onboarding] hops fg/fv list spans shared feature stores~~ [onboarding] general fixes for user onboarding via wizard May 11, 2026

Admin Admin and others added 9 commits May 11, 2026 12:19

[onboarding] fix ruff D417 + unused pytest import

888c38a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MagicLex added the wip label May 12, 2026

MagicLex and others added 2 commits May 13, 2026 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[onboarding] general fixes for user onboarding via wizard#945

[onboarding] general fixes for user onboarding via wizard#945
MagicLex wants to merge 13 commits into
logicalclocks:mainfrom
MagicLex:feat/onboarding-flow-fixes

MagicLex commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MagicLex commented May 11, 2026

Summary

Why

Also in this PR

Test plan

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 11, 2026 •

edited

Loading