Skip to content

[onboarding] general fixes for user onboarding via wizard#945

Open
MagicLex wants to merge 13 commits into
logicalclocks:mainfrom
MagicLex:feat/onboarding-flow-fixes
Open

[onboarding] general fixes for user onboarding via wizard#945
MagicLex wants to merge 13 commits into
logicalclocks:mainfrom
MagicLex:feat/onboarding-flow-fixes

Conversation

@MagicLex
Copy link
Copy Markdown

Summary

  • hops fg list and hops fv list now span every feature store visible to the project (own + shared), with a new PROJECT column to show provenance.
  • Adds Project.get_feature_stores() / Connection.get_feature_stores() backed by a new FeatureStoreApi.get_all(), hitting the same /project/{id}/featurestores endpoint the UI uses.
  • Pass --current-only on either command to restrict the listing to the active project.

Why

Onboarding users who land in a fresh project with only shared FGs (e.g. hopsworks_default) saw hops fg list return empty and assumed the CLI was broken. The old code only looked at fs.get_feature_groups() for the project's own feature store.

Also in this PR

A caveats/ doc explaining that hopsworks-apigen shim modules are generated at build time and gitignored, so running pytest against an unbuilt source tree fails with confusing import errors. Saved a few hours of head-scratching in this branch.

Test plan

  • uv tool install from the patched source, then hops fg list shows the 8 shared FGs from hopsworks_default with the project column populated.
  • hops fg list --current-only returns only the active project's FGs (empty in our test project).
  • hops fg list --json includes the PROJECT key.
  • hops fv list follows the same shape (verified, no shared FVs in our test backend so list is empty, but the table headers and iteration are correct).
  • New tests: test_fg_list_spans_shared_stores, test_fg_list_current_only_skips_shared, test_fv_list_spans_shared_stores pass; existing list tests updated for the new column.
  • ruff check and ruff format clean on touched files.
  • docsig clean on the new methods (only pre-existing failures remain).

🤖 Generated with Claude Code

The CLI's hops fg list and hops fv list called fs.get_feature_groups()
on the active project's own feature store only, so groups shared from
other projects were invisible. Onboarding users landing in a fresh
project with only shared data saw an empty list and assumed the CLI
was broken.

Add Project.get_feature_stores() (and Connection.get_feature_stores())
backed by a new FeatureStoreApi.get_all() that hits
/project/{id}/featurestores, the same endpoint the UI's feature-store
picker uses. The CLI list commands now iterate every visible store
and add a PROJECT column so the source of each row is obvious.
Pass --current-only on either command to restrict the listing to the
active project's store.

Also add a caveat doc explaining that hopsworks-apigen shim modules
are generated at build time and gitignored, so pytest run against an
unbuilt source tree fails with confusing import errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  python/hopsworks/cli
  auth.py 91-101
  output.py 106
  session.py 100-102
  python/hopsworks/cli/commands
  fg.py 440-451, 507-515, 603-624, 637-651, 887-896
  fv.py 80, 104, 132-134, 253-254, 353-355, 402, 471-493
  job.py 324-329, 342-343, 346, 351-352, 360
  td.py 37
  transformation.py 101-102, 117-118, 126
  python/hopsworks_common
  connection.py 179-180, 195
  project.py 187, 214
  python/hopsworks_common/core
  job_api.py 63-67, 119-120, 179-201
  python/hsfs/core
  arrow_flight_client.py 71-75
  feature_store_api.py 34, 48-51
  python/hsml/engine
  model_engine.py
Project Total  

This report was generated by python-coverage-comment-action

hops fg info / features / preview / insert / derive / stats / search /
keywords and hops fv info / read / delete / get-feature-vector all
called fs.get_feature_group / fs.get_feature_view on the active
project's own feature store. The SDK returns None when the entity is
missing from that store, so requests for shared groups silently
rendered "?" everywhere. Joins in hops fg derive and hops fv create
could not source shared base or joined feature groups either.

Centralise the lookup: session.get_feature_stores(ctx) caches the
visible-stores list per invocation, and shared _get_fg / _get_fv
helpers walk those stores, return the first match, and raise a clear
"not found in any visible feature store. Run hops fg list / hops fv
list" when nothing matches. The fix flows through every call site
that resolves an entity by name without retouching the individual
commands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MagicLex MagicLex changed the title [onboarding] hops fg/fv list spans shared feature stores [onboarding] general fixes for user onboarding via wizard May 11, 2026
Admin Admin and others added 9 commits May 11, 2026 12:19
A new feature pipeline today needs four CLI moves to leave the IDE:
hops files upload, hops job create, hops job schedule, hops job run.
The first three are pure plumbing the SDK can do itself; the
onboarding wizard should be able to ship a script with one command.

Add JobApi.deploy(local_path, name, type=None, environment_name=None,
args=None, remote_dir=None, overwrite=True) -> Job on the SDK side.
It composes dataset_api.upload (script lands in
/Resources/jobs/<name>/ by default) and the existing PUT-backed
create_job, and infers the job type from the extension (.py = PYTHON,
.jar = SPARK). Re-deploying the same name overwrites the script and
updates the job definition in place, so the call is idempotent.

Add hops job deploy LOCAL_FILE --name NAME on the CLI side. It calls
JobApi.deploy and then chains job.schedule when --cron is set and
job.run when --run is set, so the full upload + register + schedule +
launch chain fits on a single line.

Live-verified on the onboarding test project: deploy creates the job,
re-deploy is idempotent, --run --wait runs to completion and prints
the heartbeat in the captured stdout, and --cron attaches a schedule
visible via hops job schedule-info.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ice disable

Two bugs found while smoke-testing the wizard's pipeline step end to
end:

1. hops <command> --json mixed SDK chatter with the JSON payload. The
   SDK calls logging.basicConfig(stream=sys.stdout) at import time and
   the login banner is a bare print, so a fresh hops fg list --json
   round-tripped seven log lines before the actual JSON array, which
   broke every downstream json.loads. set_json_mode now snapshots
   sys.stdout at call time, swaps it for sys.stderr so subsequent
   prints and root-handler logs land on stderr, raises the
   hopsworks/hopsworks_common/hsfs/hsml loggers to WARNING, and routes
   print_json to the captured stdout. The snapshot is re-taken on each
   call so Click's per-invocation CliRunner buffer in tests is
   honoured.

2. arrow_flight_client._disable_feature_query_service_client()
   crashed with AttributeError on a brand-new session because the
   guard inverted the None check: when _arrow_flight_instance was
   None, the code tried to call .ArrowFlightClient(...) on None
   instead of constructing one. Fix is a one-line typo: assign the
   new ArrowFlightClient(disabled_for_session=True) to the module
   global.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… live pipeline test

A pipeline that reads from a HUDI feature group via the Python engine
fails with FlightServerError: Catalog Error: Table ... does not exist
on this cluster, even though fg.commit_details() reports the right
commit and the offline materialization job ran to SUCCEEDED. The
Hive fallback is gone in 4.x so there is no second path.

The caveat walks through the SDK payload (which is well-formed), the
per-query DuckDB registration in FlyingDuck's query_engine.py, and
the three likely root causes on the cluster side (HopsFS read
permission for the FlyingDuck pod, stale warehouse mount, race
between insert ack and materialization visibility). Includes the
probes that confirmed the SDK side is clean so the next dev loop
goes straight to the FlyingDuck pod logs.

Captures the two SDK / CLI fixes that came out of the same
investigation so the caveat is self-contained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bug is fixed end-to-end now that
``logicalclocks/hopsworks-ee#2996`` (Java emits the two-part FROM
identifier) and ``logicalclocks/flyingduck#196`` (registration uses a
real DuckDB schema) are both in. Verified live on lexterm2:
``hops fg preview`` resolves shared HUDI feature groups and the
project's own materialised FG without raising the prior catalog
error.

Per the project's caveat convention ("known gotchas; add new ones to
this folder"), once the gotcha disappears the file should follow.
Keeping a stale caveat around teaches the wrong shape — the PRs and
commit history carry the diagnostic walkthrough for anyone who needs
it later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the working doc and the live-tested sample scripts into the
repo so a future agent can pick the loop back up from PR state alone
(the pod the loop ran in is ephemeral; anything in /tmp or the user
homedir disappears when it cycles).

Doc lands at .claude/docs/onboarding-flow.md and now opens with a
"Handoff state" section pointing at the three cross-repo PRs, the
lexterm2 kubeconfig at /hopsfs/Resources/kubeconfig-lexterm2 (which
is project-scoped and survives pod restarts), and the cluster-side
state of the live demo project. Samples under .claude/docs/samples/
carry the heartbeat job, the BTC feature pipeline, and the vectorised
retrieval-time transformations.

CLI fixes from the Step 4 / Step 5 loop, committed together so the
sample scripts and the doc references all match HEAD:

- hops fg stats no longer dumps the raw statistics JSON to stdout in
  human mode. Renders FEATURE / TYPE / COUNT / MIN / MAX / MEAN /
  STDDEV / COMPLETENESS as a table, formats wide numerics with
  thousands separators and four significant digits.
- hops transformation create iterates every @udf-decorated function
  in the source file instead of refusing files with more than one,
  and emits a per-function "Created transformation NAME vV" line.
  Mirrors the SDK shape (one create_transformation_function call per
  UDF) without forcing the caller to split the file.
- hops transformation create now defaults to --version 1, working
  around the backend HTTP 500 / NPE when version is null. The SDK
  side still passes null straight through and the backend should
  default to 1 (or the next free version); tracked in the gaps list.
- hops fv create --transform now accepts the fn[version]:col shape
  so an FV can pin a specific transform version. The SDK's
  fs.get_transformation_function(name=) defaults to v1, so if a fix
  re-registers as v2, FVs created without an explicit version stay
  bound to the broken v1. Pinning closes the gap.
- Tests updated: validates "register every UDF in source" and the
  "reject source with no @udf" path; old "reject multiple" parametrize
  case removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opsfs path is not a dataset

Two gaps hit while running the autoresearch flow on lexterm2:

cli: HOPSWORKS_ENGINE was required in interactive Hopsworks pods.
hopsworks_common.connection picks the spark engine whenever pyspark
imports and we're in-pod, but the typical terminal/notebook pod has
pyspark installed transitively without a usable Spark master (Spark
Connect, no SparkSession.builder.getOrCreate() fallback), so login
crashes on getOrCreate. The CLI is single-process and never needs an
in-process spark engine: heavy operations dispatch jobs server-side.
hopsworks.cli.auth.login now defaults engine=python when neither the
caller nor HOPSWORKS_ENGINE picks one. Explicit args and the env var
both still win.

hsml: model.save() of a path under /hopsfs/ that is not actually a
project dataset path errored with "Path not found". Per-pod mount
layouts vary, FUSE writes may not have synced, and user-home folders
under /hopsfs/Users/<user>/ are not project datasets. _normalize_
hopsfs_mount_path is a pure string strip, so we now also verify the
normalized path with dataset_api.path_exists; only then do we move it,
otherwise we fall back to the regular local upload path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…schema is timestamp/date

pd.read_json and json.loads both parse ISO 8601 timestamps as object
columns, and fg.insert(df) then rejects them with a "wrong type" error
even though the value is well-formed. Inspect the FG schema (preferring
the new fg.columns, falling back to the deprecated fg.features) and
coerce any timestamp/date column in place with pd.to_datetime, so
hops fg insert --file row.json on a typical event-time FG just works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop)

Two orchestration briefs the in-app Wizard pastes into Claude Code in
the Terminal. Each is a short conversational program: a few inputs
from the user, then CLI-first execution. They wrap the canonical
10-step onboarding flow into focused, agent-loadable specs.

- docs/wizard/time-series.md: raw FGs through a feature pipeline,
  feature view, training pipeline, registered model, deployment, and
  Streamlit app. Built segment by segment with hops context after
  each so the user sees what just appeared. Hard rule against
  heredoc-python orchestration; everything goes through the hops CLI.
- docs/wizard/research.md: autonomous research loop on a feature
  view, logged to autoresearch_experiments_<tag> and the model
  registry. Mirrors hopsworks-autoresearch/program.md's contract,
  with CLI commands inlined.

The briefs explicitly call out conversation rules (one question per
turn, summarise after each segment, integrate user pushback) and
anti-patterns (heredoc python, pip-install inline, skip-the-schedule)
that came out of testing the flow end-to-end on lexterm2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MagicLex MagicLex added the wip label May 12, 2026
MagicLex and others added 2 commits May 13, 2026 17:43
15 numbered issues captured live while running the time-series wizard
brief end-to-end on lexterm2 (raw FGs to deployed Streamlit). Each
entry is dated by severity (blocker / time-waster / papercut), reports
the symptom verbatim, names the root cause where known, and proposes a
fix. The two highest-leverage ones are logicalclocks#10 (model files not copied to
Deployments/<name>/<v>/, crashloops the predictor) and logicalclocks#13 (batch
get_feature_vectors silently returns 0 rows on ISO-date strings while
the singular form accepts them).

Lives next to the briefs that produced it so the next agent loading
the wizard context sees the known traps before walking into them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four more items hit while wiring the Streamlit consumer onto the
deployed model:

- logicalclocks#16 hops app create --path rejects absolute /hopsfs/ paths and
  produces a doubled // HDFS URL; opposite semantics from
  hops job deploy.
- logicalclocks#17 hops app start has --no-wait only (start blocks by default);
  hops deployment start has --wait. Naming-symmetry break.
- logicalclocks#19 deployment.predict() does bare json.dumps() on inputs, so any
  datetime.date payload TypeError-s in hsml internals. Combined with
  logicalclocks#13, the date type round-trip is a cliff: SDK forces string,
  get_feature_vectors silently drops it.
- logicalclocks#18 python-app-pipeline env lacks plotly. Crash surfaces only after
  upload + start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant