Skip to content

Commit 08a4a66

Browse files
jimdowlingclaude
andauthored
[HWORKS-2802 / -2807] partitioned_by — Python client grain materialization (Delta/Iceberg/Hudi) + predicate translator + MCP/CLI hardening (#961)
This PR series extends **HWORKS-2802 / HWORKS-2807** from UI-only partitioning visibility into the full backend/client implementation of `partitioned_by` across Delta, Hudi, and Iceberg. It adds `partitioned_by` and `online_partition_columns` to the Python client API, with validation for allowed grains, duplicates, conflicts with `partition_key`, required `event_time`, and name collisions. The fields round-trip through REST, are exposed as read-only properties, and synthetic grain features are marked `offline_only`. The implementation evolved from an initial **Delta GENERATED ALWAYS AS** design to a simpler cross-engine model where grain columns are real materialized partition columns. The client now derives `year`, `month`, `week`, `day`, and `hour` from `event_time` before writes, using Arrow for delta-rs/PyIceberg paths and Spark functions for Spark paths. This applies to Delta, Hudi, and Iceberg, including deletes. Hudi support adds a `PartitionedByTransformer` for materialization jobs and later moves direct writes to the same real-column model, using ordinary SIMPLE partition columns with Hive-style partitioning. Iceberg gets the same grain materialization on both Spark and PyIceberg write paths. Read pruning is handled by a partition predicate translator. It adds grain-column predicates from `event_time` ranges so partition pruning works consistently across engines and formats. The translator was simplified to one direction only: **event_time range → derived grain predicates**. It also handles UTC normalization, avoids unsafe OR translations, and improves pruning by descending into finer grains when the range shares coarser prefixes. Several correctness fixes are included: per-row seconds-vs-milliseconds handling for integer `event_time`, DATE + hour materialization, partitioned deletes, repeated `Query.read()` mutation, post-merge private API renames, and CI compatibility when `deltalake` is unavailable. The series also exposes `FeatureGroup.online_config` so users can inspect RonDB secondary indexes, documents `time_travel_format` and `partitioned_by` in the hops-fg skill, and guards online serving by warning or failing when offline-only partition grain columns are selected in online feature views. Separately, it hardens the Hopsworks MCP server and CLI after a security audit: shell tools are now opt-in, HTTP defaults to localhost, bearer-token auth is supported, unauthenticated network binds are blocked unless explicitly overridden, terminal sessions use unguessable tokens, credential exposure is reduced, hostname verification defaults to true, and the CLI supports reading API keys from stdin. --------- Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0c1a952 commit 08a4a66

30 files changed

Lines changed: 1908 additions & 82 deletions

python/hopsworks/cli/commands/transformation.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,12 @@ def transformation_create(
7979
The decorated function is imported in-process and passed to
8080
``fs.create_transformation_function()``.
8181
82+
Security: the source is executed in-process (``_load_udf``) — this is
83+
arbitrary code execution by design, as the same UDF runs server-side.
84+
Treat ``--code``/``--file`` as trusted input; never assemble a
85+
``hops transformation create --code ...`` invocation from an untrusted
86+
source (an agent prompt, a CI variable, a web request).
87+
8288
Args:
8389
ctx: Click context.
8490
file_path: Python source path.

python/hopsworks/cli/main.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from __future__ import annotations
1212

1313
import importlib
14+
import sys
1415
from typing import TYPE_CHECKING
1516

1617
import click
@@ -186,7 +187,20 @@ def get_command(self, ctx: click.Context, name: str) -> click.Command | None:
186187
)
187188
@click.version_option(_VERSION, "-v", "--version", prog_name="hops")
188189
@click.option("--host", "host_flag", help="Hopsworks host URL (overrides config).")
189-
@click.option("--api-key", "api_key_flag", help="API key (overrides config).")
190+
@click.option(
191+
"--api-key",
192+
"api_key_flag",
193+
help="API key (overrides config). Avoid in shared shells — a key in argv "
194+
"is visible in the process list and shell history. Prefer the "
195+
"HOPSWORKS_API_KEY env var or --api-key-stdin.",
196+
)
197+
@click.option(
198+
"--api-key-stdin",
199+
"api_key_stdin",
200+
is_flag=True,
201+
help="Read the API key from the first line of stdin instead of argv "
202+
"(keeps the secret out of the process list and shell history).",
203+
)
190204
@click.option("--project", "project_flag", help="Project name (overrides config).")
191205
@click.option(
192206
"--verify/--no-verify",
@@ -206,6 +220,7 @@ def cli(
206220
ctx: click.Context,
207221
host_flag: str | None,
208222
api_key_flag: str | None,
223+
api_key_stdin: bool,
209224
project_flag: str | None,
210225
verify_flag: bool | None,
211226
json_flag: bool,
@@ -216,11 +231,16 @@ def cli(
216231
ctx: Click context; we stash the resolved ``HopsConfig`` on ``ctx.obj``.
217232
host_flag: Value of ``--host`` if provided.
218233
api_key_flag: Value of ``--api-key`` if provided.
234+
api_key_stdin: When True, read the API key from stdin.
219235
project_flag: Value of ``--project`` if provided.
220236
verify_flag: True for ``--verify``, False for ``--no-verify``, None if neither was passed.
221237
json_flag: When True, every output helper switches to JSON mode.
222238
"""
223239
output.set_json_mode(json_flag)
240+
if api_key_stdin:
241+
if api_key_flag:
242+
raise click.UsageError("Pass --api-key or --api-key-stdin, not both.")
243+
api_key_flag = sys.stdin.readline().strip() or None
224244
ctx.ensure_object(dict)
225245
ctx.obj["config"] = config.load(
226246
flag_host=host_flag,

python/hopsworks/mcp/run_server.py

Lines changed: 76 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,16 @@
2525
import hopsworks
2626
import uvicorn
2727

28-
from .server import mcp
28+
from .server import AUTH_TOKEN_ENV, build_mcp, static_bearer_auth
2929
from .utils.auth import login
3030

3131

32+
# Hosts that keep the server off the network. Binding anything else over an
33+
# HTTP transport without an auth token is refused (fail closed).
34+
_LOOPBACK_HOSTS = {"127.0.0.1", "::1", "localhost"}
35+
_NETWORK_TRANSPORTS = {"http", "sse", "streamable-http"}
36+
37+
3238
# Configure logging to handle closed streams gracefully
3339
class SafeStreamHandler(logging.StreamHandler):
3440
def emit(self, record):
@@ -50,14 +56,43 @@ def handle_shutdown(signum, frame):
5056

5157

5258
@click.option(
53-
"--host", default="0.0.0.0", help="Host to run the server on. (default: 0.0.0.0)"
59+
"--host",
60+
default="127.0.0.1",
61+
help="Host to bind the server to. (default: 127.0.0.1, loopback only). "
62+
"Binding a routable address exposes the server to the network — see "
63+
"--auth-token and --enable-shell-tools.",
5464
)
5565
@click.option("--port", default=8000, help="Port to run the server on. (default: 8000)")
5666
@click.option(
5767
"--transport",
5868
default="http",
5969
help="Transport method to use. (default: http). Options: 'stdio', 'http', 'sse', 'streamable-http'",
6070
)
71+
@click.option(
72+
"--auth-token",
73+
"auth_token",
74+
default=None,
75+
help="Bearer token required on the HTTP transport (clients send "
76+
f"'Authorization: Bearer <token>'). Falls back to ${AUTH_TOKEN_ENV}. "
77+
"Required to bind a non-loopback host unless --insecure-no-auth is set.",
78+
)
79+
@click.option(
80+
"--insecure-no-auth",
81+
"insecure_no_auth",
82+
is_flag=True,
83+
default=False,
84+
help="Allow binding a non-loopback host with no --auth-token. "
85+
"Unauthenticated network access — only for an isolated network you trust.",
86+
)
87+
@click.option(
88+
"--enable-shell-tools",
89+
"enable_shell_tools",
90+
is_flag=True,
91+
default=False,
92+
help="Register the terminal/brewer tools. These execute arbitrary commands "
93+
"as the server process; off by default. Never combine with an "
94+
"unauthenticated network bind.",
95+
)
6196
@click.option(
6297
"--create_session",
6398
default=True,
@@ -77,9 +112,10 @@ def handle_shutdown(signum, frame):
77112
help="Path to a file containing the API key for Hopsworks authentication",
78113
)
79114
@click.option(
80-
"--hostname_verification",
81-
default=False,
82-
help="Enable hostname verification for Hopsworks authentication",
115+
"--hostname_verification/--no-hostname-verification",
116+
default=True,
117+
help="Verify the Hopsworks TLS hostname (default: on). Disable only for a "
118+
"trusted private cluster with a self-signed certificate.",
83119
)
84120
@click.option(
85121
"--trust_store_path",
@@ -92,16 +128,19 @@ def handle_shutdown(signum, frame):
92128
help="Engine to use (python, spark, training, spark-no-metastore, spark-delta) (default: python)",
93129
)
94130
def run_server(
95-
host: str = "0.0.0.0",
131+
host: str = "127.0.0.1",
96132
port: int | None = None,
97133
transport: Literal["stdio", "http", "sse", "streamable-http"] = "http",
134+
auth_token: str | None = None,
135+
insecure_no_auth: bool = False,
136+
enable_shell_tools: bool = False,
98137
create_session: bool = True,
99138
hopsworks_host: str | None = None,
100139
hopsworks_port: int = 443,
101140
project: str | None = None,
102141
api_key_value: str | None = None,
103142
api_key_file: str | None = None,
104-
hostname_verification: bool = False,
143+
hostname_verification: bool = True,
105144
trust_store_path: str | None = None,
106145
engine: Literal[
107146
"spark", "python", "training", "spark-no-metastore", "spark-delta"
@@ -110,18 +149,24 @@ def run_server(
110149
"""Run the Hopsworks MCP server.
111150
112151
Parameters:
113-
host: Host to run the server on.
152+
host: Address to bind to. Defaults to loopback; a routable address
153+
exposes the server to the network.
114154
port:
115155
Port to run the server on.
116156
If not provided, it will default to the value of the UVICORN_PORT environment variable or 8000 if that is not set.
117157
transport: Transport method to use.
158+
auth_token: Bearer token required on the HTTP transport (or via the
159+
environment variable). Required for a non-loopback bind.
160+
insecure_no_auth: Permit a non-loopback bind with no auth token.
161+
enable_shell_tools: Register the command-executing terminal/brewer tools
162+
(off by default).
118163
create_session: Whether to create a Hopsworks session for the MCP server.
119164
hopsworks_host: Hopsworks host URL.
120165
hopsworks_port: Hopsworks port.
121166
project: Project name to use as default.
122167
api_key_value: API key value for Hopsworks authentication.
123168
api_key_file: Path to a file containing the API key for Hopsworks authentication.
124-
hostname_verification: Enable hostname verification for Hopsworks authentication.
169+
hostname_verification: Verify the Hopsworks TLS hostname (default on).
125170
trust_store_path: Path to the trust store for Hopsworks authentication.
126171
engine: Hopsworks engine to use.
127172
"""
@@ -133,6 +178,28 @@ def run_server(
133178
if port is None:
134179
port = int(os.getenv("UVICORN_PORT", "8000"))
135180

181+
auth_token = auth_token or os.getenv(AUTH_TOKEN_ENV)
182+
183+
# Fail closed: a network transport on a routable host with no auth token is
184+
# the exact shape that turns this server into an open endpoint. Refuse it
185+
# unless the operator explicitly accepts the risk.
186+
on_network = transport in _NETWORK_TRANSPORTS and host not in _LOOPBACK_HOSTS
187+
if on_network and not auth_token and not insecure_no_auth:
188+
raise click.ClickException(
189+
f"Refusing to bind {host!r} over '{transport}' without authentication. "
190+
f"Pass --auth-token (or set ${AUTH_TOKEN_ENV}), bind 127.0.0.1, or "
191+
"pass --insecure-no-auth if this is an isolated, trusted network."
192+
)
193+
if enable_shell_tools and on_network and not auth_token:
194+
raise click.ClickException(
195+
"--enable-shell-tools exposes arbitrary command execution; it cannot "
196+
"be combined with an unauthenticated network bind. Add --auth-token "
197+
"or bind 127.0.0.1."
198+
)
199+
200+
auth = static_bearer_auth(auth_token) if auth_token else None
201+
mcp = build_mcp(enable_shell_tools=enable_shell_tools, auth=auth)
202+
136203
if create_session:
137204
# Set the API key for the Hopsworks client
138205
login(

python/hopsworks/mcp/server.py

Lines changed: 84 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,25 @@
1414
# limitations under the License.
1515
#
1616

17-
"""MCP server for Hopsworks."""
17+
"""MCP server for Hopsworks.
18+
19+
`build_mcp()` is the factory. Shell-capable tools (`TerminalTools`,
20+
`BrewerTools`) execute arbitrary commands as the server process, so they are
21+
**opt-in** (`enable_shell_tools=True`) rather than always registered. Transport
22+
authentication is also opt-in via an `auth` provider; `run_server` refuses to
23+
expose a non-loopback HTTP transport without one.
24+
25+
The module-level `mcp`/`app` are a safe default for ASGI deployments that import
26+
`hopsworks.mcp.server:app`: shell tools are off, and a bearer token verifier is
27+
attached when `HOPSWORKS_MCP_AUTH_TOKEN` is set in the environment.
28+
"""
29+
30+
from __future__ import annotations
31+
32+
import os
1833

1934
from fastmcp import FastMCP
35+
from fastmcp.server.auth.providers.jwt import StaticTokenVerifier
2036
from starlette import status
2137
from starlette.responses import Response
2238

@@ -33,25 +49,76 @@
3349
)
3450

3551

36-
# Create a FastMCP server instance
37-
mcp = FastMCP(name="Hopsworks MCP")
52+
# Env var an operator can set to require a bearer token on the HTTP transport.
53+
# `run_server` also accepts it via --auth-token.
54+
AUTH_TOKEN_ENV = "HOPSWORKS_MCP_AUTH_TOKEN"
55+
56+
57+
def static_bearer_auth(token: str) -> StaticTokenVerifier:
58+
"""Build a single-token bearer verifier for the HTTP transport.
59+
60+
Clients must send ``Authorization: Bearer <token>``. This is a shared-secret
61+
gate, not per-user identity; pair it with a loopback bind or a fronting proxy
62+
for anything beyond a single trusted operator.
63+
64+
Parameters:
65+
token: The bearer token clients must present.
3866
39-
# Initialize tools and resources
40-
AuthTools(mcp)
41-
ProjectTools(mcp)
42-
ProjectResources(mcp)
43-
ProjectPrompts(mcp)
44-
SystemPrompts(mcp)
45-
JobTools(mcp)
46-
DatasetTools(mcp)
47-
FeatureGroupTools(mcp)
48-
TerminalTools(mcp)
49-
BrewerTools(mcp)
67+
Returns:
68+
A verifier suitable for ``FastMCP(auth=...)``.
69+
"""
70+
return StaticTokenVerifier(
71+
tokens={token: {"client_id": "hopsworks-mcp", "scopes": []}}
72+
)
5073

5174

52-
@mcp.custom_route("/health", methods=["GET"])
53-
async def health(_):
54-
return Response(status_code=status.HTTP_204_NO_CONTENT)
75+
def build_mcp(
76+
*,
77+
enable_shell_tools: bool = False,
78+
auth: StaticTokenVerifier | None = None,
79+
) -> FastMCP:
80+
"""Construct a Hopsworks MCP server.
5581
82+
Parameters:
83+
enable_shell_tools: Register the command-executing tools
84+
(``TerminalTools``, ``BrewerTools``). Off by default — these grant
85+
arbitrary code execution as the server process and must be turned on
86+
deliberately.
87+
auth: Optional transport auth provider. When set, the HTTP transport
88+
rejects unauthenticated requests.
89+
90+
Returns:
91+
The configured ``FastMCP`` instance.
92+
"""
93+
server = FastMCP(name="Hopsworks MCP", auth=auth)
94+
95+
AuthTools(server)
96+
ProjectTools(server)
97+
ProjectResources(server)
98+
ProjectPrompts(server)
99+
SystemPrompts(server)
100+
JobTools(server)
101+
DatasetTools(server)
102+
FeatureGroupTools(server)
103+
104+
if enable_shell_tools:
105+
TerminalTools(server)
106+
BrewerTools(server)
107+
108+
@server.custom_route("/health", methods=["GET"])
109+
async def health(_):
110+
return Response(status_code=status.HTTP_204_NO_CONTENT)
111+
112+
return server
113+
114+
115+
# Safe default for `hopsworks.mcp.server:app` importers: no shell tools, and a
116+
# bearer gate when HOPSWORKS_MCP_AUTH_TOKEN is set. Operators who need the shell
117+
# tools or a custom auth provider should run via `hopsworks-mcp` (run_server.py).
118+
_env_token = os.environ.get(AUTH_TOKEN_ENV)
119+
mcp = build_mcp(
120+
enable_shell_tools=False,
121+
auth=static_bearer_auth(_env_token) if _env_token else None,
122+
)
56123

57124
app = mcp.http_app()

python/hopsworks/mcp/tools/auth.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ async def login(
5454
project: str = None,
5555
api_key_value: str = None,
5656
api_key_file: str = None,
57-
hostname_verification: bool = False,
57+
hostname_verification: bool = True,
5858
trust_store_path: str = None,
5959
engine: str = "python",
6060
ctx: Context | None = None,

0 commit comments

Comments
 (0)